Skip to main content

Command Palette

Search for a command to run...

Kubernetes Troubleshooting Mindset: How I Approach Production Issues

Updated
5 min read
Kubernetes Troubleshooting Mindset: How I Approach Production Issues
S
Senior DevOps Engineer with 9+ years of experience across networking, infrastructure, cloud operations, and DevOps. I write about Kubernetes, CNCF certifications, cloud-native technologies, platform engineering, automation, and lessons learned from real-world projects. Currently documenting my journey toward becoming a Kubestronaut while sharing practical insights, study strategies, and hands-on experiences with the Kubernetes ecosystem.

One of the biggest lessons I’ve learned in my Kubernetes journey is this:

Most production incidents are not solved by knowing more commands.

They’re solved by having a systematic troubleshooting process.

When I first started learning Kubernetes, my troubleshooting strategy was simple:

Try random commands until something makes sense.

Sometimes it worked.

Most of the time, it wasted valuable time and created more confusion.

After preparing for the CKA, working with Kubernetes, and spending countless hours troubleshooting broken deployments, failed Pods, networking issues, and cluster problems, I realized that effective troubleshooting is more about mindset than commands.

This article covers the troubleshooting framework I follow whenever I face Kubernetes issues.

The Biggest Mistake Engineers Make

When something breaks, many engineers immediately jump to conclusions.

Examples:

  • “The application must be broken.”

  • “The network is down.”

  • “Kubernetes is acting weird.”

  • “The deployment failed.”

These assumptions often lead troubleshooting in the wrong direction.

Instead, I try to follow a simple principle:

Observe first. Assume later.

My Kubernetes Troubleshooting Framework

Whenever I encounter a problem, I follow this sequence:

Observe
↓
Scope
↓
Gather Evidence
↓
Validate Assumptions
↓
Fix
↓
Verify
↓
Document

Let’s break it down.

Step 1: Observe the Symptoms

Before touching anything, I ask:

What exactly is failing?

Examples:

  • Pod not starting

  • Service unreachable

  • Node NotReady

  • Application crashing

  • Storage issue

  • DNS issue

At this stage, I avoid making assumptions.

I simply gather facts.

Commands I often use:

kubectl get pods -A
kubectl get nodes
kubectl get svc -A

The goal is visibility.

Step 2: Define the Scope

The next question is:

How big is the problem?

Is it:

  • One Pod?

  • One Deployment?

  • One Namespace?

  • One Node?

  • The entire cluster?

This step is important because it narrows the investigation.

For example:

If only one Pod is affected, the problem is likely application-related.

If every workload is failing, the issue may be cluster-wide.

Step 3: Gather Evidence

This is where most troubleshooting begins.

Instead of guessing, collect information.

My favorite commands:

kubectl describe pod <pod-name>
kubectl logs <pod-name>
kubectl logs --previous <pod-name>
kubectl describe node <node-name>
kubectl get events --sort-by=.metadata.creationTimestamp

Events often reveal the story behind the failure.

Step 4: Follow the Dependency Chain

One of the most powerful troubleshooting techniques is following dependencies.

For example:

Application not working?

Check:

Application
↓
Pod
↓
Deployment
↓
Service
↓
DNS
↓
Network
↓
Node
↓
Cluster

Instead of randomly jumping between components, I trace the path systematically.

This prevents missing critical clues.

Step 5: Validate Assumptions

This is where many engineers get stuck.

Let’s say you believe DNS is broken.

Don’t assume.

Prove it.

Example:

kubectl exec -it pod-name -- nslookup kubernetes.default

If DNS works, move on.

If DNS fails, investigate further.

Evidence should drive conclusions — not intuition.

Step 6: Fix One Thing at a Time

A common mistake during incidents is changing multiple things simultaneously.

Example:

  • Update deployment

  • Restart pods

  • Modify service

  • Change ConfigMap

Now you don’t know which change fixed the issue.

My rule:

Make one change.

Verify the result.

Continue if necessary.

This makes troubleshooting much easier.

Step 7: Verify the Resolution

Never assume a problem is fixed.

Verify it.

Check:

kubectl get pods
kubectl logs
kubectl get endpoints
kubectl get events

And most importantly:

Verify that the application itself is functioning correctly.

A green Pod does not always mean a healthy application.

Step 8: Document the Incident

This step is often skipped.

After resolving an issue, document:

  • Root cause

  • Symptoms

  • Fix applied

  • Prevention steps

Future-you will thank present-you.

Many recurring issues can be solved in minutes if previous incidents were documented properly.

My Mental Checklist for Common Issues

Pod Not Starting

Check:

kubectl describe pod
kubectl logs
kubectl logs --previous

Look for:

  • ImagePullBackOff

  • CrashLoopBackOff

  • Resource limits

  • Missing ConfigMaps

  • Missing Secrets

Service Not Working

Check:

kubectl get svc
kubectl get endpoints
kubectl describe svc

Verify:

  • Selectors

  • Ports

  • Target Ports

  • Pod readiness

DNS Problems

Check:

kubectl get pods -n kube-system

Verify CoreDNS is healthy.

Test:

nslookup kubernetes.default

Node NotReady

Check:

kubectl describe node

Investigate:

  • Kubelet status

  • Disk pressure

  • Memory pressure

  • Networking

The Most Important Skill

After years of working in IT and preparing for Kubernetes certifications, I’ve come to believe that the most valuable Kubernetes skill is not:

  • Writing YAML

  • Memorizing commands

  • Passing certifications

It’s troubleshooting.

Production environments are unpredictable.

Applications fail.

Nodes fail.

Networks fail.

Configurations fail.

The engineers who remain calm, gather evidence, and troubleshoot systematically are the ones who solve problems quickly.

Final Thoughts

Kubernetes troubleshooting is not about finding the perfect command.

It’s about developing a repeatable process.

Whenever something breaks, I remind myself:

Observe
↓
Scope
↓
Gather Evidence
↓
Validate Assumptions
↓
Fix
↓
Verify
↓
Document

This simple framework has helped me solve countless Kubernetes issues more effectively than any command cheat sheet ever could.

The goal isn’t to know everything.

The goal is to know how to find the answer when things go wrong.

And that’s what separates Kubernetes users from Kubernetes operators.

Connect With Me

If you’re preparing for Kubernetes certifications, pursuing the Kubestronaut journey, or working in the cloud-native ecosystem, I’d love to connect.

Follow me for more articles on Kubernetes, CNCF certifications, DevOps, Platform Engineering, and Cloud-Native technologies.

LinkedIn: https://www.linkedin.com/in/shahzadaliahmad/

LFX Profile: https://openprofile.dev/profile/shahzadahmad91

Credly: https://www.credly.com/users/shahzadahmad

Website: https://shahzadahmad.dev/

If you found this article helpful, consider following, clapping, andsharing it with others in the Kubernetes community.

My Kubestronaut Journey

Part 18 of 32

Follow my journey from DevOps Engineer to Kubestronaut as I explore Kubernetes, CNCF certifications, cloud-native technologies, and hands-on learning. In this series, I share my experiences preparing for and passing certifications such as CKA, CKAD, and CKS, along with exam strategies, study resources, troubleshooting lessons, and practical insights gained from real-world Kubernetes environments. Whether you're just starting with Kubernetes or pursuing advanced CNCF certifications, I hope these experiences help guide your own cloud-native journey.

Up next

Why I’m Pursuing CKAD After Becoming a CKA

Passing the Certified Kubernetes Administrator (CKA) was one of the most rewarding milestones in my cloud-native journey. The certification challenged me to understand Kubernetes from an administrator

More from this blog

S

Shahzad Ahmad | Kubernetes, DevOps & Cloud Native Journey

32 posts

Senior DevOps Engineer documenting my journey through Kubernetes, CNCF certifications, cloud-native technologies, platform engineering, and automation. Here you'll find hands-on tutorials, certification experiences (CKA, CKAD, CKS), exam strategies, troubleshooting guides, and lessons learned from real-world DevOps and Kubernetes environments. My goal is to share practical knowledge, help others in their cloud-native journey, and ultimately document the path from DevOps Engineer to Kubestronaut.