Fixing Node Not Ready in Kubernetes
Introduction/Issue:
One day, while working on our Kubernetes cluster, we noticed that one of the nodes went into a NotReady state. This caused some pods to stop running on that node, disrupting the application’s availability. We needed to investigate and resolve the issue quickly to restore normal operations.
Why we need to do it/Cause of the issue:
When a node is in a NotReady state, it means the Kubernetes control plane is unable to communicate with the node. Common causes include:
Resource Issues: The node ran out of CPU, memory, or disk space.
Network Issues: The node lost connectivity to the Kubernetes master or other nodes.
Kubelet Problems: The Kubelet service on the node stopped working or crashed.
Disk Pressure: The disk usage exceeded the limit, causing the node to stop accepting new workloads.
In our case, the issue was caused by disk pressure due to a large number of temporary files.
How do we solve it:
Here’s the step-by-step approach we followed to fix the issue:
Check Node Status:
We started by inspecting the node’s status:
kubectl get nodes
Describe the Node:
kubectl describe node (name of the node)
The output showed :
Warning: DiskPressure
Node is under disk pressure.
Free Up Disk Space:
We logged into the node and checked the disk usage:
ssh user@(node name)
df -h
The /var/lib/docker directory was using 95% of the disk space due to accumulated temporary files and old container images.
To clean up the space, we ran the following commands:
docker system prune -a
This removed unused containers, images, and networks.
After freeing up space, we restarted the Kubelet service to refresh the node’s status:
systemctl restart kubelet
Verify the Node Status:
Finally, we checked the node status again:
kubectl get nodes
The node was back to Ready, and pods began running as expected.
Conclusion:
The NotReady state was resolved by identifying and clearing disk pressure on the node. Regular monitoring and implementing automated cleanup for unused container resources can help prevent such issues in the future. This simple yet effective fix ensured the cluster’s stability and restored application availability.
Recent Posts