I’ve recently been doing a fair bit of automation work on bringing up AWS managed Kubernetes clusters using Terraform (with Packer for building out the worker group nodes).
Some of my colleagues have not worked with EKS (or Kubernetes) much before and so I’ve also been sharing knowledge and helping others get up to speed. A colleague was having trouble with their newly provisioned personal test EKS cluster found that the kube-system / control plane related pods were not starting. I assisted with the troubleshooting process and found the following…
Upon diving into the logs of the kube-system related pods (dns, aws CNI, etc…) it was obvious that the pods were not being scheduled on the brand new cluster. The next obvious command to run was kubectl get nodes -o wide to take a look at the general state of the worker nodes.
Unsurprisingly there were no nodes in the cluster.
Troubleshooting worker nodes not joining the cluster
The first thing that comes to mind when you have worker nodes that are not joining the cluster on startup is to check the bootstrapping / startup scripts. In EKS’ case (and more specifically EC2) the worker nodes should be joining the cluster by running a couple of commands in the userdata script that the EC2 machines run on launch.
If you’re customising your worker nodes with your own custom AMI(s) then you’ll most likely be handling this userdata script logic yourself, and this is the first place to check.
The easiest way of checking userdata script failures on an EC2 instance is to simply get the cloud-init logs direct from the instance. Locate the EC2 machine in the console (or the instance-id inspect the logs for failures on the section that logs execution of your userdata script.
- In the EC2 console: Right-click your EC2 instance -> Instance Settings -> Get System Log.
- On the instance itself:
- cat /var/log/cloud-init.log | more
- cat /var/log/cloud-init-output.log | more
Upon finding the error you can then check (using intuition around the specific error message you found):
- Have any changes been introduced lately that might have caused the breakage?
- Has the base AMI that you’re building on top of changed?
- Have any resources that you might be pulling into the base image builds been modified in any way?
These are the questions to ask and investigate first. You should be storing base image build scripts (packer for example) in version control / git, so check the recent git commits and image build logs first.