Troubleshooting Amazon EKS Worker Nodes not joining the cluster

mechanic underneath car fixing things

I’ve recently been doing a fair bit of automation work on bringing up AWS managed Kubernetes clusters using Terraform (with Packer for building out the worker group nodes). Read on for some handy tips on troubleshooting EKS worker nodes.

Some of my colleagues have not worked with EKS (or Kubernetes) much before and so I’ve also been sharing knowledge and helping others get up to speed. A colleague was having trouble with their newly provisioned personal test EKS cluster found that the kube-system / control plane related pods were not starting.  I assisted with the troubleshooting process and found the following…

Upon diving into the logs of the kube-system related pods (dns, aws CNI, etc…) it was obvious that the pods were not being scheduled on the brand new cluster. The next obvious command to run was kubectl get nodes -o wide to take a look at the general state of the worker nodes.

Unsurprisingly there were no nodes in the cluster.

Troubleshooting worker nodes not joining the cluster

The first thing that comes to mind when you have worker nodes that are not joining the cluster on startup is to check the bootstrapping / startup scripts. In EKS’ case (and more specifically EC2) the worker nodes should be joining the cluster by running a couple of commands in the userdata script that the EC2 machines run on launch.

If you’re customising your worker nodes with your own custom AMI(s) then you’ll most likely be handling this userdata script logic yourself, and this is the first place to check.

The easiest way of checking userdata script failures on an EC2 instance is to simply get the cloud-init logs direct from the instance. Locate the EC2 machine in the console (or the instance-id inspect the logs for failures on the section that logs execution of your userdata script.

  • In the EC2 console: Right-click your EC2 instance -> Instance Settings -> Get System Log.
  • On the instance itself:
    • cat /var/log/cloud-init.log | more
    • cat /var/log/cloud-init-output.log | more

Upon finding the error you can then check (using intuition around the specific error message you found):

  • Have any changes been introduced lately that might have caused the breakage?
  • Has the base AMI that you’re building on top of changed?
  • Have any resources that you might be pulling into the base image builds been modified in any way?

These are the questions to ask and investigate first. You should be storing base image build scripts (packer for example) in version control / git, so check the recent git commits and image build logs first.

How to restart a slave FortiGate firewall in an HA cluster

Here’s a quick how-to on restarting a specific member of a High Availability FortiGate hardware firewall cluster. I have only tested this on a cluster of FG60 units, but am quite sure the steps would be similar for a cluster of FG100s, FG310s etc…

get-ha-status

First of all you may or may not want to set up some monitoring going to your various WAN connections on the HA cluster. Restarting the slave unit should not have any effect on these connections in theory as your master unit is the one handling all the work. The slave is merely there to take over should things go pear shaped on the master unit. When the slave restarts you can watch your ping statistics or other connections just to ensure everything stays up whilst it reboots.

1. Start by logging in to the web interface of your firewall cluster. https://ipaddress

2. Specify a custom port number if you have the management GUI on a custom port for example https://ipaddress:555

3. Login and look for “HA status” under the status area – this should be the default page that loads. It should show as “Active-passive” if this is the mode your HA cluster is in. Click the [Configure] link next to this.

4. This will give you an overview of your HA cluster – you can view which unit is the Master and which is the slave. This step is optional and just gives you a nice overview of how things are looking at the moment. Click “View HA statistics” near the top right if you would like to view each unit’s CPU/Memory usage and other statistics.

5. Return to the “Status” home page of your firewall GUI. Click in the “CLI Console” black window area to get to your console. (Optionally, you could also just SSH in if you have this enabled).

6. Type the following command to bring up your HA cluster details: get system ha status

7. This will show which firewall is master and slave in the cluster e.g.

Master:129 FG60-1 FWF60Bxxxxxxxx65 1
Slave :125 FG60-2 FWF60Bxxxxxxxx06 0

Look for the number right at the end and note this down. In the above example the Slave unit has the number “0” . Note this down.

8. Next enter the following command: execute ha manage x

Where “x” is the number noted down in step number 7.

This will change your management console to this particular firewall unit. i.e. the slave unit in our case. You should notice your command line change to reflect the name of the newly selected HA member.

9. Enter the following command to reboot the slave: execute reboot

10. Press “Y” to confirm and reboot the slave.

Monitor your ping / connection statistics to ensure everything looks fine. Give it a minute or so to boot up again, then return to your HA statistics page to ensure everything looks good.

That is all there is to it.

Installing VMWare ESX using a Dell DRAC card

Here is a how-to on installing VMWare ESX 3.5 using a DRAC (Dell Remote Access Controller) card to access the server. I was installing a new cluster in a Dell M1000e Blade Centre for work the other day and wrote up this process in order for it to be documented for anyone else doing it in the future.

Just for interests sake the basic specs of the system are:

1 x Dell M1000e Blade Centre
3 x Redundant 2000w+ Power supply units
16 x Dell M600 Blades (Each one has 2 x Quad core Xeon CPUs and 32GB RAM).

1. Connect to the M1000e’s chassis DRAC card.
a. Connect to M1000e chassis DRAC card. (https://x.x.x.x) – use the IP for that particular blade centre’s DRAC card. Login with DRAC credentials.
b. Use the agreed DRAC user credentials, or if this is a new setup, the defaults are username: root password: calvin).

login_drac

2. Select boot order for Blade and power it up
a. Choose the Blade server number that you will be working with from the Servers list on the left side.
b. Click on the Setup tab, and choose Virtual CD/DVD as the first boot device then click Apply.
c. Select the Power Management tab and choose Power on, then click Apply.

configure_boot_order_for_blade

3. Go to iDRAC console of the blade server
a. Click on Launch iDRAC GUI to access the iDRAC for the blade you have just powered on.
b. You will need to login again as this is another DRAC we are connecting to (This time the DRAC is for the actual blade server not the chassis).

launch_idrac_gui

4. Configure Mouse
a. Click on the Console tab near the top of the screen and then click the Configuration button near the top.
b. In the mouse mode drop down, select Linux as the mouse type, then click Apply.

configure_mouse

5. Launch Console viewer
a. From the console tab we can now select the Launch Viewer button.
b. An activeX popup might appear – allow it access and the DRAC console should now appear with the server in its boot process.

6. Mount Virtual ISO media for installation disc (ESX 3.5)
a. Click on Media, and then select Virtual Media Wizard.
b. Select ISO image and then browse to the ISO for ESX 3.5 – this could be on your local drive or a network share.
c. Click the connect CD/DVD button to mount the ISO.
d. Your boot order should be configured correctly to boot off this ISO now. (*Optional* You could always press F11 whilst the server is booting to choose the boot device anyway).

attach_virtual_media_iso

7. Reboot the server if the boot from virtual CD/DVD has already passed
a. Go to Keyboard – Macros – Alt-Ctrl-Del to do this.

8. ESX install should now start.
a. Press enter to go into graphical install mode
b. Select Test to test the media (The ISO should generally be fine).
c. Select OK to start the install.
d. Choose United Kingdom Keyboard layout (or whatever Keyboard layout you use).
e. Leave the mouse on generic 3 button USB.
f. Accept the license terms.

esx_install_start

esx1

9. Partitioning
a. For partition options, leave on “Recommended”. It should now show the Dell virtual disk of 69GB (in this case) or the Dell RAID virtual disk / disk configuration.
b. Say “Yes” to removing all existing partitions on the disk. (That is if you don’t mind formatting and completely clearing out any existing data that may be on this disk).
c. Alter partitions to get the following best practice sizes: (See http://vmetc.com/2008/02/12/best-practices-for-esx-host-partitions/)
d. Note: It doesn’t matter if these sizes are 2-3MB out for some. The installer deviates these sizes slightly. The swap partition should have 1600MB minimum though.
e. Next page is Advanced Options – Leave as is (Book from SCSI drive).

esx_partitions_recommended

10. Network Configuration
a. Setup network configuration
b. IP address (x.x.x.x) – whatever IP you are assigning this particular ESX Host.
c. Subnet mask: 255.255.255.0 for example.
d. Gateway:  Your gateway IP address (x.x.x.x)
e. Primary DNS:  (x.x.x.x)
f. Secondary DNS: (x.x.x.x)
g. Hostname: localhost.localdomain for example : ESXhost01.shogan
h. VLAN ID – Leave this blank if you are not using VLANs. If you are, then specify the VLAN here.
i. Create a default network for virtual machines – Unless you have a specific network configuration in mind leave this ticked on.

11. Time zone
a. Set Location to  your location.
b. System clock uses UTC is left as ticked.

12. Root password
a. Set default root password . (This is your admin password)!

13. Finish installation
a. Next page is “About to Install”
b. Check the information is all correct and click Next if all looks fine.

14. Change boot order back and restart the blade server.
a. Via the iDRAC page, change the boot order back to Hard disk for the blade so that it will reboot using the server’s RAID hard disks instead of the ISO.
b. Reboot the host by pressing the Finish button back in the console.
c. Disconnect the Virtual CD from the Media option in the console menu.
d. Watch the console while the server reboots to ensure no errors are reported on startup.

If all went well, you should now have an ESX Host booted to the console. Press Alt-F1 to access the command line (you will need to login as root or any other user you setup).

You can now access your server via the web browser (https://x.x.x.x). From here you can download the Virtual Infrastructure client to manage the ESX Host with.

This host could now be further configured and added to an ESX cluster for example. SANs could be assigned and vMotion setup so that HA (High Availability) and DRS (Distributed Resource Scheduling) can be put to good use!