First Steps: A Single Windows Node

Spin up a Windows EC2 Instance

Before we can get an autoscaling group going we need to get a single node working first as a proof-of-concept. We’ll use this guide from Microsoft on joining Windows nodes to a Kubernetes cluster as a baseline.

Note that with that guide we have our first caveat: we can only join Windows nodes to a cluster that’s utilizing flannel as the container networking plugin.

So the very first step, if you don’t have one already, is to create a basic sandbox Kubernetes cluster in AWS using flannel as the networking solution. Other properties of the cluster don’t matter too much for this exploratory purpose. For reference we had a cluster of one master, and one worker node, both of which utilized CoreOS AMIs.

Once you have your sandbox cluster you need to spin up a Windows EC2 instance in the same VPC with the same configuration as the single worker node. Initially we just did this via the AWS web interface, but we ended up creating a small Terraform configuration asew had to repeat this process many, many times. For reference the key configuration values to set were:

The EC2 AMI we used was based on Windows_Server-1809-English-Core-ContainersLatest-2019.02.13( ami-0fe3fb8879ef6147c for eu-west-3 where our cluster was). The reason we chose this image is because the Windows version matches the one used in Microsoft’s guide. The VPC the EC2 instance was to reside in and the subnet, just ensure that these match the configuration of the other worker node. Auto-assign public IP address, set to Enable unless you want to attach the VPC to your VPN. IAM role, since we’re trying to create a worker node we’ll just set it to the same as the worker nodes of our cluster, nodes.${cluster_name} . Security group, same reasoning as above, just set it to the security group used for the nodes of the cluster, nodes.${cluster_name} . Key pair, choose one that you have stored locally on your machine already, or create a new one and download it.

Now you can launch your Windows EC2 instance! Unfortunately now you’ll have to interact with it via RDP instead of any kind of remote shell, and to do that you’ll have to wait a few minutes before you can decrypt the Administrator password from AWS using the key pair previously mentioned. Once ready, you’ll be able to login using your preferred RDP client and start up a Powershell session.

Ah, the joys of Windows. 🙄

Prepare Your System

The first step after getting into your instance is to prepare some basic directories and set a few environment variables. The Microsoft guide covers this so we’ll just refer you to that.

The only additional directory we create in this guide is C:/k/downloads where we store the downloads of the various Kubernetes resources before extracting and installing them onto the system.

Install Kubernetes Resources

Now that you can interface with the machine, it’s time to start following Microsoft’s instructions, starting with the installation of the windows/nanoserver images used for the Kubernetes infrastructure. Just remember to pull the images with the same version as your host Windows OS!

Fortunately you can skip the setup of Docker as the AMI chosen already has Docker installed and ready to go.

docker pull mcr.microsoft.com/windows/nanoserver:1809

docker tag mcr.microsoft.com/windows/nanoserver:1809 microsoft/nanoserver:latest

Once this is done the next step will be to setup the directory that will contain all of our configuration and services required for Kubernetes. Microsoft had chosen C:/k and as it turned out, it is important for you to stick with this poor naming decision. Yes, we tried being better, we tried changing it to C:/kubernetes but many of the scripts we use later are completely reliant on C:/k being the base directory for all of the resources.

At this point, the next step for you is to obtain a copy of a kube config file that will allow your instance to talk to the cluster. If you’re following along with the Microsoft guide you’ll notice that the official solution is to copy the kubeconfig file from a master node into our new instance.

You’re a master Harry!

This presents an issue, as you’ll need this to happen automatically. There’d be no way for your node to even connect to a master as it wouldn’t know how to without first contacting the cluster.

The workaround for this that we utilized is to download kops onto your node and use that to export a kubeconfig file. All your node would have to know is the location of the kops state store and the name of the cluster. Fortunately this is known, and you can just use wget to download files in Powershell!

Once you have the kops executable on your path, it would simply be a matter of running kops export kubecfg --state=s3://${kops_state_store_location} ${cluster_name} ! Except for the fact that your EC2 instance role doesn’t give you enough permissions to read all that’s needed for the export command. A quick but dirty fix of changing the EC2 role instance from nodes.${cluster_name} to master.${cluster_name} will allow you to export a functioning kubeconfig file.

The next step for you would be to download the Kubernetes node binaries, which will include kubectl which can be used to verify your kubeconfig file. While you’re on the shell downloading resources, you can also go ahead and start the download process for the flannel resources as well.

Fun fact: If you wrap a download call into a background job it downloads the file significantly faster in Powershell.

I don’t why the fact above is true, but it cut the time the instance spent downloading by what felt like 80%. I have some hypotheses, but I’m not going to get into them here. Just know for that these “download and install/setup” jobs, you can wrap them with Start-Job -ScriptBlock {code here} then wait for them all to complete with Get-Job | Wait-Job . This’ll help you significantly reduce your instance spool-up time.

Once the Kubernetes binaries have finished downloading we just needed to move them into our path.

tar -xzvf c:/k/downloads/knode.tar.gz -C c:/k/downloads

mv c:/k/downloads/kubernetes/node/bin/*.exe c:/k/

Finally, you’ll just need to install flannel.

Expand-Archive c:/k/downloads/flannel.zip -DestinationPath c:/k/downloads/flannel

mv c:/k/downloads/flannel/SDN-master/Kubernetes/flannel/l2bridge/* c:/k/

Configure Kubernetes Resources

Now that you have everything you need, you’ll just need to gather all the necessary information for the start script provided by flannel. You’ll need four pieces of network information, your instance’s private IP address, the Kubernetes cluster CIDR range, the Kubernetes service CIDR range, and the Kubernetes DNS service address. Our solution is fairly simple (re: dumb, but works) and there’s likely a better way of gathering the info, but here it is.

# Gather the node's IP address.

$env:HostIP = (

Get-NetIPConfiguration |

Where-Object {

$_.IPv4DefaultGateway -ne $null -and

$_.NetAdapter.Status -ne "Disconnected"

}

).IPv4Address.IPAddress # Gather Kubernetes cluster and service CIDRs.

$env:KubeClusterCIDR = (

kubectl cluster-info dump |

Select-String -Pattern ("--cluster-cidr=\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}/\d{1,2}") -AllMatches |

% { $_.Matches } | % { $_.Value } |

Select-String -Pattern("\d.*") -AllMatches |

% { $_.Matches } | % { $_.Value } |

Select-Object -Last 1

) $env:KubeServiceCIDR = (

kubectl cluster-info dump |

Select-String -Pattern ("--service-cluster-ip-range=\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}/\d{1,2}") -AllMatches |

% { $_.Matches } | % { $_.Value } |

Select-String -Pattern("\d.*") -AllMatches |

% { $_.Matches } | % { $_.Value } |

Select-Object -Last 1

) # Gather Kuberenetes cluster DNS service cluster IP address.

$env:KubeDNSServiceIP = (

kubectl describe svc -n kube-system kube-dns |

Select-String -Pattern ("IP:.*\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}") -AllMatches |

% { $_.Matches } | % { $_.Value } |

Select-String -Pattern("\d.*") -AllMatches |

% { $_.Matches } | % { $_.Value } |

Select-Object -Last 1

)

With these values you only need to change one more piece before attempting to join the cluster. There’s a net-conf.json file that you’ll need to place your kube’s cluster CIDR range into, replacing the default of 10.244.0.0/16 . We just used a simple call to replace .

cp c:/k/net-conf.json c:/k/net-conf-template.json

(Get-Content c:/k/net-conf-template.json).replace("10.244.0.0/16", $env:KubeClusterCIDR) | Set-Content c:/k/net-conf.json

…and with that all of the configuration is squared away! It’s time to attempt to join your cluster!

If only I had been this happy during this project.

Joining the Cluster

Now that you’re ready to join your cluster, you just need to run the start.ps1 script included in the flannel download with the network configuration as parameters, simple enough right?

Nope.

In order to keep this story from becoming a god damn Odyssey we’ll just cut right to the chase. The start.ps1 script has some flaws in it. Instead of using the provided script, we built our own version into the userdata we were putting together. Perhaps in the future Microsoft will have a version of it with fixes for the issues we encountered but for now we’ll list them here.

The Start-BitsTransfer call located here is kept failing when run as part of a userdata script. We’re unsure as to why, but simply using wget as a replacement resolved that issue. The StartFlanneld call located here will oftentimes fail and hang at “Waiting for Network to be created”. This is a known issue by Microsoft as denoted here, to which their solution is “lol, just try again”. Our solution was to wrap it in a retry loop until we see the cbr0 network and to move the kubelet registration line to just before we try to start flannel, and to kill the StartFlanneld Job after about fifteen seconds if it hasn’t completed.

...

$hasCbr0Network = $false

while(-not $hasCbr0Network) {

powershell $BaseDir/start-kubelet.ps1 -RegisterOnly

Start-Sleep (Get-Random -Minimum 0 -Maximum 5)

$job = Start-Job -ScriptBlock {

ipmo c:/k/helper.psm1

StartFlanneld -ipaddress $env:HostIP -NetworkName $NetworkName

}

$job | Wait-Job -Timeout 15

$job | Where-Object {$_.State -ne "Completed"} | Stop-Job

Start-Sleep 1

$hasCbr0Network = (Get-HnsNetwork | ? Name -eq "cbr0")

}

...

Once these changes were in place we were able to consistently bring up a new node into the cluster! It’s important to note though that the node may toggle in-and-out of the cluster as it can take a few tries before flannel is satisfied. If you’re wondering about the random wait time, it can likely be removed. It was primarily an experiment as we were getting inconsistent results for any constant wait time.

Huzzah! Now you should have a single Windows node attached to your Kubernetes cluster! Now it’s time to wrap it up into a more manageable solution.