Update: Microsoft recognised cluster creation time as an issue and improved times from over 18 minutes to 10 minutes. Still nowhere near as good as GKE but it’s a start.
Following on from the brutal Google GKE vs Microsoft AKS vs Amazon EKS blog I’ve created a new cloud testing tool called Dolos. This blog analyses the results and provides verifiable proof that Microsoft AKS is considerably worse than Google GKE. Here’s a sneak peak of the spreadsheeting I’ve had to do.
I left Dolos running for around 9 hours. This created, tested and destroyed 18 Azure AKS clusters. Considerably more GKE clusters were created in this timeframe but I’ve limited the analysis to the first 18 clusters for both.
This is the workflow used for both AKS and GKE. First we create the cluster, then we deploy an application. Finally we test it and then destroy the cluster. Continuing in a loop forever and logging all of the results to a file. There are some minor differences in the components that get deployed as you would expect from them being different clouds.
All of the results are available to browse in this spreadsheet. There are 3 worksheets there so you’ll need to click between them to see all of the data.
In the spirit of 100% transparency you can also check my raw log files.
We’ll dive into the high level results and then drill down into each cloud and call out some observations. As predicted in the first comparison blog Azure AKS has considerably slower cluster create times than Google GKE.
The time it takes to create a new cluster, deploy an application and test that it’s up and working on GKE is on average 3 minutes 50 seconds.
For an identical test on AKS the average is 17 minutes 52 seconds. You’ll notice a gap in the red AKS series which corresponds to a failure.
That peak create time on AKS was 82 minutes waiting for a cluster to come up. The fastest an AKS cluster ever started was 12 minutes and 8 seconds.
Comparatively, the fastest GKE cluster came up in 3 minutes and 5 seconds and the longest creation was 5 minutes and 39 seconds.
Next we’ll look into destroy times.
Again, GKE is much quicker. You can generally expect a GKE cluster delete to happen in under 3 minutes. The average for AKS is closer to 13 minutes.
If you’re building ephemeral test environments for each pull request branch, and want to trigger a rebuild of the same hash, the delete times will start to effect your pipeline.
Average round trip time for create and destroy on Azure is approximately 30 minutes. On Google it’s 7 mins.
The other thing to notice about these two clouds is the standard deviation. Google is much more predictable whereas Azure is a big wobbly mess.
What makes Azure so slow? I’ve broken out the create times for each individual component so we can examine where the problems lie.
As you can see from the stacked graph the Azure creation time is reasonably static. This always takes between 10 and 12 minutes, unless it fails. There is a reasonable chunk of time added on top waiting for the cluster to become ready.
There is one example of the cluster being created, the IP being provisioned, and then a massive wait for the application to actually become available. Azure really need to focus on fixing their network and making it more performant, consistent and more reliable.
In the 18 iterations of cluster creation AKS failed entirely once. It also took 82 minutes to come up in another case. This isn’t a freak occurrence. This is well documented online in various Github issues and the Azure feedback system. I’ve also encountered this a few times before on other Dolos test runs. If you’re running on Azure and doing large scale automation that involves many cluster creations then you’re going to hit this frequently.
As everyone who has used GKE will know this was a pleasant and rather boring experience. It simply works, every time, without failure and it’s quick.
The big surprise here for me was the networking. Cluster create times are actually very quick and it’s not unusual to see the cluster up and ready inside of 2 minutes.
Waiting for the external IP to come up then takes about a minute. Once that’s up the test passes instantly. The log files show mostly 0 seconds from IP address available to tests passing. I was stunned by how quick this is. There are a couple of times this wasn’t instant as shown by the yellow peaks, but even when it’s not instant it’s usually not more than a couple of minutes extra.
Many people argued with me about the Azure AKS creation times I put down on the original table in the first comparison blog. They claimed much faster creation times than the generous 15 minutes that I had put down.
As a result of this blog I’ll now change that table to 20 minutes and refer anyone who argues to the Dolos testing tool.
I think we can now factually argue the case that Google GKE is not only more feature rich than Azure AKS, it’s also more stable and the network is on a whole other level.
As for what’s next with Dolos, I think I’ll add Amazon EKS next and then run the tests across all 3 clouds for a much longer period of time. Perhaps a week or longer so that we can get a better idea of how reliable each cloud is for running automated environments on.
As a side effect of this blog I’m now curious about Azure AKS networking on existing clusters. I was surprised to see how long it took to route traffic once a service comes up. My suspicion is there are some problems on AKS that an automated testing tool would uncover.
Perhaps I should extend Dolos to create and destroy additional services on the cluster once it’s up to see how quickly the network responds and if any traffic is dropped during the equivalent of a blue / green deploy.
Tell us about a new Kubernetes application
Never miss a thing! Sign up for our newsletter to stay updated.
Discover and learn about everything Kubernetes