Is Azure Kubernetes (AKS) any less terrible?

You may remember reading the feature comparison I did a few months ago between EKS, GKE and AKS. Each of the clouds have improved since then and AKS in particular have claimed 5 – 6 minutes for cluster provisioning times.

A big improvement if true.

Let’s see if this stands up to scrutiny by running my automated test suite.

A quick recap

Cluster create times and reliability of AKS were measured and proven to be significantly worse than GKE.

wdt_IDKubernetes ServiceCreate Time (max)Create Time (min)Create Time (avg)Destroy Time (avg)
1GKE5 mins 39s3 mins 5s3 mins 50s3 mins
2AKS82 mins12 mins 8s17 mins 52s13 mins

With average cluster create times of 17 minutes 52 seconds and a round trip create and destroy taking 30 mins.

Intermittent failures also plagued AKS.

As somebody who values clean room testing I love the fact that you can spin up an entire GKE cluster in 3 minutes reliably for CI pipelines.

Why is this cool? Because spinning up a Kubernetes cluster is just the beginning. You will perform lots of configuration on top of a running cluster for things like service meshes and Ingress configurations etc. This isn’t something you want to iterate into a drifted state.

Minikube is great for unit tests but you really need a full cluster for integration tests. If you can do integration tests at the pull request stage engineers will fix defects before anyone else sees them.

It wasn’t just the cluster create times that were the problem with AKS. A large amount of time was spent after the cluster was created getting the network interface up and working over the internet.

If I had to guess at what is going on behind the scenes in Azure I suspect 3 teams are contributing to the problem.

  1. Time to boot the compute (IaaS team running HyperV?)
  2. Time to configure the OS (AKS team configuring Kubernetes?)
  3. Time to make the network available (Network team(s)?)

AKS is a bit of a black box so you can’t differentiate between 1 and 2. However, the stats showed that Azure networking often accounted for a significant proportion of the time.

If you ignore iteration 15 from that graph that failed completely you can see that iteration 6 was the outlier. This was caused by a networking issue. The cluster was up, the lights were on, but nobody was home when trying to connect.

In summary, the clusters were slow to create and sometimes the networking part made things unacceptably worse.

New results

It was Christmas. My Azure trial had expired. I was spending my own money verifying a vendor claim.

For that reason I only left the test running for 10 iterations while I went to the pub.

This was half the data compared to the previous test of 19 iterations. Even so, 10 cluster builds is enough to get a good idea of whether things have improved or not.

It’s probably not enough to determine if the failure rates have improved.

Having said that, the new results are a lot better!

wdt_IDKubernetes ServiceCreate Time (max)Create Time (min)Create Time (avg)Destroy Time (avg)
1GKE5 mins 39s3 mins 5s3 mins 50s3 mins
2AKS14 mins 53s7 mins 34s9 mins 54s9 mins 39s

I didn’t retest GKE but I’ve left it in as a comparison. It’s possible Google has improved.

The main takeaway here is the average create time is below 10 minutes!

AKS still suffers from bumpy metrics where the min, max and average are quite far apart. This doesn’t instill a lot of confidence in their systems behind the scenes.

My definition of a cluster being up is whether I can curl some web content from an application over the internet.

Microsoft may have used the spurious definition of the cluster creation API saying the AKS resource has been created.

The average cluster creation time as returned by the API was 6 minutes and 39 seconds.

The vertical axis on the chart is seconds and the horizontal axis is iterations.

Out of 10 iterations only 6 clusters were created in under 6 minutes as judged by the spurious definition. When you look at more correct metrics to determine cluster state, timings are closer to 10 minutes.

The green and yellow bars show that the network performance in Azure is still a big issue.

Summary

AKS has improved a lot in a quarter with respect to cluster creation times.

Average cluster create times were 18 minutes a few months ago and now they are 9 minutes.

The press release is a lie any way you calculate the create times. I wish vendors were held accountable for these types of metric. Cluster create times are not on average between 5 and 6 minutes in all regions. They are closer to 10 minutes from start to finish if you want a usable cluster.

I tested a single instance size in a single region (the defaults) and the claim is obviously untrue. There are Github issues stating that different instance sizes also have a negative effect. I’ve not experimented with this but I expect the worst.

Azure is still 3rd place in my estimation compared to Google and Amazon from a user experience perspective.

Credit needs to be given to Microsoft for eventually owning up to the issues after being publicly called out.

People have undoubtedly worked quite hard to fix some of these issues. We are all slaves to Jira tickets and I congratulate those who improved these results in the face of corporate adversity.

Microsoft itself isn’t evil but will bend to whatever strategy results in the most financial gain whenever it suits. Previously it was “Linux is Cancer” and now it’s “embrace open source” to sell compute on Azure.

Blogs like mine that question the technical aspects of Azure are probably viewed as an annoyance and combatted by hiring 3 more DevOps evangelists to talk at conferences about how things have ‘changed’.

I don’t know if Microsoft can ever be technically excellent in the cloud. Their pedigree and DNA makes me doubt it.

Microsoft doesn’t have the engineering culture that Google and Amazon have and that’s reflected in the trampoline metrics. I’m not even sure that the performance gains made here can be persisted for a year across the multiple teams involved.

What Microsoft does have is an excellent enterprise sales team.

Second to market with an inferior product and massive sales force has always been the Microsoft strategy. They are the largest company in the world which proves good enough usually wins. I wouldn’t bet on that long term as implementors are becoming the decision makers for technology.

My honest hope is that AKS continues to improve. Maybe one day it will improve to a point where I would tolerate it. Or at least to a level where I would only increase my daily rate by double to compensate.

Related

There are so many options to choose from it can be a daunting task to even get started with Kubernetes. Here's some…

Read more

Blog UpdatesSince the last update there have been 4 new blogs. Three that I wrote and a guest blog from another…

  • Blog
  • 1.1K
Read more

There have been many comparisons done between these cloud hosted Kubernetes providers already. However, probably none…

Read more

Tell us about a new Kubernetes application

Newsletter

Never miss a thing! Sign up for our newsletter to stay updated.

About

Discover and learn about everything Kubernetes

Navigation