Project Dolos: Testing Kubernetes on Google and Azure

This is part 1 of a 2 part blog where we look at Kubernetes cluster creation times on Azure AKS and Google GKE. In this blog we’ll go over the new testing tool I created called Dolos.

Edit: the 2nd part is now here.

I encourage others to give it a try to verify my results. In the 2nd blog I’ll reveal exactly how bad Azure is with lots of proof to back it up.

You may remember that I previously compared Google GKE vs Microsoft AKS vs Amazon EKS. There were a couple of line items in the comparison table that people disagreed with. Many messages flooded my inbox telling me that Azure AKS is now quick. That somehow in the past 2 weeks it has sped up and that I should try it again.

Call me pessimistic but I knew this was bollocks.

Yesterday I started a new project on Github called Dolos and today the first version is complete. I’m currently running it nonstop collecting data for the next blog. Having already seen the first few hours of results my suspicions are already confirmed. But we may as well collect a few days worth of data so it’s absolutely undeniable.

What does Dolos do? Dolos creates a new Kubernetes cluster, deploys a sample application to it, then waits until the application is up and running. Once this is tested successfully Dolos will delete the cluster and do it all again.

It does this in a loop, forever, and outputs timings at every stage to a log file. It also records errors so we can evaluate how stable each cloud is.

Here you can see it running under Docker compose. When you start Dolos it creates one container per cloud being tested.

Currently this means a container to run the GKE automation and a separate container for AKS. As you may notice from the screenshot the 3rd line up from the bottom shows “Total create time taken” for GKE as 3 minutes and 5 seconds. I’ve actually seen it take a lot less than that but I won’t spoil part 2 now.

I’ll add AWS EKS later on once it’s faster. There’s a section in the Dolos project readme that discusses the problems with doing it now.

Why?

Firstly, I was amused by all of the people fixating on a single provable fact from the comparison blog post. I criticised AWS just as much as I did Azure and yet zero people messaged me about EKS.

Currently EKS isn’t very good, I keep telling people to wait and not use it yet, and yet nobody cares about this.

After giving this some thought I believe most people using AWS are doing it through choice. The main AWS audience are implementors and they listen and don’t argue with honest advice. Whereas most people messaging me about Azure have some kind of financial motivation. They either work for Microsoft or a partner and have some of their income tied to the Azure cloud.

Secondly, I stand by the fact that performance and reliability are paramount for any cloud service (alongside security).

Let me give you a recent example where I created Dolos yesterday. I started by testing Azure first which was a big mistake. Automating things involves writing code and then running it, over and over. The round trip time on Azure for creation and deletion was pretty slow and so I’d kick off the script, wait, then come back and fix whatever else broke.

You can see from the Github timeline this took around 10 hours from start to finish. It was a weekend so I was watching Netflix and eating dinner and other stuff in-between so I didn’t mind too much.

When I came to do the same for GKE the round trip time was so quick I got the whole thing done in an hour. If this was an automation task at work that’s a massive difference. I’m not a massive fan of context switching so starting lots of jobs in parallel because you need to wait for something to complete makes work suck.

I also stand by the comment I made in the original comparison blog about working differently when cluster creation is fast.

My team makes changes to the Kubernetes platform that other development teams deploy their applications onto. The platform itself needs to change and therefore we make changes in code, submit pull requests, test the changes and then release them to our customer environments.

You really don’t want to be applying change on top of change to static long lived testing environments. This totally invalidates your testing. What you want is to spin up a clean cluster, apply your changes, test it, then deploy it to an integration environment. Then test the customer applications on top.

Hopefully this explains the motivation for this blog and why it’s a real issue. Keen an eye out for part two which I think many will find very interesting. Sign up to our newsletter to get notified when it’s released. Also, any feedback on Dolos is welcome.

 

Related

Blog Updates At the beginning of this week I saw the usual monitoring related threads on Reddit. They are always the…

Read more

Follow Up: Container Scanning Comparison This is a follow on from my previous blog where I compared the results of 5…

Read more

Helm Chart Prometheus Exporter Ever wanted to know what versions of software are running on your Kubernetes cluster?…

Read more

Tell us about a new Kubernetes application

Newsletter

Never miss a thing! Sign up for our newsletter to stay updated.

About

Discover and share new Kubernetes applications

Navigation