Upgrading a large cluster on GKE

Originally published on the Google Cloud Community blog at https://medium.com/google-cloud/upgrading-a-large-cluster-on-gke-499a7256e7e1

At Olark we’ve been running production workloads on kubernetes in GKE since early 2017. In the beginning our clusters were small and easily managed. When we upgraded kubernetes on the nodes, our most common cluster-wide management task, we could just run the process in the GKE console and keep an eye on things for awhile. Upgrading involves tearing down and replacing nodes one at a time, and consumes about 4–5 minutes per node in the best case. When we were at 20 nodes it might take 90–120 minutes, which is in a tolerable range. It was disruptive, but all our k8s services at the time could deal with that. It was irreversible too, but we mitigated that risk by testing in staging, and by staying current enough that the previous version was still available for a replacement nodepool if needed. This approach seemed to work fine for over a year.

As our clusters grew and we created additional nodepools for specific purposes a funny thing began to happen: upgrading started to become a hassle. Specifically it began to take a long time. Not only did we have more nodes, but we also had a greater diversity of services running on them. Some of those implemented things like pod disruption budgets and termination grace periods, that slow an upgrade down. Others could not be restarted without a downtime due to legacy connection management issues. As the upgrade times got longer the duration of these scheduled downtimes also grew, impacting our customers and our team. Not surprisingly we began to fall behind the current GKE release version. Recently we received an email from Google Support letting us know that an upcoming required master update would be incompatible with our node version. We had to upgrade them, or they would.

Continue reading