It’s been a couple of years since I’ve written on software or systems topics. No specific reason for that other than that I wrote a bunch back when kubernetes adoption was ramping up and I just got tired of the topic (even though there have been plenty of new things worth writing about!). Also pandemic, family responsibilities, work stuff, etc., etc. Before that I wrote mostly for publication on Medium. Over time I became less and less thrilled with Medium, and so I’ve decided that for any future work I’m going to publish here and syndicate there, and we’ll see how that goes. For this post I want to talk about something that snuck up and hobbled our RPC services in our production cluster a couple of weeks ago: conntrack table exhaustion.
It started, as many such episodes do, with a ping in our internal support channel. Some things were slow, and people were seeing errors. On our response video chat we combed through logs until someone noticed a bunch of “temporary failure in name resolution” messages, an error that cluster nodes and other VMs will log when they can’t, you know, resolve a name. We run on Google Cloud + GKE, and all of our DNS zones are on Google Cloud DNS. If DNS lookups were failing it wasn’t our doing. We’ve been on Google for almost seven years and the platform has been amazingly durable and performant… but the odd, brief episode of network shenanigans had not been, up to this point, an unknown thing.
I spend a large part of every day shelled into cloud servers, viewing logs, checking alerts in slack channels, looking at pages on my phone, glancing at the kitchen clock as I walk by to get coffee, and otherwise behaving like a typical engineer. These activities have something in common: they all involve timestamps of one form or another and most of them are different.
Yeah, I hate time zones, and you probably do too. Our servers are on UTC military time. Our slack channel shows 12-hour local time, as does the kitchen clock and my phone. My colleagues are often reporting timestamps in their own local time, which given that we’ve been a remote team for something like forever means those might be EST, EDT, CDT, CST, PDT, PST… you get the point… moreover you’ve probably lived it just like the rest of us. I’ve considered just changing everything in my life to UTC military time but I would irritate my wife and you can’t avoid hitting a disconnect somewhere. Still, I do want to make all the on-the-fly converting I have to do as easy as possible.
I don’t often write “in the trenches” stories about production issues, although I enjoy reading them. One of my favorite sources for that kind of tale is rachelbythebay. I’ve been inspired by her writing in the past, and I’m always on the lookout for opportunities to write more posts like hers. As it happens we experienced an unusual incident of our own recently. The symptoms and sequence of events involved make an interesting story and, as is often the case, contain a couple of valuable lessons. Not fun to work through at the time, but perhaps fun to replay here.
Some background: where I work we run a real-time chat service that provides an important communications tool to tens of thousands of businesses worldwide. Reliability is critical, and to ensure reliability we have invested a lot of engineering time into monitoring and logging. For alerting our general philosophy is to notify on root causes and to page the on-call engineer for customer-facing symptoms. Often the notifications we see in email or slack channels allow us to get ahead of a developing problem before it escalates to a page, and since pages mean our users are affected this is a good thing.
Logging is one of those plumbing things that often gets attention only when it’s broken. That’s not necessarily a criticism. Nobody makes money off their own logs. Rather we use logs to gain insight into what our programs are doing… or have done, so we can keep the things we do make money from running. At small scale, or in development, you can get the necessary insights from printing messages to stdout. Scale up to a distributed system and you quickly develop a need to aggregate those messages to some central place where they can be useful. This need is even more urgent if you’re running containers on an orchestration platform like kubernetes, where processes and local storage are ephemeral.
Since the early days of containers and the publication of the Twelve-Factor manifesto a common pattern has emerged for handling logs generated by container fleets: processes write messages to stdout or stderr, containerd (docker) redirects the standard streams to disk files outside the containers, and a log forwarder tails the files and forwards them to a database. The log forwarder fluentd is a CNCF project, like containerd itself, and has become more or less a de facto standard tool for reading, transforming, transporting and indexing log lines. If you create a GKE kubernetes cluster with cloud logging enabled (formerly Stackdriver) this is pretty much the exact pattern that you get, albeit using Google’s own flavor of fluentd.
I’m a software engineer and so I usually fill this space with software and systems engineering topics. It’s what I do and love, and I enjoy writing about it, but not today. Instead I’m going to talk about what my wife does, and loves doing, and how the times we are living through have affected her job and our lives together. In many ways we’re among the lucky ones: we both have incomes and health insurance, and I already worked from home. In other ways we’re not so fortunate. The current crisis facing the world is like nothing any of us have seen in a generation or more. It’s impacting every single segment of our population and economy, and everyone has a story. This is what ours looks like, almost four weeks into lock-down.
My wife is a registered nurse. She works at a regional hospital in northern New Jersey, about 30 miles from our home. She has been there more than a decade. Her current role is as clinical coordinator on a cardiac critical care unit. You can think of it as sort of the captain of the care team. Some weeks ago, in preparation for what was obviously coming, her unit was converted into a negative pressure floor for the care of Covid-19 cases. This means that a lot of work was done to seal the floor off and provide ventilation to lower the air pressure within to prevent the escape of infectious material. The same was done to one other unit in the hospital, and a lot of work was also done to prepare to provide intensive respiratory care for patients in those units.
Last night we migrated a key service to a new environment. Everything went smoothly and we concluded the maintenance window early, exchanged a round of congratulations and killed the zoom call. This morning I settled in at my desk and realized that this key service’s builds were breaking on master. My initial, and I think understandable impulse was that somehow I had broken the build when I merged my work branch for the migration into master the night before. Nothing pours sand on your pancakes like waking up to find out the thing you thought went so well last evening is now a smoking pile of ruin.
Except that wasn’t the problem. There was no difference between the commit that triggered the last good build and the merge commit to master that was now failing. I’m fine with magic when it fixes problems. We even have an emoji for it. “Hey, that thingamajig is working now!” Magic. I do not like it when it breaks things, although it is possible to use the same emoji for those cases as well. The first clue as to what was really happening was that the broken thing was a strict requirements check we run on a newly built image before unit tests. It has a list of packages it expects to find, and fails if it finds any discrepancy between that and the image contents.
A couple of years ago I lost all of what I would have considered, up to that point, my intellectual life, not to mention a lot of irreplaceable photos, in a hard drive failure. And while this post is not about the technical and behavioral missteps that allowed the loss to occur those things nonetheless make up a part of the story. How does it happen that an experienced software engineer, someone who is often responsible for corporate data and has managed to not get fired for losing any of it, suffers a hard drive failure and finds himself in possession of zero backups? Almost effortlessly, as it turned out.
Since the early 1980’s I’ve kept all my digital self in a single directory tree off the root of my system’s boot disk. Over the years this directory structure was faithfully copied every time I upgraded, travelling on floppies, zip drives, CD-Rs, DVD-Rs, USB thumb drives, flash drives, from my first 8088 to my second and ridiculously expensive 80286 and so on through all of the machines I’ve bought or built in three decades. Along the way it grew, becoming the repository for all my software and writing work. The first VGA code I wrote was in there. The complete source code for my shareware backgammon game was in there. All the articles I wrote for Dr. Dobbs, Software Development and other journals were in there.
At Olark we’ve been running production workloads on kubernetes in GKE since early 2017. In the beginning our clusters were small and easily managed. When we upgraded kubernetes on the nodes, our most common cluster-wide management task, we could just run the process in the GKE console and keep an eye on things for awhile. Upgrading involves tearing down and replacing nodes one at a time, and consumes about 4–5 minutes per node in the best case. When we were at 20 nodes it might take 90–120 minutes, which is in a tolerable range. It was disruptive, but all our k8s services at the time could deal with that. It was irreversible too, but we mitigated that risk by testing in staging, and by staying current enough that the previous version was still available for a replacement nodepool if needed. This approach seemed to work fine for over a year.
As our clusters grew and we created additional nodepools for specific purposes a funny thing began to happen: upgrading started to become a hassle. Specifically it began to take a long time. Not only did we have more nodes, but we also had a greater diversity of services running on them. Some of those implemented things like pod disruption budgets and termination grace periods, that slow an upgrade down. Others could not be restarted without a downtime due to legacy connection management issues. As the upgrade times got longer the duration of these scheduled downtimes also grew, impacting our customers and our team. Not surprisingly we began to fall behind the current GKE release version. Recently we received an email from Google Support letting us know that an upcoming required master update would be incompatible with our node version. We had to upgrade them, or they would.
Usually my posts here are about some thing I think I might have figured out and want to share. Today’s post is about a thing I’m pretty sure I haven’t figured out and want to share. I want to talk about a problem we’ve been wrestling with over the last couple of weeks; one which we can suggest a potential fix for but do not yet know the root cause of. In short, if you are running certain types of services behind a GCE class ingress on GKE you might be getting traffic even when your pods are unready, as during a deployment for example. Before I get into the details here is the discovery story. If you just want the tl;dr and recommendations jump to the end.
[Update 4/17/2019 — Google’s internal testing confirmed that this is a problem with the front end holding open connections to kube-proxy and reusing them. Because the netfilter NAT table rules only apply to new connections this effectively short-circuited kubernetes’ internal service load balancing and directed all traffic to a given node/nodeport to the same pod. Google also confirmed that removing the keep-alive header from the server response is a work-around, and we’ve confirmed this internally. If you need the keep-alive header then the next best choice is to move to container native load balancing with a VPC-native cluster, since this takes the nodeport hop right out of the equation. Unfortunately that means building a new cluster if yours is not already VPC-native. So that is the solution… if you’re still interested in the story read on!]
If you’re a GKE user and you’ve created a cluster within the last six months or so you might have noticed a new option:
You may also have caught the press release announcing this feature back in May, or the announcement last October of container-native load balancing for GKE pods, a related thing. VPC-native, container-native, alias IP: these all seem like fairly intimidating terms, and since this networking architecture is set to become the default for new clusters “soon” I thought it would be useful to relate what we’ve learned about it, based on creating and running both types of clusters in production and comparing the way they work.
First, the anxiety-mitigation portion of the post: running a cluster as VPC-native changes almost nothing inside the cluster itself. Nothing about the way your workloads are deployed, discovered or connected to by other workloads inside the cluster is affected. In fact if you compare two clusters, one using VPC-native and the other using the legacy approach, now inexplicably called “advanced routing,” you’ll find they’re pretty much identical from the inside down to the command line arguments passed to the kubelet, kube-dns and kube-proxy on startup. So you’re not going to break anything switching your workloads to a VPC-native cluster, unless you’re doing something stranger than I can currently imagine as I write this.