Exhausting conntrack table space crippled our k8s cluster

It’s been a couple of years since I’ve written on software or systems topics. No specific reason for that other than that I wrote a bunch back when kubernetes adoption was ramping up and I just got tired of the topic (even though there have been plenty of new things worth writing about!). Also pandemic, family responsibilities, work stuff, etc., etc. Before that I wrote mostly for publication on Medium. Over time I became less and less thrilled with Medium, and so I’ve decided that for any future work I’m going to publish here and syndicate there, and we’ll see how that goes. For this post I want to talk about something that snuck up and hobbled our RPC services in our production cluster a couple of weeks ago: conntrack table exhaustion.

It started, as many such episodes do, with a ping in our internal support channel. Some things were slow, and people were seeing errors. On our response video chat we combed through logs until someone noticed a bunch of “temporary failure in name resolution” messages, an error that cluster nodes and other VMs will log when they can’t, you know, resolve a name. We run on Google Cloud + GKE, and all of our DNS zones are on Google Cloud DNS. If DNS lookups were failing it wasn’t our doing. We’ve been on Google for almost seven years and the platform has been amazingly durable and performant… but the odd, brief episode of network shenanigans had not been, up to this point, an unknown thing.

Continue reading

So you want Windows to show 24-hour time?

Originally published at https://medium.com/@betz.mark/so-you-want-windows-to-show-24-hour-time-eeac41062b73

I spend a large part of every day shelled into cloud servers, viewing logs, checking alerts in slack channels, looking at pages on my phone, glancing at the kitchen clock as I walk by to get coffee, and otherwise behaving like a typical engineer. These activities have something in common: they all involve timestamps of one form or another and most of them are different.

Yeah, I hate time zones, and you probably do too. Our servers are on UTC military time. Our slack channel shows 12-hour local time, as does the kitchen clock and my phone. My colleagues are often reporting timestamps in their own local time, which given that we’ve been a remote team for something like forever means those might be EST, EDT, CDT, CST, PDT, PST… you get the point… moreover you’ve probably lived it just like the rest of us. I’ve considered just changing everything in my life to UTC military time but I would irritate my wife and you can’t avoid hitting a disconnect somewhere. Still, I do want to make all the on-the-fly converting I have to do as easy as possible.

Continue reading

The day a chat message blew up prod

Originally published at https://medium.com/@betz.mark/the-day-a-chat-message-blew-up-prod-2c30941db07a

I don’t often write “in the trenches” stories about production issues, although I enjoy reading them. One of my favorite sources for that kind of tale is rachelbythebay. I’ve been inspired by her writing in the past, and I’m always on the lookout for opportunities to write more posts like hers. As it happens we experienced an unusual incident of our own recently. The symptoms and sequence of events involved make an interesting story and, as is often the case, contain a couple of valuable lessons. Not fun to work through at the time, but perhaps fun to replay here.

Some background: where I work we run a real-time chat service that provides an important communications tool to tens of thousands of businesses worldwide. Reliability is critical, and to ensure reliability we have invested a lot of engineering time into monitoring and logging. For alerting our general philosophy is to notify on root causes and to page the on-call engineer for customer-facing symptoms. Often the notifications we see in email or slack channels allow us to get ahead of a developing problem before it escalates to a page, and since pages mean our users are affected this is a good thing.

Continue reading

The cost of tailing logs in kubernetes

Originally published at https://medium.com/@betz.mark/the-cost-of-tailing-logs-in-kubernetes-aca2bfc6fe43

Logging is one of those plumbing things that often gets attention only when it’s broken. That’s not necessarily a criticism. Nobody makes money off their own logs. Rather we use logs to gain insight into what our programs are doing… or have done, so we can keep the things we do make money from running. At small scale, or in development, you can get the necessary insights from printing messages to stdout. Scale up to a distributed system and you quickly develop a need to aggregate those messages to some central place where they can be useful. This need is even more urgent if you’re running containers on an orchestration platform like kubernetes, where processes and local storage are ephemeral.

Since the early days of containers and the publication of the Twelve-Factor manifesto a common pattern has emerged for handling logs generated by container fleets: processes write messages to stdout or stderr, containerd (docker) redirects the standard streams to disk files outside the containers, and a log forwarder tails the files and forwards them to a database. The log forwarder fluentd is a CNCF project, like containerd itself, and has become more or less a de facto standard tool for reading, transforming, transporting and indexing log lines. If you create a GKE kubernetes cluster with cloud logging enabled (formerly Stackdriver) this is pretty much the exact pattern that you get, albeit using Google’s own flavor of fluentd.

Continue reading

Pulling shared docker tags is bad

Originally published at https://medium.com/@betz.mark/pulling-shared-docker-tags-is-bad-5aea48e079c6

Last night we migrated a key service to a new environment. Everything went smoothly and we concluded the maintenance window early, exchanged a round of congratulations and killed the zoom call. This morning I settled in at my desk and realized that this key service’s builds were breaking on master. My initial, and I think understandable impulse was that somehow I had broken the build when I merged my work branch for the migration into master the night before. Nothing pours sand on your pancakes like waking up to find out the thing you thought went so well last evening is now a smoking pile of ruin.

Except that wasn’t the problem. There was no difference between the commit that triggered the last good build and the merge commit to master that was now failing. I’m fine with magic when it fixes problems. We even have an emoji for it. “Hey, that thingamajig is working now!” Magic. I do not like it when it breaks things, although it is possible to use the same emoji for those cases as well. The first clue as to what was really happening was that the broken thing was a strict requirements check we run on a newly built image before unit tests. It has a list of packages it expects to find, and fails if it finds any discrepancy between that and the image contents.

Continue reading

Upgrading a large cluster on GKE

Originally published on the Google Cloud Community blog at https://medium.com/google-cloud/upgrading-a-large-cluster-on-gke-499a7256e7e1

At Olark we’ve been running production workloads on kubernetes in GKE since early 2017. In the beginning our clusters were small and easily managed. When we upgraded kubernetes on the nodes, our most common cluster-wide management task, we could just run the process in the GKE console and keep an eye on things for awhile. Upgrading involves tearing down and replacing nodes one at a time, and consumes about 4–5 minutes per node in the best case. When we were at 20 nodes it might take 90–120 minutes, which is in a tolerable range. It was disruptive, but all our k8s services at the time could deal with that. It was irreversible too, but we mitigated that risk by testing in staging, and by staying current enough that the previous version was still available for a replacement nodepool if needed. This approach seemed to work fine for over a year.

As our clusters grew and we created additional nodepools for specific purposes a funny thing began to happen: upgrading started to become a hassle. Specifically it began to take a long time. Not only did we have more nodes, but we also had a greater diversity of services running on them. Some of those implemented things like pod disruption budgets and termination grace periods, that slow an upgrade down. Others could not be restarted without a downtime due to legacy connection management issues. As the upgrade times got longer the duration of these scheduled downtimes also grew, impacting our customers and our team. Not surprisingly we began to fall behind the current GKE release version. Recently we received an email from Google Support letting us know that an upcoming required master update would be incompatible with our node version. We had to upgrade them, or they would.

Continue reading

Ingress load balancing issues on Google’s GKE

Originally published on the Google Cloud Community blog at https://medium.com/google-cloud/ingress-load-balancing-issues-on-googles-gke-f54c7e194dd5

Usually my posts here are about some thing I think I might have figured out and want to share. Today’s post is about a thing I’m pretty sure I haven’t figured out and want to share. I want to talk about a problem we’ve been wrestling with over the last couple of weeks; one which we can suggest a potential fix for but do not yet know the root cause of. In short, if you are running certain types of services behind a GCE class ingress on GKE you might be getting traffic even when your pods are unready, as during a deployment for example. Before I get into the details here is the discovery story. If you just want the tl;dr and recommendations jump to the end.

[Update 4/17/2019 — Google’s internal testing confirmed that this is a problem with the front end holding open connections to kube-proxy and reusing them. Because the netfilter NAT table rules only apply to new connections this effectively short-circuited kubernetes’ internal service load balancing and directed all traffic to a given node/nodeport to the same pod. Google also confirmed that removing the keep-alive header from the server response is a work-around, and we’ve confirmed this internally. If you need the keep-alive header then the next best choice is to move to container native load balancing with a VPC-native cluster, since this takes the nodeport hop right out of the equation. Unfortunately that means building a new cluster if yours is not already VPC-native. So that is the solution… if you’re still interested in the story read on!]

Over the last couple of months we’ve been prepping one of our most critical services for migration to GKE. This service consists partly of an http daemon that handles long poll requests from our javascript client, and runs on 90 GCE instances. These instances handle approximately 15k requests per second at peak load. Because many of these requests are long polls with a timeout of 30 seconds we need the ability to gracefully shut down instances of this service. To accomplish this we have a command we can send the service that causes it to take itself out of rotation, wait 60 seconds for all existing long polls to complete, and then exit.

Continue reading

VPC-native clusters on Google Kubernetes Engine

Originally published on the Google Cloud Community blog at https://medium.com/google-cloud/vpc-native-clusters-on-google-kubernetes-engine-b7c022c07510

If you’re a GKE user and you’ve created a cluster within the last six months or so you might have noticed a new option:

You may also have caught the press release announcing this feature back in May, or the announcement last October of container-native load balancing for GKE pods, a related thing. VPC-native, container-native, alias IP: these all seem like fairly intimidating terms, and since this networking architecture is set to become the default for new clusters “soon” I thought it would be useful to relate what we’ve learned about it, based on creating and running both types of clusters in production and comparing the way they work.

First, the anxiety-mitigation portion of the post: running a cluster as VPC-native changes almost nothing inside the cluster itself. Nothing about the way your workloads are deployed, discovered or connected to by other workloads inside the cluster is affected. In fact if you compare two clusters, one using VPC-native and the other using the legacy approach, now inexplicably called “advanced routing,” you’ll find they’re pretty much identical from the inside down to the command line arguments passed to the kubelet, kube-dns and kube-proxy on startup. So you’re not going to break anything switching your workloads to a VPC-native cluster, unless you’re doing something stranger than I can currently imagine as I write this.

Continue reading

Understanding resource limits in kubernetes: memory

Originally published at https://medium.com/@betz.mark/understanding-resource-limits-in-kubernetes-memory-6b41e9a955f9

When I started working with kubernetes at scale I began encountering something that didn’t happen when I was just running experiments on it: occasionally a pod would get stuck in pending status because no node had sufficient cpu or ram available to run it. You can’t add cpu or ram to a node, so how do you un-stick the pod? The simplest fix is to add another node, and I admit resorting to this answer more than once. Eventually it became clear that this strategy fails to leverage one of kubernetes greatest strengths: its ability to efficiently utilize compute resources. The real problem in many of these cases was not that the nodes were too small, but that we had not accurately specified resource limits for the pods.

Resource limits are the operating parameters that you provide to kubernetes that tell it two critical things about your workload: what resources it requires to run properly; and the maximum resources it is allowed to consume. The first is a critical input to the scheduler that enables it to choose the right node on which to run the pod. The second is important to the kubelet, the daemon on each node that is responsible for pod health. While most readers of these posts probably have at least a basic familiarity with the concept of resources and limits, there is a lot of interesting detail under the hood. In this two-part series I’m going to first look closely at memory limits, and then follow up with a second post on cpu limits.

Continue reading

Understanding resource limits in kubernetes: cpu time

Originally published at https://medium.com/@betz.mark/understanding-resource-limits-in-kubernetes-cpu-time-9eff74d3161b

In the first post of this two-part series on resource limits in kubernetes I discussed how the ResourceRequirements object was used to set memory limits on containers in a pod, and how those limits were implemented by the container runtime and linux control groups. I also talked about the difference between requests, used to inform the scheduler of a pod’s requirements at schedule time, and limits, used to assist the kernel in enforcing usage constraints when the host system is under memory pressure. In this post I want to continue by looking in detail at cpu time requests and limits. Having read the first post is not a prerequisite to getting value from this one, but I encourage you to read them both at some point to get a complete picture of the controls available to engineers and cluster administrators.

CPU limits

As I mentioned in the first post cpu limits are more complicated than memory limits, for reasons that will become clear below. The good news is that cpu limits are controlled by the same cgroups mechanism that we just looked at, so all the same ideas and tools for introspection apply, and we can just focus on the differences. Let’s start by adding cpu limits back into the example resources object that we looked at last time:

Continue reading