It’s been a couple of years since I’ve written on software or systems topics. No specific reason for that other than that I wrote a bunch back when kubernetes adoption was ramping up and I just got tired of the topic (even though there have been plenty of new things worth writing about!). Also pandemic, family responsibilities, work stuff, etc., etc. Before that I wrote mostly for publication on Medium. Over time I became less and less thrilled with Medium, and so I’ve decided that for any future work I’m going to publish here and syndicate there, and we’ll see how that goes. For this post I want to talk about something that snuck up and hobbled our RPC services in our production cluster a couple of weeks ago: conntrack table exhaustion.
It started, as many such episodes do, with a ping in our internal support channel. Some things were slow, and people were seeing errors. On our response video chat we combed through logs until someone noticed a bunch of “temporary failure in name resolution” messages, an error that cluster nodes and other VMs will log when they can’t, you know, resolve a name. We run on Google Cloud + GKE, and all of our DNS zones are on Google Cloud DNS. If DNS lookups were failing it wasn’t our doing. We’ve been on Google for almost seven years and the platform has been amazingly durable and performant… but the odd, brief episode of network shenanigans had not been, up to this point, an unknown thing.
We were tempted to declare this one a “not us,” and it sort of seemed like things were getting better as time went on, however as we didn’t have an actual answer we kept poking around in logs and dashboards. Shortly evidence began piling up that this was more than DNS failing. Latencies to our higher volume RPC services were up, and some connections to external resources such as Firebase were occasionally failing or timing out. We added pods to a few key deployments and things continued to improve over the evening, but the next morning, at almost the exact same time, the problem started up again. Still under the happy delusion that there was nothing we could have done to cause these network issues, I filed a ticket with Google Cloud support.
People on various message boards like to bash Google’s support (looking at you HN), but our experience with the GCP support team has been excellent since day one, and it was just an hour or so before their rep Cristian got back to me. After a little back and forth he was able to provide me with VM logs showing that we were experiencing a lot of errors with the message “nf_conntrack: table full, dropping packet.” I’m not one of those really smart engineers but the “dropping packet” part seemed relevant to our issue with network performance. The source of the log message, “nf_conntrack,” was strange to me, so I had to do a little digging. Turns out that conntrack is a netfilter module that provides stateful connection-tracking abilities to the linux firewall. I’ve poked around in iptables, another netfilter module, a fair bit back when I was working out how kubernetes networks pods and services, but conntrack was totally new to me.
The problem ended up not being very mysterious at all. Just one of those things where you don’t hit it until you hit it, and then you get educated. There’s lots of info out there on conntrack and what it does, but the tl;dr for the purposes of this tale is simple: conntrack maintains a table of active connections; the table has a size limit, which on GKE nodes is set using a formula based on system RAM; if you have lots and lots of connections active on a VM then that limit might be reached; when that limit is reached and a packet that is part of a new connection attempt comes in conntrack rejects it and logs the message we already looked at. There are two solutions: you can make the conntrack table bigger; or you can reduce the number of connections. There’s actually a third thing, which is that you can open connections in such a way that conntrack ignores them, but neither that nor changing the size of the table were things we were enthused about inflicting on our production kubernetes cluster. That left reducing the number of connections.
Which quickly led to the question of why we had so many in the first place. If you want to know how many active connections conntrack is handling on a given VM you can log in and run
sudo conntrack -L. That lists out the contents of the table, and if you want to get fancy you can grep out specific connection states like ESTABLISHED or TIME_WAIT, but for a rough view of how many things are talking to other things on a given machine that will do. When we did this we noticed something a little eye opening: most of our cluster nodes would have a couple of thousand tracked connections, but two of the nodes, the ones logging the conntrack errors, had over 100k connections being tracked. The nodes had another thing in common that seemed relevant: they happened to be the ones running the linkerd pods that provide routing for our RPC requests. Basically linkerd is a load balancer for RPC calls: all our RPC clients open connections to it, and it opens connections to all the servers.
It was a pretty straightforward matter at that point to figure out which pods were opening a shit-ton of connections to linkerd. The conntrack table contains the four-tuple of source ip/port and destination ip/port for the connection flow in both directions and those map directly to pods. Here’s a thing, however: the number of lines in the conntrack table does not tell you how many actual open connections a pod has. Conntrack keeps a connection’s entry in the table for 120 seconds after it closes (by default, this is no-doubt configurable some place). So the list of conntrack table entries is a generally good indication of magnitude, but if you want to know how many connections a given container actually has open at any specific time you can use one of my favorite tools for debugging container networking problems:
nsenter. If you have the PID of a container you can use
nsenter to run
netstat inside its network namespace.
All you need to do this is the process ID of the container. The way to do this on GKE is run
crictl ps on the node to get the list of running containers, find the process you care about, note the container ID, then run
crictl inspect <container ID> to get the PID. You can then run
nsenter -t <PID> -n netstat and list out the open connections for the container. We began poking around using these tools and were able to identify the RPC client that was opening all of the connections to linkerd, and determine how many were active at a given time. When we chased this thread all the way to the spool we discovered that a mis-configured CDN edge name was sending far more requests to this service than we expect based on historical data, which just goes to show that you don’t know what’s really giving you fits until you know what’s really giving you fits.
Once we corrected the mis-configured CDN the problem was alleviated. Based on our shiny new understanding of conntrack we also decided to add pod anti-affinity to the linkerd pods so that they would not collocate with each other on the same node, and to increase the replica count to spread bursts of network activity across a larger number of nodes. The combination of these changes resulted in keeping the size of the conntrack table well under the limit. Another thing we learned during this process was that we had been too quick in the past to assume that low-level issues with network performance had to be on our cloud provider. We had access to the same logs our support rep did, and yet we didn’t go looking for issues with dropped packets because somewhere in the back of our minds we all tended to assume it was a problem on Google’s end that would be quickly resolved. You get used to dealing with your system at the level of fleets of containers running in a cluster, and its easy to forget that low level stuff like CPU, RAM, and network connection tracking state still has to be managed and conserved.
If you’re interested in learning more about the netfilter conntrack module and how it works this Cloudflare post is not a bad place to start.