It’s been a couple of years since I’ve written on software or systems topics. No specific reason for that other than that I wrote a bunch back when kubernetes adoption was ramping up and I just got tired of the topic (even though there have been plenty of new things worth writing about!). Also pandemic, family responsibilities, work stuff, etc., etc. Before that I wrote mostly for publication on Medium. Over time I became less and less thrilled with Medium, and so I’ve decided that for any future work I’m going to publish here and syndicate there, and we’ll see how that goes. For this post I want to talk about something that snuck up and hobbled our RPC services in our production cluster a couple of weeks ago: conntrack table exhaustion.
It started, as many such episodes do, with a ping in our internal support channel. Some things were slow, and people were seeing errors. On our response video chat we combed through logs until someone noticed a bunch of “temporary failure in name resolution” messages, an error that cluster nodes and other VMs will log when they can’t, you know, resolve a name. We run on Google Cloud + GKE, and all of our DNS zones are on Google Cloud DNS. If DNS lookups were failing it wasn’t our doing. We’ve been on Google for almost seven years and the platform has been amazingly durable and performant… but the odd, brief episode of network shenanigans had not been, up to this point, an unknown thing.
Continue reading