The day a chat message blew up prod

I don’t often write “in the trenches” stories about production issues, although I enjoy reading them. One of my favorite sources for that kind of tale is rachelbythebay. I’ve been inspired by her writing in the past, and I’m always on the lookout for opportunities to write more posts like hers. As it happens we experienced an unusual incident of our own recently. The symptoms and sequence of events involved make an interesting story and, as is often the case, contain a couple of valuable lessons. Not fun to work through at the time, but perhaps fun to replay here.

Some background: where I work we run a real-time chat service that provides an important communications tool to tens of thousands of businesses worldwide. Reliability is critical, and to ensure reliability we have invested a lot of engineering time into monitoring and logging. For alerting our general philosophy is to notify on root causes and to page the on-call engineer for customer-facing symptoms. Often the notifications we see in email or slack channels allow us to get ahead of a developing problem before it escalates to a page, and since pages mean our users are affected this is a good thing.

