Leaving Olark after nine years

In June of 2016 I thought I was interviewing for a python gig at a small chat company. Instead Olark’s engineering director noticed “kubernetes” on my resume and I ended up on a zoom with two engineers from the company’s Engineering Operations team. They were building an SRE culture to drive improvements in reliability via a migration to kubernetes for container orchestration. Was I interested in a role with them instead? I was. A couple of weeks later during onboarding in Ann Arbor we started hacking out the first version of our tooling for deploying containers to a k8s cluster from a CI/CD pipeline. We would build out at least two more over the next nine years, a period which has been one of the most rewarding and productive of my career. I was privileged to work with a group of very smart engineers, at a company small-enough to be friendly and comfortable, and yet large enough to have systems you could get your teeth into.


Alas, nothing lasts forever and August 29th will be my final day on the job with Olark. I’m not sure yet what my next thing is, but I am sure I will miss the friends I made there and the work we did together. I wish them all much success and endlessly incrementing karma. To help me put a sort of mental wrap on it all I am going to give a quick list of what I feel are the highlights of my time with a company whose name was never, as best I can recall, fully explained to me. I did get enough to know it’s not about the bird.

Migrating to Google Cloud

When I first joined Olark we ran on a bunch of VMs on Rackspace public cloud. Olark was a Y-Combinator alum, and one of their graduation gifts I think was credits or something to entice them in that direction. That was my first experience with pager rotation, and in my first week I was awakened at least once because one of those VMs had inexplicably lost its network interface or boot disk and sent nagios screaming into the night. Incidents were depressingly common. Our initial plan had been to move services directly into GKE and leave all that VM + puppet + jenkins stuff behind but the situation got so bad that we decided to pick it up lock, stock and barrel and move to Google. We all worked on everything but my daily focus was building out the terraform scaffolding to create service instances and networking and creating the base images that allowed the systems to join our puppet show. I cannot fail to mention the time I ran a terraform command locally in the wrong folder and deleted 180 VMs. Anyway, the Saturday in March of 2017 when we flipped the DNS switch and lit it all up remains one of the high points of my career.

Kubernetes: it’s Greek for something

Kubernetes on GKE was always where we were heading. Some of our services were already containerized but most were python processes deployed by jenkins and configured by shell scripts dumped onto the instances by a custom daemon. When we updated config we had to ssh to all the boxes and restart everything. Moving it all to containers and into kubernetes took probably two years, and we had a lot to learn along the way from cluster sizing to resource management, networking and autoscaling. As this migration evolved we replaced our initial custom CI/CD tooling with helm charts and later mostly moved from those to a toolchain using kustomize to patch yaml. Before joining Olark I had played with running http servers and whatnot in early beta clusters. Here at the end of my time with the company we run five clusters, with nearly 100 nodes. The curve of our adoption of k8s closely paralleled the curve of Google’s rollout and evolution of GKE. In those years from 2017 through 2022 we worked through a constant stream of changes: everything from statefulsets to ephemeral storage and container-native load balancing. Throughout it all GKE has been an extremely performant and reliable platform for us, and I think it’s fair to say we met all of our initial goals for the migration and then some.

All you need is logs (and then metrics)

We weren’t on GCP for long when our monthly bills convinced us that we were going to need our own aggregation, storage and visualization pipeline for logs and metrics. Google had just bought Stackdriver at the time, and while I don’t remember the pricing model or numbers I do remember it wasn’t working for us. I think the value prop has since flipped, but at the time building our own was the only option. We ended up with fluent-bit on our remaining VMs, fluentd in the cluster (a daemonset tailing container logs on every node and a dedicated indexing deployment), elasticsearch for storage and kibana for querying. At peak utilization we were handling 15 – 20k log lines per second. Figuring out the sizing of the elasticsearch cluster was a bit of black magic and we ended up replacing the whole thing a year in, but it’s been doing its job now for seven years. Later on I got to implement prometheus and grafana for monitoring, and probably the best part of that was working with all the other engineers to instrument the services they were responsible for.

Out of state

State is always the hard bit. We migrated into GCP with a bunch of databases, elasticsearch and redis clusters, etcd clusters and memcached instances. We messed around with moving some of it into k8s but it was early days for statefulsets and we weren’t happy with the storage performance at the time. It quickly became a policy and medium term goal to get as much of that stuff onto managed services as we could. We moved the dbs to Cloud SQL first, then moved all the redis instances to MemoryStore. We retained the elasticsearch clusters for lack of a fully-managed GCP-hosted alternative, but if one had been available we would certainly have been tempted. In all of these cases the hosted alternative has been more performant and cost effective, and the migrations took a big chunk of responsibility off the backs of our small engineering team.

Migrating our DNS

After the migration to GCP our public DNS zones remained hosted on Rackspace, while our private zones were implemented on AWS Route 53. In mid-2023 we decided to move it all to Google Cloud DNS. The motivation was to finally close out our Rackspace account, clean up a lot of ancient cruft in the zones, and bring it all under terraform management so that it could be more accessible to the engineering team. We extracted all of our existing records as bind formatted zone files, a process that was simple at AWS but required a support request at Rackspace. We then cleaned out all the obsolete records, wrote code to generate terraform markup from the zone files, created the new zones, wrote some more code to compare the records in the new and old zones and validate that they were identical, and finally updated the name server records at our registrar. The whole thing went very smoothly and without any disruption for our customers. We’ve since been quite happy with the performance of Cloud DNS and the process of managing our zones with terraform, and while the terraform markup for DNS records is pretty cumbersome (since each record is a named terraform resource) it is overall an improvement over the previous imperative approach to managing our DNS.

Thanks a bunch Edge.io

The looming holiday season of 2024 brought a fun email from Edge.io, the CDN platform we had been caching our content on since way back when they were Verizon Digital Media Services. They were in the process of shutting down the network in mid-January of 2025 and we would have to find a new place to cache our stuff. A notification that you have to switch to a new CDN network and have only a couple of weeks to do it in is not the kind of holiday card I like to find in the mailbox. After a crash program to investigate capabilities and pricing of a number of leading CDN providers we settled on Cloudflare and were able to migrate our objects to it and switch over seamlessly, while at the same time saving a substantial amount of money. Cloudflare, of course, is much more than a CDN network and once we had the relationship in place we began finding uses for some of their other tools, like edge workers, and maybe that was the point. Well-played Cloudflare.

Shout out to Gitlab

It’s not easy being a small engineering team running a complex distributed system built from dozens of critical open source components. One of the ways in which it is not easy is keeping up with the pace of change while continuing to ensure that new versions of things play nice with older versions of other things and that nothing breaks horribly because you had the temerity to move from happyfreestuff 2.02 to 2.03. The perfect storm is when something is critical, complex, and scary to upgrade. Such things tend to languish. Since not long after our migration to GCP we had been running our own Gitlab server. I’m not proud of it, but by mid-2024 the current version was 17.4 and we were still chugging along on 13.whatever and had a problem: Gitlab’s license file format had changed and we needed to upgrade in order to be able to renew ours. Fortunately Gitlab has an awesome upgrade path planning tool: you give it the current running and target versions and it tells you which versions you need to install to get from A to B. We cloned the server, ran through the upgrade on the clone successfully, then did it again for reals. Building some new runners was easier than upgrading the ones we had and we took the opportunity to give them a little more muscle too.

What it was that made Olark fun

I’ve been with Olark longer than any other company I’ve worked for, including the one I co-founded back in the late 90’s. When I think about what it was that kept me there I end up with a short list of things that made it a great place to work. First, a sense of mission. The company had serious platform problems when I arrived. A lot of things had to be fixed. Big changes had to be made and there was support for making them. That’s always exciting. Second a really sharp team of people I identified with (though I am older than most of them). I remember at one point in my work-along day when one of the other guys had broken my demo and we were laughing at a joke I cracked about it. I mentioned something about my computer and one of them asked if I built it. Yeah of course I built it. Asus motherboard, Fractal Design case, EGA 1070Ti. We’re all geeks on this call. We all enjoyed building stuff and the virtual high fives when you got a thing running were as real as virtual high fives get. That is the thing that still gets me up and to work every day, and the reason why I am still doing engineering after 30 years. So thanks for that Olark. I wish you all luck… and I’ll just leave this here…