Leaving Olark after nine years

In June of 2016 I thought I was interviewing for a python gig at a small chat company. Instead Olark’s engineering director noticed “kubernetes” on my resume and I ended up on a zoom with two engineers from the company’s Engineering Operations team. They were building an SRE culture to drive improvements in reliability via a migration to kubernetes for container orchestration. Was I interested in a role with them instead? I was. A couple of weeks later during onboarding in Ann Arbor we started hacking out the first version of our tooling for deploying containers to a k8s cluster from a CI/CD pipeline. We would build out at least two more over the next nine years, a period which has been one of the most rewarding and productive of my career. I was privileged to work with a group of very smart engineers, at a company small-enough to be friendly and comfortable, and yet large enough to have systems you could get your teeth into.


Alas, nothing lasts forever and August 29th will be my final day on the job with Olark. I’m not sure yet what my next thing is, but I am sure I will miss the friends I made there and the work we did together. I wish them all much success and endlessly incrementing karma. To help me put a sort of mental wrap on it all I am going to give a quick list of what I feel are the highlights of my time with a company whose name was never, as best I can recall, fully explained to me. I did get enough to know it’s not about the bird.

Migrating to Google Cloud

When I first joined Olark we ran on a bunch of VMs on Rackspace public cloud. Olark was a Y-Combinator alum, and one of their graduation gifts I think was credits or something to entice them in that direction. That was my first experience with pager rotation, and in my first week I was awakened at least once because one of those VMs had inexplicably lost its network interface or boot disk and sent nagios screaming into the night. Incidents were depressingly common. Our initial plan had been to move services directly into GKE and leave all that VM + puppet + jenkins stuff behind but the situation got so bad that we decided to pick it up lock, stock and barrel and move to Google. We all worked on everything but my daily focus was building out the terraform scaffolding to create service instances and networking and creating the base images that allowed the systems to join our puppet show. I cannot fail to mention the time I ran a terraform command locally in the wrong folder and deleted 180 VMs. Anyway, the Saturday in March of 2017 when we flipped the DNS switch and lit it all up remains one of the high points of my career.

Kubernetes: it’s Greek for something

Kubernetes on GKE was always where we were heading. Some of our services were already containerized but most were python processes deployed by jenkins and configured by shell scripts dumped onto the instances by a custom daemon. When we updated config we had to ssh to all the boxes and restart everything. Moving it all to containers and into kubernetes took probably two years, and we had a lot to learn along the way from cluster sizing to resource management, networking and autoscaling. As this migration evolved we replaced our initial custom CI/CD tooling with helm charts and later mostly moved from those to a toolchain using kustomize to patch yaml. Before joining Olark I had played with running http servers and whatnot in early beta clusters. Here at the end of my time with the company we run five clusters, with nearly 100 nodes. The curve of our adoption of k8s closely paralleled the curve of Google’s rollout and evolution of GKE. In those years from 2017 through 2022 we worked through a constant stream of changes: everything from statefulsets to ephemeral storage and container-native load balancing. Throughout it all GKE has been an extremely performant and reliable platform for us, and I think it’s fair to say we met all of our initial goals for the migration and then some.

All you need is logs (and then metrics)

We weren’t on GCP for long when our monthly bills convinced us that we were going to need our own aggregation, storage and visualization pipeline for logs and metrics. Google had just bought Stackdriver at the time, and while I don’t remember the pricing model or numbers I do remember it wasn’t working for us. I think the value prop has since flipped, but at the time building our own was the only option. We ended up with fluent-bit on our remaining VMs, fluentd in the cluster (a daemonset tailing container logs on every node and a dedicated indexing deployment), elasticsearch for storage and kibana for querying. At peak utilization we were handling 15 – 20k log lines per second. Figuring out the sizing of the elasticsearch cluster was a bit of black magic and we ended up replacing the whole thing a year in, but it’s been doing its job now for seven years. Later on I got to implement prometheus and grafana for monitoring, and probably the best part of that was working with all the other engineers to instrument the services they were responsible for.

Out of state

State is always the hard bit. We migrated into GCP with a bunch of databases, elasticsearch and redis clusters, etcd clusters and memcached instances. We messed around with moving some of it into k8s but it was early days for statefulsets and we weren’t happy with the storage performance at the time. It quickly became a policy and medium term goal to get as much of that stuff onto managed services as we could. We moved the dbs to Cloud SQL first, then moved all the redis instances to MemoryStore. We retained the elasticsearch clusters for lack of a fully-managed GCP-hosted alternative, but if one had been available we would certainly have been tempted. In all of these cases the hosted alternative has been more performant and cost effective, and the migrations took a big chunk of responsibility off the backs of our small engineering team.

Migrating our DNS

After the migration to GCP our public DNS zones remained hosted on Rackspace, while our private zones were implemented on AWS Route 53. In mid-2023 we decided to move it all to Google Cloud DNS. The motivation was to finally close out our Rackspace account, clean up a lot of ancient cruft in the zones, and bring it all under terraform management so that it could be more accessible to the engineering team. We extracted all of our existing records as bind formatted zone files, a process that was simple at AWS but required a support request at Rackspace. We then cleaned out all the obsolete records, wrote code to generate terraform markup from the zone files, created the new zones, wrote some more code to compare the records in the new and old zones and validate that they were identical, and finally updated the name server records at our registrar. The whole thing went very smoothly and without any disruption for our customers. We’ve since been quite happy with the performance of Cloud DNS and the process of managing our zones with terraform, and while the terraform markup for DNS records is pretty cumbersome (since each record is a named terraform resource) it is overall an improvement over the previous imperative approach to managing our DNS.

Thanks a bunch Edge.io

The looming holiday season of 2024 brought a fun email from Edge.io, the CDN platform we had been caching our content on since way back when they were Verizon Digital Media Services. They were in the process of shutting down the network in mid-January of 2025 and we would have to find a new place to cache our stuff. A notification that you have to switch to a new CDN network and have only a couple of weeks to do it in is not the kind of holiday card I like to find in the mailbox. After a crash program to investigate capabilities and pricing of a number of leading CDN providers we settled on Cloudflare and were able to migrate our objects to it and switch over seamlessly, while at the same time saving a substantial amount of money. Cloudflare, of course, is much more than a CDN network and once we had the relationship in place we began finding uses for some of their other tools, like edge workers, and maybe that was the point. Well-played Cloudflare.

Shout out to Gitlab

It’s not easy being a small engineering team running a complex distributed system built from dozens of critical open source components. One of the ways in which it is not easy is keeping up with the pace of change while continuing to ensure that new versions of things play nice with older versions of other things and that nothing breaks horribly because you had the temerity to move from happyfreestuff 2.02 to 2.03. The perfect storm is when something is critical, complex, and scary to upgrade. Such things tend to languish. Since not long after our migration to GCP we had been running our own Gitlab server. I’m not proud of it, but by mid-2024 the current version was 17.4 and we were still chugging along on 13.whatever and had a problem: Gitlab’s license file format had changed and we needed to upgrade in order to be able to renew ours. Fortunately Gitlab has an awesome upgrade path planning tool: you give it the current running and target versions and it tells you which versions you need to install to get from A to B. We cloned the server, ran through the upgrade on the clone successfully, then did it again for reals. Building some new runners was easier than upgrading the ones we had and we took the opportunity to give them a little more muscle too.

What it was that made Olark fun

I’ve been with Olark longer than any other company I’ve worked for, including the one I co-founded back in the late 90’s. When I think about what it was that kept me there I end up with a short list of things that made it a great place to work. First, a sense of mission. The company had serious platform problems when I arrived. A lot of things had to be fixed. Big changes had to be made and there was support for making them. That’s always exciting. Second a really sharp team of people I identified with (though I am older than most of them). I remember at one point in my work-along day when one of the other guys had broken my demo and we were laughing at a joke I cracked about it. I mentioned something about my computer and one of them asked if I built it. Yeah of course I built it. Asus motherboard, Fractal Design case, EGA 1070Ti. We’re all geeks on this call. We all enjoyed building stuff and the virtual high fives when you got a thing running were as real as virtual high fives get. That is the thing that still gets me up and to work every day, and the reason why I am still doing engineering after 30 years. So thanks for that Olark. I wish you all luck… and I’ll just leave this here…

The sincerest form of flattery

Originally published at https://medium.com/@betz.mark/the-sincerest-form-of-flattery-6688486327c6

Writing technical articles is hard work. I wrote my first one in 1993 for Dr. Dobb’s Journal (a link, more or less), and since then I have written a couple of dozen more. Last year I wrote three posts here on kubernetes networking that proved pretty popular and were picked up by the Google Cloud community blog. Each of these posts took dozens of hours of writing and research, not to mention creating accompanying graphics. And each of the posts got things wrong, despite my several years of experience with the platform and all the aforementioned research. As readers have chimed in with clarifications and corrections I have revisited the work and updated it where changes were needed. I know a lot of people are reading them and I’d like them to continue to be useful.

In all the years of writing I have never, as far as I know, been the source for a plagiarist. This is probably a testament to the level of obscurity in which I toiled. So I was fairly surprised when a kind reader named Ian Douglas reached out to me last week while I was attending Olark’s company retreat to let me know he had run into some content that was suspiciously similar to mine. I didn’t really have time to look into it until I returned home last night. When I did, sure enough, the content was suspiciously similar to mine. I’ll let you draw your own conclusions. Here’s a link to my post on pod networking, the first in the series, and the other guy’s post on the same topic:

MeUnderstanding kubernetes networking: pods

HimHow Does The Kubernetes Networking Work? : Part 1

The pattern continues for the whole series, but it would be tiresome to post them all. At least the author took the time to rewrite, rather than simply copy and paste text extracted from my posts. But the graphics were just snatched wholesale, and of course none of it is attributed to me.

Now, to be clear, I don’t make any money off these posts. Nobody has even offered me a job because of these posts. Which is fine because I’m not looking for one. I don’t really give a shit if someone copies them. My instinctive reaction would usually be “whatever.” If this had proven to be some small outfit in a developing nation copying my stuff for their website I’d be like: hey, if copying my stuff helps you get your business off the ground and make some money have at it. But the author of these derivative works is someone by the name of James Lee whose profile identifies him as an ex-Googler who lives in San Francisco.

I mean come on, man. I don’t even get to live in San Francisco. I live in one of the more expensive parts of New Jersey, where people from San Francisco come to downscale and improve monthly cash flow. Ok, that’s false, but it does strangely bother me more that I’ve been ripped off by someone who is probably a verified member of the privileged tech class. Maybe this is related to why he’s an ex-Googler. Who knows? But seriously, privileged people should not steal. It’s like taking two flutes of champagne from the tray at a fundraiser. There’s only so much ripping off that can be tolerated in a given period of time, and I think we should save that for people who need it.

So, James, if you’re looking to raise your profile and give your company a boost, the best way is persistent work. You can try an end run around that lamentable fact, but it will almost always come back to haunt you later. Like this.

Moving to a New Host

After five years running my site on Network Solutions I finally grew tired of their performance issues. It was taking 60-120 seconds to load the home page on some occasions, and inquiries to tech support just got me form replies reminding me I was on a shared platform, and advising me about optimizing my javascript and images to load from a CDN. Yes, CDNs are good. If you have tons of traffic using a CDN can move a lot of load off your server. Some day, if I have tons of traffic, I might care about CDNs. At the moment I don’t have tons of traffic, don’t aspire to having tons of traffic, and don’t think it’s taking Chrome 120 seconds to download and process the javascript in my WordPress install. But thanks for the advice, Network Solutions. I’ve moved my site to Rochen Hosting, and the pages load in about 2 seconds. Meanwhile, you might want to take a gander at the load on your MySQL servers.

GSearch 1.1 Update

Trying to get a few things cleared away before I start a new position on Monday morning. Shortly after the January release of my GSearch libraries for .NET 3.5 and Silverlight 2, Codeplex user FBrink discovered that my conversions for lattitude/longitude in the local search class were naive, in that they failed when the current culture uses the comma character as a decimal separator. Tonight I released version 1.1 of the .NET and SL libraries, which corrects that issue, as well as updating some similarly naive code in the search class event raising machinery. You can check out the release notes and grab the latest source and runtimes at the CodePlex project page.

Gradient Editor Update

I’ve released version 1.0 of the Silverlight Gradient Editor. This version fixes a few small bugs and usability issues, and adds support for transparent gradient stops using the system color picker. This will probably be the last release of the editor unless I discover some problems. There’s not really much more to do with it. It came into existance because I needed a gradient editor control for the drawing application I’m working on, and once I had one it was a very small leap to add some code output and slap it into a web page. Hope you find it useful. If you do, or if you run into some problems, drop me a note and let me know.

The GSearch Lib – Google Searches from .NET and Silverlight

In the process of working on GMemory, a Silverlight 2 game I wrote as an exercise a couple of weeks back, I became familiar with Google’s RESTful webservice API. Using this API applications can execute searches and receive results back. The API is not a complete drop-in replacement for the full-blown Google search engine – it can produce at most 64 results (8 items per page over 8 pages), for example – but it is an interesting and useful way to incorporate search results into your applications. Results come back in the form of some nested JSON types, which are easily deserialized into .NET classes once you understand the structure and get the type definitions correct. I had to do that for image searches to make GMemory work, so once that was done I decided to go ahead and implement the rest of the search types as well.

The result is GSearch, a library of classes for searching Google from .NET 3.5 and Silverlight 2 managed code. The library encompases all the supported search types on the current version of the Google API, meaning blogs, books, images, locations, news, patents, video, and web pages. The classes are very easy to use, and you’ll find some examples in the readme files accompanying the runtime packages. The .NET distribution also includes GSearchPad, a WPF example program that will allow you to execute any of the search types with custom arguments and display the results.

GSearch is copyrighted software released under a BSD Permissive license. Feel free to play around with it and use it in your own commercial or noncommercial apps. This is the first release, and there are sure to be some warts left in it. If you find one, or have a question, please feel free to drop a comment here or shoot me an email.

OBX Time

Over the last week my family and I had the pleasure of joining the rest of our far-flung and extended clan in Nag’s Head for the wedding of my brother and his delightful fiancee. Which is to say that by the end of the week the clan was even more extended and far-flung than it had been when we started. The wedding was held on Coquina beach, across from the access road to Bodie Island Light. The weather, and the relatives (for the most part), behaved admirably, and an excellent time was had by all. I had never been in the Outer Banks area before, and was captivated by the dunes, the long stretches of pristine beach, the Atlantic breakers rolling in after three days of steady winds. I managed to get out and take some pictures, capturing some scenes from Oregon Inlet all the way to Hatteras Light. I’ve collected the better ones in this gallery. Have a look and let me know what you think!

WordPress 2.6 Follow-up

I figured out that the little balloon with the number “2” in it that appeared next to the “Plugins” menu choice in the admin screen meant that WordPress thought there were two plugins that required upgrading. I’m not sure why it thought that, as there was really just one. But that one didn’t show up right after upgrading to 2.6. It showed up the next day. Once I upgraded Sociable the tooltip disappeared.

WordPress 2.6

Well I got around to upgrading this evening. I’m not sure whether I would ever go through this if it weren’t for the pale yellow nag bar that appears in the admin screen whenever there’s a new version. That’s a pretty effective device. Everything went smoothly, though it took a few minutes longer because I decided to actually follow advice this time and grab a database backup and a copy of the existing files. One little wierdness: since upgrading one of those little “tool tip balloons” appears next to “plugins” in the admin menu. It has the number “2” in it. Apparently there are two plugins that want… something. I understand when comments want to be admin’d, but I’m not sure what these two plugins want, or which plugins it is that want something. Clicking the balloon just goes to the plugin admin screen. Other than that I haven’t noticed any difference, which I’ll take as a good thing. Well, the nag bar did go away, and I guess that’s good too.

Hawk News

I left a message this morning at The Raptor Trust inquiring as to the progress of the young Broadwing Hawk we brought them last week, something they encourage people to do through an option on their voicemail system. Not all animal welfare organizations will provide progress updates to the people who rescue wildlife, so my daughters and I very much appreciate this aspect of their operation. A short time later I received a call from Donna who told me that the bird was doing well, and that she had been in an outdoor aviary since the 6th of July. They were unable to find any medical problem, however Donna noted that she is a young bird and that, to paraphrase her comments, fledgelings often get into trouble. Whether she was just confused and lost, or had been sideswiped by a car without sustaining serious injury we’ll never know, but at least she is well now, and we can look forward to her release back into the skies of Northern New Jersey. I can’t say enough about the professionalism and care that the folks at the Trust have displayed, and I encourage anyone who finds an injured raptor to contact them.