When I started working with kubernetes at scale I began encountering something that didn’t happen when I was just running experiments on it: occasionally a pod would get stuck in pending status because no node had sufficient cpu or ram available to run it. You can’t add cpu or ram to a node, so how do you un-stick the pod? The simplest fix is to add another node, and I admit resorting to this answer more than once. Eventually it became clear that this strategy fails to leverage one of kubernetes greatest strengths: its ability to efficiently utilize compute resources. The real problem in many of these cases was not that the nodes were too small, but that we had not accurately specified resource limits for the pods.
Resource limits are the operating parameters that you provide to kubernetes that tell it two critical things about your workload: what resources it requires to run properly; and the maximum resources it is allowed to consume. The first is a critical input to the scheduler that enables it to choose the right node on which to run the pod. The second is important to the kubelet, the daemon on each node that is responsible for pod health. While most readers of these posts probably have at least a basic familiarity with the concept of resources and limits, there is a lot of interesting detail under the hood. In this two-part series I’m going to first look closely at memory limits, and then follow up with a second post on cpu limits.
In the first post of this two-part series on resource limits in kubernetes I discussed how the ResourceRequirements object was used to set memory limits on containers in a pod, and how those limits were implemented by the container runtime and linux control groups. I also talked about the difference between requests, used to inform the scheduler of a pod’s requirements at schedule time, and limits, used to assist the kernel in enforcing usage constraints when the host system is under memory pressure. In this post I want to continue by looking in detail at cpu time requests and limits. Having read the first post is not a prerequisite to getting value from this one, but I encourage you to read them both at some point to get a complete picture of the controls available to engineers and cluster administrators.
CPU limits
As I mentioned in the first post cpu limits are more complicated than memory limits, for reasons that will become clear below. The good news is that cpu limits are controlled by the same cgroups mechanism that we just looked at, so all the same ideas and tools for introspection apply, and we can just focus on the differences. Let’s start by adding cpu limits back into the example resources object that we looked at last time:
A post by Caleb Doxsey a week or so back generated a fair bit of discussion when it was shared on hacker news. In it he talked about deploying a small project on kubernetes and demonstrated some of the techniques he used. The resulting comment thread highlighted a debate that is currently common in certain software communities, i.e. that kubernetes is either the awesome-est sauce ever and you should use it for everything, or it’s a massive pile of overkill for all but the largest organizations. The perspective of the commenter, either pro or con, is often likely to be further extended to containers themselves. Docker is awesome… docker is a crutch for people who can’t even chroot, etc.
In this post I want to talk about why the truth is, unsurprisingly, somewhere in the middle, but more than that I want to explain why I think containers and orchestration are a fundamental shift in the level of abstraction at which we interact with compute resources. Many of the most significant changes in the way we practice software development, deployment and operations over the last 50 years have been changes in the level of abstraction of the interface between us and the things we’re working with. Abstractions are incredibly potent and useful tools for grappling with (seemingly) unbounded complexity, but when good new ones come along they are not always welcomed with open arms. I think there are some fundamental reasons for that.
In twenty-five years as a professional software person I’ve done quite a few things but they have all focused on, or at least started with, writing code. I am basically a programmer. Somewhere along the way, around the 2000’s if I recall, the term “software engineer” became the fashionable title. I always felt a little silly using it because I don’t have a degree in software anything, and my Dad is an actual engineer with stuff hanging on his walls to prove it. I didn’t go to family parties and talk about what an engineer I was. In fact I’m not sure I ever actually had the word “engineer” in my role until now. In this post I’m going to talk a little bit about how that changed.
Back in 2015 I had just finished up a gig writing a specialty search engine from the ground up, working on a two man team with friend and repeat colleague Joey Espinosa. With just two of us working for a somewhat tech-savvy business person that project was hands-on full-stack everything. We did the data layer, scraping engine, customized spiders for horrible ancient broken sites, web layer, networking, admin, everything. We deployed the app in docker containers using custom scaffolding on AWS instances. It was a ton of fun almost all the time, but business-wise it went nowhere.
Writing technical articles is hard work. I wrote my first one in 1993 for Dr. Dobb’s Journal (a link, more or less), and since then I have written a couple of dozen more. Last year I wrote three posts here on kubernetes networking that proved pretty popular and were picked up by the Google Cloud community blog. Each of these posts took dozens of hours of writing and research, not to mention creating accompanying graphics. And each of the posts got things wrong, despite my several years of experience with the platform and all the aforementioned research. As readers have chimed in with clarifications and corrections I have revisited the work and updated it where changes were needed. I know a lot of people are reading them and I’d like them to continue to be useful.
In all the years of writing I have never, as far as I know, been the source for a plagiarist. This is probably a testament to the level of obscurity in which I toiled. So I was fairly surprised when a kind reader named Ian Douglas reached out to me last week while I was attending Olark’s company retreat to let me know he had run into some content that was suspiciously similar to mine. I didn’t really have time to look into it until I returned home last night. When I did, sure enough, the content was suspiciously similar to mine. I’ll let you draw your own conclusions. Here’s a link to my post on pod networking, the first in the series, and the other guy’s post on the same topic:
The pattern continues for the whole series, but it would be tiresome to post them all. At least the author took the time to rewrite, rather than simply copy and paste text extracted from my posts. But the graphics were just snatched wholesale, and of course none of it is attributed to me.
Now, to be clear, I don’t make any money off these posts. Nobody has even offered me a job because of these posts. Which is fine because I’m not looking for one. I don’t really give a shit if someone copies them. My instinctive reaction would usually be “whatever.” If this had proven to be some small outfit in a developing nation copying my stuff for their website I’d be like: hey, if copying my stuff helps you get your business off the ground and make some money have at it. But the author of these derivative works is someone by the name of James Lee whose profile identifies him as an ex-Googler who lives in San Francisco.
I mean come on, man. I don’t even get to live in San Francisco. I live in one of the more expensive parts of New Jersey, where people from San Francisco come to downscale and improve monthly cash flow. Ok, that’s false, but it does strangely bother me more that I’ve been ripped off by someone who is probably a verified member of the privileged tech class. Maybe this is related to why he’s an ex-Googler. Who knows? But seriously, privileged people should not steal. It’s like taking two flutes of champagne from the tray at a fundraiser. There’s only so much ripping off that can be tolerated in a given period of time, and I think we should save that for people who need it.
So, James, if you’re looking to raise your profile and give your company a boost, the best way is persistent work. You can try an end run around that lamentable fact, but it will almost always come back to haunt you later. Like this.
A couple of years ago my Dad and I began sifting through a treasure trove of family history that we had received, piecing together the story of our earliest ancestors in the U.S. Among these materials were many original documents in German, dating from the decades 1840 to 1870. These documents proved extremely difficult to translate, as documented elsewhere on this site. Nevertheless using various tools I was able to put together transcriptions and translations of many of the official Bavarian documents. I find these immensely interesting, and I hope they are useful not just to curious members of our sprawling family tree (Alois had ten children with two wives, the majority of whom survived into adulthood) but perhaps also to anyone interested in 19th century German writing and emigration stories.
The general theory of pod scheduling in kubernetes is to let the scheduler handle it. You tell the cluster to start a pod, the cluster looks at all the available nodes and decides where to put the new thing, based on comparing available resources with what the pod declares it needs. That’s scheduling in a nutshell. Sometimes, however, you need a little more input into the process. For example you may have been asked to run a thing that requires more resources than any single node in your cluster offers. You can add a new node with enough juice, maybe using a nodepool if you’re running on GKE, but how do you make sure the right pods run on it? How do you make sure the wrong pods don’t run on it?
You can often nudge the scheduler in the right direction simply by setting resource requests appropriately. If your new pod needs 5 GB of ram and the only node big enough is the one you added for it to run on, then setting the memory request for that pod to 5 GB will force the scheduler to put it there. This is a fairly fragile approach, however, and while it will get your pod onto a node with sufficient resources it won’t keep the scheduler from putting other things there as well, as long as they will fit. Maybe that’s not important, but if it is, or if for some other reason you need positive control over which nodes your pod schedules to then you need the finer level of scheduling control that kubernetes offers through the use of taints, tolerations and affinity.
As discussed in my recent post on kubernetes ingress there is really only one way for traffic from outside your cluster to reach services running in it. You can read that article for more detail but the tl;dr is that all outside traffic gets into the cluster by way of a nodeport, which is a port opened on every host/node. Nodes are ephemeral things and clusters are designed to scale up and down, and because of this you will always need some sort of load balancer between clients and the nodeports. If you’re running on a cloud platform like GKE then the usual way to get there is to use a type LoadBalancer service or an ingress, either of which will build out a load balancer to handle the external traffic.
This isn’t always, or even most often what you want. Your case may vary but at Olark we deploy a lot more internal services than we do external ones. Up until recently the load balancers created by kubernetes on GKE were always externally visible, i.e. they were allocated a non-private IP that is reachable from outside the project. Maintaining firewall rules to sandbox lots of internal services is not a tradeoff we want to make, so for these use cases we created our services as type NodePort, and then provisioned an internal TCP load balancer for them using terraform.
There’s a reason why the kubernetes project is the current crown jewel of the cloud native community, with attendance at Kubecon 2017 in Austin nearly four times that of last year’s conference in Seattle and seemingly every major enterprise vendor perched behind a booth in the exhibit hall eager to help attendees take advantage of the platform. The reason is that the advantages are significant, especially in those areas that matter most to developers and system engineers: application reliability, observe-ability, control-ability and life-cycle management. If Docker built the engine of the container revolution then it was kubernetes that supplied the chassis and got it up to highway speed.
But driving at highway speed means keeping your hands on the wheel and obeying the rules of the road. Kubernetes has its own rules, and applications that adhere to best practices with respect to certain key touch points are much less likely to wipe out and take a few neighboring lanes of traffic with them. In this post I am going to briefly discuss five important design features that will affect how well your application behaves when running on kubernetes: configuration, logging, signal handling, health checks, and resource limits. The treatment of each topic will be necessarily high level, but I will provide links to more detailed information where it will be useful.
In the first post of this series I described the network that enables pods to connect to each other across nodes in a kubernetes cluster. The second focused on how the service network provides load balancing for pods so that clients inside the cluster can communicate with them reliably. For this third and final installment I want to build on those concepts to show how clients outside the cluster can connect to pods using the same service network. For various reasons this will likely be the most involved of the three, and the concepts introduced in parts one and two are prerequisites to getting much value out of what follows.
First, having just returned from kubecon 2017 in Austin I’m reminded of something I might have made clear earlier in the series. Kubernetes is a rapidly maturing platform. Much of the architecture is plug-able, and this includes networking. What I have been describing here is the default implementation on Google Kubernetes Engine. I haven’t seen Amazon’s Elastic Kubernetes Service yet but I think it will be close to the default implementation there as well. To the extent that kubernetes has a “standard” way of handling networking I think these posts describe it in its fundamental aspects. You have to start somewhere, and getting these concepts well in hand will help when you start to think about alternatives like unified service meshes, etc. With that said, let’s talk ingress.