Understanding resource limits in kubernetes: cpu time

Originally published at https://medium.com/@betz.mark/understanding-resource-limits-in-kubernetes-cpu-time-9eff74d3161b

In the first post of this two-part series on resource limits in kubernetes I discussed how the ResourceRequirements object was used to set memory limits on containers in a pod, and how those limits were implemented by the container runtime and linux control groups. I also talked about the difference between requests, used to inform the scheduler of a pod’s requirements at schedule time, and limits, used to assist the kernel in enforcing usage constraints when the host system is under memory pressure. In this post I want to continue by looking in detail at cpu time requests and limits. Having read the first post is not a prerequisite to getting value from this one, but I encourage you to read them both at some point to get a complete picture of the controls available to engineers and cluster administrators.

CPU limits

As I mentioned in the first post cpu limits are more complicated than memory limits, for reasons that will become clear below. The good news is that cpu limits are controlled by the same cgroups mechanism that we just looked at, so all the same ideas and tools for introspection apply, and we can just focus on the differences. Let’s start by adding cpu limits back into the example resources object that we looked at last time:

    memory: 50Mi
    cpu: 50m
    memory: 100Mi
    cpu: 100m

The unit suffix m stands for “thousandth of a core,” so this resources object specifies that the container process needs 50/1000 of a core (5%) and is allowed to use at most 100/1000 of a core (10%). Likewise 2000m would be two full cores, which can also be specified as 2 or 2.0. Let’s create a pod with just a request for cpu and see how this is configured at the docker and cgroup levels:

$ kubectl run limit-test --image=busybox --requests "cpu=50m" --command -- /bin/sh -c "while true; do sleep 2; done"
deployment.apps "limit-test" created

We can see that kubernetes configured the 50m cpu request:

$ kubectl get pods limit-test-5b4c495556-p2xkr -o=jsonpath='{.spec.containers[0].resources}'

We can also see that docker configured the container with the same limit:

$ docker ps | grep busy | cut -d' ' -f1
$ docker inspect f2321226620e --format '{{.HostConfig.CpuShares}}'

Why 51, and not 50? The cpu control group and docker both divide a core into 1024 shares, whereas kubernetes divides it into 1000. How does docker apply this request to the container process? In the same way that setting memory limits caused docker to configure the process’s memory cgroup, setting cpu limits causes it to configure the cpu,cpuacct cgroup:

$ ps ax | grep /bin/sh
   60554 ?      Ss     0:00 /bin/sh -c while true; do sleep 2; done
$ sudo cat /proc/60554/cgroup
$ ls -l /sys/fs/cgroup/cpu,cpuacct/kubepods/burstable/pode12b33b1-db07-11e8-b1e1-42010a800070/3be263e7a8372b12d2f8f8f9b4251f110b79c2a3bb9e6857b2f1473e640e8e75
total 0
drwxr-xr-x 2 root root 0 Oct 28 23:19 .
drwxr-xr-x 4 root root 0 Oct 28 23:19 ..
-rw-r--r-- 1 root root 0 Oct 28 23:19 cpu.shares

Docker’s HostConfig.CpuShares container property maps to the cpu.shares property of the cgroup, so let’s look at that:

$ sudo cat /sys/fs/cgroup/cpu,cpuacct/kubepods/burstable/podb5c03ddf-db10-11e8-b1e1-42010a800070/64b5f1b636dafe6635ddd321c5b36854a8add51931c7117025a694281fb11444/cpu.shares

You might be surprised to see that setting a cpu request propagates a value to the cgroup, given that in the last post we saw that setting a memory request did not. The bottom line is that kernel behavior with respect to memory soft limits is not very useful to kubernetes, where as setting cpu.shares is useful. I’ll talk more about why below. So what happens when we also set a cpu limit? Let’s find out:

$ kubectl run limit-test --image=busybox --requests "cpu=50m" --limits "cpu=100m" --command -- /bin/sh -c "while true; do
sleep 2; done"
deployment.apps "limit-test" created

Now we can also see the limit in the kubernetes pod resource object:

$ kubectl get pods limit-test-5b4fb64549-qpd4n -o=jsonpath='{.spec.containers[0].resources}'
map[limits:map[cpu:100m] requests:map[cpu:50m]]

And in the docker container config:

$ docker ps | grep busy | cut -d' ' -f1
$ docker inspect 472abbce32a5 --format '{{.HostConfig.CpuShares}} {{.HostConfig.CpuQuota}} {{.HostConfig.CpuPeriod}}'
51 10000 100000

The cpu request is stored in the HostConfig.CpuShares property as we saw above. The cpu limit, though, is a little less obvious. It is represented by two values: HostConfig.CpuPeriod and HostConfig.CpuQuota. These docker container config properties map to two additional properties of the process’s cpu,cpuacct cgroup: cpu.cfs_period_us and cpu.cfs_quota_us. Let’s take a look at those:

$ sudo cat /sys/fs/cgroup/cpu,cpuacct/kubepods/burstable/pod2f1b50b6-db13-11e8-b1e1-42010a800070/f0845c65c3073e0b7b0b95ce0c1eb27f69d12b1fe2382b50096c4b59e78cdf71/cpu.cfs_period_us
$ sudo cat /sys/fs/cgroup/cpu,cpuacct/kubepods/burstable/pod2f1b50b6-db13-11e8-b1e1-42010a800070/f0845c65c3073e0b7b0b95ce0c1eb27f69d12b1fe2382b50096c4b59e78cdf71/cpu.cfs_quota_us

As expected these are set to the same values as specified in the docker container config. But how do the values of these two properties derive from the100m cpu limit setting in our pod, and how do they implement that limit? The answer lies in the fact that cpu requests and cpu limits are implemented using two separate control systems. Requests use the cpu shares system, the earlier of the two. Cpu shares divide each core into 1024 slices and guarantee that each process will receive its proportional share of those slices. If there are 1024 slices and each of two processes sets cpu.shares to 512, then they will each get about half of the available time. The cpu shares system, however, cannot enforce upper bounds. If one process doesn’t use its share the other is free to.

Around 2010 Google and others noticed that this could cause issues. In response a second and more capable system was added: cpu bandwidth control. The bandwidth control system defines a period, which is usually 1/10 of a second, or 100000 microseconds, and a quota which represents the maximum number of slices in that period that a process is allowed to run on the cpu. In this example we asked for a cpu limit of 100m on our pod. That is 100/1000 of a core, or 10000 out of 100000 microseconds of cpu time. So our limit request translates to setting cpu.cfs_period_us=100000 and cpu.cfs_quota_us=10000 on the process’s cpu,cpuacct cgroup. The cfs in those names, by the way, stands for Completely Fair Scheduler, which is the default linux cpu scheduler. There’s also a realtime scheduler with its own corresponding quota values.

So we’ve seen that setting a cpu request in kubernetes ultimately sets the cpu.shares cgroup property, and setting cpu limits engages a different system through setting cpu.cfs_period_us and cpu.cfs_quota_us. As with memory limits the request is primarily useful to the scheduler, which uses it to find a node with at least that many cpu shares available. Unlike memory requests setting a cpu request also sets a property on the cgroup that helps the kernel actually allocate that number of shares to the process. Limits are also treated differently from memory. Exceeding a memory limit makes your container process a candidate for oom-killing, whereas your process basically can’t exceed the set cpu quota, and will never get evicted for trying to use more cpu time than allocated. The system enforces the quota at the scheduler so the process just gets throttled at the limit.

What happens if you don’t set these properties on your container, or set them to inaccurate values? As with memory, if you set a limit but don’t set a request kubernetes will default the request to the limit. This can be fine if you have very good knowledge of how much cpu time your workload requires. How about setting a request with no limit? In this case kubernetes is able to accurately schedule your pod, and the kernel will make sure it gets at least the number of shares asked for, but your process will not be prevented from using more than the amount of cpu requested, which will be stolen from other process’s cpu shares when available. Setting neither a request nor a limit is the worst case scenario: the scheduler has no idea what the container needs, and the process’s use of cpu shares is unbounded, which may affect the node adversely. And that’s a good segue into the last thing I want to talk about: ensuring default limits in a namespace.

Default limits

Given everything we’ve just discussed about the negative effects of ignoring resource limits on your pod containers, you might think it would be nice to be able to set defaults, so that every pod admitted to the cluster has at least some limits set. Kubernetes allows us to do just that, on a per namespace basis, using the LimitRange v1 api object. To establish default limits you create the LimitRange object in the namespace you want them to apply to. Here’s an example:

apiVersion: v1
kind: LimitRange
  name: default-limit
  - default:
      memory: 100Mi
      cpu: 100m
      memory: 50Mi
      cpu: 50m
  - max:
      memory: 512Mi
      cpu: 500m
  - min:
      memory: 50Mi
      cpu: 50m
    type: Container

The naming here can be a little confusing so let’s tear it down briefly. The default key under limits represents the default limits for each resource. In this case any pod admitted to the namespace without a memory limit will be assigned a limit of 100Mi. Any pod without a cpu limit will be assigned a limit of 100m. The defaultRequest key is for resource requests. If a pod is created without a memory request it will be assigned the default request of 50Mi, and if it has no cpu request it will get a default of 50m. The max and min keys are something a little different: basically if these are set a pod will not be admitted to the namespace if it sets a request or limit that violates these bounds. I haven’t found a use for these, but perhaps you have and if so leave a comment and let us know what you did with them.

The defaults set forth in the LimitRange are applied to pods by the LimitRanger plugin, which is a kubernetes admission controller. Admission controllers are plugins that get a chance to modify podSpecs after the object has been accepted by the api, but before the pod is created. In the case of the LimitRanger it looks at each pod, and if it does not specify a given request or limit for which there is a default set in the namespace, it applies that default. You can see that the LimitRanger has set a default on your pod by examining the annotations in the pod metadata. Here’s an example where the LimitRanger applied a default cpu request of 100m:

apiVersion: v1
kind: Pod
    kubernetes.io/limit-ranger: 'LimitRanger plugin set: cpu request for container
  name: limit-test-859d78bc65-g6657
  namespace: default
  - args:
    - /bin/sh
    - -c
    - while true; do sleep 2; done
    image: busybox
    imagePullPolicy: Always
    name: limit-test
        cpu: 100m

And that wraps up this look at resource limits in kubernetes. I hope you find this information useful. If you’re interested in reading more about using resource limits and defaults, linux cgroups, or memory management I’ve provided some links to more detailed information on these subjects below.

Further reading






Leave a Reply

Your email address will not be published. Required fields are marked *