The cost of tailing logs in kubernetes

Originally published at

Logging is one of those plumbing things that often gets attention only when it’s broken. That’s not necessarily a criticism. Nobody makes money off their own logs. Rather we use logs to gain insight into what our programs are doing… or have done, so we can keep the things we do make money from running. At small scale, or in development, you can get the necessary insights from printing messages to stdout. Scale up to a distributed system and you quickly develop a need to aggregate those messages to some central place where they can be useful. This need is even more urgent if you’re running containers on an orchestration platform like kubernetes, where processes and local storage are ephemeral.

Since the early days of containers and the publication of the Twelve-Factor manifesto a common pattern has emerged for handling logs generated by container fleets: processes write messages to stdout or stderr, containerd (docker) redirects the standard streams to disk files outside the containers, and a log forwarder tails the files and forwards them to a database. The log forwarder fluentd is a CNCF project, like containerd itself, and has become more or less a de facto standard tool for reading, transforming, transporting and indexing log lines. If you create a GKE kubernetes cluster with cloud logging enabled (formerly Stackdriver) this is pretty much the exact pattern that you get, albeit using Google’s own flavor of fluentd.

It was the pattern we got when Olark first began migrating services to GKE-flavored k8s four years ago, and when we outgrew logging-as-a-service it was the pattern we followed for our own log aggregation system that processes 15–20k lines per second at peak load. There are reasons why this pattern works well, and why the Twelve-Factor guidelines specifically advise writing logs to standard streams. It decouples the application from the concerns of log routing, and makes containers easily observable in development or production. And if your logging system goes south there is at least some chance the logs are still there on the cluster node host disks.

The downside is that tailing log files is relatively expensive in terms of CPU utilization. We started looking at this after recently doing a tune-up on our logging system and realizing that fluentd was consuming 1/8 of all our production CPU request quota. Part of that is due to cluster topography: if fluentd is deployed per-node to tail local files (a daemonset in k8s parlance), and you run four-core nodes and need to reserve 50% of a core to handle logs… well, you get the picture. Another chunk is expended on all the text munging we make fluentd perform. Who can resist cleaning up messy log records? The rest, though, is the overhead of using inotifywait to monitor disk files, and handling the reads and tracking. We wanted to know what that cost was, since there are other ways to get logs to fluentd. We could push them to a forward port, for example. How would that compare?

In order to isolate the costs of acquiring log lines by tailing files I put together a test framework. The main pieces of this are a python thing to spawn some number of log writers with configurable message rates and sizes; and a docker compose file that spins up fluentd to receive the logs, cadvisor to monitor the fluentd container, prometheus to scrape cadvisor and grafana to visualize the data in prometheus. The log writers create messages in the same general json format containerd uses, and can either write to files or push to fluentd’s forward port. When writing to files they use a RotatingFileHandler to better mimic the setup in a cluster. The fluentd configuration discards all records through a null output and does no regexp-ing or retagging, so most of the work should be in receiving the log lines. Lastly in addition to scraping cadvisor prometheus also scrapes fluentd for log throughput metrics.

Developing the parameters for the comparison was fairly subjective. I wrote another tool to take a rough measurement of the log volume being produced across our cluster nodes. Not surprisingly it varied widely from a few tens of lines per second to 500 or more on the busiest nodes. This is another area of concern, since if you’re using a daemonset then fluentd has to be provisioned for the noisiest nodes in the cluster. You might potentially avoid the imbalance by labeling heavy log producers and using soft anti-affinity to get the scheduler to spread them out, but that’s beyond the scope of this post. My initial thought was to test 500 and 1000 lines per second using from one to ten writers, and comparing the tail and push delivery mechanisms.

One thing that jumped out of these early tests and surprised me was that the real driver of CPU usage when tailing files was simply lines per second, regardless of how many log files were watched. These first two graphs compare 1000 lines per second from a single writer with the same output from ten, and show little difference.

I did not include the graph here, but an interesting side note is that ten loggers writing 100 lines per second can reach a higher combined throughput than a single process writing 1000 lines per second on my machine. That might be an artifact of my code but I haven’t looked into it. In any case I had expected the number of open log files to be a big factor, but it proved to be essentially irrelevant. Another variable that had basically no effect was line length. The test above used the standard line length of 100 characters. Rerunning the test with lines ten times as long showed no appreciable change in CPU usage over the duration of the test, which was 180 seconds for all iterations.

Based on the above I decided to test two writer processes since a single process appeared to be hitting some internal limits and we clearly didn’t need ten of them. I ran these at 500 and 1000 lines per second. The following set of graphs show the results for tailing files and pushing lines to the forward port.

After having run these tests a bunch of different ways over the last week there are really two key points that emerged for me. The first is that pushing to a forward socket consistently uses 30–50% less CPU than reading lines from tailed log files at the same rate. One possible explanation for at least part of that is the transfer of messagepack encoding responsibility from fluentd to the logging client. When fluentd reads log records from disk it gets text and has to convert to messagepack internally. When my python client pushes logs to the forward port the records are converted before they’re sent. Even if that proved to be a large part of the difference it would not necessarily negate the advantage: doing that work in the client might still be more efficient than reserving CPU to have fluentd do it on every node.

The second major learning was simply the realization that lines per second is what’s driving CPU usage, and not the number of discrete sources. In fact that proved equally as true for pushing to the forward port as it had for tailing files. These two graphs show the results of rerunning the first test, with 1000 lines from a single writer vs the same rate spread over ten writers, but for the push case this time.

So do these results mean we should all be logging to a socket rather than to files? Obviously it’s not that simple, and if we could just change the way things log most of the problems we have with logging wouldn’t be problems. Logging to stdout makes containers a lot easier to observe and work with in development. Setting them up to log both ways depending on context would increase complexity, as would setting up fluentd to visualize them in development (using a stdout output plugin for example). Perhaps a more practical interpretation of these results would be to suggest running larger nodes. If you have to provision fluentd for the noisiest nodes in the cluster then having fewer of them is better. Combined with using anti-affinity to spread out the heavy log producers this could be a good strategy, but there are a lot of considerations and trade-offs in sizing nodes that go well beyond the needs of the logging system.

Scale also obviously matters. At smaller scale the inconvenience and added complexity probably isn’t worth it, and you usually have more important fires to put out anyway. If you’re just starting out and have that new paint smell on your engineering processes you could always standardize the logging format and libraries up front, and take advantage of the lower costs of pushing logs without unduly aggravating engineers. For those running at huge scale this post is irrelevant, since Google-scale log producers have already done a much more thorough and scientific version of the same analysis I’m stumbling through here. Of course at their scale you’re also deploying your own clusters and can do whatever you like with the logging pipeline, so you can have the best of both worlds.

To close let me anticipate a couple of questions and hopefully provide answers: First, isn’t this post really about fluentd and is it even about kubernetes? The answer to both parts of that question is… eh, maybe? My general reading and experience suggests this is a thing inherent in tailing files on linux in situations where you have a lot of disk i/o. I haven’t run these tests using another log forwarder like logstash, but it would be interesting to. As for kubernetes, the differences in CPU usage I’m noting here have nothing to do with it, and I expect you’d get the same results however you run your loggers and forwarders. Having said that I mean the post as a specific heads-up to people running kubernetes, because tailing log files is how most k8s-as-a- service clusters are set up, at least by default.

Lastly, about that other expendable resource: memory. I was initially going to include it, and the dashboard provisioned with the project visualizes fluentd ram usage, but it ended up just not being important. All of the tests consumed a max of 85 MB of ram, with the variance between individual tests rarely being greater than 10MB. The fairly low ram usage is of course due to not using any buffered output plugins. But more importantly it didn’t vary between the file tailing and socket push delivery cases by any meaningful amount, and the post was getting long enough. If it is worth doing a more detailed test of this kind there are lots of corners that could be poked into, including breaking down the CPU states and syscalls that fluentd is spending its time in, but that’s going to have to wrap it up for this one.

Leave a Reply

Your email address will not be published. Required fields are marked *