Logstash ate my events

I’ve mentioned previously that I’m testing some infrastructure components for a new system at work. One of those components is Logstash, which along with redis and Elastic Search will function as our distributed logging system. As a test setup I have python scripts generating log messages into a redis list, which is then popped by Logstash for indexing into Elastic Search. The whole thing is running in a Docker container, the building of which I discussed in my last post. These processes only generate a couple of events per second, based on the other work they are doing. I’m running four of them for now, so I am getting six to eight events per second written to redis. This is very low volume, but sufficient for me to get a handle on how everything works.

The first time I ran the test and connected to Kibana to see the collected events stream I thought it was so cool that I just paged around with a silly smile on my face for fifteen minutes, without paying a lot of attention to the details. The next time I ran it I noticed that some events seemed to be missing. I know the pattern of events emitted by these scripts very well, and I wasn’t seeing them all. So I shut down, checked a few things, convinced myself that no logic in the code could cause that pattern, and then fired the test back up again. This time things looked a lot closer to normal, but I couldn’t quite be sure without having another definitive source to compare to.

Fortunately I had already built in code to enable/disable both file-based and redis-based logging, so I simply enabled file logging and rebuilt the container. I fired it up, ran a short test, and the number of lines in the log files exactly matched the number of events in the logstash index on ES. The events in Kibana looked complete, too. So ok, problem solved.

I got off on other things for a day or two, and when I came back and started running this test again I forgot to disable file logging. I realized it after a short time and killed the test. Before rebuilding the container I decided to check the counts again. They were off. There were something like 500 fewer events in the logstash index than in the files, out of a total of several thousand. I started the test again, intending to run a longer one and do a detailed comparison, and this time the earlier pattern I had seen, with obviously missing events, was back. I let this test run for a bit and then compared counts: ~6500 in the files vs. ~3200 in the index. Half my events had disappeared.

I was pretty sure the culprit wasn’t redis, which is pretty bulletproof, but to quickly rule it out I shut down logstash and ran the test. With nothing reading them the events piled up in redis and at the end llen show the counts to be exactly the same. Redis wasn’t mysteriously losing anything. I next checked the logstash log, but there was nothing in it of note. I connected directly to ES and queried the index. The counts all matched with what Kibana had shown me. There was no indication anywhere as to why half the events had evaporated.

I was poking around on the logstash site looking for clues when I noticed the verbosity settings in the command line. I quickly edited the init.d script that launches the daemon to set -vv, the highest logging verbosity level for logstash. I reran the test and again saw the same pattern of missing events. I let it run for a minute or so and then shut it down. The logstash.log file was now over 7 megs, compared to a few hundred k in the last test. I dove in and quickly noticed some Java exceptions getting logged. These were field parser errors, caused by an inner number format exception, and the field that caused the issue was named ‘data.’ I grep’d the log to see how many of these there were, and the number came back ~3200. Well, whaddaya know?

The data field is one that I use in my log event format to contain variable data. In one message it may have a number, and in another a string. Looking at the exceptions in the log made it very clear what was going on: ES was dynamically assigning a type to the field based on the data it received. Not all my log events include data, and based on the results of processing the first populated ‘data’ field to get into the queue might have a number, or it might have a string. If the string field arrived first all was well, because ES could easily convert a number to a string. But if a number field arrived first ES decided that all subsequent data had to be converted to a number, which was not possible in about half the cases.

The solution turned out to be pretty simple. You can include a custom template for your index in the logstash configuration, and this can include field mappings. In order to make this work through logstash you have to be using ES version 0.90.5 or later. Fortunately I am using the embedded instance for this test, which is at 0.90.9. To implement the template solution required modifying the logstash output to reference the template:

output {
  elasticsearch {
    embedded => true
    template => "/opt/logstash/es-event-template.json"
  }
}

It also required the template itself. At first I tried creating a very simple template that only added the single field mapping I needed. That didn’t work, and I assumed it was because I was overwriting logstash’s settings in  way that was breaking things. I had hoped my single setting would just get merged in, but either that’s not happening or there is something else I don’t understand. What I ended up doing is grabbing the current template from:

http://myhost.com:9200/_template

And then editing it to add the part I needed:

"data": { 
    "index": "not_analyzed",
    "type": "string"
}

And with that the problem was solved. The lesson of the day for me is: if logstash appears to be eating your events, crank up the logging verbosity level and see what’s giving it indigestion.

Containers are going to hurt Windows in the datacenter

As anyone who accidentally stumbles on this blog knows, I’ve been playing with Docker for a few weeks now, which makes me an expert and entitled to opine on its future. Hell, you don’t need to be an expert to know where this is going. Docker rocks. It’s going to revolutionize the way applications and dependencies are managed in the datacenter. No technology has captivated me this way since… I don’t even know. Virtualization came close. But it is containers that are poised to really deliver the value virtualization promised. And Windows has nothing comparable what they offer. There are plenty of virtualization solutions that work on Windows, but containers are not virtualization: they’re about environment isolation, and they are a lot lighter weight, a lot easier to use, and a lot more manageable than virtualization technologies.

Over the weekend I finished up an experiment to encapsulate the complete environment for my spider application in a Docker container. It was a huge success from my point of view. Just to recap, I’m working on an application that uses a bunch of spider processes to get information off the web. The environment these things run in is fairly sophisticated at this point: they get their work input from, and stream work product to, redis. They also stream log events to redis, from which the events are picked up by logstash and indexed into Elastic Search. The whole process runs under the control of supervisord. So there are a few pieces in play.

Configuring the environment on EC2 previously required approximately 50 steps, including installing the base dependencies from the python, debian, and java repositories. After this weekend’s work it should require two: install Docker, and build the container image.

All those manual steps are now captured declaratively in a Docker build file. The base image that the container build depends on is an Ubuntu 13.10 debootstrap install with python and a few other things added, that I pushed to my repository on the Docker index. It can now be pulled to anywhere that Docker is installed. I could also build and push the final environment image, but all images on the Docker index are public. The way around that is either to run your own repository, or build the final container image in place on the server. I’m going to take the latter approach for now.

So, with the base image in the repo, and the Docker build file created and tested, launching a new environment on a server with Docker installed looks like this:

sudo docker build -t="my:image" -rm=true - < mydockerfile
sudo docker run -t -i -n="my:name" my:image /bin/bash

Boom, the server is running. In my case what that means is that redis is up, logstash is up, Elastic Search is up, supervisord is up and has launched the spider processes, and the proper ports are exposed and bridged over to the host system. All I have to do is point my web browser at http://myhostname.com:9292 and the Kibana3 dashboard pops up with the event stream visualized so I can monitor the work progress. That. Is. Cool.

Which leads me back to my opening thought. As a developer who spent 20 years writing code for Microsoft platforms I don’t want to come off like an ex-smoker who has seen the light and needs to let everyone know it. But at the same time, I can’t help but wonder how Microsoft will respond to this? Docker came out of nowhere, igniting a wildfire by making a formerly obtuse technology (LXC containers) easier to understand and use. I think within another year or so containers will be as ubiquitous and as important in the datacenter as virtualization is now, perhaps more so. And Windows will be increasingly relegated to running Exchange and Sharepoint servers.