I’ve mentioned previously that I’m testing some infrastructure components for a new system at work. One of those components is Logstash, which along with redis and Elastic Search will function as our distributed logging system. As a test setup I have python scripts generating log messages into a redis list, which is then popped by Logstash for indexing into Elastic Search. The whole thing is running in a Docker container, the building of which I discussed in my last post. These processes only generate a couple of events per second, based on the other work they are doing. I’m running four of them for now, so I am getting six to eight events per second written to redis. This is very low volume, but sufficient for me to get a handle on how everything works.
The first time I ran the test and connected to Kibana to see the collected events stream I thought it was so cool that I just paged around with a silly smile on my face for fifteen minutes, without paying a lot of attention to the details. The next time I ran it I noticed that some events seemed to be missing. I know the pattern of events emitted by these scripts very well, and I wasn’t seeing them all. So I shut down, checked a few things, convinced myself that no logic in the code could cause that pattern, and then fired the test back up again. This time things looked a lot closer to normal, but I couldn’t quite be sure without having another definitive source to compare to.
Fortunately I had already built in code to enable/disable both file-based and redis-based logging, so I simply enabled file logging and rebuilt the container. I fired it up, ran a short test, and the number of lines in the log files exactly matched the number of events in the logstash index on ES. The events in Kibana looked complete, too. So ok, problem solved.
I got off on other things for a day or two, and when I came back and started running this test again I forgot to disable file logging. I realized it after a short time and killed the test. Before rebuilding the container I decided to check the counts again. They were off. There were something like 500 fewer events in the logstash index than in the files, out of a total of several thousand. I started the test again, intending to run a longer one and do a detailed comparison, and this time the earlier pattern I had seen, with obviously missing events, was back. I let this test run for a bit and then compared counts: ~6500 in the files vs. ~3200 in the index. Half my events had disappeared.
I was pretty sure the culprit wasn’t redis, which is pretty bulletproof, but to quickly rule it out I shut down logstash and ran the test. With nothing reading them the events piled up in redis and at the end llen show the counts to be exactly the same. Redis wasn’t mysteriously losing anything. I next checked the logstash log, but there was nothing in it of note. I connected directly to ES and queried the index. The counts all matched with what Kibana had shown me. There was no indication anywhere as to why half the events had evaporated.
I was poking around on the logstash site looking for clues when I noticed the verbosity settings in the command line. I quickly edited the init.d script that launches the daemon to set -vv, the highest logging verbosity level for logstash. I reran the test and again saw the same pattern of missing events. I let it run for a minute or so and then shut it down. The logstash.log file was now over 7 megs, compared to a few hundred k in the last test. I dove in and quickly noticed some Java exceptions getting logged. These were field parser errors, caused by an inner number format exception, and the field that caused the issue was named ‘data.’ I grep’d the log to see how many of these there were, and the number came back ~3200. Well, whaddaya know?
The data field is one that I use in my log event format to contain variable data. In one message it may have a number, and in another a string. Looking at the exceptions in the log made it very clear what was going on: ES was dynamically assigning a type to the field based on the data it received. Not all my log events include data, and based on the results of processing the first populated ‘data’ field to get into the queue might have a number, or it might have a string. If the string field arrived first all was well, because ES could easily convert a number to a string. But if a number field arrived first ES decided that all subsequent data had to be converted to a number, which was not possible in about half the cases.
The solution turned out to be pretty simple. You can include a custom template for your index in the logstash configuration, and this can include field mappings. In order to make this work through logstash you have to be using ES version 0.90.5 or later. Fortunately I am using the embedded instance for this test, which is at 0.90.9. To implement the template solution required modifying the logstash output to reference the template:
output { elasticsearch { embedded => true template => "/opt/logstash/es-event-template.json" } }
It also required the template itself. At first I tried creating a very simple template that only added the single field mapping I needed. That didn’t work, and I assumed it was because I was overwriting logstash’s settings in way that was breaking things. I had hoped my single setting would just get merged in, but either that’s not happening or there is something else I don’t understand. What I ended up doing is grabbing the current template from:
http://myhost.com:9200/_template
And then editing it to add the part I needed:
"data": { "index": "not_analyzed", "type": "string" }
And with that the problem was solved. The lesson of the day for me is: if logstash appears to be eating your events, crank up the logging verbosity level and see what’s giving it indigestion.