For the last week I’ve been evaluating some infrastructure technologies for our new platform at work. Most recently that effort has focused on using Docker for dependency management and deployment, as well as ElasticSearch, Logstash, and Kibana for log flow handling. Since our embryonic system does not yet produce much in the way of log data, I turned to an existing web spider framework I had built that generates tons of it. Well, according to my standards of scale, anyway. Running the spider against a typical list of 30,000 target domains yields close to a gig of log data. At maturity our new system will likely generate more than this, but the spider makes a great initial test of the logging pipeline I have in mind.
I brought Docker into the mix because we deploy onto AWS, and the promise of automating the build and deployment of independent containers to the server (continuous integration) was too good to pass up. As I began to work on the structure of the container for the spider framework I ran into a decision which will eventually confront everyone who is designing a container-based deployment: whether to make a single fat container with all the necessary internal dependencies, or several skinnier containers that work together. For example, the spider framework has the following major components:
- The python spider scripts
- Redis
- Logstash
- Elastic Search
- Kibana
- Supervisor
The supervisor daemon runs the spiders. The spiders take input from, write work output to, and log messages into Redis. Logstash reads log messages from Redis and indexes them in Elastic Search, and Kibana queries Elastic Search to provide log visualization and analysis.
Looking at that list there are a few different ways you could can it up. The most granular would be having each component run in its own container with just its own dependencies. That would also be the best option from a future scalability perspective. If you needed to cluster Redis, or Elastic Search, it would be a lot easier to do if everything ran separately. Docker makes it pretty easy to specify the network links between containers in a stable, declarative manner. So this could be manageable.
On the other hand, for at least this test iteration I am also attracted to the idea of a single “appliance” container that has everything needed by the framework, with the proper ports exposed so I can connect to it from the outside to monitor and control the operation. In that case configuring a new server at AWS would be a simple matter of installing Docker, pulling the image, launching the container, and then connecting to it from a browser. For simplicity’s sake I find this prospect attractive, and since I am currently using Logstash with the embedded Elastic Search and Kibana instances, I decided to try this approach first.
Probably the main thing you need to get around in this scenario is the fact that Docker wants to run one process per container. It launches that process when the container starts, and when the process exits the container exits. In my case I have at least nine processes I need to run: four instances of python running the spider script, two Redis daemons, and the daemons for Logstash, Elastic Search, and Kibana. This situation isn’t uncommon for Docker deployments. There is a lot of discussion on the web, usually leading to the overarching question that is the subject of this post: do you really want to run a bunch of stuff in one container? Assuming you do, there are a few ways to go about it.
One thing to note about Docker (and LXC containers in general, I think) is that they aren’t virtual machines. They don’t boot up, per se, and they don’t start daemons through rc or initctrl. Even so, it is nice to have core daemons running as services because you get some nice control and lifecycle semantics, like automatically respawning if something faults and a daemon crashes. You can do this by installing them as services, and then starting them manually from a launch script when the container runs. So the command to run your container might look like:
sudo docker run -d markbnj/spider /usr/local/bin/launch.sh
And then the launch.sh script looks something like:
/etc/init.d/redis_6379 start /etc/init.d/logstash start
Not quite good enough, though, because this script will exit, and then the container will exit. You have to do something at the end to keep things running. In my case I need to get redis and logstash spun up, and then launch my spiders. That’s where supervisor comes in. The last line of my launch script will be a call to run supervisord, which will launch and control the lifecycles of the four spider instances. The container will remain open as long as supervisord is running, which will be until I connect to it and tell it to shut down.
As of last night I had everything but Supervisor set up. I need to do some more reading on parsing json with Logstash, and I need to write some code in the spiders to change the logging method to use the Redis queue. After that I will be able to deploy the last pieces, convert my recipe into a dockerfile I can automate the container build from, and then test. If it all works the way I intend then I will be able to simply launch the container and the spiders will start working. I will be able to connect to port 6379 to monitor the redis work output queue, and 9200/9292 to query log data. Pretty neat stuff.