Using Celery to schedule python tasks

Many python programmers are familiar with Celery due to its integration with django, a hugely popular web framework. A colleague recommended it to me recently when the need to schedule pruning of old logstash indices came up during a development discussion. As I soon discovered, Celery is a fast and powerful way to turn any python callable into a scheduled task, or message-processing worker. We’re now using it to clip the aforementioned indices, as well as perform other tasks such as injecting work payloads into a distributed queue.

Celery is a python package, so the easiest way to get it into your virtualenv (or Docker container, or vagrant env) is simply:

pip install celery

The gzipped tarball is only 1.3 megs, and while there are a few other dependencies (billiard, kombu, pytz) the installation takes less than a minute. Once it is complete you’re ready to create a task. Let’s start with a simple script that downloads a web page and saves it to a file. We’ll create the script and set it up to run every ten minutes. For downloading the page we’ll use the awesome requests package. First, the script itself.

from datetime import timedelta
from celery import Celery
import requests

app = Celery('page_saver')
            'schedule': timedelta(minutes=10)

def save_page(url, file_name):
    response = requests.get(url)
    if response.status_code ==
        f = open(file_name, 'wb')

After importing Celery and the requests package the first thing this script does is create the app object and initialize it. The Celery app object marks this module as a Celery app, and the only parameter we’re passing here is the name, which must match the name of the module. In this example the script would be saved on the file system as “”.

The call to app.conf.update is one way of passing in configuration data to the Celery object. There are several others, and in general most of the configuration options and settings are well beyond the scope of this post. You can find a good intro and links to more information here.

The first setting, ‘BROKER_URL’, specifies the pipeline that Celery will use for passing messages between clients and workers. I’m using redis here because I always have it lying around, but you can also use RabbitMQ, or a database, although that isn’t recommended in production.

The next two settings, ‘CELERY_TASK_SERIALIZER’, and ‘CELERY_ACCEPT_CONTENT’ instruct Celery to use json encoding when talking to the task as a client, and when accepting messages in the task as a server. Without these settings Celery will also allow pickle (and warn on startup), and nobody should receive pickle from a network port unless it is with a pair of tweezers. In any event, pickle is deprecated so json is the way to go for that reason as well.

The last setting contains the schedule on which our task will be executed. This requires some explanation, and it is easiest to do it in tandem with an explanation of the actual task method. As you can see in the script we define a method named save_page, and decorate it with “@app.task()”, passing in a name for the task. The naming seems pretty much arbitrary, but I like the suggested convention of “appname.taskname.”

The decorator turns this callable into the entry point for a Celery task with the given name. The script could contain many more tasks, each being individually callable, schedulable, etc., but for this example one will suffice, and I think I like a 1:1 mapping between app and task anyway. The actual implementation of the save_page method is self-explanatory, and contains no error handling or retry logic for brevity’s sake.

With the task defined the script above constitutes a complete Celery worker, and it can be run at any time using Celery to activate it and send it a message. For example, save the script to a folder and then cd into that folder and do this:

celery -A page_saver worker --loglevel=INFO

You should see a bunch of output indicating that celery is starting the task. Once it is up and running open up another terminal and start the python interpreter. Enter the following statements:

>>>from page_saver import save_page

The task object, save_page, serves as a proxy to call itself in the worker process running in the other terminal session. The delay method executes the call asynchronously, and Celery provides additional methods to access the results when they become available. In this case the only result is the file output.

So, that brings us back to the schedule. To have a scheduled task we need a scheduler – a client to wake up and essentially call delay() on our task when a timer expires. Fortunately Celery includes celery-beat, which does exactly that. You can run celery-beat as a stand-alone service and use the same schedule configuration schema, but you can also run it in tandem with celery using the -B command line switch:

celery -A page_saver worker -B --loglevel=INFO

The schedule configuration in the example script establishes a single schedule named “save_page” and tells it to run the task named “page_saver.save_page” every ten minutes using a timedelta. You can also set the ‘schedule’ field to a crontab object and use all of the same options available to a cron job, so there is a lot of flexibility.

And that’s about it for this simple example, which incidentally is complete enough to handle a lot of system maintenance tasks. Add some error-handling, logging and notification and you’re ready to go. If your needs go beyond simple repeatable tasks you’ll find Celery has a lot of options for executing distributed workloads and returning results asynchronously. Have fun!

Elasticsearch discovery in EC2

Elasticsearch out of the box does a good job of locating nodes and building a cluster. Just assign everyone the same cluster name and ES does the rest. But running a cluster on Amazon’s EC2 presents some additional challenges specific to that environment. I recently set up a docker-ized Elasticsearch cluster on two EC2 medium instances, and thought I would put down in one place the main things that I ran into setting it up.

The most fundamental constraint in EC2 is that multicast discovery (unsurprisingly) will not work. EC2 doesn’t allow multicast traffic. This gives you two choices: use unicast discovery, or set up EC2 dynamic discovery. Unicast discovery requires that all nodes have the host IPs for the other nodes in the cluster. I don’t like that idea, so I chose to set up dynamic discovery using the EC2 APIs. Getting EC2 dynamic discovery to work requires changes in three places: the Elasticsearch installation, the configuration file at /etc/elasticsearch/elasticsearch.yml, and finally the instance and security configuration on Amazon. I’ll take these in order.

The best way to integrate EC2 into the Elasticsearch discovery workflow is to use the cloud-aws plugin module. You can install this at any time with the following command:

sudo /usr/share/elasticsearch/bin/plugin -install \

This will pull the latest plugin from the download site and install it, which is basically just extracting files. Note that the version in the command is the latest one. You can check here to see if it is still current. And that’s all there is to that. Adding the cloud-aws plugin enables the discovery.ec2 settings in elasticsearch.yml, which is where we’ll head next.

The Elasticsearch config file is located in /etc/elasticsearch/elasticsearch.yml, and we’ll want to change/add a few things there. First, obviously, give everyone the same cluster name: my_cluster

Another setting that makes sense in production, at least, is to require the cloud-aws plugin to be present on start:

plugin.mandatory: cloud-aws

The next two settings are required for the cloud-aws plugin to communicate on your behalf with the great AWS brain in the sky: NOTMYACCESSKEYATALL ITw0UlDnTB3seCRetiFiPuTItHeR3

Not to digress too much into AWS access management, but if you’ve set things up the right way then the access key and secret key used above will be for an IAM sub-account that grants just the specific permissions needed. The next two settings deal with discovery directly:

discovery.type: ec2 false

The first one just sets the discovery type to use the EC2 plugin. Pretty self-explanatory. The second one disables the use of multicast discovery, on the principle of “it doesn’t work, so don’t try it.” The last two settings we’ll discuss can be seen as alternatives to one and other, but are not mutually exclusive, which requires a bit of explanation.

Basically, we may need to filter the instances that get returned to the discovery process. When the cloud-aws plugin queries EC2 and gets a list of addresses back, it is going to assume they are all Elasticsearch nodes. During discovery it will try to contact them, and if some are not actually nodes it will just keep trying. This behavior makes sense with the multicast discovery process, because if you are not listening for multicast traffic then you don’t respond to it. But the EC2 discovery APIs will return all the instances in an availability zone, so we need some way to identify to Elasticsearch discovery which ones are really nodes.

One way to do this is by using a security group. You can create a security group and assign all the instances that you want in a particular cluster to that group, then make the following addition to the config:

discovery.ec2.groups: my_security_group

This setting tells the plugin to return only those instances that are members of the security group ‘my_security_group.’ Since you will need a security group anyway, as explained below, this is a convenient way to separate them from the crowd. But there can be cases where you don’t want to partition on the security group name. You might, for example, want to have one security group to control the access rules for a set of instances representing more than one cluster. In that case you can use tags:

discovery.ec2.tag.my_tag: my_tag_value

This setting tells the plugin to return only those instances that have the tag ‘my_tag’ with the value ‘my_tag_value.’ This is even more convenient, since it doesn’t involve mucking about with security groups, and setting the tag on an instance is easily done. Finally, as mentioned before these aren’t mutually exclusive. You can use the groups option to select a set of instances, and then partition them into clusters using tags.

And that’s it for the elasticsearch.yml settings, or at least the ones I had to change to make this work on EC2. There are a lot of other options if your specific case requires tweaking, and you can find an explanation of them here. The last thing I want to go into are the necessary steps to take in the Amazon EC2 console with respect to configuration. These fall into two areas: security groups and instance configuration. I don’t want to digress far into the specific AWS console steps, but I’ll outline in general what needs to happen.

Security groups first. You’re going to need all the nodes in your cluster to be able to talk to each other, and you’re going to want some outside clients to be able to talk to the nodes in the cluster as well. There are lots of different cases for connecting clients for querying purposes, so I’m going to ignore that topic and focus on communications between the nodes.

The nodes in a cluster need to use the Elasticsearch transport protocol on port 9300, so you’ll need to create a security group, or modify an existing one that applies to your instances, and create a rule allowing this traffic. For my money the easiest and most durable way to do this is to have a single security group covering just the nodes, and to add a rule allowing inbound TCP traffic on port 9300 from any instance in the group. If you are using the discovery.ec2.groups method discussed above, make sure to give your group the same name you used in the settings.

The last point is instance configuration, and for my purposes here it’s really just setting the tag or security group membership appropriately so that the instance gets included in the list returned from discovery. There are lots of other specifics regarding how to set up an instance to run Elasticsearch efficiently, but those are topics for another time (after I figure them out myself!)

The very last thing I want to mention is a tip for Docker users. If you’re running Elasticsearch inside a container on EC2 your discovery is going to fail. The first node that starts is going to elect itself as master, and if you query cluster health on that node it will succeed and tell you there is one node in the cluster, which is itself. The other nodes will hang when you try the same query and that is because they are busy failing discovery. If you look in /var/log/elasticsearch/cluster_name.log you’re going to see some exceptions that look like this:

[2014-03-18 18:20:04,425][INFO ][discovery.ec2            ]
[MacPherran] failed to send join request to master 
reason [org.elasticsearch.transport.RemoteTransportException:
Node [[MacPherran][aHWxgYAhSpaWlPEXNhs7RA][4fb92271cce6][inet[/]]] 
not master for join request from 

I cut out a lot of detail, but basically the reason this is happening is that, in this example, node MacPherran is talking to itself and doesn’t know it. The problem is caused because a running Docker container has a different IP address than the host instance it is running on. So when the node does discovery it first finds itself at the container IP, something like:


And then finds itself at the instance IP returned from EC2:


From there things do not go well for the discovery process. Fortunately this is easily fixed with another addition to elasticsearch.yml:


This setting tells Elasticsearch that this node should advertise itself as being at the specified host IP address. Set this to the host address of the instance and Elasticsearch will now tell everyone that is its address, which it is, assuming you are mapping the container ports over to the host ports. Now, of course, you have the problem of how to inject the host IP address into the container, but hopefully that is a simpler problem, and is left as an exercise for the reader.

Docker: run startup scripts then exit to a shell

As I’ve mentioned before the way Docker is set up it expects to run one command at container start up, and when that command exits the container stops. There are all sorts of creative ways to get around that if, for example, you want to mimic init behavior by launching some daemons before opening a shell to work in the container. Here’s a little snippet from a Docker build file that illustrates one simple approach:

CMD bash -C '/path/to/';'bash'

In the ‘’ script (which I usually copy to /usr/local/bin during container build) you can put any start up commands you want. For example the one in the container I’m working on now just has:

service elasticsearch start

The result when launching the container is that bash runs, executes the command which spools up es, and then executes another shell and quits:

mark:~$ sudo docker run -i -t mark/example
 * Starting Elasticsearch Server                                         [ OK ] 
root@835600d4d0b2:/home/root# curl http://localhost:9200
  "status" : 200,
  "name" : "Box IV",
  "version" : {
    "number" : "1.0.1",
    "build_hash" : "5c03844e1978e5cc924dab2a423dc63ce881c42b",
    "build_timestamp" : "2014-02-25T15:52:53Z",
    "build_snapshot" : false,
    "lucene_version" : "4.6"
  "tagline" : "You Know, for Search"
root@835600d4d0b2:/home/root# exit

Docker Builds and -no-cache

I was building out a search server container with Elasticsearch 1.0.1 today, and I ran into one of those irritating little problems that I could solve a lot faster if I would just observe more carefully what is actually going on. One of the steps in the build is to clone some stuff from our git repo that includes config files that will get copied to various places. In the process of testing I added a new file and pushed it, then re-ran the build. Halfway through I got a stat error from a cp command that couldn’t find the file.

But, but, I had pushed it, and pulled the repo, so where the hell was it? Yesterday something similar had happened when building a logstash/redis container. One of the nice things about a Docker build is that it leaves the interim containers installed until the end of the build (or forever if you don’t use the -rm=true option). So you can start up the container from the last successful build step and look around inside it. In yesterday’s case it turned out I was pushing to one branch and cloning from another.

But that problem had been solved yesterday. Today’s problem was different, because I was definitely cloning the right branch. I took a closer look at the output from the Docker build, and where I expected to see…

Step 4 : RUN git clone blahblahblah.git
 ---> Running in 51c842191693

I instead saw…

Step 4 : RUN git clone blahblahblah.git
 ---> Using cache

Docker was assuming the effect of the RUN command was deterministic and was reusing the interim image from the last time I ran the build. Interestingly it did the same thing with a later wget command that downloaded an external package. I’m not sure how those commands could ever be considered deterministic, since they pull data from outside sources, but whatever. The important thing is you can add the -no-cache option to the build command to get Docker to ignore the cache.

sudo docker build -no-cache -rm=true - < DockerFile

Note that this applies to the whole build, so if you do have some other commands that are in fact deterministic they are not going to use the cache either. It would be nice to have an argument to the RUN command to do this on per-step basis, but at least -no-cache will make sure all your RUN steps get evaluated every time you build.

Logstash ate my events

I’ve mentioned previously that I’m testing some infrastructure components for a new system at work. One of those components is Logstash, which along with redis and Elastic Search will function as our distributed logging system. As a test setup I have python scripts generating log messages into a redis list, which is then popped by Logstash for indexing into Elastic Search. The whole thing is running in a Docker container, the building of which I discussed in my last post. These processes only generate a couple of events per second, based on the other work they are doing. I’m running four of them for now, so I am getting six to eight events per second written to redis. This is very low volume, but sufficient for me to get a handle on how everything works.

The first time I ran the test and connected to Kibana to see the collected events stream I thought it was so cool that I just paged around with a silly smile on my face for fifteen minutes, without paying a lot of attention to the details. The next time I ran it I noticed that some events seemed to be missing. I know the pattern of events emitted by these scripts very well, and I wasn’t seeing them all. So I shut down, checked a few things, convinced myself that no logic in the code could cause that pattern, and then fired the test back up again. This time things looked a lot closer to normal, but I couldn’t quite be sure without having another definitive source to compare to.

Fortunately I had already built in code to enable/disable both file-based and redis-based logging, so I simply enabled file logging and rebuilt the container. I fired it up, ran a short test, and the number of lines in the log files exactly matched the number of events in the logstash index on ES. The events in Kibana looked complete, too. So ok, problem solved.

I got off on other things for a day or two, and when I came back and started running this test again I forgot to disable file logging. I realized it after a short time and killed the test. Before rebuilding the container I decided to check the counts again. They were off. There were something like 500 fewer events in the logstash index than in the files, out of a total of several thousand. I started the test again, intending to run a longer one and do a detailed comparison, and this time the earlier pattern I had seen, with obviously missing events, was back. I let this test run for a bit and then compared counts: ~6500 in the files vs. ~3200 in the index. Half my events had disappeared.

I was pretty sure the culprit wasn’t redis, which is pretty bulletproof, but to quickly rule it out I shut down logstash and ran the test. With nothing reading them the events piled up in redis and at the end llen show the counts to be exactly the same. Redis wasn’t mysteriously losing anything. I next checked the logstash log, but there was nothing in it of note. I connected directly to ES and queried the index. The counts all matched with what Kibana had shown me. There was no indication anywhere as to why half the events had evaporated.

I was poking around on the logstash site looking for clues when I noticed the verbosity settings in the command line. I quickly edited the init.d script that launches the daemon to set -vv, the highest logging verbosity level for logstash. I reran the test and again saw the same pattern of missing events. I let it run for a minute or so and then shut it down. The logstash.log file was now over 7 megs, compared to a few hundred k in the last test. I dove in and quickly noticed some Java exceptions getting logged. These were field parser errors, caused by an inner number format exception, and the field that caused the issue was named ‘data.’ I grep’d the log to see how many of these there were, and the number came back ~3200. Well, whaddaya know?

The data field is one that I use in my log event format to contain variable data. In one message it may have a number, and in another a string. Looking at the exceptions in the log made it very clear what was going on: ES was dynamically assigning a type to the field based on the data it received. Not all my log events include data, and based on the results of processing the first populated ‘data’ field to get into the queue might have a number, or it might have a string. If the string field arrived first all was well, because ES could easily convert a number to a string. But if a number field arrived first ES decided that all subsequent data had to be converted to a number, which was not possible in about half the cases.

The solution turned out to be pretty simple. You can include a custom template for your index in the logstash configuration, and this can include field mappings. In order to make this work through logstash you have to be using ES version 0.90.5 or later. Fortunately I am using the embedded instance for this test, which is at 0.90.9. To implement the template solution required modifying the logstash output to reference the template:

output {
  elasticsearch {
    embedded => true
    template => "/opt/logstash/es-event-template.json"

It also required the template itself. At first I tried creating a very simple template that only added the single field mapping I needed. That didn’t work, and I assumed it was because I was overwriting logstash’s settings in  way that was breaking things. I had hoped my single setting would just get merged in, but either that’s not happening or there is something else I don’t understand. What I ended up doing is grabbing the current template from:

And then editing it to add the part I needed:

"data": { 
    "index": "not_analyzed",
    "type": "string"

And with that the problem was solved. The lesson of the day for me is: if logstash appears to be eating your events, crank up the logging verbosity level and see what’s giving it indigestion.

Containers are going to hurt Windows in the datacenter

As anyone who accidentally stumbles on this blog knows, I’ve been playing with Docker for a few weeks now, which makes me an expert and entitled to opine on its future. Hell, you don’t need to be an expert to know where this is going. Docker rocks. It’s going to revolutionize the way applications and dependencies are managed in the datacenter. No technology has captivated me this way since… I don’t even know. Virtualization came close. But it is containers that are poised to really deliver the value virtualization promised. And Windows has nothing comparable what they offer. There are plenty of virtualization solutions that work on Windows, but containers are not virtualization: they’re about environment isolation, and they are a lot lighter weight, a lot easier to use, and a lot more manageable than virtualization technologies.

Over the weekend I finished up an experiment to encapsulate the complete environment for my spider application in a Docker container. It was a huge success from my point of view. Just to recap, I’m working on an application that uses a bunch of spider processes to get information off the web. The environment these things run in is fairly sophisticated at this point: they get their work input from, and stream work product to, redis. They also stream log events to redis, from which the events are picked up by logstash and indexed into Elastic Search. The whole process runs under the control of supervisord. So there are a few pieces in play.

Configuring the environment on EC2 previously required approximately 50 steps, including installing the base dependencies from the python, debian, and java repositories. After this weekend’s work it should require two: install Docker, and build the container image.

All those manual steps are now captured declaratively in a Docker build file. The base image that the container build depends on is an Ubuntu 13.10 debootstrap install with python and a few other things added, that I pushed to my repository on the Docker index. It can now be pulled to anywhere that Docker is installed. I could also build and push the final environment image, but all images on the Docker index are public. The way around that is either to run your own repository, or build the final container image in place on the server. I’m going to take the latter approach for now.

So, with the base image in the repo, and the Docker build file created and tested, launching a new environment on a server with Docker installed looks like this:

sudo docker build -t="my:image" -rm=true - < mydockerfile
sudo docker run -t -i -n="my:name" my:image /bin/bash

Boom, the server is running. In my case what that means is that redis is up, logstash is up, Elastic Search is up, supervisord is up and has launched the spider processes, and the proper ports are exposed and bridged over to the host system. All I have to do is point my web browser at and the Kibana3 dashboard pops up with the event stream visualized so I can monitor the work progress. That. Is. Cool.

Which leads me back to my opening thought. As a developer who spent 20 years writing code for Microsoft platforms I don’t want to come off like an ex-smoker who has seen the light and needs to let everyone know it. But at the same time, I can’t help but wonder how Microsoft will respond to this? Docker came out of nowhere, igniting a wildfire by making a formerly obtuse technology (LXC containers) easier to understand and use. I think within another year or so containers will be as ubiquitous and as important in the datacenter as virtualization is now, perhaps more so. And Windows will be increasingly relegated to running Exchange and Sharepoint servers.

Docker build files: the context of the RUN command

Docker is a game changing tool that is simplifying server dependency management in a wide variety of applications. For many of these applications simply spinning up a new container and installing a few things may be sufficient. Once the container is completed and tested you commit it to an image, push the image to your Docker repo, and you’re ready to pull and run it anywhere.

If your image has a lot of working parts and a more complicated install, however, this workflow is probably not good enough. That’s where a DockerFile comes in. A DockerFile is basically a makefile for Docker containers. You use a declarative syntax to specify the base image to build from, and the steps to take to transform it into the image that you want. Those steps usually include executing arbitrary shell statements using the RUN command.

The format of the RUN command is simply “RUN the-command-text”. Initially you might be tempted to look at RUN as essentially a shell prompt, from which you can do anything you would at an interactive shell, but that isn’t quite the way things work. For example, have a look at this minimal DockerFile:

# A minimal DockerFile example
FROM my-image
RUN mkdir /home/root/test
RUN touch /home/root/test/test.txt
RUN cd /home/root/test
RUN rm test.txt

This seems pretty straightforward: start with the base my-image, then create a directory, create a file in that directory, cd into that directory, and finally remove the file. If we try to execute this file using “docker build”, however, we get the following output:

Uploading context 2.048 kB
Uploading context 
Step 1 : FROM mn:saucy-base
 ---> 69e9b7adc04c
Step 2 : RUN mkdir /home/root
 ---> Running in d4802792515c
 ---> 4be0a443060a
Step 3 : RUN touch /home/root/test.txt
 ---> Running in 27aee53a2a17
 ---> b67284690b98
Step 4 : RUN cd /home/root
 ---> Running in 58d5fedeee98
 ---> 3a5826ad206c
Step 5 : RUN rm test.txt
 ---> Running in 02f11782a5e7
rm: cannot remove 'test.txt': No such file or directory
2014/01/31 15:11:34 The command [/bin/sh -c rm test.txt] returned a non-zero code: 255

The reason why the rm command was unable to find test.txt is hinted at by the output above the error. In particular, note the following:

Step 4 : RUN cd /home/root
 ---> Running in 58d5fedeee98
 ---> 3a5826ad206c
Step 5 : RUN rm test.txt
 ---> Running in 02f11782a5e7

Every instance of the RUN command that Docker processes gets applied to a new container that resulted from the changes created by the previous command. That’s what “Running in 58d5fedeee98” tells us. The command is being executed in the container with that ID, which is clearly different from the ID of the container in which the next command runs.

What this means is that the context of each RUN command is essentially a new instance of the shell, and any previous non-persistent changes like setting the current working directory are lost. The following revised DockerFile shows one way around this issue:

FROM my-image
RUN mkdir /home/root/test
RUN touch /home/root/test/test.txt
RUN cd /home/root/test;rm test.txt

Now the command that sets the working directory and the command that removes the file execute in the same context. If we re-run the build command we get the following output:

Uploading context 2.048 kB
Uploading context 
Step 1 : FROM mn:saucy-base
 ---> 69e9b7adc04c
Step 2 : RUN mkdir /home/root
 ---> Running in 633dd0266b8e
 ---> 7b2a80409513
Step 3 : RUN touch /home/root/test.txt
 ---> Running in d8122e2fb2ec
 ---> 70d091a60051
Step 4 : RUN cd /home/root;rm test.txt
 ---> Running in 68589850d97c
 ---> b88df827ad5f
Successfully built b88df827ad5f

One other quick note: when you build a container from a DockerFile containing many steps, a lot of intermediate containers are generated. The way to avoid having to manually delete them is to use the -rm flag to build:

sudo docker build -rm=true - < DockerFile

This will remove all the intermediate containers, as long as the script completed successfully. If any of the commands in the script failed, then it will leave all those containers behind. In that case, the easy way to get rid of them is:

sudo docker rm $(sudo docker ps -a -q)

Thanks to Dan Sosedoff for the tip.

Docker: fat container vs. skinny container

For the last week I’ve been evaluating some infrastructure technologies for our new platform at work. Most recently that effort has focused on using Docker for dependency management and deployment, as well as ElasticSearch, Logstash, and Kibana for log flow handling. Since our embryonic system does not yet produce much in the way of log data, I turned to an existing web spider framework I had built that generates tons of it. Well, according to my standards of scale, anyway. Running the spider against a typical list of 30,000 target domains yields close to a gig of log data. At maturity our new system will likely generate more than this, but the spider makes a great initial test of the logging pipeline I have in mind.

I brought Docker into the mix because we deploy onto AWS, and the promise of automating the build and deployment of independent containers to the server (continuous integration) was too good to pass up. As I began to work on the structure of the container for the spider framework I ran into a decision which will eventually confront everyone who is designing a container-based deployment: whether to make a single fat container with all the necessary internal dependencies, or several skinnier containers that work together. For example, the spider framework has the following major components:

  • The python spider scripts
  • Redis
  • Logstash
  • Elastic Search
  • Kibana
  • Supervisor

The supervisor daemon runs the spiders. The spiders take input from, write work output to, and log messages into Redis. Logstash reads log messages from Redis and indexes them in Elastic Search, and Kibana queries Elastic Search to provide log visualization and analysis.

Looking at that list there are a few different ways you could can it up. The most granular would be having each component run in its own container with just its own dependencies. That would also be the best option from a future scalability perspective. If you needed to cluster Redis, or Elastic Search, it would be a lot easier to do if everything ran separately. Docker makes it pretty easy to specify the network links between containers in a stable, declarative manner. So this could be manageable.

On the other hand, for at least this test iteration I am also attracted to the idea of a single “appliance” container that has everything needed by the framework, with the proper ports exposed so I can connect to it from the outside to monitor and control the operation. In that case configuring a new server at AWS would be a simple matter of installing Docker, pulling the image, launching the container, and then connecting to it from a browser. For simplicity’s sake I find this prospect attractive, and since I am currently using Logstash with the embedded Elastic Search and Kibana instances, I decided to try this approach first.

Probably the main thing you need to get around in this scenario is the fact that Docker wants to run one process per container. It launches that process when the container starts, and when the process exits the container exits. In my case I have at least nine processes I need to run: four instances of python running the spider script, two Redis daemons, and the daemons for Logstash, Elastic Search, and Kibana. This situation isn’t uncommon for Docker deployments. There is a lot of discussion on the web, usually leading to the overarching question that is the subject of this post: do you really want to run a bunch of stuff in one container? Assuming you do, there are a few ways to go about it.

One thing to note about Docker (and LXC containers in general, I think) is that they aren’t virtual machines. They don’t boot up, per se, and they don’t start daemons through rc or initctrl. Even so, it is nice to have core daemons running as services because you get some nice control and lifecycle semantics, like automatically respawning if something faults and a daemon crashes. You can do this by installing them as services, and then starting them manually from a launch script when the container runs. So the command to run your container might look like:

sudo docker run -d markbnj/spider /usr/local/bin/

And then the script looks something like:

/etc/init.d/redis_6379 start
/etc/init.d/logstash start

Not quite good enough, though, because this script will exit, and then the container will exit. You have to do something at the end to keep things running. In my case I need to get redis and logstash spun up, and then launch my spiders. That’s where supervisor comes in. The last line of my launch script will be a call to run supervisord, which will launch and control the lifecycles of the four spider instances. The container will remain open as long as supervisord is running, which will be until I connect to it and tell it to shut down.

As of last night I had everything but Supervisor set up. I need to do some more reading on parsing json with Logstash, and I need to write some code in the spiders to change the logging method to use the Redis queue. After that I will be able to deploy the last pieces, convert my recipe into a dockerfile I can automate the container build from, and then test. If it all works the way I intend then I will be able to simply launch the container and the spiders will start working. I will be able to connect to port 6379 to monitor the redis work output queue, and 9200/9292 to query log data. Pretty neat stuff.

Using Docker to turn my server app into an appliance

After messing with Docker enough to get comfortable with the way it works, I started thinking about a project I could do that would actually make my life easier right now. I have this server application that consists of a number of pieces. The main part is a python script that scans urls for information. I can run as many of these as I want. They get their target urls from redis, and write their work results back out to it as well. Currently they log to files in a custom format, including the pid so I can tell which instance was doing what.

When I want to run these things on AWS I open up a terminal session, load the initial redis database, then use screen to run as many instances of the spider as I want. Once they’re running I monitor output queues in redis, and sometimes tail the logs as well. I’d like to can this whole thing up in Docker so that I can just pull it down from my repo to a new AWS instance and connect to it from outside to monitor progress. I’d like the container to start up as many instances of the spider as I tell it to, and collect the log information for me in a more usable way.

To do this my plan is to change the spider as follows: I will rewrite the entry code so that it takes the number of instances to run as an argument, and then uses the process class to launch the instances in separate processes. I will also rewrite the logging code to log messages to a redis list instead of files. I will probably do this by launching a second redis instance on a different port, because the existing instance runs in aof mode, and that’s overkill for logging a ton of messages. Lastly I am going to install logstash and kibana. I’ll tell logstash to consume the logging messages from redis and insert them into its internal Elastic Search db, and I’ll use kibana to search and visualize this log data.

Redis, logstash, and kibana will all be set to run as daemons when the container starts, and the main container command will be a shell script that launches the spiders. The Docker image will expose the two redis ports, and the kibana web port. If all goes as planned I should be able to launch the container and connect to redis and kibana from outside to monitor progress. I have a quick trip to Miami at the start of this week, so I won’t be able to set this up until I get back, but when I do I’ll post here about my experiences and results.

Python: If you have Docker, do you need virtualenv?

I’ve been working in Python a lot over the last year. I actually started learning it about three years ago, but back then my day gig was 100% .NET/C# on the back-end, and just when I decided to do something interesting with Python in Ubuntu a series of time-constrained crises would erupt on the job and put me off the idea. Since that time my employer and my role have changed, the first totally, and the second at least incrementally. At KIP we use Python for daily development, and run all our software on Centos on Amazon Web Services. As such I got to dive back in and really learn to use the language, and it’s been a lot of fun.

One of the de facto standard components of the Python development toolchain is virtualenv. We use it for Django development, and I have used it for my personal projects as well. What virtualenv does is pretty simple, but highly useful: it copies your system Python install along with supporting packages to a directory associated with your project, and then updates the Python paths to point to this copied environment. That way you can pip install to your heart’s content, without making any changes to your global Python install. It’s a great way of localizing dependencies, but only for your Python code and Python packages and modules. If your project requires a redis-server install, or you need lxml and have to install libxml2-dev and three or four other dependencies first, virtualenv doesn’t capture changes to the system at that level. You’re still basically polluting your environment with project-specific dependencies.

Dependencies in general are a hassle. Back when I started messing with VMs a few years ago I thought that technology would solve a lot of these problems, and indeed it has in some applications. I still use VMs heavily, and in fact my daily development environment is an Ubuntu Saucy VirtualBox VM running on Windows 7. But VMs are a heavyweight solution. They consume a fixed amount of ram, and for best performance a fixed amount of disk. They take time to spin up. They’re not easy to move from place to place and it’s fairly complicated to automate the creation of them in a controllable way. Given all these factors I never quite saw myself having two or three VMs running with different projects at the same time. It’s just cumbersome.

And then along comes Docker. Docker is… I guess wrapper is too minimizing a term… let’s call it a management layer, over Linux containers. Linux containers are a technology I don’t know enough about yet. I plan to learn more soon, but for the moment it’s enough to know that they enable all sorts of awesome. For example, once you have installed Docker and have the daemon running, you can do this:

sudo debootstrap saucy saucy > /dev/null
sudo tar -C saucy -c . | sudo docker import - myimages:saucy

The first command uses debootstrap to download a minimal Ubuntu 13.10 image into a directory named saucy. It will take a little while to pull the files, but that’s as long as you’ll ever have to wait when building and using a Docker image.

The second command tars up the saucy install and feeds the tarball to docker’s import command, which causes it to create a new image and register it locally as myimages:saucy. Now that we have a base Ubuntu 13.10 image the following coolness becomes possible:

sudo docker run -i -t -name="saucy" -h="saucy" myimages:saucy /bin/bash

This command tells Docker to launch a container from the image we just created, with an interactive shell. The -name option will give the container the name “saucy,” which will be a convenient way to refer to it in future commands. The -h option causes Docker to assign the container the host name “saucy” as well. The bit at the end ‘/bin/bash’ tells Docker what command to run when the container launches. Hit enter on this and we’re running as root in a shell in a self-contained minimal install of Ubuntu. The coolest thing of all is that once we got past the initial image building step, starting a container from that image was literally as fast as starting Sublime Text. Maybe faster.

So we have our base image running in a container. Now what? Now we install stuff. Add in a bunch of things the minimal install doesn’t include, like man, nano, wget, and screen. Install Python and PIP. Create a project folder and install GIT. Whatever we want in our base environment. Once that is done we can exit the container by typing ‘exit’ or hitting control-d. Once back at the host system shell prompt the command:

sudo docker ps -a

…will show all the containers we’ve launched. Ordinarily the ps command only shows running containers, but the -a option causes it to show those that are stopped, like the one we just exited from. That container still has all of our changes. If we want to preserve those changes so that we can easily launch new containers with the same stuff inside, we can do this:

sudo docker commit saucy myimages:saucy-dev

This just tells Docker to commit the current state of the container saucy to a new image named myimages:saucy-dev. This image can now serve as our base development image, which can be launched in a couple of seconds anytime we want to start a new project or just try something out. I can’t overemphasize how much speed contributes to the usefulness of this tool. You can launch a new Docker container before mkvirtualenv can get through copying the base Python install for a new virtual environment. And it launches completely configured and ready to run commands in.

Given that, I found myself wondering just what use virtualenv is to me at this point? Unlike virtualenv Docker captures the complete state of the system. I can fully localize all dependencies on a “virtualization” platform that is as easy to use as a text editor in terms of speed and accessibility. Even better, I can create a “DockerFile” that describes all of the steps to create my custom image from a knwon base, and now any copy of Docker, running anywhere, can recreate my container environment from a version controlled script. That is way cool.

Now, the truth is, there are probably some valid reasons why the picture isn’t quite that rosy yet. The Docker people will be the first to tell you it is a very young tool, and in fact they have warnings about this splashed all over the website. There are issues with not running as root in the container. I haven’t been able to get .bashrc to run when launching a container for a non-root user. There are issues running screen due to the way the pseudo-tty is allocated. And if you’re doing a GUI app, then things may be more complicated still. All this stuff is being worked on and improved, and it’s very exciting to think of where this approachable container technology might take us.