Docker build files: the context of the RUN command

Docker is a game changing tool that is simplifying server dependency management in a wide variety of applications. For many of these applications simply spinning up a new container and installing a few things may be sufficient. Once the container is completed and tested you commit it to an image, push the image to your Docker repo, and you’re ready to pull and run it anywhere.

If your image has a lot of working parts and a more complicated install, however, this workflow is probably not good enough. That’s where a DockerFile comes in. A DockerFile is basically a makefile for Docker containers. You use a declarative syntax to specify the base image to build from, and the steps to take to transform it into the image that you want. Those steps usually include executing arbitrary shell statements using the RUN command.

The format of the RUN command is simply “RUN the-command-text”. Initially you might be tempted to look at RUN as essentially a shell prompt, from which you can do anything you would at an interactive shell, but that isn’t quite the way things work. For example, have a look at this minimal DockerFile:

# A minimal DockerFile example
FROM my-image
RUN mkdir /home/root/test
RUN touch /home/root/test/test.txt
RUN cd /home/root/test
RUN rm test.txt

This seems pretty straightforward: start with the base my-image, then create a directory, create a file in that directory, cd into that directory, and finally remove the file. If we try to execute this file using “docker build”, however, we get the following output:

Uploading context 2.048 kB
Uploading context 
Step 1 : FROM mn:saucy-base
 ---> 69e9b7adc04c
Step 2 : RUN mkdir /home/root
 ---> Running in d4802792515c
 ---> 4be0a443060a
Step 3 : RUN touch /home/root/test.txt
 ---> Running in 27aee53a2a17
 ---> b67284690b98
Step 4 : RUN cd /home/root
 ---> Running in 58d5fedeee98
 ---> 3a5826ad206c
Step 5 : RUN rm test.txt
 ---> Running in 02f11782a5e7
rm: cannot remove 'test.txt': No such file or directory
2014/01/31 15:11:34 The command [/bin/sh -c rm test.txt] returned a non-zero code: 255

The reason why the rm command was unable to find test.txt is hinted at by the output above the error. In particular, note the following:

Step 4 : RUN cd /home/root
 ---> Running in 58d5fedeee98
 ---> 3a5826ad206c
Step 5 : RUN rm test.txt
 ---> Running in 02f11782a5e7

Every instance of the RUN command that Docker processes gets applied to a new container that resulted from the changes created by the previous command. That’s what “Running in 58d5fedeee98” tells us. The command is being executed in the container with that ID, which is clearly different from the ID of the container in which the next command runs.

What this means is that the context of each RUN command is essentially a new instance of the shell, and any previous non-persistent changes like setting the current working directory are lost. The following revised DockerFile shows one way around this issue:

FROM my-image
RUN mkdir /home/root/test
RUN touch /home/root/test/test.txt
RUN cd /home/root/test;rm test.txt

Now the command that sets the working directory and the command that removes the file execute in the same context. If we re-run the build command we get the following output:

Uploading context 2.048 kB
Uploading context 
Step 1 : FROM mn:saucy-base
 ---> 69e9b7adc04c
Step 2 : RUN mkdir /home/root
 ---> Running in 633dd0266b8e
 ---> 7b2a80409513
Step 3 : RUN touch /home/root/test.txt
 ---> Running in d8122e2fb2ec
 ---> 70d091a60051
Step 4 : RUN cd /home/root;rm test.txt
 ---> Running in 68589850d97c
 ---> b88df827ad5f
Successfully built b88df827ad5f

One other quick note: when you build a container from a DockerFile containing many steps, a lot of intermediate containers are generated. The way to avoid having to manually delete them is to use the -rm flag to build:

sudo docker build -rm=true - < DockerFile

This will remove all the intermediate containers, as long as the script completed successfully. If any of the commands in the script failed, then it will leave all those containers behind. In that case, the easy way to get rid of them is:

sudo docker rm $(sudo docker ps -a -q)

Thanks to Dan Sosedoff for the tip.

Docker: fat container vs. skinny container

For the last week I’ve been evaluating some infrastructure technologies for our new platform at work. Most recently that effort has focused on using Docker for dependency management and deployment, as well as ElasticSearch, Logstash, and Kibana for log flow handling. Since our embryonic system does not yet produce much in the way of log data, I turned to an existing web spider framework I had built that generates tons of it. Well, according to my standards of scale, anyway. Running the spider against a typical list of 30,000 target domains yields close to a gig of log data. At maturity our new system will likely generate more than this, but the spider makes a great initial test of the logging pipeline I have in mind.

I brought Docker into the mix because we deploy onto AWS, and the promise of automating the build and deployment of independent containers to the server (continuous integration) was too good to pass up. As I began to work on the structure of the container for the spider framework I ran into a decision which will eventually confront everyone who is designing a container-based deployment: whether to make a single fat container with all the necessary internal dependencies, or several skinnier containers that work together. For example, the spider framework has the following major components:

  • The python spider scripts
  • Redis
  • Logstash
  • Elastic Search
  • Kibana
  • Supervisor

The supervisor daemon runs the spiders. The spiders take input from, write work output to, and log messages into Redis. Logstash reads log messages from Redis and indexes them in Elastic Search, and Kibana queries Elastic Search to provide log visualization and analysis.

Looking at that list there are a few different ways you could can it up. The most granular would be having each component run in its own container with just its own dependencies. That would also be the best option from a future scalability perspective. If you needed to cluster Redis, or Elastic Search, it would be a lot easier to do if everything ran separately. Docker makes it pretty easy to specify the network links between containers in a stable, declarative manner. So this could be manageable.

On the other hand, for at least this test iteration I am also attracted to the idea of a single “appliance” container that has everything needed by the framework, with the proper ports exposed so I can connect to it from the outside to monitor and control the operation. In that case configuring a new server at AWS would be a simple matter of installing Docker, pulling the image, launching the container, and then connecting to it from a browser. For simplicity’s sake I find this prospect attractive, and since I am currently using Logstash with the embedded Elastic Search and Kibana instances, I decided to try this approach first.

Probably the main thing you need to get around in this scenario is the fact that Docker wants to run one process per container. It launches that process when the container starts, and when the process exits the container exits. In my case I have at least nine processes I need to run: four instances of python running the spider script, two Redis daemons, and the daemons for Logstash, Elastic Search, and Kibana. This situation isn’t uncommon for Docker deployments. There is a lot of discussion on the web, usually leading to the overarching question that is the subject of this post: do you really want to run a bunch of stuff in one container? Assuming you do, there are a few ways to go about it.

One thing to note about Docker (and LXC containers in general, I think) is that they aren’t virtual machines. They don’t boot up, per se, and they don’t start daemons through rc or initctrl. Even so, it is nice to have core daemons running as services because you get some nice control and lifecycle semantics, like automatically respawning if something faults and a daemon crashes. You can do this by installing them as services, and then starting them manually from a launch script when the container runs. So the command to run your container might look like:

sudo docker run -d markbnj/spider /usr/local/bin/

And then the script looks something like:

/etc/init.d/redis_6379 start
/etc/init.d/logstash start

Not quite good enough, though, because this script will exit, and then the container will exit. You have to do something at the end to keep things running. In my case I need to get redis and logstash spun up, and then launch my spiders. That’s where supervisor comes in. The last line of my launch script will be a call to run supervisord, which will launch and control the lifecycles of the four spider instances. The container will remain open as long as supervisord is running, which will be until I connect to it and tell it to shut down.

As of last night I had everything but Supervisor set up. I need to do some more reading on parsing json with Logstash, and I need to write some code in the spiders to change the logging method to use the Redis queue. After that I will be able to deploy the last pieces, convert my recipe into a dockerfile I can automate the container build from, and then test. If it all works the way I intend then I will be able to simply launch the container and the spiders will start working. I will be able to connect to port 6379 to monitor the redis work output queue, and 9200/9292 to query log data. Pretty neat stuff.

Using Docker to turn my server app into an appliance

After messing with Docker enough to get comfortable with the way it works, I started thinking about a project I could do that would actually make my life easier right now. I have this server application that consists of a number of pieces. The main part is a python script that scans urls for information. I can run as many of these as I want. They get their target urls from redis, and write their work results back out to it as well. Currently they log to files in a custom format, including the pid so I can tell which instance was doing what.

When I want to run these things on AWS I open up a terminal session, load the initial redis database, then use screen to run as many instances of the spider as I want. Once they’re running I monitor output queues in redis, and sometimes tail the logs as well. I’d like to can this whole thing up in Docker so that I can just pull it down from my repo to a new AWS instance and connect to it from outside to monitor progress. I’d like the container to start up as many instances of the spider as I tell it to, and collect the log information for me in a more usable way.

To do this my plan is to change the spider as follows: I will rewrite the entry code so that it takes the number of instances to run as an argument, and then uses the process class to launch the instances in separate processes. I will also rewrite the logging code to log messages to a redis list instead of files. I will probably do this by launching a second redis instance on a different port, because the existing instance runs in aof mode, and that’s overkill for logging a ton of messages. Lastly I am going to install logstash and kibana. I’ll tell logstash to consume the logging messages from redis and insert them into its internal Elastic Search db, and I’ll use kibana to search and visualize this log data.

Redis, logstash, and kibana will all be set to run as daemons when the container starts, and the main container command will be a shell script that launches the spiders. The Docker image will expose the two redis ports, and the kibana web port. If all goes as planned I should be able to launch the container and connect to redis and kibana from outside to monitor progress. I have a quick trip to Miami at the start of this week, so I won’t be able to set this up until I get back, but when I do I’ll post here about my experiences and results.

Python: If you have Docker, do you need virtualenv?

I’ve been working in Python a lot over the last year. I actually started learning it about three years ago, but back then my day gig was 100% .NET/C# on the back-end, and just when I decided to do something interesting with Python in Ubuntu a series of time-constrained crises would erupt on the job and put me off the idea. Since that time my employer and my role have changed, the first totally, and the second at least incrementally. At KIP we use Python for daily development, and run all our software on Centos on Amazon Web Services. As such I got to dive back in and really learn to use the language, and it’s been a lot of fun.

One of the de facto standard components of the Python development toolchain is virtualenv. We use it for Django development, and I have used it for my personal projects as well. What virtualenv does is pretty simple, but highly useful: it copies your system Python install along with supporting packages to a directory associated with your project, and then updates the Python paths to point to this copied environment. That way you can pip install to your heart’s content, without making any changes to your global Python install. It’s a great way of localizing dependencies, but only for your Python code and Python packages and modules. If your project requires a redis-server install, or you need lxml and have to install libxml2-dev and three or four other dependencies first, virtualenv doesn’t capture changes to the system at that level. You’re still basically polluting your environment with project-specific dependencies.

Dependencies in general are a hassle. Back when I started messing with VMs a few years ago I thought that technology would solve a lot of these problems, and indeed it has in some applications. I still use VMs heavily, and in fact my daily development environment is an Ubuntu Saucy VirtualBox VM running on Windows 7. But VMs are a heavyweight solution. They consume a fixed amount of ram, and for best performance a fixed amount of disk. They take time to spin up. They’re not easy to move from place to place and it’s fairly complicated to automate the creation of them in a controllable way. Given all these factors I never quite saw myself having two or three VMs running with different projects at the same time. It’s just cumbersome.

And then along comes Docker. Docker is… I guess wrapper is too minimizing a term… let’s call it a management layer, over Linux containers. Linux containers are a technology I don’t know enough about yet. I plan to learn more soon, but for the moment it’s enough to know that they enable all sorts of awesome. For example, once you have installed Docker and have the daemon running, you can do this:

sudo debootstrap saucy saucy > /dev/null
sudo tar -C saucy -c . | sudo docker import - myimages:saucy

The first command uses debootstrap to download a minimal Ubuntu 13.10 image into a directory named saucy. It will take a little while to pull the files, but that’s as long as you’ll ever have to wait when building and using a Docker image.

The second command tars up the saucy install and feeds the tarball to docker’s import command, which causes it to create a new image and register it locally as myimages:saucy. Now that we have a base Ubuntu 13.10 image the following coolness becomes possible:

sudo docker run -i -t -name="saucy" -h="saucy" myimages:saucy /bin/bash

This command tells Docker to launch a container from the image we just created, with an interactive shell. The -name option will give the container the name “saucy,” which will be a convenient way to refer to it in future commands. The -h option causes Docker to assign the container the host name “saucy” as well. The bit at the end ‘/bin/bash’ tells Docker what command to run when the container launches. Hit enter on this and we’re running as root in a shell in a self-contained minimal install of Ubuntu. The coolest thing of all is that once we got past the initial image building step, starting a container from that image was literally as fast as starting Sublime Text. Maybe faster.

So we have our base image running in a container. Now what? Now we install stuff. Add in a bunch of things the minimal install doesn’t include, like man, nano, wget, and screen. Install Python and PIP. Create a project folder and install GIT. Whatever we want in our base environment. Once that is done we can exit the container by typing ‘exit’ or hitting control-d. Once back at the host system shell prompt the command:

sudo docker ps -a

…will show all the containers we’ve launched. Ordinarily the ps command only shows running containers, but the -a option causes it to show those that are stopped, like the one we just exited from. That container still has all of our changes. If we want to preserve those changes so that we can easily launch new containers with the same stuff inside, we can do this:

sudo docker commit saucy myimages:saucy-dev

This just tells Docker to commit the current state of the container saucy to a new image named myimages:saucy-dev. This image can now serve as our base development image, which can be launched in a couple of seconds anytime we want to start a new project or just try something out. I can’t overemphasize how much speed contributes to the usefulness of this tool. You can launch a new Docker container before mkvirtualenv can get through copying the base Python install for a new virtual environment. And it launches completely configured and ready to run commands in.

Given that, I found myself wondering just what use virtualenv is to me at this point? Unlike virtualenv Docker captures the complete state of the system. I can fully localize all dependencies on a “virtualization” platform that is as easy to use as a text editor in terms of speed and accessibility. Even better, I can create a “DockerFile” that describes all of the steps to create my custom image from a knwon base, and now any copy of Docker, running anywhere, can recreate my container environment from a version controlled script. That is way cool.

Now, the truth is, there are probably some valid reasons why the picture isn’t quite that rosy yet. The Docker people will be the first to tell you it is a very young tool, and in fact they have warnings about this splashed all over the website. There are issues with not running as root in the container. I haven’t been able to get .bashrc to run when launching a container for a non-root user. There are issues running screen due to the way the pseudo-tty is allocated. And if you’re doing a GUI app, then things may be more complicated still. All this stuff is being worked on and improved, and it’s very exciting to think of where this approachable container technology might take us.

New Year, New Theme

I’ve been slow in posting to this site for a couple of years, now. Time is at a premium these days, or at least, it is when you subtract all the time I spend playing guitar and starting but not finishing games. In the run-up to the symbolic passage of the old year and arrival of the new I gave some thought to the site and whether I wanted to keep it up. I decided that I did, but that the theme I originally chose for it, Palaam, had grown quite dated. This new theme is called Elucidate, and I think it’s pretty sharp. It’s responsive, and scales well on mobile devices. I’ve made some tweaks to it, reducing the size of article titles and adding the social icons at the top, nothing major. I also reorganized some content, but not so much that anyone will notice.

Let me know what you think. The team at work and I are ramping up development on a new system, and we’re in the rare position of being able to sample a lot of technologies and make our own choices with regard to our stack and architecture. I hope to write a lot more about that stuff in the coming weeks, so hopefully the bit on Docker above is just the beginning.