Bunch of Yahoos

Having a great set of developer tools can help make your platform ubiquitous and loved. When Microsoft first launched its Developer Network it revolutionized the way programmers got access to their operating systems, tools, and documentation. They successfully migrated that set of resources to the web and it remains invaluable for Windows developers. If you’ve ever set up access to a Google API, or deployed a set of EC2 resources on Amazon’s AWS cloud infrastructure, you know how impactful a clean, functional web interface is.

By the same token, a clunky, dysfunctional interface can make a platform loathed and avoided. Take Yahoo’s Developer Network and their BOSS premium APIs, for example.

We’re working on a system that needs to geolocate placenames in blocks of free text. This isn’t a trivial problem. There’s been a lot of work done on it, and we’ve explored most of it. During that exploration we wanted to try Yahoo’s PlaceSpotter API. It’s a pay service, but if it works well the cost could be reasonable, and just because we have built our system on free and open-source components doesn’t mean we won’t pay for something if it improves our business.

With that in mind I set out to test it, just as I had previously set out to test Google’s Places API. In that experiment I simply created a Google application under my user name, grabbed the creds, and wrote a python wrapper in about five minutes to submit text queries and print out the results. That’s my idea of a test.

In order to test Yahoo’s PlaceSpotter I needed access to the BOSS API. To get access to the BOSS API I needed to create a developer account. Ok, that’s not an issue. I will happily create a developer account. To create a developer account, it turns out, requires a bunch of personal info, including an active mobile number. Ok, I’ll do that too, albeit not quite as happily because all I want to do is figure out if this thing is worth exploring.

I should note that there is a free way to get to the same Geo data that BOSS uses, and the same functionality, through YQL queries. Maybe I was shooting myself in the foot right from the beginning, but I had no experience with YQL, I needed to move quickly and make some decisions, and I just wanted an API I could fling http requests at. Since the billing is per 1000 queries I had no problem paying for the first 1000 to test with. Not that big a deal.

After creating the account, during which I had to change the user name four times because of the cryptic message that it was “inappropriate” (no, I was not trying to use b1tch as a user name, or anything else objectionable), I finally ended up on a control panel-ish account dashboard. There I could retrieve my OAUTH key (ugh) and other important stuff, and activate access to the BOSS API.

I clicked the button to activate the API, and the panel changed to display another button labelled “BOSS Setup.” Next to that was a red rectangle stating that access to BOSS was not enabled because billing had not been set up. It wasn’t obvious to me that in order to set up billing you have to click “BOSS Setup.” I assumed billing would be at the account level. Well there are billing options at the account level! They’re not the ones you want, and unless the verbiage triggers some warnings as it did for me you might just go ahead and set up your credit card there, only to find it didn’t help.

Not to be deterred, I googled a bit and found that, indeed, I had to click “BOSS Setup.” It would have been nice if they had mentioned that in the red-colored billing alert. So I clicked, entered my login again because, you know, I was using the account control panel and so obviously might be an impostor, and ultimately found the place to enter my payment information. Once that was done, submitted, and authorized I received a confirmation and invoice in my email. Now, I could finally toss a few queries at the API.

Except no. When I returned to the account dashboard the same red-colored billing alert appeared. No access. I am a patient man, some of the time. Maybe their systems are busy handshaking. I waited. Nope. I waited some more. Nope. I gave up and waited overnight, and checked again this morning. Nope. Ok, dammit, I’ll click the “BOSS Setup” button again. I do that and what they show me is the confirmation page for my order again, with an active submit button. But wait… I got an invoice? Was I charged? Will I be charged again? Should I resubmit, or email Yahoo, or call my bank?

Maybe I should just not use the API. Oh, and did I mention that they have a “BOSS Setup” tutorial? It’s a download-only PDF. And 2/3 of it is about setting up ads.

Adding swap to an EBS-backed Ubuntu EC2 instance

Another one of those recipes I need to capture for future use. We were running a bunch of memory intensive processes on a medium (m2) Ubuntu instance on EC2 yesterday, and things were not going well. It looked like some processes were dying and being restarted. Poking around in the kernel messages we came across several events reading:

"Out of memory: Kill process ... blah blah"

Well, damn. There are less than 4GB of usable RAM on a medium, but it should be swapping, right? Wrong. We checked free and there was no swap file. That was my screw up, since I set up the instance and did not realize it had no swap. Turns out that what Amazon considers to be memory constrained instances (smalls and micros, for example) get some swap in the default config, but apparently mediums and larges do not. We decided to rebuild the instance as a large to get more RAM and also another processing unit, and add some swap at the same time.

I wasn’t able to figure out how to configure swap space in the instance details prior to launch, so I searched around and put together this recipe for adding it post-launch. This applies to Ubuntu EBS-backed instances. It should work generally for any debian-based distro, I think. If your instance is not EBS backed you can use many of the same techniques, but you’ll have to figure out the deltas because we don’t use any non-EBS instances. Anyway, to the details.

EBS-backed instances have their root volume on EBS, which is what EBS-backed means. But you don’t want to put swap space on EBS. EBS use incurs I/O charges, and although they are very low, they aren’t nothing. If you were to locate swap on EBS then there would at least be some chance of some process going rogue and causing a lot of swapping and associated costs. Not good.

Fortunately, EBS-backed instances still get so-called “ephemeral” or “instance storage” at launch. Unlike EBS this storage is physically attached to the system, and also unlike EBS it goes away when the instance is stopped (but not when it is rebooted). That’s why it is called “ephemeral.” Something else that is ephemeral is the data stored in a swap file, so it seems like instance storage and swap files are made for each other. Good news: every EBS-backed instance on EC2 still gets ephemeral storage by default. It consists of either a 32GB SSD, or two 320GB magnetic disks, depending on your choices. For instances that do not get swap allocated by default, this storage is mounted at /mnt after startup (either the SSD, or the first of the magnetic disks; if you want the second disk you need to mount it manually).

So before you start make sure your config matches what I’ve described above, i.e. you have an EBS-backed instance with ephemeral storage mounted at /mnt (which you can confirm with the lsblk command), and no swap space allocated (which you can confirm with the swapon -s command).

First thing to do is create a swapfile in /mnt. The one thing I am not going to do is opine on what size it should be, because there is a lot of info out there on the topic. In this example I made the swap 2GB.

# sudo dd if=/dev/zero of=/mnt/swapfile bs=1M count=2048

This command just writes two gigs worth of 0′s to the file ‘swapfile’ on /mnt. Next make sure the permissions on this new file are set appropriately.

# sudo chown root:root /mnt/swapfile
# sudo chmod 600 /mnt/swapfile

Next use mkswap to turn the file full of 0′s into an actual linux swapfile.

# sudo mkswap /mnt/swapfile
# sudo swapon /mnt/swapfile

The first command formats the file as swap space (I actually have no idea what it does at a low level, so ‘format’ might be a wildly incorrect term to use), and the second sets it up as the system swap file. Now you should update fstab so that this swap space gets mounted and used when the system is rebooted. Use your preferred editor and add the following line to /etc/fstab:

/mnt/swapfile swap swap defaults 0 0

Lastly, turn on swapping.

# swapon -a

You can now use swapon -s to confirm that the swap space is in use, and the free or top commands to confirm the amount of swap space. One thing to note is that this swap space will not survive an instance stop/start. When the instance stops the ephemeral storage will be destroyed. On restart the changes made in fstab and elsewhere will still be there, because the system root is on EBS, but the actual file we created at /mnt/swapfile will be gone, and will need to be recreated.

 

Running haproxy at system startup on Ubuntu

Sometimes I just have to capture this sort of thing because I know damn well I am going to forget it long before the next time it’s needed. We decided to use haproxy as a round-robin matchmaker in front of a cluster of elasticsearch nodes. I installed it on Ubuntu 14.04, configured it, and tested it by running it from the command line. Everything worked. But when I rebooted the instance the proxy didn’t start. I checked the init file and run levels. Everything seemed as it should be… but no dice. To make it worse, executing commands like…

sudo service haproxy start

or

sudo service haproxy status

… resulted in nothing at all. No errors, no output, nada. I searched around a little bit and stumbled on the following line in /etc/default/haproxy…

ENABLED=0

Apparently this flag, when set to 0, prevents the init script from running the proxy. The second-hand info I found on the reasoning is that there is no rational default config for haproxy, so they want you to explicitly configure it and then set this flag to 1. So… fine. Makes sense, but I do wish the init script was a little more verbose about what was happening. In any case, once I did this…

ENABLED=1

… haproxy launched at start as desired.

Re-raising exceptions in Python

I often run into situations where an error occurs, an exception is caught, and some action is taken before the exception is re-raised so that higher levels of the program can deal with it. Exactly how the exception is re-raised can have a big impact on how much information is available to someone trying to debug a problem. I got this wrong as a newbie, and I’ve seen even experienced python programmers get tripped up by these subtleties so I thought I would try to post a clear example. Consider the following program:

def bar():
    raise Exception("Oops!")

def foo():
    bar()

def dinkum():
    foo()

def fair():
    try:
        dinkum()
    except Exception as e:
        print "Something failed!"
        raise e

fair()

As you can see I’ve created a few methods in order to have a stack trace to look at, and raised an exception in the last one to be called, foo(). That exception is caught in fair() and some information is printed before the exception is re-raised by the statement “raise e.” This syntax looks pretty reasonable, so let’s run it and see what happens:

mark@viking:~/workspace$ python raise_test1.py
 Something failed!
 Traceback (most recent call last):
 File "raise_test.py", line 21, in 
 fair()
 File "raise_test.py", line 19, in fair
 raise e
 Exception: Oops!

Hmm. Looks like we’re missing some useful information. The exception was raised on line 6 but the stack trace begins on line 19. Lets change the program slightly and run it again:

def bar():
    raise Exception("Oops!")

# ... snip for brevity ...

def fair():
    try:
        dinkum()
    except Exception as e:
        print "Something failed!"
        raise

fair()

The only change here is that now we’re simply using “raise” rather than re-raising the exception by name with “raise e.” The output is quite different:

mark@viking:~/workspace$ python raise_test2.py
Something failed!
Traceback (most recent call last):
  File "raise_test.py", line 21, in 
    fair()
  File "raise_test.py", line 16, in fair
    dinkum()
  File "raise_test.py", line 12, in dinkum
    foo()
  File "raise_test.py", line 9, in foo
    bar()
  File "raise_test.py", line 6, in bar
    raise Exception("Oops!")
Exception: Oops!

That’s more like it. Now we can see the whole stack trace and see that the error originated on line 6 in the call to bar(). This would be helpful stuff to have in a debugging situation. So what’s going on here?

In some languages and runtimes exception objects carry their own stack trace information, but in python the interpreter is responsible for capturing the state of the call stack when an exception occurs. To access this information a program can call the sys.exc_info() method, which returns a tuple with the last traceback in item 3. We can use this method to get a better look at the state of the call stack when we catch an exception:

import sys
import traceback

def bar():
    raise Exception("Oops!")

# ... snip for brevity ...

def fair():
    try:
        dinkum()
    except Exception as e:
        print "Something failed!"
        traceback.print_tb(sys.exc_info()[2])
        raise
try:
    fair()
except Exception as e:
    print "It sure looks like it!"
    traceback.print_tb(sys.exc_info()[2])

The print_tb() method defined in the traceback module is a handy way to get a prettyprint of the traceback object. When run this code produces the following:

mark@viking:~/workspace$ python raise_test3.py
Something failed!
File "raise_test.py", line 15, in fair
dinkum()
File "raise_test.py", line 11, in dinkum
foo()
File "raise_test.py", line 8, in foo
bar()
File "raise_test.py", line 5, in bar
raise Exception("Oops!")
It sure looks like it!
File "raise_test.py", line 21, in
fair()
File "raise_test.py", line 15, in fair
dinkum()
File "raise_test.py", line 11, in dinkum
foo()
File "raise_test.py", line 8, in foo
bar()
File "raise_test.py", line 5, in bar
raise Exception("Oops!")

The interpreter creates a new traceback when the exception is initially raised on line 5. As the stack is unwound back to the first “except” block each call point (method name and line number) is recorded on the traceback. As you can see above when we first catch the exception there are four call points on the traceback. When we catch it a second time after using “raise” to re-raise it, a fifth call point has been added. The crucial thing is that “raise” did not prompt the interpreter to start a new stack trace. What happens when instead of “raise” we use “raise e” to re-raise the exception?

Something failed!
File "raise_test.py", line 15, in fair
dinkum()
File "raise_test.py", line 11, in dinkum
foo()
File "raise_test.py", line 8, in foo
bar()
File "raise_test.py", line 5, in bar
raise Exception("Oops!")
It sure looks like it!
File "raise_test.py", line 21, in
fair()
File "raise_test.py", line 19, in fair
raise e

In this case the interpreter started a new stack trace when it saw the “raise e” statement. In fact this is the exact same behavior you’d get by just raising a whole new exception with the statement “raise Exception(“Sorry, no traceback for you”).”

So when would you actually want to use “raise e?” To the extent that ‘e’ is an exception that you’ve just caught, never. Doing so will always lose the traceback information from the previous frames on the stack, and obfuscate the point of origin of the original error. However, it can sometimes make sense to translate on exception into another. Something like the following isn’t an unknown pattern:

try:
    some_vulnerable_op()
except LowLevelException as e:
    very_important_cleanup()
    raise HighLevelException("Low level stuff failed")

While this is a valid pattern I’m not a huge fan of it unless information specific to the inner exception is logged or otherwise made available. There are a number of ways you could accomplish this goal, but we can leave that for another time. For now, suffice it to say that exception handling complicates reasoning about the control flow and behavior of a program, so it makes sense to keep your exception handling patterns as simple as you can.

Using Celery to schedule python tasks

Many python programmers are familiar with Celery due to its integration with django, a hugely popular web framework. A colleague recommended it to me recently when the need to schedule pruning of old logstash indices came up during a development discussion. As I soon discovered, Celery is a fast and powerful way to turn any python callable into a scheduled task, or message-processing worker. We’re now using it to clip the aforementioned indices, as well as perform other tasks such as injecting work payloads into a distributed queue.

Celery is a python package, so the easiest way to get it into your virtualenv (or Docker container, or vagrant env) is simply:

pip install celery

The gzipped tarball is only 1.3 megs, and while there are a few other dependencies (billiard, kombu, pytz) the installation takes less than a minute. Once it is complete you’re ready to create a task. Let’s start with a simple script that downloads a web page and saves it to a file. We’ll create the script and set it up to run every ten minutes. For downloading the page we’ll use the awesome requests package. First, the script itself.

from datetime import timedelta
from celery import Celery
import requests

app = Celery('page_saver')
app.conf.update(
    BROKER_URL='redis://localhost',
    CELERY_TASK_SERIALIZER='json',
    CELERY_ACCEPT_CONTENT=['json'],
    CELERYBEAT_SCHEDULE={
        'save_page':{
            'task':'page_saver.save_page',
            'schedule': timedelta(minutes=10)
        }
    }
)

@app.task(name='page_saver.save_page')
def save_page(url, file_name):
    response = requests.get(url)
    if response.status_code == requests.codes.ok:
        f = open(file_name, 'wb')
        f.write(response.text)
        f.close()

After importing Celery and the requests package the first thing this script does is create the app object and initialize it. The Celery app object marks this module as a Celery app, and the only parameter we’re passing here is the name, which must match the name of the module. In this example the script would be saved on the file system as “page_saver.py”.

The call to app.conf.update is one way of passing in configuration data to the Celery object. There are several others, and in general most of the configuration options and settings are well beyond the scope of this post. You can find a good intro and links to more information here.

The first setting, ‘BROKER_URL’, specifies the pipeline that Celery will use for passing messages between clients and workers. I’m using redis here because I always have it lying around, but you can also use RabbitMQ, or a database, although that isn’t recommended in production.

The next two settings, ‘CELERY_TASK_SERIALIZER’, and ‘CELERY_ACCEPT_CONTENT’ instruct Celery to use json encoding when talking to the task as a client, and when accepting messages in the task as a server. Without these settings Celery will also allow pickle (and warn on startup), and nobody should receive pickle from a network port unless it is with a pair of tweezers. In any event, pickle is deprecated so json is the way to go for that reason as well.

The last setting contains the schedule on which our task will be executed. This requires some explanation, and it is easiest to do it in tandem with an explanation of the actual task method. As you can see in the script we define a method named save_page, and decorate it with “@app.task()”, passing in a name for the task. The naming seems pretty much arbitrary, but I like the suggested convention of “appname.taskname.”

The decorator turns this callable into the entry point for a Celery task with the given name. The script could contain many more tasks, each being individually callable, schedulable, etc., but for this example one will suffice, and I think I like a 1:1 mapping between app and task anyway. The actual implementation of the save_page method is self-explanatory, and contains no error handling or retry logic for brevity’s sake.

With the task defined the script above constitutes a complete Celery worker, and it can be run at any time using Celery to activate it and send it a message. For example, save the script to a folder and then cd into that folder and do this:

celery -A page_saver worker --loglevel=INFO

You should see a bunch of output indicating that celery is starting the task. Once it is up and running open up another terminal and start the python interpreter. Enter the following statements:

>>>from page_saver import save_page
>>>save_page.delay('https://news.ycombinator.com/','hn.txt')

The task object, save_page, serves as a proxy to call itself in the worker process running in the other terminal session. The delay method executes the call asynchronously, and Celery provides additional methods to access the results when they become available. In this case the only result is the file output.

So, that brings us back to the schedule. To have a scheduled task we need a scheduler – a client to wake up and essentially call delay() on our task when a timer expires. Fortunately Celery includes celery-beat, which does exactly that. You can run celery-beat as a stand-alone service and use the same schedule configuration schema, but you can also run it in tandem with celery using the -B command line switch:

celery -A page_saver worker -B --loglevel=INFO

The schedule configuration in the example script establishes a single schedule named “save_page” and tells it to run the task named “page_saver.save_page” every ten minutes using a timedelta. You can also set the ‘schedule’ field to a crontab object and use all of the same options available to a cron job, so there is a lot of flexibility.

And that’s about it for this simple example, which incidentally is complete enough to handle a lot of system maintenance tasks. Add some error-handling, logging and notification and you’re ready to go. If your needs go beyond simple repeatable tasks you’ll find Celery has a lot of options for executing distributed workloads and returning results asynchronously. Have fun!

Elasticsearch discovery in EC2

Elasticsearch out of the box does a good job of locating nodes and building a cluster. Just assign everyone the same cluster name and ES does the rest. But running a cluster on Amazon’s EC2 presents some additional challenges specific to that environment. I recently set up a docker-ized Elasticsearch cluster on two EC2 medium instances, and thought I would put down in one place the main things that I ran into setting it up.

The most fundamental constraint in EC2 is that multicast discovery (unsurprisingly) will not work. EC2 doesn’t allow multicast traffic. This gives you two choices: use unicast discovery, or set up EC2 dynamic discovery. Unicast discovery requires that all nodes have the host IPs for the other nodes in the cluster. I don’t like that idea, so I chose to set up dynamic discovery using the EC2 APIs. Getting EC2 dynamic discovery to work requires changes in three places: the Elasticsearch installation, the configuration file at /etc/elasticsearch/elasticsearch.yml, and finally the instance and security configuration on Amazon. I’ll take these in order.

The best way to integrate EC2 into the Elasticsearch discovery workflow is to use the cloud-aws plugin module. You can install this at any time with the following command:

sudo /usr/share/elasticsearch/bin/plugin -install \
elasticsearch/elasticsearch-cloud-aws/2.0.0.RC1

This will pull the latest plugin from the download site and install it, which is basically just extracting files. Note that the version in the command is the latest one. You can check here to see if it is still current. And that’s all there is to that. Adding the cloud-aws plugin enables the discovery.ec2 settings in elasticsearch.yml, which is where we’ll head next.

The Elasticsearch config file is located in /etc/elasticsearch/elasticsearch.yml, and we’ll want to change/add a few things there. First, obviously, give everyone the same cluster name:

cluster.name: my_cluster

Another setting that makes sense in production, at least, is to require the cloud-aws plugin to be present on start:

plugin.mandatory: cloud-aws

The next two settings are required for the cloud-aws plugin to communicate on your behalf with the great AWS brain in the sky:

cloud.aws.access_key: NOTMYACCESSKEYATALL
cloud.aws.secret_key: ITw0UlDnTB3seCRetiFiPuTItHeR3

Not to digress too much into AWS access management, but if you’ve set things up the right way then the access key and secret key used above will be for an IAM sub-account that grants just the specific permissions needed. The next two settings deal with discovery directly:

discovery.type: ec2
discovery.zen.ping.multicast.enabled: false

The first one just sets the discovery type to use the EC2 plugin. Pretty self-explanatory. The second one disables the use of multicast discovery, on the principle of “it doesn’t work, so don’t try it.” The last two settings we’ll discuss can be seen as alternatives to one and other, but are not mutually exclusive, which requires a bit of explanation.

Basically, we may need to filter the instances that get returned to the discovery process. When the cloud-aws plugin queries EC2 and gets a list of addresses back, it is going to assume they are all Elasticsearch nodes. During discovery it will try to contact them, and if some are not actually nodes it will just keep trying. This behavior makes sense with the multicast discovery process, because if you are not listening for multicast traffic then you don’t respond to it. But the EC2 discovery APIs will return all the instances in an availability zone, so we need some way to identify to Elasticsearch discovery which ones are really nodes.

One way to do this is by using a security group. You can create a security group and assign all the instances that you want in a particular cluster to that group, then make the following addition to the config:

discovery.ec2.groups: my_security_group

This setting tells the plugin to return only those instances that are members of the security group ‘my_security_group.’ Since you will need a security group anyway, as explained below, this is a convenient way to separate them from the crowd. But there can be cases where you don’t want to partition on the security group name. You might, for example, want to have one security group to control the access rules for a set of instances representing more than one cluster. In that case you can use tags:

discovery.ec2.tag.my_tag: my_tag_value

This setting tells the plugin to return only those instances that have the tag ‘my_tag’ with the value ‘my_tag_value.’ This is even more convenient, since it doesn’t involve mucking about with security groups, and setting the tag on an instance is easily done. Finally, as mentioned before these aren’t mutually exclusive. You can use the groups option to select a set of instances, and then partition them into clusters using tags.

And that’s it for the elasticsearch.yml settings, or at least the ones I had to change to make this work on EC2. There are a lot of other options if your specific case requires tweaking, and you can find an explanation of them here. The last thing I want to go into are the necessary steps to take in the Amazon EC2 console with respect to configuration. These fall into two areas: security groups and instance configuration. I don’t want to digress far into the specific AWS console steps, but I’ll outline in general what needs to happen.

Security groups first. You’re going to need all the nodes in your cluster to be able to talk to each other, and you’re going to want some outside clients to be able to talk to the nodes in the cluster as well. There are lots of different cases for connecting clients for querying purposes, so I’m going to ignore that topic and focus on communications between the nodes.

The nodes in a cluster need to use the Elasticsearch transport protocol on port 9300, so you’ll need to create a security group, or modify an existing one that applies to your instances, and create a rule allowing this traffic. For my money the easiest and most durable way to do this is to have a single security group covering just the nodes, and to add a rule allowing inbound TCP traffic on port 9300 from any instance in the group. If you are using the discovery.ec2.groups method discussed above, make sure to give your group the same name you used in the settings.

The last point is instance configuration, and for my purposes here it’s really just setting the tag or security group membership appropriately so that the instance gets included in the list returned from discovery. There are lots of other specifics regarding how to set up an instance to run Elasticsearch efficiently, but those are topics for another time (after I figure them out myself!)

The very last thing I want to mention is a tip for Docker users. If you’re running Elasticsearch inside a container on EC2 your discovery is going to fail. The first node that starts is going to elect itself as master, and if you query cluster health on that node it will succeed and tell you there is one node in the cluster, which is itself. The other nodes will hang when you try the same query and that is because they are busy failing discovery. If you look in /var/log/elasticsearch/cluster_name.log you’re going to see some exceptions that look like this:

[2014-03-18 18:20:04,425][INFO ][discovery.ec2            ]
[MacPherran] failed to send join request to master 
[[Amina][Pi6vJ470SYy4fEQhGXiwEA][38d8cffbdcd5][inet[/172.17.0.2:9300]]],
reason [org.elasticsearch.transport.RemoteTransportException:
[MacPherran][inet[/172.17.0.2:9300]][discovery/zen/join]; 
org.elasticsearch.ElasticsearchIllegalStateException: 
Node [[MacPherran][aHWxgYAhSpaWlPEXNhs7RA][4fb92271cce6][inet[/172.17.0.2:9300]]] 
not master for join request from 
[[MacPherran][aHWxgYAhSpaWlPEXNhs7RA][4fb92271cce6][inet[/172.17.0.2:9300]]]]

I cut out a lot of detail, but basically the reason this is happening is that, in this example, node MacPherran is talking to itself and doesn’t know it. The problem is caused because a running Docker container has a different IP address than the host instance it is running on. So when the node does discovery it first finds itself at the container IP, something like:

[MacPherran][inet[/172.17.0.2:9300]]

And then finds itself at the instance IP returned from EC2:

[MacPherran][inet[/10.147.39.21:9300]]

From there things do not go well for the discovery process. Fortunately this is easily fixed with another addition to elasticsearch.yml:

network.publish_host: 255.255.255.255

This setting tells Elasticsearch that this node should advertise itself as being at the specified host IP address. Set this to the host address of the instance and Elasticsearch will now tell everyone that is its address, which it is, assuming you are mapping the container ports over to the host ports. Now, of course, you have the problem of how to inject the host IP address into the container, but hopefully that is a simpler problem, and is left as an exercise for the reader.

Docker: run startup scripts then exit to a shell

As I’ve mentioned before the way Docker is set up it expects to run one command at container start up, and when that command exits the container stops. There are all sorts of creative ways to get around that if, for example, you want to mimic init behavior by launching some daemons before opening a shell to work in the container. Here’s a little snippet from a Docker build file that illustrates one simple approach:

CMD bash -C '/path/to/start.sh';'bash'

In the ‘start.sh’ script (which I usually copy to /usr/local/bin during container build) you can put any start up commands you want. For example the one in the container I’m working on now just has:

service elasticsearch start

The result when launching the container is that bash runs, executes the command which spools up es, and then executes another shell and quits:

mark:~$ sudo docker run -i -t mark/example
 * Starting Elasticsearch Server                                         [ OK ] 
root@835600d4d0b2:/home/root# curl http://localhost:9200
{
  "status" : 200,
  "name" : "Box IV",
  "version" : {
    "number" : "1.0.1",
    "build_hash" : "5c03844e1978e5cc924dab2a423dc63ce881c42b",
    "build_timestamp" : "2014-02-25T15:52:53Z",
    "build_snapshot" : false,
    "lucene_version" : "4.6"
  },
  "tagline" : "You Know, for Search"
}
root@835600d4d0b2:/home/root# exit
exit
mark:~$

Docker Builds and -no-cache

I was building out a search server container with Elasticsearch 1.0.1 today, and I ran into one of those irritating little problems that I could solve a lot faster if I would just observe more carefully what is actually going on. One of the steps in the build is to clone some stuff from our git repo that includes config files that will get copied to various places. In the process of testing I added a new file and pushed it, then re-ran the build. Halfway through I got a stat error from a cp command that couldn’t find the file.

But, but, I had pushed it, and pulled the repo, so where the hell was it? Yesterday something similar had happened when building a logstash/redis container. One of the nice things about a Docker build is that it leaves the interim containers installed until the end of the build (or forever if you don’t use the -rm=true option). So you can start up the container from the last successful build step and look around inside it. In yesterday’s case it turned out I was pushing to one branch and cloning from another.

But that problem had been solved yesterday. Today’s problem was different, because I was definitely cloning the right branch. I took a closer look at the output from the Docker build, and where I expected to see…

Step 4 : RUN git clone blahblahblah.git
 ---> Running in 51c842191693

I instead saw…

Step 4 : RUN git clone blahblahblah.git
 ---> Using cache

Docker was assuming the effect of the RUN command was deterministic and was reusing the interim image from the last time I ran the build. Interestingly it did the same thing with a later wget command that downloaded an external package. I’m not sure how those commands could ever be considered deterministic, since they pull data from outside sources, but whatever. The important thing is you can add the -no-cache option to the build command to get Docker to ignore the cache.

sudo docker build -no-cache -rm=true - < DockerFile

Note that this applies to the whole build, so if you do have some other commands that are in fact deterministic they are not going to use the cache either. It would be nice to have an argument to the RUN command to do this on per-step basis, but at least -no-cache will make sure all your RUN steps get evaluated every time you build.

Logstash ate my events

I’ve mentioned previously that I’m testing some infrastructure components for a new system at work. One of those components is Logstash, which along with redis and Elastic Search will function as our distributed logging system. As a test setup I have python scripts generating log messages into a redis list, which is then popped by Logstash for indexing into Elastic Search. The whole thing is running in a Docker container, the building of which I discussed in my last post. These processes only generate a couple of events per second, based on the other work they are doing. I’m running four of them for now, so I am getting six to eight events per second written to redis. This is very low volume, but sufficient for me to get a handle on how everything works.

The first time I ran the test and connected to Kibana to see the collected events stream I thought it was so cool that I just paged around with a silly smile on my face for fifteen minutes, without paying a lot of attention to the details. The next time I ran it I noticed that some events seemed to be missing. I know the pattern of events emitted by these scripts very well, and I wasn’t seeing them all. So I shut down, checked a few things, convinced myself that no logic in the code could cause that pattern, and then fired the test back up again. This time things looked a lot closer to normal, but I couldn’t quite be sure without having another definitive source to compare to.

Fortunately I had already built in code to enable/disable both file-based and redis-based logging, so I simply enabled file logging and rebuilt the container. I fired it up, ran a short test, and the number of lines in the log files exactly matched the number of events in the logstash index on ES. The events in Kibana looked complete, too. So ok, problem solved.

I got off on other things for a day or two, and when I came back and started running this test again I forgot to disable file logging. I realized it after a short time and killed the test. Before rebuilding the container I decided to check the counts again. They were off. There were something like 500 fewer events in the logstash index than in the files, out of a total of several thousand. I started the test again, intending to run a longer one and do a detailed comparison, and this time the earlier pattern I had seen, with obviously missing events, was back. I let this test run for a bit and then compared counts: ~6500 in the files vs. ~3200 in the index. Half my events had disappeared.

I was pretty sure the culprit wasn’t redis, which is pretty bulletproof, but to quickly rule it out I shut down logstash and ran the test. With nothing reading them the events piled up in redis and at the end llen show the counts to be exactly the same. Redis wasn’t mysteriously losing anything. I next checked the logstash log, but there was nothing in it of note. I connected directly to ES and queried the index. The counts all matched with what Kibana had shown me. There was no indication anywhere as to why half the events had evaporated.

I was poking around on the logstash site looking for clues when I noticed the verbosity settings in the command line. I quickly edited the init.d script that launches the daemon to set -vv, the highest logging verbosity level for logstash. I reran the test and again saw the same pattern of missing events. I let it run for a minute or so and then shut it down. The logstash.log file was now over 7 megs, compared to a few hundred k in the last test. I dove in and quickly noticed some Java exceptions getting logged. These were field parser errors, caused by an inner number format exception, and the field that caused the issue was named ‘data.’ I grep’d the log to see how many of these there were, and the number came back ~3200. Well, whaddaya know?

The data field is one that I use in my log event format to contain variable data. In one message it may have a number, and in another a string. Looking at the exceptions in the log made it very clear what was going on: ES was dynamically assigning a type to the field based on the data it received. Not all my log events include data, and based on the results of processing the first populated ‘data’ field to get into the queue might have a number, or it might have a string. If the string field arrived first all was well, because ES could easily convert a number to a string. But if a number field arrived first ES decided that all subsequent data had to be converted to a number, which was not possible in about half the cases.

The solution turned out to be pretty simple. You can include a custom template for your index in the logstash configuration, and this can include field mappings. In order to make this work through logstash you have to be using ES version 0.90.5 or later. Fortunately I am using the embedded instance for this test, which is at 0.90.9. To implement the template solution required modifying the logstash output to reference the template:

output {
  elasticsearch {
    embedded => true
    template => "/opt/logstash/es-event-template.json"
  }
}

It also required the template itself. At first I tried creating a very simple template that only added the single field mapping I needed. That didn’t work, and I assumed it was because I was overwriting logstash’s settings in  way that was breaking things. I had hoped my single setting would just get merged in, but either that’s not happening or there is something else I don’t understand. What I ended up doing is grabbing the current template from:

http://myhost.com:9200/_template

And then editing it to add the part I needed:

"data": { 
    "index": "not_analyzed",
    "type": "string"
}

And with that the problem was solved. The lesson of the day for me is: if logstash appears to be eating your events, crank up the logging verbosity level and see what’s giving it indigestion.

Containers are going to hurt Windows in the datacenter

As anyone who accidentally stumbles on this blog knows, I’ve been playing with Docker for a few weeks now, which makes me an expert and entitled to opine on its future. Hell, you don’t need to be an expert to know where this is going. Docker rocks. It’s going to revolutionize the way applications and dependencies are managed in the datacenter. No technology has captivated me this way since… I don’t even know. Virtualization came close. But it is containers that are poised to really deliver the value virtualization promised. And Windows has nothing comparable what they offer. There are plenty of virtualization solutions that work on Windows, but containers are not virtualization: they’re about environment isolation, and they are a lot lighter weight, a lot easier to use, and a lot more manageable than virtualization technologies.

Over the weekend I finished up an experiment to encapsulate the complete environment for my spider application in a Docker container. It was a huge success from my point of view. Just to recap, I’m working on an application that uses a bunch of spider processes to get information off the web. The environment these things run in is fairly sophisticated at this point: they get their work input from, and stream work product to, redis. They also stream log events to redis, from which the events are picked up by logstash and indexed into Elastic Search. The whole process runs under the control of supervisord. So there are a few pieces in play.

Configuring the environment on EC2 previously required approximately 50 steps, including installing the base dependencies from the python, debian, and java repositories. After this weekend’s work it should require two: install Docker, and build the container image.

All those manual steps are now captured declaratively in a Docker build file. The base image that the container build depends on is an Ubuntu 13.10 debootstrap install with python and a few other things added, that I pushed to my repository on the Docker index. It can now be pulled to anywhere that Docker is installed. I could also build and push the final environment image, but all images on the Docker index are public. The way around that is either to run your own repository, or build the final container image in place on the server. I’m going to take the latter approach for now.

So, with the base image in the repo, and the Docker build file created and tested, launching a new environment on a server with Docker installed looks like this:

sudo docker build -t="my:image" -rm=true - < mydockerfile
sudo docker run -t -i -n="my:name" my:image /bin/bash

Boom, the server is running. In my case what that means is that redis is up, logstash is up, Elastic Search is up, supervisord is up and has launched the spider processes, and the proper ports are exposed and bridged over to the host system. All I have to do is point my web browser at http://myhostname.com:9292 and the Kibana3 dashboard pops up with the event stream visualized so I can monitor the work progress. That. Is. Cool.

Which leads me back to my opening thought. As a developer who spent 20 years writing code for Microsoft platforms I don’t want to come off like an ex-smoker who has seen the light and needs to let everyone know it. But at the same time, I can’t help but wonder how Microsoft will respond to this? Docker came out of nowhere, igniting a wildfire by making a formerly obtuse technology (LXC containers) easier to understand and use. I think within another year or so containers will be as ubiquitous and as important in the datacenter as virtualization is now, perhaps more so. And Windows will be increasingly relegated to running Exchange and Sharepoint servers.