Ten tips for debugging Docker containers

Containers are awesome, but sometimes it can feel like your code has been shut up in a black box, stuck off in the cloud where you can’t see what it’s doing. When your app runs the way it’s supposed to this isn’t a problem: containers come, containers go, everyone’s happy. But then there’s that moment when you push a new version of your image and check the cluster status only to see it crashlooping. What’s going on? In my latest post on medium.com I list ten tips that will make debugging your docker containers easier.

Ten tips for debugging Docker containers

Comparing Amazon’s Elastic Container Service and Google Kubernetes

Over the last two years docker containers have completely changed the way I look at software builds, deployment and infrastructure. Orchestration platforms like Amazon’s Elastic Container Service and Google Container Engine, built on their open source kubernetes framework, are closing the gap between what we have been able to do with software in containers, and what we want to be able to do with infrastructure and deployments in the cloud. My first piece on medium.com is a comparison of these two “container orchestration” platforms.

Comparing Amazon Elastic Container Service and Google Kubernetes

Using rsyslog to ship logs to a remote logstash server

Wanted to give author dreed a shout out for a truly awesome tutorial on using rsyslog to format and ship logs to a remote logstash server listening on a tcp endpoint. Followed this last night and was able to get our logs from haproxy over to logstash and into elasticsearch all running in Google’s kubernetes container engine. Nice work, dreed. I would have commented on the site but for disqus and having to sign in.

Sometimes a hack is all you’ve got

So it’s late, and I’ve been messing with feed parsing again. I have this project that I’ve been assembling off and on for awhile, and it involves ingesting and analyzing RSS and Atom news feeds. I’m using python 2.5.9 and lxml to parse the content from these feeds. The lxml package is a powerful and very fast xml/html parser, but it has its quirks.

There are actually two parsers in lxml. The etree parser deals formally with xml documents, and is rather fussy about things like namespaces, something that I incidentally care nothing at all about. The html parser is a lot less fussy, but it can cause problems when you use it to parse feeds, because some of the stuff in feeds gets interpreted as malformed html. An example:

<rss>
    <channel>
        <!-- blah -->
        <item>
            <!-- blah -->
            <link>http://popupmedia.com/link</link>
        </item>
    </channel>
</rss>

Yep, a link tag in html is supposed to be self-closing. So the html parser figures that you don’t need the closing tag, and it drops it. You end up with:

<rss>
    <channel>
        <!-- blah -->
        <item>
            <!-- blah -->
            <link>http://popupmedia.com/link\n
        </item>
    </channel>
</rss>

And that is not well-formed xml, and it does not help when you are trying to do something like this:

tag = html.fromstring(text)
hrefs = tag.xpath("//channel/item/link/text()")

That xpath query finds nothing. Incidentally Atom feeds don’t have this problem, because they look like this:

<feed>
    <entry>
        <!-- blah -->
        <link rel="alternate" href="http://popupmedia.com/link" />
    </entry>
</feed>

So, as I said, it’s late, and I just wanted to close the book on this chunk of code, and in order to test it I need this method that returns a list of the item links to work. How to get them? It turns out that after dropping the closing tag the html parser is able to locate the now unclosed link tag just fine, and since the text that was originally enclosed in the link tag is now following the unclosed tag, that makes it the tail. So this works:

tag = html.fromstring(text)
hrefs = 
    [l.tail.strip() for l in tag.xpath("//channel/item/link")]

Go figure.

Environment-specific settings for python 2 modules

A little trick we came up with for a recent project. When developing back-end software in python or any other language there is often a need to load different values for configuration settings based on environment. A typical case is a database connection address and port, for example, that would be different when working locally vs. test vs. production. There are lots of ways to do this, but this one worked well for us. The technique relies on setting an environment variable with the name of the environment, and then using that name to load a default settings file and an environment-specific settings file and merge them both into the global namespace.

Microsoft .NET coreclr on Ubuntu

I’ve said for years that .NET should be open sourced and cross-platform, and that development is finally taking place. Today Microsoft announced a preview of coreclr running on Ubuntu, and this evening I was able to build it on Ubuntu 14.04 running in a docker container.

HelloWorld.exe on Ubuntu 14.04 in a docker container

There were quite a few steps involved, and as this is a preview there is also quite a bit still missing. Notably compilation of managed code on linux (using roslyn) is not available, so after building the coreclr on Ubuntu you have to pop into Windows and build it, and the corefx libraries, then copy a bunch of crap over to your linux system. You also still need mono for some callable wrappers, and nuget to grab a bunch of dependent packages. Still, all in all it feels fairly historic, and coming on the heels of Microsoft’s announcement of their new cross-platform code editor I’d say it’s been a good week for them and those of us who are fans of their tools (whatever platform we find ourselves working on).

If you want to try it yourself the ubuntu:14.04 docker image is a good starting point. Note that you’ll want to install both wget and curl before following the instructions to get coreclr running.

 

phantomjs, about:blank, and –ssl-protocol

This was an odd one. On our current web data aggregation project we had a class of sites that were causing an attribute error in our python code. After some troubleshooting it turned out that the problem was in the response to our call out to phantomjs to render a page. We were expecting to get either an error or a valid response with a valid url, and instead we were getting a blank response with the url “about:blank.” Ok, knowing this made it easy to avoid the attribute error, but it didn’t get the data back. The real question was why were were getting about:blank.

Various posts on Stackoverflow and other places discuss this error in the context of the –ignore-ssl-errors phantomjs command line option. Apparently if you don’t tell phantom to ignore ssl errors, and you get some on a site, you can end up on about:blank. Fair enough, but we were already passing that option to phantom, so that wasn’t our issue.

I decided to fire up Fiddler on windows and tell phantom to use it as a proxy. This proved a little disconcerting, because when I did this the sites magically started working again. Clearly using Fiddler as a proxy was masking the issue somehow. I disabled the proxy to confirm that the problem returned, and it did. I then ran the process through fiddler again and checked the resulting output.

At first glance the data looked normal, but then I noticed that, while the offending sites were using ‘https:’ in their urls, and would redirect the browser to the ‘https:’ address if you tried ‘http:’, they were not returning an encoded response. Was this somehow the culprit? I confess that the idea of simply running a local proxy to make the issue go away did occur to me. Instead I decided to scan the phantomjs command line options, and there I saw the “–ssl-protocol” option. This setting defaults to ssl v3, but one of the acceptable alternatives is “any.” I added “–ssl-protocol=any” to our startup options for phantomjs, and the sites started working again. Well, three of them did. The fourth is still causing some javascript error, but I’ll count it a win anyway.

As for exactly why this worked, I haven’t had time to puzzle that out yet. If anyone has some ideas post them below!

TBODY: When it’s there, and when it ain’t

My current job involves aggregating a lot of information from many websites, all of which are constructed using different tools, frameworks, approaches, etc. So, pretty much a standard web scraping challenge, except perhaps with some additional hurdles introduced by the fact that our industry hosts some of the oldest, ugliest websites you’ve ever seen. About that I will have no more to say today, but for the purposes of this post it is sufficient to note that many of these sites are… sigh… table-driven.

If you’ve spent a lot of time with xpath expressions and have used tools like FirePath and tried to develop good expressions and export them to use in running applications, then you probably knew where I was going with this as soon as you saw the title. For the rest of you, maybe this will save you a few minutes somewhere down the road.

The TBODY element is a child of the TABLE element, and is used to encapsulate a body within the table. Here’s an example:

<TABLE><TBODY><TR><TD>ugh caps</TD></TR></TBODY></TABLE>

According to the standard the TBODY element is optional unless a table has more than one body. Most tables have just one body, and most tables that I see in the wild omit the TBODY element from the markup.

So can we forget about it? Not really. If you’re developing an xpath expression in FireFox, for example, and looking at the page DOM as you work, then you’re seeing TBODY in the tables whether it was present in the page markup, or not. Most browser rendering engines insert a TBODY around table contents. I’m not sure why, but I assume it makes the parsing or rendering path more efficient.

If you’re using something like FirePath then the expression you’re developing is evaluated against the DOM, meaning that it won’t match if you don’t insert the TBODY. But what if the TBODY isn’t actually in the markup? Then that xpath expression won’t match when you move it to your application. Maybe.

It won’t match if you’re running your xpath against the page markup, and the TBODY is not in the page markup. But what if you render the page in memory using something like phantomjs, a task that is more often than not required in order to access the full content displayed on the page? If you do, then what you end up scraping is the rendered DOM, serialized back into html markup. Guess what? That means the TBODY is back.

The bottom line is: if you’re scraping page markup text retrieved from the site server, and the TBODY element is not used in a table, then you don’t want it in your xpath either. If you’re scraping page markup retrieved from a server-side in-memory rendering engine then you will need the TBODY, whether it was present in the markup or not.

Running haproxy at system startup on Ubuntu

Sometimes I just have to capture this sort of thing because I know damn well I am going to forget it long before the next time it’s needed. We decided to use haproxy as a round-robin matchmaker in front of a cluster of elasticsearch nodes. I installed it on Ubuntu 14.04, configured it, and tested it by running it from the command line. Everything worked. But when I rebooted the instance the proxy didn’t start. I checked the init file and run levels. Everything seemed as it should be… but no dice. To make it worse, executing commands like…

sudo service haproxy start

or

sudo service haproxy status

… resulted in nothing at all. No errors, no output, nada. I searched around a little bit and stumbled on the following line in /etc/default/haproxy…

ENABLED=0

Apparently this flag, when set to 0, prevents the init script from running the proxy. The second-hand info I found on the reasoning is that there is no rational default config for haproxy, so they want you to explicitly configure it and then set this flag to 1. So… fine. Makes sense, but I do wish the init script was a little more verbose about what was happening. In any case, once I did this…

ENABLED=1

… haproxy launched at start as desired.

Re-raising exceptions in Python

I often run into situations where an error occurs, an exception is caught, and some action is taken before the exception is re-raised so that higher levels of the program can deal with it. Exactly how the exception is re-raised can have a big impact on how much information is available to someone trying to debug a problem. I got this wrong as a newbie, and I’ve seen even experienced python programmers get tripped up by these subtleties so I thought I would try to post a clear example. Consider the following program:

def bar():
    raise Exception("Oops!")

def foo():
    bar()

def dinkum():
    foo()

def fair():
    try:
        dinkum()
    except Exception as e:
        print "Something failed!"
        raise e

fair()

As you can see I’ve created a few methods in order to have a stack trace to look at, and raised an exception in the last one to be called, foo(). That exception is caught in fair() and some information is printed before the exception is re-raised by the statement “raise e.” This syntax looks pretty reasonable, so let’s run it and see what happens:

mark@viking:~/workspace$ python raise_test1.py
 Something failed!
 Traceback (most recent call last):
 File "raise_test.py", line 21, in 
 fair()
 File "raise_test.py", line 19, in fair
 raise e
 Exception: Oops!

Hmm. Looks like we’re missing some useful information. The exception was raised on line 6 but the stack trace begins on line 19. Lets change the program slightly and run it again:

def bar():
    raise Exception("Oops!")

# ... snip for brevity ...

def fair():
    try:
        dinkum()
    except Exception as e:
        print "Something failed!"
        raise

fair()

The only change here is that now we’re simply using “raise” rather than re-raising the exception by name with “raise e.” The output is quite different:

mark@viking:~/workspace$ python raise_test2.py
Something failed!
Traceback (most recent call last):
  File "raise_test.py", line 21, in 
    fair()
  File "raise_test.py", line 16, in fair
    dinkum()
  File "raise_test.py", line 12, in dinkum
    foo()
  File "raise_test.py", line 9, in foo
    bar()
  File "raise_test.py", line 6, in bar
    raise Exception("Oops!")
Exception: Oops!

That’s more like it. Now we can see the whole stack trace and see that the error originated on line 6 in the call to bar(). This would be helpful stuff to have in a debugging situation. So what’s going on here?

In some languages and runtimes exception objects carry their own stack trace information, but in python the interpreter is responsible for capturing the state of the call stack when an exception occurs. To access this information a program can call the sys.exc_info() method, which returns a tuple with the last traceback in item 3. We can use this method to get a better look at the state of the call stack when we catch an exception:

import sys
import traceback

def bar():
    raise Exception("Oops!")

# ... snip for brevity ...

def fair():
    try:
        dinkum()
    except Exception as e:
        print "Something failed!"
        traceback.print_tb(sys.exc_info()[2])
        raise
try:
    fair()
except Exception as e:
    print "It sure looks like it!"
    traceback.print_tb(sys.exc_info()[2])

The print_tb() method defined in the traceback module is a handy way to get a prettyprint of the traceback object. When run this code produces the following:

mark@viking:~/workspace$ python raise_test3.py
Something failed!
File "raise_test.py", line 15, in fair
dinkum()
File "raise_test.py", line 11, in dinkum
foo()
File "raise_test.py", line 8, in foo
bar()
File "raise_test.py", line 5, in bar
raise Exception("Oops!")
It sure looks like it!
File "raise_test.py", line 21, in
fair()
File "raise_test.py", line 15, in fair
dinkum()
File "raise_test.py", line 11, in dinkum
foo()
File "raise_test.py", line 8, in foo
bar()
File "raise_test.py", line 5, in bar
raise Exception("Oops!")

The interpreter creates a new traceback when the exception is initially raised on line 5. As the stack is unwound back to the first “except” block each call point (method name and line number) is recorded on the traceback. As you can see above when we first catch the exception there are four call points on the traceback. When we catch it a second time after using “raise” to re-raise it, a fifth call point has been added. The crucial thing is that “raise” did not prompt the interpreter to start a new stack trace. What happens when instead of “raise” we use “raise e” to re-raise the exception?

Something failed!
File "raise_test.py", line 15, in fair
dinkum()
File "raise_test.py", line 11, in dinkum
foo()
File "raise_test.py", line 8, in foo
bar()
File "raise_test.py", line 5, in bar
raise Exception("Oops!")
It sure looks like it!
File "raise_test.py", line 21, in
fair()
File "raise_test.py", line 19, in fair
raise e

In this case the interpreter started a new stack trace when it saw the “raise e” statement. In fact this is the exact same behavior you’d get by just raising a whole new exception with the statement “raise Exception(“Sorry, no traceback for you”).”

So when would you actually want to use “raise e?” To the extent that ‘e’ is an exception that you’ve just caught, never. Doing so will always lose the traceback information from the previous frames on the stack, and obfuscate the point of origin of the original error. However, it can sometimes make sense to translate one exception into another. Something like the following isn’t an unknown pattern:

try:
    some_vulnerable_op()
except LowLevelException as e:
    very_important_cleanup()
    raise HighLevelException("Low level stuff failed")

While this is a valid pattern I’m not a huge fan of it unless information specific to the inner exception is logged or otherwise made available. There are a number of ways you could accomplish this goal, but we can leave that for another time. For now, suffice it to say that exception handling complicates reasoning about the control flow and behavior of a program, so it makes sense to keep your exception handling patterns as simple as you can.