Sometimes a hack is all you’ve got

So it’s late, and I’ve been messing with feed parsing again. I have this project that I’ve been assembling off and on for awhile, and it involves ingesting and analyzing RSS and Atom news feeds. I’m using python 2.5.9 and lxml to parse the content from these feeds. The lxml package is a powerful and very fast xml/html parser, but it has its quirks.

There are actually two parsers in lxml. The etree parser deals formally with xml documents, and is rather fussy about things like namespaces, something that I incidentally care nothing at all about. The html parser is a lot less fussy, but it can cause problems when you use it to parse feeds, because some of the stuff in feeds gets interpreted as malformed html. An example:

        <!-- blah -->
            <!-- blah -->

Yep, a link tag in html is supposed to be self-closing. So the html parser figures that you don’t need the closing tag, and it drops it. You end up with:

        <!-- blah -->
            <!-- blah -->

And that is not well-formed xml, and it does not help when you are trying to do something like this:

tag = html.fromstring(text)
hrefs = tag.xpath("//channel/item/link/text()")

That xpath query finds nothing. Incidentally Atom feeds don’t have this problem, because they look like this:

        <!-- blah -->
        <link rel="alternate" href="" />

So, as I said, it’s late, and I just wanted to close the book on this chunk of code, and in order to test it I need this method that returns a list of the item links to work. How to get them? It turns out that after dropping the closing tag the html parser is able to locate the now unclosed link tag just fine, and since the text that was originally enclosed in the link tag is now following the unclosed tag, that makes it the tail. So this works:

tag = html.fromstring(text)
hrefs = 
    [l.tail.strip() for l in tag.xpath("//channel/item/link")]

Go figure.

Environment-specific settings for python 2 modules

A little trick we came up with for a recent project. When developing back-end software in python or any other language there is often a need to load different values for configuration settings based on environment. A typical case is a database connection address and port, for example, that would be different when working locally vs. test vs. production. There are lots of ways to do this, but this one worked well for us. The technique relies on setting an environment variable with the name of the environment, and then using that name to load a default settings file and an environment-specific settings file and merge them both into the global namespace.

Thought for the day

“Enterprise customers” are like the deserted island that software CEOs wash up on when their ship sinks. At first it seems like you’ve been rescued: it’s land, it’s dry and all the customers that were on the ship with you are there too. Could be fun! But in a short time it becomes clear that you’re stuck on the island, and the world is sailing away from you. Your customers call in their choppers and one by one they leave the island. Unfortunately none of them have room for another passenger, but they will send someone back for you. They promise.

Do I want Windows 10?

I sat down in my office this morning and found a new icon in the system tray notification area of my Windows 7 Enterprise desktop. Right-clicking it showed four options, none of which was “Exit.” Left-clicking it brought up this window…

I’m not sure when Microsoft installed this program, but it must have been last week when the Tuesday update batch hit. None of the actions in the window appealed to me this morning. I don’t know what it means to “reserve” my free upgrade, and I am still not sold on Windows 10. Since I couldn’t get rid of the program (at least not without hunting down the process and killing it, after which it would in all likelihood return on the next restart) I used the notifications manager to hide it.

It’s not that I’m uninterested in Windows 10. On the contrary I’ve been very interested in all of Microsoft’s most recent decisions with their ecosystem, including the move to open source .NET and make the CLR portable across systems. And the truth is that I was a Microsoft platform developer for twenty years. It’s safe to say I have never before been two whole versions behind the current release of the operating system. Why, then, am I still running Windows 7? Should I upgrade to Windows 10?

I’m still running Windows 7 because Windows 8 sucked hard, in my opinion. I installed it. I used it. My wife has it on her laptop and I tried to help her a bit during the acclimation phase, and I thought it was horrible. I’m a software developer and the interface of Windows 8 did nothing other than make everything I already knew how to do cumbersome and difficult, with no compensating benefit that I could detect. I don’t have two 24″ monitors so I can cover them with big tiles.

I felt pretty much the same way about Windows Vista, but for different and more technical reasons. Windows Vista just wasn’t ready. Windows 8 just didn’t make any sense, ready or not. However Windows Vista became Windows 7, easily the best version of the OS that I have used, and a pretty high bar which Windows 8 certainly did not manage to leap over, at least in my view. Now that Windows 8 is becoming Windows 10 is it time to switch?

Even without thinking too deeply about it I’m biased against. The main reason is the deprecation of Windows Media Center. Microsoft is giving up on the desktop entertainment functions of the PC, apparently. But for me WMC has long been my Netflix solution of choice. It looks and works great and my ten year-old Firefly RF remote works awesomely with it. I realize, unfortunately, that this technology is aging, so maybe I should just get over it. What else does Windows 10 offer me, other than correcting the things people saw as mistakes in Windows 8? Let’s have a look at their news release. What are the big features a professional developer like me should care about?

Cortana, the world’s first truly personal digital assistant helps you get things done. Cortana learns your preferences to provide relevant recommendations, fast access to information, and important reminders. Interaction is natural and easy via talking or typing. And the Cortana experience works not just on your PC, but can notify and help you on your smartphone too.

Awesome, a digital assistant. I don’t really need one on my desktop. It might be useful in a mobile context, but I don’t run Windows Phone and I am not likely to anytime soon. Then again my kids all have iPhones with Siri, and I don’t hear them talking to their phones. As far as I know they don’t use Siri for anything. Maybe if I went to San Jose I’d encounter lots of people asking their phones to do things, but I just don’t see it happening around here, at least not in public.

Microsoft Edge, is an all-new browser designed to get things done online in new ways, with built-in commenting on the web – via typing or inking — sharing comments, and a reading view that makes reading web sites much faster and easier. With Cortana integrated, Microsoft Edge offers quick results and content based on your interests and preferences. Fast, streamlined and personal, you can focus on just the content that matters to you and actively engage with the web.

Well, web apps are still a big part of what I do so I will be getting familiar with Edge whether I want to or not. Am I excited about getting Windows 10 so Edge can replace my current browser? Let’s see… it has Cortana integrated… w00t. See above. And a new “reading view!” I’ll be keeping an eye on Edge in terms of standards compliance and performance, of course, and I will be testing web apps on it, but if those are the big draw features I’ll continue to bounce back and forth between Chrome and Firefox as one or the other alternately pisses me off.

Office on Windows: In addition to the Office 2016 full featured desktop suite, Windows 10 users will be able to experience new universal Windows applications for Word, Excel, and PowerPoint, all available separately. These offer a consistent, touch-first experience across a range of devices to increase you productivity …

I’m not even going to bother with that whole quote. I love Windows, but if there was ever a good reason to hate Windows it would have to be related to Office somehow. From a word processor so horribly complicated that no living human can enumerate more than 10% of the feature set to an email cum personal productivity tool that set a new standard for how long a legacy code base can continue to be crammed into ever more ill-fitting skin, there is literally nothing about Office that I like. Been using Google Docs for years and the words “touch-first experience” in the quote above certainly don’t give me any reason to rethink that choice. Yuck.

Xbox Live and the integrated Xbox App bring new game experiences to Windows 10. Xbox on Windows 10 brings the expansive Xbox Live gaming network to both Windows 10 PCs and tablets. Communicate with your friends on Windows 10 PCs and Xbox One – while playing any PC game.

Ok, that’s fine. I’m not a console gamer, and I don’t own an XBox, but this is still pretty cool and if I were a console gamer, or was willing to purchase an XBox to replace Windows Media Center, this might be exciting.

New Photos, Videos, Music, Maps, People, Mail & Calendar apps have updated designs that look and feel familiar from app to app and device to device.  You can start something on one device and continue it on another since your content is stored on and synched through OneDrive.

Wow, now this is what I was waiting for. Not. There are better services available now for all this stuff. Definitely will be meaningful to the thousands of people on Windows Phone, though. Next.

Windows Continuum enables today’s best laptops and 2-in-1 devices to elegantly transform from one form factor to the other, enabling smooth transitions of your tablet into a PC, and back. And new Windows phones with Continuum can be connected to a monitor, mouse and physical keyboard to make your phone work like a PC.

I’m not writing this off. Device convergence has been talked about for a long time, and I certainly hope somebody is able to make it happen. The vision of being able to use one device across different inputs and display form factors is compelling. But the problem Microsoft has is they don’t have the dynamic mobile ecosystem to pull this off and make it relevant to lots of people. Jury is out, but give them credit for the attempt anyway.

Windows Hello, greets you by name and with a smile, letting you log in without a password and providing instant, more secure access to your Windows 10 devices. With Windows Hello, biometric authentication is easy with your face, iris, or finger, providing instant recognition.

Same thing. Not writing it off, despite the ridiculous name. Also not relevant to me sitting here at my desk. So like the Continuum feature it is interesting, but provides no motivation to move me off 7 for my work system.

Windows Store, with easy install and uninstall of trusted applications, supported by the broadest range of global payment methods.

When lots of people are running Windows on their phones this will be a very compelling offering. Which is just a bit like saying that when I win the Powerball I will be living in a beachfront home in Santa Barbara. I understand Microsoft’s dilemma, and I don’t envy them. They have to be relevant to mobile, even though they are barely relevant to mobile. You can’t build the future of your business on desktop computing, even though most of the people who use your current product use it at a desktop, for work.

It’s a tough situation, but the reality for me is that reading through this feature list doesn’t make me wonder whether I should upgrade to Windows 10 on my desktop: it makes me wonder why I am still running Windows on my desktop at all. I also have an Ubuntu development box and switching would be pretty painless for me. The answer is: games and Steam, and to a lesser extent a huge file called outlook.pst. I play games like Battlefield 4 and Planetside 2, and I like a mouse and keyboard for that. And I have ten years of history in my outlook file. I never look at it, but for some reason I haven’t been able to just delete it. If I ever get to that point and also stop playing shooters (which I should do since I basically just get slaughtered by teenagers) that will probably be it for Windows.

So the answer to the original question I posed to myself in the title appears to still be “no.” I’ll be giving Windows 10 a look in a VM at some point, and at some other point, hopefully still well into the future, I am going to be faced with the fact that I just can’t continue running Windows 7. I suspect that what will happen then is my Ubuntu and Windows machines will switch roles. Instead of having Windows drive my two monitors and using NX to access Ubuntu in a window it will be the other way around, and I will be switching into Windows every now and then just to play a game, or to actually try and find something in that huge Outlook file.

Microsoft .NET coreclr on Ubuntu

I’ve said for years that .NET should be open sourced and cross-platform, and that development is finally taking place. Today Microsoft announced a preview of coreclr running on Ubuntu, and this evening I was able to build it on Ubuntu 14.04 running in a docker container.

HelloWorld.exe on Ubuntu 14.04 in a docker container

There were quite a few steps involved, and as this is a preview there is also quite a bit still missing. Notably compilation of managed code on linux (using roslyn) is not available, so after building the coreclr on Ubuntu you have to pop into Windows and build it, and the corefx libraries, then copy a bunch of crap over to your linux system. You also still need mono for some callable wrappers, and nuget to grab a bunch of dependent packages. Still, all in all it feels fairly historic, and coming on the heels of Microsoft’s announcement of their new cross-platform code editor I’d say it’s been a good week for them and those of us who are fans of their tools (whatever platform we find ourselves working on).

If you want to try it yourself the ubuntu:14.04 docker image is a good starting point. Note that you’ll want to install both wget and curl before following the instructions to get coreclr running.


Gazela Primiero, 1985

In the Spring of 1985 I was a member of the crew of a 180′ wooden barkentine named “Gazela Primiero” on a trip from Philadelphia to Quebec City, Canada. A survivor of the Portugese Grand Banks cod fishing fleet, Gazela had been constructed in Lisbon in 1883, and had worked the banks until 1969, making her the last working wooden three-masted ship in the world. I took a bunch of pictures on that trip, and they all languished in a closet for 30 years until I dug them out recently and began to scan them. Here are 34 of the best images from the black and white collection. I also have some color images I will be scanning in at some point in the future.

Bluenose II

Bluenose II under sail off the coast of Nova Scotia, 1985.

Chesapeake Bay Skipjacks, 1986

In the mid 1980’s I spent a couple of seasons working for Captain Ed Farley on his skipjack “Stanley Norman” out of Tilghman Island, Maryland. One day he graciously agreed to bring in a fill-in and allow me to play photographer for the day with my old Pentax SLR. I had the photos developed and then stashed them away in a box. 28 years later I came upon them while searching for old pictures for a project my daughter Olivia was working on. There were 42 images good enough to display, and so here they are resized but otherwise unretouched.

Chesapeake Bay Skipjacks, 1986

A bay skipjack pushes down Harris Creek under yawl power.


phantomjs, about:blank, and –ssl-protocol

This was an odd one. On our current web data aggregation project we had a class of sites that were causing an attribute error in our python code. After some troubleshooting it turned out that the problem was in the response to our call out to phantomjs to render a page. We were expecting to get either an error or a valid response with a valid url, and instead we were getting a blank response with the url “about:blank.” Ok, knowing this made it easy to avoid the attribute error, but it didn’t get the data back. The real question was why were were getting about:blank.

Various posts on Stackoverflow and other places discuss this error in the context of the –ignore-ssl-errors phantomjs command line option. Apparently if you don’t tell phantom to ignore ssl errors, and you get some on a site, you can end up on about:blank. Fair enough, but we were already passing that option to phantom, so that wasn’t our issue.

I decided to fire up Fiddler on windows and tell phantom to use it as a proxy. This proved a little disconcerting, because when I did this the sites magically started working again. Clearly using Fiddler as a proxy was masking the issue somehow. I disabled the proxy to confirm that the problem returned, and it did. I then ran the process through fiddler again and checked the resulting output.

At first glance the data looked normal, but then I noticed that, while the offending sites were using ‘https:’ in their urls, and would redirect the browser to the ‘https:’ address if you tried ‘http:’, they were not returning an encoded response. Was this somehow the culprit? I confess that the idea of simply running a local proxy to make the issue go away did occur to me. Instead I decided to scan the phantomjs command line options, and there I saw the “–ssl-protocol” option. This setting defaults to ssl v3, but one of the acceptable alternatives is “any.” I added “–ssl-protocol=any” to our startup options for phantomjs, and the sites started working again. Well, three of them did. The fourth is still causing some javascript error, but I’ll count it a win anyway.

As for exactly why this worked, I haven’t had time to puzzle that out yet. If anyone has some ideas post them below!

TBODY: When it’s there, and when it ain’t

My current job involves aggregating a lot of information from many websites, all of which are constructed using different tools, frameworks, approaches, etc. So, pretty much a standard web scraping challenge, except perhaps with some additional hurdles introduced by the fact that our industry hosts some of the oldest, ugliest websites you’ve ever seen. About that I will have no more to say today, but for the purposes of this post it is sufficient to note that many of these sites are… sigh… table-driven.

If you’ve spent a lot of time with xpath expressions and have used tools like FirePath and tried to develop good expressions and export them to use in running applications, then you probably knew where I was going with this as soon as you saw the title. For the rest of you, maybe this will save you a few minutes somewhere down the road.

The TBODY element is a child of the TABLE element, and is used to encapsulate a body within the table. Here’s an example:

<TABLE><TBODY><TR><TD>ugh caps</TD></TR></TBODY></TABLE>

According to the standard the TBODY element is optional unless a table has more than one body. Most tables have just one body, and most tables that I see in the wild omit the TBODY element from the markup.

So can we forget about it? Not really. If you’re developing an xpath expression in FireFox, for example, and looking at the page DOM as you work, then you’re seeing TBODY in the tables whether it was present in the page markup, or not. Most browser rendering engines insert a TBODY around table contents. I’m not sure why, but I assume it makes the parsing or rendering path more efficient.

If you’re using something like FirePath then the expression you’re developing is evaluated against the DOM, meaning that it won’t match if you don’t insert the TBODY. But what if the TBODY isn’t actually in the markup? Then that xpath expression won’t match when you move it to your application. Maybe.

It won’t match if you’re running your xpath against the page markup, and the TBODY is not in the page markup. But what if you render the page in memory using something like phantomjs, a task that is more often than not required in order to access the full content displayed on the page? If you do, then what you end up scraping is the rendered DOM, serialized back into html markup. Guess what? That means the TBODY is back.

The bottom line is: if you’re scraping page markup text retrieved from the site server, and the TBODY element is not used in a table, then you don’t want it in your xpath either. If you’re scraping page markup retrieved from a server-side in-memory rendering engine then you will need the TBODY, whether it was present in the markup or not.

IBM Research report on performance of Linux containers

At Knowledge In Practice we were pretty early adopters of Docker, and after more than six months of use nearly all of our production services are now deployed to Amazon’s EC2 as linux containers. While the lower overhead of containers was a draw,  as a small team the main benefits for us have been ease of deployment and increased environmental stability due to the use of Docker build files to declaratively specify the content of each service’s run-time environment. Launching a new instance of a service is literally as easy as adding one line to the cloudinit script for the instance, then running “docker pull” to get the image we want, and “docker run” to get the container going. Those steps could easily be automated as well. It’s a workflow that’s hard to beat.

Late last month IBM Research released a paper (PDF) comparing the performance of linux containers vs. traditional types of hardware and software virtualization. Not surprisingly containers fare quite well, although the paper notes that both VMs and containers need to be fine-tuned for high I/O workloads. Section 2.3 of the paper provides an excellent quick overview of how containers are implemented in linux using kernel namespaces and cgroups, and in fact I found that part of the document more valuable than the performance comparisons. Well worth a scan, at least, if you have an interest in this technology.