Sometimes a hack is all you’ve got

So it’s late, and I’ve been messing with feed parsing again. I have this project that I’ve been assembling¬†off and on for awhile, and it involves ingesting and analyzing RSS and Atom news feeds. I’m using python 2.5.9 and lxml to parse the content from these feeds. The lxml package is a powerful and very fast xml/html parser, but it has its quirks.

There are actually two parsers in lxml. The etree parser deals formally with xml documents, and is rather fussy about things like namespaces, something that I incidentally care nothing at all about. The html parser is a lot less fussy, but it can cause problems when you use it to parse feeds, because some of the stuff in feeds gets interpreted as malformed html. An example:

<rss>
    <channel>
        <!-- blah -->
        <item>
            <!-- blah -->
            <link>http://popupmedia.com/link</link>
        </item>
    </channel>
</rss>

Yep, a link tag in html is supposed to be self-closing. So the html parser figures that you don’t need the closing tag, and it drops it. You end up with:

<rss>
    <channel>
        <!-- blah -->
        <item>
            <!-- blah -->
            <link>http://popupmedia.com/link\n
        </item>
    </channel>
</rss>

And that is not well-formed xml, and it does not help when you are trying to do something like this:

tag = html.fromstring(text)
hrefs = tag.xpath("//channel/item/link/text()")

That xpath query finds nothing. Incidentally Atom feeds don’t have this problem, because they look like this:

<feed>
    <entry>
        <!-- blah -->
        <link rel="alternate" href="http://popupmedia.com/link" />
    </entry>
</feed>

So, as I said, it’s late, and I just wanted to close the book on this chunk of code, and in order to test it I need this method that returns a list of the item links to work. How to get them? It turns out that after dropping the closing tag the html parser is able to locate the now unclosed link tag just fine, and since the text that was originally enclosed in the link tag is now following the unclosed tag, that makes it the tail. So this works:

tag = html.fromstring(text)
hrefs = 
    [l.tail.strip() for l in tag.xpath("//channel/item/link")]

Go figure.