My current job involves aggregating a lot of information from many websites, all of which are constructed using different tools, frameworks, approaches, etc. So, pretty much a standard web scraping challenge, except perhaps with some additional hurdles introduced by the fact that our industry hosts some of the oldest, ugliest websites you’ve ever seen. About that I will have no more to say today, but for the purposes of this post it is sufficient to note that many of these sites are… sigh… table-driven.
If you’ve spent a lot of time with xpath expressions and have used tools like FirePath and tried to develop good expressions and export them to use in running applications, then you probably knew where I was going with this as soon as you saw the title. For the rest of you, maybe this will save you a few minutes somewhere down the road.
The TBODY element is a child of the TABLE element, and is used to encapsulate a body within the table. Here’s an example:
<TABLE><TBODY><TR><TD>ugh caps</TD></TR></TBODY></TABLE>
According to the standard the TBODY element is optional unless a table has more than one body. Most tables have just one body, and most tables that I see in the wild omit the TBODY element from the markup.
So can we forget about it? Not really. If you’re developing an xpath expression in FireFox, for example, and looking at the page DOM as you work, then you’re seeing TBODY in the tables whether it was present in the page markup, or not. Most browser rendering engines insert a TBODY around table contents. I’m not sure why, but I assume it makes the parsing or rendering path more efficient.
If you’re using something like FirePath then the expression you’re developing is evaluated against the DOM, meaning that it won’t match if you don’t insert the TBODY. But what if the TBODY isn’t actually in the markup? Then that xpath expression won’t match when you move it to your application. Maybe.
It won’t match if you’re running your xpath against the page markup, and the TBODY is not in the page markup. But what if you render the page in memory using something like phantomjs, a task that is more often than not required in order to access the full content displayed on the page? If you do, then what you end up scraping is the rendered DOM, serialized back into html markup. Guess what? That means the TBODY is back.
The bottom line is: if you’re scraping page markup text retrieved from the site server, and the TBODY element is not used in a table, then you don’t want it in your xpath either. If you’re scraping page markup retrieved from a server-side in-memory rendering engine then you will need the TBODY, whether it was present in the markup or not.