The New York Times has an interesting article today on various efforts to search the “deep web”, including a project at Google (of course). The deep web is defined as all the information in web-connected databases, in contrast with all the information that is encoded in HTML on static pages.
I’m using the Times’ descriptions, but obviously there’s a lot of slop here. First, the deep web would really be anything that is hidden behind a form POST or querystring, as opposed to being statically encoded in a page. Querystrings can be linked to, and search engines can already crawl those links when they are presented statically. For example, Google does a very good job of indexing messages written on forums, which are always accessed through a querystring or posted form data.
So, the term “web-connected” database must be taken loosely, even though it applies to the vast majority of cases. The real challenge is how to get past the data submission layer which human surfers use to tell the back-end engine what they are interested in. This is, as Alon Halevy of Google says in the piece, “the most interesting data integration problem imaginable.”
But it is also a business problem, and one that is unlikely to be solved. This isn’t really distinguishable from issues I wrestled with back in the late 1990’s, while engaged in the development of online banking technology. The holy grail of such technology, for companies that are not banks, has always been to integrate all the account information from various financial institutions. Quicken and Microsoft have been pseudo-collaborating for years to try and make this happen. They may have given up by now.
If they have, then it’s because such integration is not at all the holy grail for companies that are banks. They have no interest in having their information, which they pay to create, store, maintain, and retrieve on demand, de-branded and used to add value to some third party’s website.
These “deep web” initiatives face the same challenge. The NYT piece talks about the difficulties of understanding the semantics of the search interfaces, with solutions described that include querying on the entire dictionary to find out what the database contains. Well those semantics are perfectly understandable to the customers and users of the people who paid to create those interfaces. If I owned one, and noticed Google’s bot driving up my database utilization by trying to query huge blocks of semantically-related terms I would block it in an instant.
These efforts are misguided, because they attempt to do an end-run around the whole value proposition for putting this data on the web in the first place. It would be better to focus on a collaborative API that allows database owners to make certain slices of their data visible to search engines via semantic queries, in return for greater traffic and more convenience for their users.