The continued need for semantic HTML

Two new search tools in Wolfram and Google Squared have been released that demonstrate the need for a standards based approach to web semantics and data.

June 4, 2009
Updated September 19, 2024

HTML is king ¶

From its invention in 1989 to present day developments, HTML has been the foundation for documents on the web. Documents are marked up by humans or machines using HTML and pushed into the ether of the web where they are subsequently consumed by humans and robots. Initially the HTML specification worked well for creators and consumers alike and Google, the most ubiquitous search service, grew out of the ability to parse and intelligently index HTML content. Google did this so well the company name became a verb synonymous with discovering things via the internet. Fait accompli you might think.

Innovation ¶

Google produced a product that killed others, but with the emergence of Web 2.0 (a much maligned term if ever there was one) Google’s standard search engine is beginning to look behind the times for a number of reasons. Content is being generated more quickly and more frequently. Services like Twitter are quicker to return results in nearly real-time and APIs are allowing developers in userland to create custom views and mashups of disparate sets of data.

Huge effort for low return ¶

Of course Google continues to innovate but in using Google Squared I couldn’t help feeling that a massive amount of human and computational effort had been extended on my search of British Beer. I can only imagine the data harvesting and algorithms that are going on behind the scenes to give me three results. Granted of course that Google Squared is still a labs product.

The same is true of Wolfram which is attempting to make search computational. In terms of aggregating disparate pieces of information it must be reliant on some hefty algorithms and a lot of processing power.

More complex than it should be ¶

Indexing, storing and processing data seems more complex than it should be. Yahoo’s YQL seems to turn the data harvesting and algorithm heavy model on its head. By making data available through Open Tables there is a RESTful route to the primary source data that is available to anyone. Essentially the World Wide Web becomes its own database. The web exists as the data and a giant relational database with the ability to create meta views of that data. There is no need for it to exist in multiple instances in search engine datacentres. I am not saying that traditional search engines are going to go away, but that from a design pattern perspective the Open Tables model seems a cleaner and more efficient way of retrieving data from the web.

In addition to APIs and YQL (which is essentially an attempt to standardise an interface to APIs) is a long-standing move to add more semantic meaning to HTML. Microformats have emerged as a genuine means of adding additional semantic meaning to HTML and allowing third party services access to data in a standardised, structured way. Microformats exist for contact information, calendars and reviews. This allows parsers to access the primary source of data and use it however they like. I even produced

an example of how Microformats can return contact details from a URL. No need for a third party search engine - instant access to the primary data source in real time.

Standards needed ¶

What struck me about Google Squared in particular was that pretty much everything it was trying to do could be solved by giving HTML more semantic meaning. Of course defining this is a massive task but in a previous job as an Information Professional, I learned that for good reason there are international standards for indexing books, journals and data that knowledge professionals rely on. In my opinion the web is still badly lacking these standards and frameworks for giving content semantic meaning.

Google’s goal is “to organize the world’s information and make it universally accessible and useful”. The goal of the World Wide Web in general should now be to make world’s data universally accessible. The means for controlling access to open data are coming along nicely and as a community I feel we should redouble our efforts to make sure data is open to all. YQL and Microformats show that standardising APIs and Semantic HTML respectively can be done at a meta level. For me harvesting massive amounts of data in the search space is not the way to go. Instead we should work towards creating a standards based approach to open data on the web.