HTML5 and The Semantic Web

Linking Open Data dataset cloud as of July 14t...

Image via Wikipedia

Linking Open Data dataset cloud as of July 14t...

Image via Wikipedia

Since HTML5’s uptake in mainstream browsers, there’s been a lot of talk about the next version of the web, web 3.0 (even though Tim Berners-Lee dislikes that term).

The “next version” of the web, web 3.0, is also called the “Semantic Web” by many of the leading engineers working on HTML5 and other web standards.  The driving idea behind the semantic web is that computers, not humans, will be reading, interpreting, and digesting information from websites and web pages more than humans.  Unfortunately, since the data on the web is often in forms that make it computationally complex to parse or recognize, new HTML tags and standards had to be developed and integrated with HTML5 to provide this functionality.  Let’s look at a few examples of why this problem required such a solution.

Lets take a simple date, like the 21st of March, 2011.  A human can read this instantly and understand that I am talking about a date.  A computer, however, has to read the line, verify that this matches some sort of date pattern, and match that to a date pattern to use it.

An even harder sample to interpret would be 2/3/2011.  In countries other than the United States, this means the 2nd of March, 2011.  If this was written by an American, it could mean the third of February.  The computer must do additional research, or ask the user for verification of the actual date.  Either solution is undesirable.  To fix this problem we have some new tags in HTML5.  In this case, the tag will help us out.

Instead of just writing that the concert is on the 4th of March, 2011 at 8pm, we write:
The concert is on the 3rd of March, 2011.

The datetime attribute allows you to specify time by placing a “T” followed by the 24-hour time, +/- the timezone offset (use 00:00 for zulu time).  The pubdate attribute works the same way but denotes when an article has been published.

While time is probably the best example for the semantic web changes in HTML5, it is by far not the only one.  The following is a list of some of the semantic web tags included in HTML5 (let me know if you have any good ones you’d like to see added!)

  • — Specifies contact information for the author of an article.
  • — Denotes an article, a block of information that stands on its own.
  • — Group elements together to create a cleaner DOM flow for interpreters.
  • — Gives additional information/controls to be shown or hidden.
  • — Summary of a document, used inside the tag.— Denotes a figure (like a chart, picture, or other self-contained object)
  • — Gives the figure a caption.
  • — Denotes an abbreviation and its expansion.
  • — Strike out the text between the tags.
  • — Denotes inserted new text.  Usually after .
  • rel=”” — Attribute for links and hyperlinks.  It tells the browser (and more importantly, search engines) what relevance the linked document has to the current document.  Some examples are provided here.

The reason why these tags are so important really boils down to the heart of semantics, which is the ability for machines to understand the data that we are feeding them.  Thus, by adding these tags, we can do much more targeted search patterns.  For example, imagine a search engine in which you can search for all the news articles published between 2010 and 2011 by author “X”, but only those that happened to link to videos in the articles.  

This is only one example of everyday consumer use.  Enterprise use could have much more of an impact for internal search engines and document management, especially for law and security firms that need to keep hundreds of thousands or even millions of documents.  Instead of being overwhelmed with search information or having to somehow add all sorts of document information, by inputting documents with simple HTML markup, many document management problems could be eliminated.  Imagine a world where you can cross reference all articles containing the date “1-1-2011” with mentions of “New Years Party” in the document summary.  That’s what the semantic web is all about — easy, built-in data mining.

That’s pretty cool, but the descriptors that have been given to us by HTML standards are not even close to what we need to describe vague things like hot or cold, or people who aren’t authors of the page.  Of course, you could probably add things into the HTML tags like classes inside of tags (many microformats do this!) but this turns out to be a pretty inelegant solution because it quickly balloons your filesize if you have many things that you wish to tag or mark inside of your documents.

A Brief Intro To RDF

Enter the bold new world of RDF — The Resource Description Framework, where everything can be linked to anything and even queried for information, just like a SQL relational database (with similarly styled structure and a similarly named query language called SPARQL).  The idea behind this initiative is to give the entire web some sort of way to crawl over a document and what it contains more easily.  Anything can be described by RDF, as illustrated in this example from the Wikipedia Article on RDF

                Tony Benn

In the future, look for more RDF and a few, other, formats that make use of the same principles, as well as another article from about RDF and how to leverage it.  RDF and its brethren are far too complex and comprehensive to fit in as a side note to HTML5, as the example well illustrates.

In my next article about HTML5, I will go over the security features present within HTML5 and any security challenges that HTML5 presents to organizations.