HTML5 versus HTML 4 and XHTML
Listening to marketing departments you would think that HTML 5 is poised to take over the world, but I found that the status of HTML5 versus HTML 4 and XHTML as standards in typical web developer practice is not as simple as that. The W3C working group is attempting to set a timeline for reaching a "last call" version of HTML 5 by mid 2011. However, it is not clear how many of the desirable features such as microdata will be part of the HTML 5 spec or become a separate specification.
The working group for XHTML 2 was officially closed as of December 2010, with the intent of letting developers concentrate on HTML5 and XHTML5. It is too early to tell how many of the ideas the working group developed will live on in XHTML5. However, RDFa (Resource Description Framework in Attributes) recommendation for expressing structured data now has a home in HTML5+RDFa.
[Ed. Note: Establishing a standard Video codec feature, too, has been an issue in recent HTML 5 development and specification activity. See related story: "HTML5 video codec war and lax support hinder adoption says Forrester."]
A compact HTML history
There are four previous major HTML specifications a well as three previous XHTML specifications. Note that the IETF (Internet Engineering Task Force) work on the standard was moved to the W3C in 1996.
- HTML original Although the IETF started a HTML working group in Dec. 1994, no HTML 1 standard was published.
- HTML 2.0 Published by IETF in Nov. 1995, but the working group closed in Sept. 1996, transferring control to the W3C.
- HTML 3.2 W3C recommendation published Jan. 1997 attempted to catch up with browser developments.
- HTML 4.0 The specification of Dec. 1997 adds more features to 3.2 such as style sheets.
- HTML 4.01 This Dec. 1999 version cleans up some minor problems.
- XHTML 1.0 Early draft in 1998, spec. release Jan 2000, DTDs for "strict", "transitional" and "frameset".
- XHTML 1.1 Modularized version of 1.0 release May 2001, revision Oct 2008, 2nd ed July 2010
- XHTML 2.0 Various working drafts through 2009 but never a specification.
- HTML5 First draft Jan 2008, 9th draft Jan 18, 2011.
- XHTML5 When HTML5 is serialized as a valid XML document, it will be called XHTML5. Any page served as XHTML5 must have a media type specifying XML such as "application/xhtml+xml".
- HTML5+RDFa Latest draft Jan 2011 gives syntax for adding RDFa (Resource Description Framework in Attributes) to HTML5 and HTML 4 documents to support the semantic web.
After the ferment of competing browser feature development in the early years it was a relief to get HTML 4.0 and XHTML 1.0 to provide some degree of stability for web authors for nearly a decade. Now it appears we are in for another period of dramatic change.
What the DOCTYPE reveals
Markup languages derived from standard generalized markup language (SGML), such as XML and all versions of HTML before HTML5, embrace the role of DOCTYPE declarations to associate a document type definition (DTD) with a document. DTDs use a compact formal syntax which defines exactly which elements can occur where in any SGML compliant language. The DOCTYPE declaration, which should be the first element of a document, guides a client program such as a browser in interpretation of the markup.
For a number of reasons, the developers of HTML5 have abandoned SGML compliance, and the use of DOCTYPE declarations in HTML 5 does not cite a DTD. I found confusing recommendations on the need for a DOCTYPE declaration. This W3C working draft dated January 13, 2011 states "A DOCTYPE is a required preamble." The use of <!DOCTYPE html> (case insensitive) is recommended. Since XHTML5 requires all elements to be in lower case, the DOCTYPE for XHTML5 will be case sensitive. The idea being that this simple DOCTYPE will make browsers use "standards mode" for rendering. Other sources, such as this WHATWG page of Jan 19, 2011 state that the DOCTYPE declaration is actually optional for XHTML5.
A little field work
Given all of the above, I thought it would be interesting to find out exactly which versions of markup are actually in use in the web today. Starting with the web crawler I wrote for my previous look at XHTML use, I collected counts of the DOCTYPE names in use on over 12,000 pages with interesting results.
- XHTML 1.0 - With a few XHTML 1.1, about 74% total.
- HTML 4.0 and 4.01 - Mostly 4.01 and using the "transitional" DTD, about 15%.
- Possible HTML5 and XHTML5 - As indicated by the use of "<?!DOCTYPE html>, about 10%.
- HTML 3 and 2 - Astonishing but true, about 1% of these very old standards. I am guessing either some pages have not been touched in years - or - people are using very out of date authoring tools.
- undecipherable A bit more than 0.5% were obvious invalid declarations.
Web pages using the HTML5 and XHTML5 DOCTYPE declarations are starting to appear on the web. However, XHTML 1.0 remains the most common and HTML 4 use is still very common. I think there will be a "long tail" of HTML 4 and older documents on the web for a long time, continuing to complicate the work of browser developers.