A grammar for HTML5

March 24, 2012 ∞

The HTML5 specification uses pseudo-code to specify how HTML documents should be parsed. Here’s a taste:

If the byte at position is one of 0x09 (ASCII TAB), 0x0A (ASCII LF), 0x0C (ASCII FF), 0x0D (ASCII CR), 0x20 (ASCII space), or 0x2F (ASCII /) then advance position to the next byte and redo this substep.

If the byte at position is 0x3E (ASCII >), then abort the “get an attribute” algorithm. There isn’t one.

Otherwise, the byte at position is the start of the attribute name. Let attribute name and attribute value be the empty string.

…

This style of specification has provoked consternation among some, who prefer the “declarative” style of the HTML 4 specification, based on grammars.

Fortunately, I have spent a great deal of time over the past ten years learning about parsing and its security aspects, and I believe I can give here a very succinct grammar for HTML5.

A few preliminaries. I will be using a variant of Backus-Naur Form (BNF) grammars, in which “.” will denote any single input character, and postfix “*” will denote zero or more repetitions of the preceding construct (Kleene closure). I will use capitalized identifiers for the nonterminals of the grammar.

Here then is the grammar of HTML 5:

HTML5 = .*

Yes! No kidding, that really is the grammar. Any input that matches this grammar—which is to say, any input at all—is going to be accepted by just about any web browser, which will do its best to render something sensible. You can think of this as a degenerate case of Postel’s Law, in which browsers are extremely liberal in what they accept from others. They accept everything!

To be fair, the HTML5 specification does discuss “parse errors”, and says that HTML user agents can abort processing when they encounter them. But it also says that they can continue processing, and that’s what browsers seem to do.

This is no different from HTML 4, whose grammar really should be the same as this. Web browsers have always been very tolerant of “errors” in web pages; they try to render as much of their input as possible. The more complicated grammar that you will find in the HTML 4 specification is incomplete. It is not complete because it does not say what happens when a browser encounters “garbled” HTML; browsers have been left to decide this for themselves. Naturally enough, this leads to browsers that behave differently on the same input: browser incompatibility. And that leads, in turn, to certain security vulnerabilities.

The defenders of “declarative” specifications will note that HTML 4’s syntax specification is not only a grammar. That’s true, there is also a lot of English prose confusing things. Here are some questions for the defenders: is the HTML 4 specification equivalent to “.*”? If not, then when an input does not conform to the grammar of HTML 4, what DOM tree will a browser produce? (The answers are “no” and “only your browser knows”.)

The pseudo-code of the HTML5 specification is charmless, but it is pretty easy to convince yourself that it accepts “.*”.