If you think parsing is a solved problem, you’ve solved the wrong problem

October 7, 2012 ∞

Research in parsing computer languages (as opposed to human languages) has focused mostly on the regular languages and the context-free languages. So much has been learned about these two classes of languages that many researchers now consider parsing a solved problem.

What most people don’t realize is that regular and context-free languages almost never arise in practice. We have been solving the wrong problem.

“Regular expressions” are widely used, but they don’t correspond to the regular grammars of language theory. For example, regular expression libraries use a “longest-match” semantics, which means that they are context-sensitive—they are not even context-free, much less regular. Other common features, such as matching the beginning of a line or a previous portion of input, are also context-sensitive. These features are implemented because they are needed: regular languages alone are not sufficient to handle the parsing problems we need to solve.

Similarly, hundreds of parser generators support context-free grammars, but there are almost no context-free languages in practice. Most of the programming languages popular today, including C, C++, Javascript, Python, and Ruby, are not context-free languages. Perhaps more important, most data formats are not context-free.

Obviously we have parsers for these languages, and they are often built using regular expression libraries and context-free parser generators—plus a few hacks. Just what is wrong with this situation is subtle, but to get an idea of its seriousness, consider that we don’t know how to specify language syntax.

For an example you don’t have to look further than Haskell. You would think that the Haskell community would know how to specify their own language, but apparently not: none of the Haskell implementations implement the spec. It’s not even clear whether any two implementations agree, because they are all trying to approximate the ill-defined spec with ad hoc methods.

What’s true for Haskell is even worse in other communities. The IETF, for example, defines hundreds of data formats crucial to the operation of the Internet; many of their specifications rely on context-free grammars, but not in a rigorous way: the grammars are ambiguous and supplemented with English prose. They do not have the tools to write down an unambiguous specification, and this leads to implementations that do not interoperate, and then to security vulnerabilities.