C and C++ are not context free

January 31, 2013 ∞

(Part 5 of a series: 1, 2, 3, 4)

It seems that the Internet is still confused about this, so let’s consider the question:

Are C and C++ context-free languages?

And the answer is

No, C and C++ are context-sensitive languages.

There are several reasons.

To parse C and C++, you start by using a very powerful preprocessor. These preprocessors are inevitably written by hand (they are not based on a theoretic foundation like regular expressions or context-free grammars).
C and C++ lexers require lexical feedback to differentiate between typedef names and identifiers. That is, the context-sensitive lexer needs help from the “context-free” parser to distinguish between an identifier “foo” and a typedef name “foo”. In this snippet,
```
     int foo;
     typedef int foo;
     foo x;
```
the first “foo” is an identifier while the second and third are typedef names. You need to parse typedef declarations to figure this out (and since types have nested parentheses, this is definitely at least as hard as parsing a context-free language).

This means that the parser and lexer are mutually recursive, so it doesn’t make sense to say that the parser is context free while the lexer is context sensitive.
The grammar is ambiguous: it has LR conflicts, such as the if-then-else conflict. Parsers typically resolve this using context (an “else” matches the closest “if”).

If you believe that these reasons are not sufficient to show that the languages are context sensitive, please consult my previous post on the subject.

C and C++ are important languages so it’s worth saying a bit more.

The C and C++ preprocessors are a real problem for the languages. Among other things, they make it hard to write tools that operate on programs. For example, suppose you would like to refactor your program by renaming a function. Preprocessors can introduce new tokens, so ideally your tool would work by first running the preprocessor, then the lexer/parser to find all occurrences of the function and replacing it. But then, you have a refactored program that has already been preprocessed. At this point you would like to run the preprocessor in reverse, so that that result is both readable and similar to the original. Good luck with that.

C++ is much harder to parse than C. McPeak’s work on Elkhound and Elsa has a good discussion of the issues.

If you ever look closely at language specifications, you’ll see that they have a lot of problems. They have many ambiguities and even contradictions. Moreover, language implementations rarely meet the specification, and no two implementations work the same on all programs (cf. my posts on Postel’s Law). A Few Billion Lines of Code Later: Using Static Analysis to Find Bugs in the Real World is a delightful article by Coverity on the difficulties of writing static analyzers for C programs “in the wild.” The whole article is worth reading several times over, but here are a couple of relevant quotes:

Law: You can’t check code you can’t parse. Checking code deeply requires understanding the code’s semantics. The most basic requirement is that you parse it. Parsing is considered a solved problem. Unfortunately, this view is naïve, rooted in the widely believed myth that programming languages exist.

The C language does not exist; neither does Java, C++, and C#. While a language may exist as an abstract idea, and even have a pile of paper (a standard) purporting to define it, a standard is not a compiler. What language do people write code in? The character strings accepted by their compiler. Further, they equate compilation with certification. A file their compiler does not reject has been certified as “C code” no matter how blatantly illegal its contents may be to a language scholar. Fed this illegal not-C code, a tool’s C front-end will reject it. This problem is the tool’s problem.

and

If divergence-induced parse errors are isolated events scattered here and there, then they don’t matter. An unsound tool can skip them. Unfortunately, failure often isn’t modular. In a sad, too-common story line, some crucial, purportedly “C” header file contains a blatantly illegal non-C construct. It gets included by all files. The no-longer-potential customer is treated to a constant stream of parse errors as your compiler rips through the customer’s source files, rejecting each in turn. The customer’s derisive stance is, “Deep source code analysis? Your tool can’t even compile code. How can it find bugs?” It may find this event so amusing that it tells many friends.

(Coverity, by the way, is now looking at HTML5 parsing to find security vulnerabilities, cf. this post and BEEP.)

I don’t blame C for these problems. C is old, and it was created when there were few widely available parser generators. In fact, it looks to me like yacc and C were essentially born together! Here’s an inspirational interview with Stephen Johnson that discusses those early days. It wasn’t easy; Johnson says

Over the next several years, I rewrote the program over a dozen times, speeding it up by a factor of 10,000 or so.

That’s the level of effort and craftsmanship you need to create one of the most influential programs ever written.