How to prove that a programming language is context free
November 16, 2012  

(Part 4 of a series: 1, 2, 3)

I’ve now given a few examples of programming languages that I claim are not context-free languages (and I think that most programming languages are not context free).

What I haven’t done is given a proof. If you think that my lack of a proof is a fatal flaw, let me point out that if you want to claim that these languages are context free, then you really ought to give me a proof of that.

The difficulty in either case is that it is not clear what the definition of a context-free language should be for programming languages whose parsers are split into a “lexing” phase and a “parsing” phase (see part 1).

Having gone through a few examples, let’s take a stab at a definition.

Definition: a programming language is context free if it can be defined by a lexer and a grammar with the following properties:

  1. the lexer is defined by a sequence of regular expressions that maps character sequences to tokens, with ambiguities resolved by the longest-match, first-match rule;
  2. there is no lexical feedback; and
  3. the grammar is an unambiguous context-free grammar over tokens.

The first and second conditions are meant to limit the power of the lexer what you get with LEX, the canonical lexer generator. The third condition is motivated by my Java example: for a programming language we surely want a deterministic parser, and we know that context-free grammars plus disambiguation can go beyond the context-free languages.

I don’t know of many programming languages whose parsers meet this definition. The s-expression-based languages (Lisp and Scheme) would qualify, but I have never seen a parser for Ruby, Python, Haskell, C, C++, or Javascript that meets this definition.

It should be possible to take this definition and formulate a variant of Ogden’s Lemma that would allow us to prove that a given language is not context-free. But this would only be interesting if others accept my definition and there is disagreement about a particular language. Perhaps when someone comes up with a parser for Python that satisfies this definition I’ll get around to it.

In any case, the point is not that these languages or their parsers are bad; the point is that we should not pretend that they are context-free. They are context-sensitive, and so we should be working on the theory and implementation of context-sensitive languages and parsers.