(Part 3 of a series; see part 1 and part 2.)
This might be a surprising question to most parser experts: is the syntax of Java a context free language? I myself am inclined to say that it is, but it’s actually not straightforward.
The Java Language Specification (Gosling, Joy, Steele 1996) includes three “context-free grammars”:
- the lexical grammar;
- the syntactic grammar;
- an LALR(1) version of the syntactic grammar.
These grammars are used to define a lexer and parser in the standard methodology:
The lexical grammar defines the lexer, and either of the two syntactic grammars can be used to define the parser. No lexical feedback, and a bog-standard lexer, about as vanilla as you can get.
However, let’s look at these “context-free grammars” more closely.
The lexical grammar is defined by grammar rules that look like Chomsky’s definition of context-free grammar rules, but they are interpreted differently. If we interpret them as a true context-free grammar then that grammar is highly ambiguous. For example, by the standard definition of context-free grammars, “ifthen” could be interpreted as (keyword) “if” followed by (keyword) “then”, or as (identifier) “ifthen”, or as (identifier) “if” followed by (identifier) “then”, etc.
This isn’t what is intended, of course. Ambiguities are supposed to be resolved in the usual way: the rules are (implicitly) ordered and we use a first-match, longest-match rule to break ties. Consequently, keywords get precedence over identifiers, and whitespace becomes mandatory between certain tokens. All perfectly standard, but this is not context-free semantics.
It’s sloppy terminology.
The syntactic grammar is presented over several chapters. It is described as “very similar to the LALR(1) grammar but more readable”. That means it requires more than one token of lookahead to parse and (I think) the authors hope that it describes the same language as the grammar that will be used to build actual parsers.
The fact that the authors believe that humans and parser generators need a different presentation of the syntax, and that they aren’t sure whether their two definitions are equivalent deserves comment, but I’ll pass on that for now.
The LALR(1) grammar is not as readable as the first grammar (which is “much better for exposition”) but it has been mechanically verified to require only a single token of lookahead.
This would seem to be definitive. What goes unsaid, however, is that the grammar has shift-reduce conflicts. That is, the grammar is ambiguous, and the parser applies some heuristics to resolve ambiguities.
The canonical example is the if-then-else statement. The program fragment
if (c == '0') x++; if (c == '1') x--; else y++;
has more than one parse according to the LALR(1) grammar: the “else” could be associated with the first “if” or the second “if”. This shows up as a shift-reduce conflict that the parser resolves as a shift, which means that the “else” gets associated with the closer, second “if”.
This has no effect on the language accepted by the grammar, so in that narrow sense the syntax is context free.
However, in practice we care about the actual parse tree determined by the grammar. In that sense, I think that for us to call the syntax context free we require an unambiguous grammar. The LALR(1) grammar doesn’t reach that standard.
As far as I know it’s an open question whether there is an unambiguous context-free grammar for Java. It’s known that you can write an unambiguous grammar for if-then-else statements, but it’s also known that an ambiguous grammar plus disambiguating rules can result in a non-context-free language.
There’s nothing wrong with Java’s syntax, or even its specification. The point is that it doesn’t use context-free grammars. Instead of pretending that parsing is solved by context-free parsers, we should instead study what is used in practice.