Open access should not mean sole access
May 12, 2012  
There's a big mistake that most open access publications are making: they require readers to visit the publication web site to read articles. That is, anyone can read articles for free at the web site, and even copy them for their own use, but you aren't allowed to copy articles for republication on another site.

The obvious problem with this is that the operation of the web site becomes critical. If the web server goes down, then readers can't access the articles. If the publication forgets to renew its domain name, or some country decides to blacklist the domain name, then readers can't access the articles. If the publication goes out of business, or the hosting company goes out of business, or a disk fails and the publisher doesn't have a good backup system in place, then readers can't access the articles. The web site becomes a single point of failure.

A less-obvious problem is cost. All-electronic publications can be run cheaply, but not if you need to hire and manage a staff to develop and maintain a fancy web site. Web site expenses are part of the reason that some open access journals charge authors thousands of dollars to publish each article, a charge that is out of reach for many authors.

The purpose of academic publishing is to further the advancement of science by disseminating peer-reviewed research as quickly and as widely as possible. Republishing—replication—is clearly aligned with this purpose, and restricting republication is clearly at cross purposes.

Here's what we should do: move to a publication model that encourages replication and republication of the entire contents of open access journals. For example, libraries should be able to republish journals, and their patrons should be able to read articles through the libraries' web sites. Libraries—or anyone else—should be able to copy not only the articles but the table of contents of the journal, as well as other metadata. This provides multiple continuously-tested backups of the publication which can even survive the publication going out of business: once an article is published, it will always be available.

Journals and authors may wish to prevent some republication, e.g., commercial (for-profit) republication, or republication without attribution. This can be handled as a copyright and licensing issue. Preventing all republication, however, is a mistake.

(See also: “gold” and “green” open access, and Stallman on redistributable scientific publishing.)

Good news: Harvard is broke
April 24, 2012  
Harvard's Faculty Advisory Council on the Library has declared that
major periodical subscriptions, especially to electronic journals published by historically key providers, cannot be sustained: continuing these subscriptions on their current footing is financially untenable.

This is excellent news. If Harvard cannot afford its journal subscriptions, then you can be sure that no academic institution can continue to go along with the extortion of today's academic publishing industry. And if that's the case, perhaps there is hope for change.

What should be done? According to the Council, faculty should no longer submit papers to closed access journals, and they should resign from editorial boards for closed access journals. Instead, they should support open access publishers.

This is great advice. Open access publishing can be cheap, as explained by Harvard's own Stuart Schieber. Most academic journals already obtain the research, peer review, and many editorial positions at no cost—the contrast between costs and prices charged has been a major impetus behind the open access movement. Schieber points out in addition that most authors do a better job of typesetting and copy editing than traditional publishers. And by running his own press, he has demonstrated that the cost of a printed journal can be brought to under 10 cents per page.

I agree with everything Schieber says in his article, but I would go further. Many journals, whether closed or open access, run elaborate web sites. These sites serve as the primary, and often, sole, source for downloading articles, and they can provide other services like search, statistics on readership and impact factor, and reader comments. This is unnecessary: these services are already better provided by others, e.g., CiteULike, Mendeley, CiteSeer, etc.

Journals should have minimal web sites. The only real purpose of a journal is to provide certification for its articles: it certifies that its articles have met its standards, including peer review. This certification can be provided simply as a list of accepted articles, plus information regarding the constitution of the editorial board and its standards.

The site can certainly provide the articles themselves, but this should not be the sole or primary way to access the articles. Instead, we should do what we have done for thousands of years: rely on libraries to provide access. Partner with libraries to mirror the content—this requires only using cryptographic hashes to validate the articles. Multiple copies at libraries around the world ensures preservation of the journal, even in the case that the journal ceases publication. Open access should embrace (verifiable) replication, by libraries, by authors, by anyone.

Journals should not be printed. Paper documents are much more expensive and inefficient than electronic documents, and they are no better at ensuring preservation. Moreover, reading is increasing happening on electronic devices, where reformatting is a requirement. Even when readers prefer paper, they will usually make their own printout from an electronic document, rather than using their library's printed copy.

By following Schieber's advice, and by further eliminating the costs of a web site and printing, we can make the cost of the journal dependent on just peer review and editorial functions. This is a cost that Harvard and the rest of the academic community can easily afford.

The curious incident of the semicolon at the newline
April 23, 2012  
Once again, someone is wrong about syntax on the Internet! As an author of several papers on parsing, I can't resist the spectacle.

The fight is over the following snippet of JavaScript:

    clearMenus();
    !isActive && $parent.toggleClass('open')

Note the semicolon after clearMenus(). As Holmes would say, what is curious about this semicolon is that it does nothing at the newline. That is, as far as the JavaScript parser is concerned, it might as well be omitted:

    clearMenus()
    !isActive && $parent.toggleClass('open')

When JavaScript parses this, it notices that a semicolon is missing after clearMenus(), but acts as if it were not missing. So the variant without the semicolon parses just fine.

The problem is that while the parser produces the very same result for these two snippets, other tools do not. In particular, JSMin was transforming the syntactically-correct second snippet into code that would not parse.

I'm not going to weigh in on whether this is a bug in JSMin, or whether the code in question should be changed—the debate on those issues, and their resolution, is already entertaining enough.

What I will say is that JSMin can never live up to this blurb from its own README:

JSMin is a filter that omits or modifies some characters. This does not change the behavior of the program that it is minifying.
Douglas Crockford, JSMin's author, provides several counterexamples himself later in the same README, so he knows this statement is false. What I am stating, however, is much stronger:
All filters that omit or modify some characters must change the behavior of some JavaScript programs.

To see this, just consider that JavaScript programs are typically run in browsers as part of a web page, where they have access to their own text via the DOM. A program that can examine its own text can alter its behavior if its text has been altered.

This may seem like a pedantic edge case of an obscure program. The truth is, what is true for this obscure program is also true for every programming tool you have ever used, including every compiler. None of these tools are semantics preserving. The “pedantic edge case” may be different for different tools, but I guarantee you, they all have one.

A grammar for HTML5
March 24, 2012  
The HTML5 specification uses pseudo-code to specify how HTML documents should be parsed. Here's a taste:
  1. If the byte at position is one of 0x09 (ASCII TAB), 0x0A (ASCII LF), 0x0C (ASCII FF), 0x0D (ASCII CR), 0x20 (ASCII space), or 0x2F (ASCII /) then advance position to the next byte and redo this substep.
  2. If the byte at position is 0x3E (ASCII >), then abort the “get an attribute” algorithm. There isn't one.
  3. Otherwise, the byte at position is the start of the attribute name. Let attribute name and attribute value be the empty string.
  4. ...
This style of specification has provoked consternation among some, who prefer the “declarative” style of the HTML 4 specification, based on grammars.

Fortunately, I have spent a great deal of time over the past ten years learning about parsing and its security aspects, and I believe I can give here a very succinct grammar for HTML5.

A few preliminaries. I will be using a variant of Backus-Naur Form (BNF) grammars, in which “.” will denote any single input character, and postfix “*” will denote zero or more repetitions of the preceding construct (Kleene closure). I will use capitalized identifiers for the nonterminals of the grammar.

Here then is the grammar of HTML 5:

HTML5 = .*
Yes! No kidding, that really is the grammar. Any input that matches this grammar—which is to say, any input at all—is going to be accepted by just about any web browser, which will do its best to render something sensible. You can think of this as a degenerate case of Postel's Law, in which browsers are extremely liberal in what they accept from others. They accept everything!

To be fair, the HTML5 specification does discuss “parse errors”, and says that HTML user agents can abort processing when they encounter them. But it also says that they can continue processing, and that's what browsers seem to do.

This is no different from HTML 4, whose grammar really should be the same as this. Web browsers have always been very tolerant of “errors” in web pages; they try to render as much of their input as possible. The more complicated grammar that you will find in the HTML 4 specification is incomplete. It is not complete because it does not say what happens when a browser encounters “garbled” HTML; browsers have been left to decide this for themselves. Naturally enough, this leads to browsers that behave differently on the same input: browser incompatibility. And that leads, in turn, to certain security vulnerabilities.

The defenders of “declarative” specifications will note that HTML 4's syntax specification is not only a grammar. That's true, there is also a lot of English prose confusing things. Here are some questions for the defenders: is the HTML 4 specification equivalent to “.*”? If not, then when an input does not conform to the grammar of HTML 4, what DOM tree will a browser produce? (The answers are “no” and “only your browser knows”.)

The pseudo-code of the HTML5 specification is charmless, but it is pretty easy to convince yourself that it accepts “.*”.

Postel's Law and network security
March 22, 2012  
Postel's Law, which I've written about before, goes like this:
Be conservative in what you do, be liberal in what you accept from others.
The graybeards tell me that Postel's Law was an important engineering guideline for the development of the Internet. To me it seems not only an example of pragmatic engineering but a necessary property of network protocols, which are naturally designed and implemented in a distributed fashion. If you are bootstrapping a network and its protocols with collaborators across the globe, and you find a mismatch between your implementation and another's, you aren't going to stop the network, call them on the phone, get them to fix their bug, and restart—you are going to bumble on as best you can. There aren't many successful network protocols whose implementations ignore Postel's Law.

That said, any protocol implemented according to Postel's Law is going to fall prey to an immediate corollary that has unfortunate security implications:

Corollary: Everyone is liberal in a different way.
Here are a couple of examples.

NUL characters in SSL certificates

In 2009 or so, Marlinspike noted that SSL certificate authorities were signing certificates for domains containing the NUL (ASCII 0) character. For example, consider this domain where 0 denotes NUL:
gmail.com0.ev.il
To an SSL certificate authority, this may look like a subdomain that should be “owned” by the ev.il domain. If so, they will happily issue an SSL certificate for the subdomain to the owner of ev.il. This is actually a liberal treatment of domain names in the sense of Postel, because domain names aren't supposed to include NUL.

Now consider a browser trying to reach https://gmail.com/. Its communication could be intercepted by agents of ev.il (there are lots of ways this can happen, but the details are not important here). That's fine so far, because SSL should detect this. However, ev.il has a certificate for gmail.com0.ev.il, which it gives to the browser. The browser, however, interprets the certificate as a certificate for gmail.com, because it is using NUL as a string terminator (common practice in programming languages like C), so ev.il has succeeded in impersonating gmail.

The browser is interpreting the bogus domain name liberally, in the sense of Postel, but the browser's “liberal” is different than the certificate authority's “liberal”.

HTTP request splitting

Here's a second example, derived from one by Watchfire:
    POST /form.html HTTP/1.1
    Host: example.com
    Content-Type: application/x-www-form-urlencoded
    Content-Length: 0
    Content-Length: 44

    GET /good HTTP/1.1
    Host: example.com
    Bla: GET /evil HTTP/1.1
    Host: example.com

This looks like a sequence of HTTP requests (the messages your browser sends to web sites to retrieve content). However, something is strange because the first request has two Content-Length fields, when it should have at most one. If we want to be liberal like Postel, there are a couple of ways to handle this. One way is to believe the first Content-Length header, in which case we get two requests,
    POST /form.html HTTP/1.1
    Host: example.com
    Content-Type: application/x-www-form-urlencoded
    Content-Length: 0
    Content-Length: 44

and
    GET /good HTTP/1.1
    Host: example.com
    Bla: GET /evil HTTP/1.1
    Host: example.com

Here we are posting to /form.html and getting the resource /good.

A second way to be liberal is to believe the second Content-Length header, in which case, we get two different requests,

    POST /form.html HTTP/1.1
    Host: example.com
    Content-Type: application/x-www-form-urlencoded
    Content-Length: 0
    Content-Length: 44

    GET /good HTTP/1.1
    Host: example.com
    Bla: 
and
    GET /evil HTTP/1.1
    Host: example.com

In this case we are posting to /form.html again, though the contents of the posting are different (the posting includes the GET /good); and we are getting the resource /evil. (In a real attack, /evil would be a well-chosen request that would cause trouble for the web site.)

Once again, two ways of being liberal. This can be a problem when you have two programs working together, each of which is liberal in a different way. A common example is a proxy that sits in front of a web site, filtering requests, and throwing out any “bad” requests that it sees.

Here if the proxy sees our input and uses the first interpretation, it will let it through. If the web site uses the second interpretation, it will be exploited, and we will have defeated the proxy's protection.

Script injection

As a final example, I'll use the current most-reported security vulnerability, the script injection. A script injection happens when a web site serves up input provided by its users. Facebook is one example (of many). Most of the web pages you look at on Facebook contain content provided by its users—after all, that's the whole point of Facebook. Script injection happens when a user provides the site with content in such a way that the resulting web page includes a script of the user's choice. This script will be executed in the browser of other users who view the page.

Script injection is common but complicated. I'm not going to spell out all of the details (see my BEEP project if you want to know more), but one of the complications is relevant to Postel's Law. Namely, the main defense against script injection is for the web site to be careful with user input: it should filter out user input that includes scripts.

Of course, this requires knowing exactly what constitutes a script. And browsers do not agree on this. Browsers, famously, have incompatibilities.

For example, most browsers think that this snippet of HTML

    <img src=java
    script:alert(0)>
does not contain a script. However, some browsers will treat it as a script, essentially treating it as if it were this correct snippet:
    <img src=javascript:alert(0)>
Something similar happens for the following snippet:
    <img """><script>alert(0)
    </script>">
In short, the answer to the question of whether a bit of user input includes a script depends on the browser—there is no one answer. Each browser, and each version of a browser, may have a different answer; they can all be liberal in a different way. This is one of the major hurdles for any defense against script injection.

Summary

It is very hard to build two independent implementations of a network protocol which behave identically (in fact the point may well be that they behave differently, e.g., one may be faster). Whenever this happens, the implementations still need to interact, and they do so by Postel's Law. If you are interested in finding (and preventing) attacks on protocols, one of the best places to start looking is somewhere that the implementations are liberal in different ways.
Archival journals require open access
March 18, 2012  
Many academics don't trust electronic-only journals; they think that paper documents are more durable, more archival, than electronic documents. Evidence for this is all around us: link rot in blogs and other web sites is pervasive, it approaches 3% in digital libraries, and in some cases publishers have lost the full text of articles for journals that cease publication or change ownership.

The funny thing is, we've known since at least the burning of the Library of Alexandria that paper (or papyrus, or any other physical medium) is no guarantee of durability. Brewster Kahle, who started the Internet Archive and knows a thing or two about document preservation, says that

the lesson of the first Library of Alexandria is “don’t have just one copy.”

If paper journals are more durable than electronic journals, then it must be because paper journals are using replication more effectively than electronic journals. This is strange, given how much easier it is to copy bits than paper, and how much computer scientists know about the uses of replication; yet it is so.

Consider how an issue of a typical academic journal gets published. The publisher prints up some number of copies of the issue, stores some in its own archives, and ships the rest to its subscribers. These subscribers include university libraries, who catalog their new acquisitions and store them in their stacks. University libraries can contain millions of volumes, and they maintain their holdings in climate- and moisture-controlled facilities, employing dozens of librarians and custodians to monitor the stacks around the clock.

At this point, if the publishing house were to burn down like the Library of Alexandria, the journal itself would survive. It is still available at dozens of libraries. Furthermore, the authenticity of any journal article can be established. Libraries are trustworthy: patrons know that they are dedicated to scholarship and its preservation, and tampering with the holdings is unlikely. If there were ever a question of tampering, the holdings of one library could be compared with others. (Libraries seem to be providing quite a lot of value to publishers; perhaps publishers should be paying libraries, instead of the other way round?)

In contrast, a typical electronic journal sits behind a paywall. The journal maintains the canonical copy; if they have mirrors or backups, the rest of the world is none the wiser. Libraries do not have copies of the electronic documents. Readers have to go through the publisher to read articles. Sometimes, authors are allowed to put a copy of their article on their web sites, but very often they cannot put up the “official” version.

The replication in this system, if any, is unsatisfactory. Libraries are not partners in the critical task of preservation. There are not multiple, independent copies of publications that can be compared for validation.

The problem here is not with the nature of electronic documents, but rather with publishers who are afraid of electronic documents. Open access publishers generally do better. The ArXiv is run by a library and has mirrors at other libraries. PLoS ONE has at least one mirror, and PLoS is experimenting with aggregating content published in other journals.

Open access enables replication, and replication is the essential ingredient of preservation. Therefore open access journals should emphasize this advantage over closed access journals, and they should pursue partnerships with libraries. (They should also start using cryptographic hashes to help authenticate replicated documents, and abandon PDF, a historical relic.)

Paper journals are on their way out. Nowadays a publication is more than just text—there is audio, there is video, there is source code, there are data sets. Electronic publications run by closed access publishers are also on their way out—they aren't archival. Open access electronic journals are the only way forward.

Paper harms scholarship
March 12, 2012  
I've just finished up a conference submission, which means I've spent quite a bit of time forcing my paper into compliance with the publisher's formatting requirements:
Papers must adhere to the standard ACM conference format: two columns, nine-point font on a ten-point baseline, with columns 20pc (3.33in) wide and 54pc (9in) tall, with a column gutter of 2pc (0.33in).
Furthermore, papers must be
at most 12 pages, including bibliography and figures.
Is there any doubt that these requirements actively harm scholarship? Anyone who has tried to comply with them understands that:
  • They force authors to spend a great deal of time eliminating widows and orphans, and otherwise trimming text according to the geometry of the page, regardless of whether that text could better explain the research.
  • Including the bibliography in the page limit makes it impossible to properly cite the literature. In practice, this requirement means that papers written by program committee members will be cited (otherwise, the paper stands no chance), but many relevant papers must be omitted.
  • A nine-point font is almost unreadable on paper, the narrow margins leave no room for annotations when reading, and a two-column format is difficult to read on an electronic device.
A fair reviewing process requires some limit on the length of submissions, but basing it on a number of US letter-sized pages is archaic. On top of that, all of these requirements are driven by the desire to keep the cost of printing low, at a time when I, for one, don't even want printed documents.

I once sat on a plane on the way to a conference next to a colleague who wrote up a submission for a different conference in pencil, during the plane ride. The submission was accepted. I'm sure there is some sensible way to fairly enforce a limit on the length of submissions that does not require them to be printed out in a nine-point font.

Some people argue that the best (only?) way to read a paper is to print it out. But of course a paper that has been formatted for print is best read in print! That has no bearing on how we format future papers. Going forward, academia should abandon paper.