Open access should not mean sole access
There's a big mistake that most open access publications are making:
they require readers to visit the publication web site to read
articles. That is, anyone can read articles for free at the web site,
and even copy them for their own use, but you aren't allowed to copy
articles for
republication on another site.
The obvious problem with this is that the operation of the web site
becomes critical. If the web server goes down, then readers can't
access the articles. If the publication forgets to renew its domain
name, or some country decides to blacklist the domain name, then
readers can't access the articles. If the publication goes out of
business, or the hosting company goes out of business, or a disk fails
and the publisher doesn't have a good backup system in place, then
readers can't access the articles. The web site becomes a single
point of failure.
A less-obvious problem is
cost. All-electronic
publications can be run cheaply, but not if you need to hire and
manage a staff to develop and maintain a fancy web site. Web site
expenses are part of the reason that some open access journals
charge
authors thousands of dollars to publish each article, a charge
that
is out
of reach for many authors.
The purpose of academic publishing is to further the advancement of
science by disseminating peer-reviewed research as quickly and as
widely as possible. Republishing—replication—is
clearly aligned with this purpose, and restricting republication is
clearly at cross purposes.
Here's what we should do: move to a publication model that encourages
replication and republication of the entire contents of open access
journals. For example, libraries should be able to republish
journals, and their patrons should be able to read articles through
the libraries' web sites. Libraries—or anyone else—should
be able to copy not only the articles but the table of contents of the
journal, as well as other metadata. This provides multiple
continuously-tested backups of the publication which can even survive
the publication going out of business: once an article is published,
it will always be available.
Journals and authors may wish to prevent some republication, e.g.,
commercial (for-profit) republication, or republication without
attribution. This can be handled as a copyright
and licensing issue.
Preventing all republication, however, is a mistake.
(See
also: “gold”
and “green” open access,
and Stallman
on redistributable scientific publishing.)
Good news: Harvard is broke
Harvard's Faculty Advisory Council on the Library
has
declared that
major periodical subscriptions, especially to electronic journals
published by historically key providers, cannot be sustained:
continuing these subscriptions on their current footing is financially
untenable.
This is excellent news. If Harvard cannot afford its journal
subscriptions, then you can be sure that no academic
institution can continue to go along with the extortion of today's
academic publishing industry. And if that's the case, perhaps there
is hope for change.
What should be done? According to the Council, faculty should no
longer submit papers to closed access journals, and they should resign
from editorial boards for closed access journals. Instead, they
should support open access publishers.
This is great advice. Open access publishing can be
cheap, as
explained by Harvard's own Stuart Schieber. Most academic
journals already obtain the research, peer review, and many editorial
positions at no cost—the contrast between costs and prices
charged has been a major impetus behind the open access movement.
Schieber points out in addition that most authors do a better job of
typesetting and copy editing than traditional publishers. And by
running his own press, he has demonstrated that the cost of a printed
journal can be brought to under 10 cents per page.
I agree with everything Schieber says in his article, but I would go
further. Many journals, whether closed or open access, run elaborate
web sites. These sites serve as the primary, and often, sole, source
for downloading articles, and they can provide other services like
search, statistics on readership and impact factor, and reader
comments. This is unnecessary: these services are already better
provided by others,
e.g., CiteULike, Mendeley,
CiteSeer, etc.
Journals should have minimal web sites. The only real purpose of a
journal is to provide certification for its articles: it certifies
that its articles have met its standards, including peer review. This
certification can be provided simply as a list of accepted articles,
plus information regarding the constitution of the editorial board and
its standards.
The site can certainly provide the articles themselves, but this
should not be the sole or primary way to access the articles.
Instead, we should do what we have done for thousands of years: rely
on libraries to provide access. Partner with libraries to mirror the
content—this requires only
using cryptographic hashes to validate the
articles. Multiple copies at libraries around the world ensures
preservation of the journal, even in the case that the journal ceases
publication. Open access should embrace (verifiable) replication, by
libraries, by authors, by anyone.
Journals should not be printed.
Paper documents are much more expensive and inefficient than
electronic documents, and
they are no better
at ensuring preservation. Moreover, reading is increasing
happening on electronic devices, where reformatting is a requirement.
Even when readers prefer paper, they will usually make their own
printout from an electronic document, rather than using their
library's printed copy.
By following Schieber's advice, and by further eliminating the costs
of a web site and printing, we can make the cost of the journal
dependent on just peer review and editorial functions. This is a cost
that Harvard and the rest of the academic community can easily afford.
The curious incident of the semicolon at the newline
Once again, someone is wrong about syntax on the Internet! As an
author of several papers on parsing, I can't resist the spectacle.
The fight is over the following snippet of JavaScript:
clearMenus();
!isActive && $parent.toggleClass('open')
Note the semicolon after clearMenus(). As Holmes would
say, what is curious about this semicolon is that it does
nothing at the newline. That is, as far as the JavaScript parser
is concerned, it might as well be omitted:
clearMenus()
!isActive && $parent.toggleClass('open')
When JavaScript parses this, it notices that a semicolon is
missing after clearMenus(), but acts as if it
were not missing. So the variant without the semicolon parses
just fine.
The problem is that while the parser produces the very same result
for these two snippets, other tools do not. In particular, JSMin was
transforming the syntactically-correct second snippet into code that
would not parse.
I'm not going to weigh in on whether this is a bug in JSMin, or
whether the code in question should be
changed—the
debate on those issues, and their resolution, is already
entertaining enough.
What I will say is that JSMin can never live up to
this
blurb from its own README:
JSMin is a filter that omits or modifies some characters. This does not change
the behavior of the program that it is minifying.
Douglas Crockford, JSMin's author, provides several counterexamples himself
later in the same README, so he knows this statement is false. What I
am stating, however, is much stronger:
All filters that omit or modify some characters must
change the behavior of some JavaScript programs.
To see this, just consider that JavaScript programs are typically
run in browsers as part of a web page, where they have access to
their own text via the DOM. A program that can examine its own
text can alter its behavior if its text has been altered.
This may seem like a pedantic edge case of an obscure program.
The truth is, what is true for this obscure program is also true
for every programming tool you have ever used, including every
compiler. None of these tools are semantics preserving. The
“pedantic edge case” may be different for different tools,
but I guarantee you, they all have one.
A grammar for HTML5
The HTML5 specification uses pseudo-code to specify how HTML documents
should be parsed. Here's a taste:
- If the byte at position is one of 0x09 (ASCII TAB), 0x0A
(ASCII LF), 0x0C (ASCII FF), 0x0D (ASCII CR), 0x20 (ASCII space), or
0x2F (ASCII /) then advance position to the next byte and redo
this substep.
- If the byte at position is 0x3E (ASCII >), then abort
the “get an attribute” algorithm. There isn't one.
- Otherwise, the byte at position is the start of the
attribute name. Let attribute name and attribute value be the empty
string.
- ...
This style of specification has provoked consternation among
some,
who prefer the “declarative” style of the HTML 4
specification, based on grammars.
Fortunately, I have spent a great deal of time over the past ten years
learning about parsing and its security aspects, and I believe I can
give here a very succinct grammar for HTML5.
A few preliminaries. I will be using a variant of Backus-Naur Form
(BNF) grammars, in which “.” will denote any single input
character, and postfix “*” will denote zero or more
repetitions of the preceding construct (Kleene closure). I will use
capitalized identifiers for the nonterminals of the grammar.
Here then is the grammar of HTML 5:
HTML5 = .*
Yes! No kidding, that really is the grammar. Any input that matches
this grammar—which is to say, any input at all—is going to
be accepted by just about any web browser, which will do its best to
render something sensible. You can think of this as a degenerate case
of
Postel's Law, in which
browsers are
extremely liberal in what they accept from others.
They accept everything!
To be fair,
the HTML5
specification does discuss “parse errors”, and says
that HTML user agents can abort processing when they encounter
them. But it also says that they can continue processing, and that's
what browsers seem to do.
This is no different from HTML 4, whose grammar really should be the
same as this. Web browsers have always been very tolerant of
“errors” in web pages; they try to render as much of their
input as possible. The more complicated grammar that you will find in
the HTML 4 specification is incomplete. It is not complete because it
does not say what happens when a browser encounters
“garbled” HTML;
browsers have
been left to decide this for themselves. Naturally enough, this
leads to browsers that behave differently on the same input: browser
incompatibility. And that leads, in turn, to
certain security
vulnerabilities.
The defenders of “declarative” specifications will note
that HTML 4's syntax specification is not only a grammar.
That's true, there is also a lot of English prose confusing things.
Here are some questions for the defenders: is the HTML 4 specification
equivalent to “.*”? If not, then when an input
does not conform to the grammar of HTML 4, what DOM tree will a
browser produce? (The answers are “no” and “only
your browser knows”.)
The pseudo-code of the HTML5 specification is charmless, but it is
pretty easy to convince yourself that it accepts “.*”.
Postel's Law and network security
Postel's Law, which I've written about
before, goes like this:
Be conservative in what you do, be liberal in what you accept from others.
The graybeards tell me that Postel's Law was an important engineering
guideline for the development of the Internet. To me it seems not
only an example of pragmatic engineering but a
necessary
property of network protocols, which are naturally designed and
implemented in a distributed fashion. If you are bootstrapping a
network and its protocols with collaborators across the globe, and you
find a mismatch between your implementation and another's, you aren't
going to stop the network, call them on the phone, get them to fix
their bug, and restart—you are going to bumble on as best you
can. There aren't many
successful network protocols whose
implementations ignore Postel's Law.
That said, any protocol implemented according to Postel's Law is going
to fall prey to an immediate corollary that has unfortunate security
implications:
Corollary: Everyone is liberal in a different way.
Here are a couple of examples.
NUL characters in SSL certificates
In 2009 or so, Marlinspike
noted
that SSL certificate authorities were signing certificates for domains
containing the NUL (ASCII 0) character. For example, consider this
domain where
0 denotes NUL:
gmail.com0.ev.il
To an SSL certificate authority, this may look like a subdomain that
should be “owned” by the ev.il domain. If so, they will
happily issue an SSL certificate for the subdomain to the owner of
ev.il. This is actually a
liberal treatment of domain names in
the sense of Postel, because domain names aren't supposed to include
NUL.
Now consider a browser trying to reach https://gmail.com/. Its
communication could be intercepted by agents of ev.il (there are lots
of ways this can happen, but the details are not important here).
That's fine so far, because SSL should detect this. However, ev.il
has a certificate for gmail.com0.ev.il, which it gives to the
browser. The browser, however, interprets the certificate as a
certificate for gmail.com, because it is using NUL as a string
terminator (common practice in programming languages like C), so ev.il
has succeeded in impersonating gmail.
The browser is interpreting the bogus domain name liberally, in the
sense of Postel, but the browser's “liberal” is different
than the certificate authority's “liberal”.
HTTP request splitting
Here's a second example, derived from one by Watchfire:
POST /form.html HTTP/1.1
Host: example.com
Content-Type: application/x-www-form-urlencoded
Content-Length: 0
Content-Length: 44
GET /good HTTP/1.1
Host: example.com
Bla: GET /evil HTTP/1.1
Host: example.com
This looks like a sequence of HTTP requests (the messages your browser
sends to web sites to retrieve content). However, something is
strange because the first request has
two Content-Length
fields, when it should have at most one. If we want to be liberal
like Postel, there are a couple of ways to handle this. One way is to
believe the first Content-Length header, in which case we get two
requests,
POST /form.html HTTP/1.1
Host: example.com
Content-Type: application/x-www-form-urlencoded
Content-Length: 0
Content-Length: 44
and
GET /good HTTP/1.1
Host: example.com
Bla: GET /evil HTTP/1.1
Host: example.com
Here we are posting to /form.html and getting the resource /good.
A second way to be liberal is to believe the second Content-Length
header, in which case, we get two different requests,
POST /form.html HTTP/1.1
Host: example.com
Content-Type: application/x-www-form-urlencoded
Content-Length: 0
Content-Length: 44
GET /good HTTP/1.1
Host: example.com
Bla:
and
GET /evil HTTP/1.1
Host: example.com
In this case we are posting to /form.html again, though the contents
of the posting are different (the posting includes the GET /good); and
we are getting the resource /evil. (In a real attack, /evil would be
a well-chosen request that would cause trouble for the web site.)
Once again, two ways of being liberal. This can be a problem when you
have two programs working together, each of which is liberal in a
different way. A common example is a proxy that sits in front of a
web site, filtering requests, and throwing out any “bad”
requests that it sees.
Here if the proxy sees our input and uses the first interpretation, it
will let it through. If the web site uses the second interpretation,
it will be exploited, and we will have defeated the proxy's
protection.
Script injection
As a final example, I'll use the current most-reported security
vulnerability, the
script injection. A script injection
happens when a web site serves up input provided by its users.
Facebook is one example (of many). Most of the web pages you look at
on Facebook contain content provided by its users—after all,
that's the whole point of Facebook. Script injection happens when a
user provides the site with content in such a way that the resulting
web page includes a script of the user's choice. This script will be
executed in the browser of other users who view the page.
Script injection is common but complicated. I'm not going to spell
out all of the details (see my BEEP
project if you want to know more), but one of the complications is
relevant to Postel's Law. Namely, the main defense against script
injection is for the web site to be careful with user input: it should
filter out user input that includes scripts.
Of course, this requires knowing exactly what constitutes a
script. And browsers do not agree on this. Browsers,
famously, have incompatibilities.
For example, most browsers think that this snippet of HTML
<img src=java
script:alert(0)>
does not contain a script. However,
some browsers will treat
it as a script, essentially treating it as if it were this correct
snippet:
<img src=javascript:alert(0)>
Something similar happens for the following snippet:
<img """><script>alert(0)
</script>">
In short, the answer to the question of whether a bit of user input
includes a script
depends on the browser—there is no one
answer. Each browser, and each
version of a browser, may have
a different answer; they can all be liberal in a different way. This
is one of the major hurdles for any defense against script injection.
Summary
It is very hard to build two independent implementations of a network
protocol which behave identically (in fact the point may well be that
they behave differently, e.g., one may be faster). Whenever this
happens, the implementations still need to interact, and they do so by
Postel's Law. If you are interested in finding (and preventing)
attacks on protocols, one of the best places to start looking is
somewhere that the implementations are liberal in different ways.
Archival journals require open access
Many academics don't trust electronic-only journals; they think that
paper documents are more durable, more
archival, than
electronic documents. Evidence for this is all around us:
link rot in blogs
and other web sites is pervasive,
it
approaches 3% in digital libraries, and in some
cases
publishers
have lost the full text of articles for journals that cease
publication or change ownership.
The funny thing is, we've known since at least the burning of the
Library of Alexandria that paper (or papyrus, or any other physical
medium) is no guarantee of durability. Brewster Kahle, who started
the
Internet Archive and knows a thing
or two about document preservation,
says that
the lesson of the first Library of Alexandria is “don’t have just one
copy.”
If paper journals are more durable than electronic journals, then
it must be because paper journals are using replication more
effectively than electronic journals. This is strange, given how
much easier it is to copy bits than paper, and how much computer
scientists know about the uses of replication; yet it is so.
Consider how an issue of a typical academic journal gets published.
The publisher prints up some number of copies of the issue, stores
some in its own archives, and ships the rest to its subscribers.
These subscribers include university libraries, who catalog their new
acquisitions and store them in their stacks. University libraries can
contain millions of volumes, and they maintain their holdings in
climate- and moisture-controlled facilities, employing dozens of
librarians and custodians to monitor the stacks around the clock.
At this point, if the publishing house were to burn down like the
Library of Alexandria, the journal itself would survive. It is still
available at dozens of libraries. Furthermore,
the authenticity of any journal article can be established.
Libraries are trustworthy: patrons know that they are dedicated to
scholarship and its preservation, and tampering with the holdings is
unlikely. If there were ever a question of tampering, the holdings of
one library could be compared with others. (Libraries seem to be
providing quite a lot of value to publishers; perhaps publishers
should be paying libraries, instead of the other way round?)
In contrast, a typical electronic journal sits behind a paywall. The
journal maintains the canonical copy; if they have mirrors or backups,
the rest of the world is none the wiser. Libraries do not have copies
of the electronic documents. Readers have to go through the publisher
to read articles. Sometimes, authors are allowed to put a copy of
their article on their web sites, but very often they cannot put up
the “official” version.
The replication in this system, if any, is unsatisfactory. Libraries
are not partners in the critical task of preservation. There are not
multiple, independent copies of publications that can be compared for
validation.
The problem here is not with the nature of electronic documents, but
rather with publishers who are afraid of electronic documents. Open
access publishers generally do
better. The ArXiv is run by a library
and has mirrors at other
libraries. PLoS ONE has at
least one
mirror, and PLoS is experimenting with
aggregating content
published in other journals.
Open access enables replication, and replication is the essential
ingredient of preservation. Therefore open access journals should
emphasize this advantage over closed access journals, and they should
pursue partnerships with libraries. (They should also start
using cryptographic hashes to help
authenticate replicated documents,
and abandon PDF, a historical relic.)
Paper journals are on their way out. Nowadays a publication is more
than just text—there is audio, there is video, there is source
code, there are data sets. Electronic publications run by closed
access publishers are also on their way out—they aren't
archival. Open access electronic journals are the only way forward.
Paper harms scholarship
I've just finished up a conference submission, which means I've spent
quite a bit of time forcing my paper into compliance with the
publisher's formatting requirements:
Papers must adhere to the standard ACM conference format: two
columns, nine-point font on a ten-point baseline, with columns 20pc
(3.33in) wide and 54pc (9in) tall, with a column gutter of 2pc
(0.33in).
Furthermore, papers must be
at most 12 pages, including bibliography and figures.
Is there any doubt that
these requirements actively harm scholarship? Anyone
who has tried to comply with them understands that:
- They force authors to spend a great deal of time eliminating
widows and orphans, and otherwise trimming text according to the
geometry of the page, regardless of whether that text could better
explain the research.
- Including the bibliography in the page limit makes it impossible
to properly cite the literature. In practice, this requirement
means that papers written by program committee members will be cited
(otherwise, the paper stands no chance), but many relevant papers
must be omitted.
- A nine-point font is almost unreadable on paper, the narrow
margins leave no room for annotations when reading, and a two-column
format is difficult to read on an electronic device.
A fair reviewing process requires some limit on the length of
submissions, but basing it on a number of US letter-sized pages is
archaic. On top of that, all of these requirements are driven by the
desire to keep the cost of printing low, at a time when I, for one,
don't even want printed documents.
I once sat on a plane on the way to a conference next to a colleague
who wrote up a submission for a different conference in pencil,
during the plane ride. The submission was accepted. I'm sure there
is some sensible way to fairly enforce a limit on the length of
submissions that does not require them to be printed out in a
nine-point font.
Some people argue that the best (only?) way to read a paper is to
print it out. But of course a paper that has been formatted for print
is best read in print! That has no bearing on how we format future
papers. Going forward,
academia should abandon paper.