Hashing FTW
February 29, 2012  

I was surprised to discover recently that my research organization is not able to issue technical reports. When we were part of Bell Laboratories, we had the Computer Science Technical Reports, which date from the early days of Unix. Some of my favorites:

CSTR 31: The C Programming Language
CSTR 32: Yacc: Yet Another Compiler-Compiler CSTR 57: MAKE: A Program for Maintaining Computer Programs
CSTR 68: AWK: A Pattern Scanning and Processing Language

The CSTR series seems to have fizzled out in 1993. Alcatel-Lucent, which acquired most of Bell Laboratories in a series of spinoffs and mergers, apparently doesn’t maintain web access to these anymore, but I was able to find a 2008 snapshot at the Internet Archive. (UPDATE: they’re back.)

It’s sad to see these important historical documents neglected. It’s also a reminder that even though many of us are fed up with traditional publishers, they actually do make some contributions to scholarship. They actively work to make their publications available (for purchase) on the web and in printed form in libraries. And, more subtly, they provide an authoritative naming scheme for publications (e.g., journal name, volume and issue number, title of article, and page numbers).

It’s this second issue that prevents me from issuing a technical report. At the moment there’s no one here in charge of issuing technical report numbers and saving an archival copy of the publication. The number would let others cite the work, and the archival copy would let others verify that they have the actual work and not a modified version.

Hashing to the rescue?

You might think that this is exactly the sort of problem that cryptographic hashes were meant to solve. This is almost right. A cryptographic hash of a file is effectively an authoritative naming scheme. It can be cited, and if you have a purported copy of the document, you can compute its hash to verify that it is, in fact, the right document. Moreover, hashes have the crucial property that you don’t have to rely on a separate authority to issue the hash, you can do that yourself.

If you look at any technical report, though, you’ll see one little advantage of the old way: the technical report number is included on the title page of the technical report itself. So, the techreport is self-identifying, which is rather nice: if you have it in your hands, you know how to cite it.

It seems impossible to get this property with a numbering scheme based on a cryptographic hash function. On the face of it, it would seem that you would have to (1) choose a hash, so you could use it in your report to identify the report, and then (2) write the rest of the report in such a way that the hash of the report is the same as what you chose in step (1). This amounts to computing a collision for a cryptographic hash function, which is supposed to be infeasible.

Nevertheless, it can be done. You are reading a document with SHA-1 hash to be determined, and as you can see, that hash appears several times in the document itself. Verify it for yourself:

    $ curl trevorjim.com/hashing-ftw/ | sha1sum
    to-be-determined -

Of course, I haven’t done the impossible. It’s a trick.

A two-part solution

Here’s how it works. The document you are reading is a web page. That means it’s written in HTML, and to read it you use a web browser, which does not show you the HTML, but rather, it renders the HTML. Among other things, rendering can involve evaluating JavaScript embedded in the page. (Unless you have disabled JavaScript, in which case, this won’t work at all.)

I use two scripts to get the hash of a web page into the rendered page. The first script calculates the hash and assigns it to a variable, __the_hash. Then a second script inserts it into the rendered page at the desired DOM elements. For the page you are reading I used the following script:

  ...
  <script>
  function doit() {
    var spans = document.getElementsByTagName('span');
    for (var i = 0; i < spans.length; i++) {
      spans[i].innerHTML = __the_hash;
    }
  }
  </script>
  </head>
  <body onload="doit()">
  ...

All that’s left is to calculate __the_hash.

Cheating

One way to calculate __the_hash is to cheat: just use an external script

  <script src="myhash.js"></script>

in your web page. The file myhash.js can be constructed after the web page itself, and it should just assign the hash to the variable __the_hash. The cheat here is that the web page is not self-contained. We can change the contents of myhash.js without changing the hash of the HTML file.

Of course, this would completely miss the point. The hash is supposed to verify that you are seeing the right document, but myhash.js can include any code we like, including code that will completely change the rendering of the page. The hash of the page can only achieve our goal if the page is self-contained.

Plan B

In order to make the page self-contained, we need to build a script into the page that can grab the contents of the page, including the script itself, and pass that through the hash function. Thanks to Microsoft, there is a convenient way of grabbing the contents of the page from a script in the page: innerHTML. So we could try something like this:

  __the_hash =
      hash(document.documentElement.innerHTML);

Unfortunately, I’ve found that innerHTML is kind of flaky. It isn’t consistent from browser to browser, or even between two versions of the same browser. And the innerHTML of a document is not guaranteed to be the same as the original HTML file. For a cryptographic hash, we need it to be exactly the same.

Self-hashing

It’s time to break out the oldest trick in the book: the self-reproducing program, i.e., a program that prints itself out. Self-reproducing programs seem useless, except as a programming exercise, but there are a few pragmatic applications. The best one that I know of is a bootstrapping compiler (see Thompson’s Reflections on Trusting Trust to see how this works).

We can solve our problem with a twist on self-reproducing programs: the self-hashing program. This web page is an HTML+JavaScript program that prints itself out, sends that output to the SHA-1 cryptographic hash function, and inserts the resulting hash into the web page as rendered by the browser.

I’m not going to explain how to write a self-reproducing program, since you can find good explanations elsewhere (Russ Cox’s Zip Files All The Way Down is a fun example). Once you’ve figured that out, it’s easy to turn your self-reproducing program into a self-hashing program. If you do try it yourself, you’ll probably find that, paradoxically, it’s easier to write a program that generates self-hashing programs than it is to write a standalone self-hashing program. That’s what I did: selfhash is a JavaScript program that takes an HTML file as input, and inserts a script that stores the SHA-1 hash of the input plus inserted script into the variable __the_hash. A second script (like the one above) in the input file can use __the_hash to insert the hash into the rendered page. Give it a try if you want to create your own self-hashing web pages.

I should point out that it is possible to do this with any document format that is Turing-complete, including print formats like PostScript; the hardest part is writing SHA-1 in PostScript. I don’t know whether it’s possible to write a self-hashing PDF file, as PDF is more restricted than PostScript, but in any case I don’t think we should write for paper anymore.

Should you self-hash?

I’m not seriously suggesting that we should all move to self-hashing documents. There are a variety of issues that make this impractical: users may disable JavaScript; search engines may not properly index documents that require JavaScript for rendering; documents may be composed of multiple files; we still need to verify that a document is self-contained; and, due to bugs and differences in browsers, your HTML+JavaScript web page is not guaranteed to evaluate the same way every time.

For these reasons and more, we can’t eliminate the need for the authority in an authoritative naming scheme. The best current option we have for such an authority is something like arXiv, because arXiv accepts submissions from anyone, and immediately gives you an authoritative number by an automated process.

One thing that arXiv does not do right, however, is that it does not rely on cryptographic hashes to verify document contents. This should be a requirement for an authority. As it stands, there is no way to detect whether an arXiv insider has modified a document, or whether a document has been modified because the site has been compromised. Put another way, without hashes we must still rely on arXiv to securely maintain the canonical copy of the document for distribution and verification. That’s too much power and responsibility to give to the publisher.