Do not DOI: Searchable hashes instead
April 29, 2014  

I missed this bit of news from December: GitHub, Mozilla, and figshare are trying enable academic citations for code. They do this by generating a DOI for code in a GitHub repository. This is an awful idea, backwards in every respect.

For those who don’t know, a DOI is a digital object identifier and it is used, badly, by publishers as a canonical reference for publications. The idea of having a canonical reference is good, but the reality is that publishers routinely botch the job.

For example, I recently mentioned an old paper of mine in a blog post. To get the link to the paper for my post, I started from my publications page, where I found that the link to the journal was dead—publishers don’t maintain their web sites well. After messing around with some web searches and clicking around the publishers site I managed to find a link to the paper. And on that page, it claims that the DOI of the paper is this: http://dx.doi.org. And that is not a DOI at all, it is a link to a web page where you can look up an article if you have its DOI. In other words, as far as I can tell, my publisher hasn’t assigned my article a DOI.

So not only are publishers bad at maintaining their publications online, they are bad at assigning DOIs and keeping them up to date. Furthermore, this paper—my paper—published in 2000, costs $45. FORTY-FIVE DOLLARS! What kind of value are they providing for this?

Now consider the start of this rant: software is becoming an important part of publications, so we need to have a canonical way to cite it in other academic work. But software kept in a GitHub repository already has a canonical identifier, namely, its git hash. And this is better than a DOI because it is a cryptographic fingerprint: if you obtain the software you can actually verify that you have a pristine copy, one that has not been tampered with.

Really what we should be working towards is replacing DOIs with hashes. Instead of assigning DOIs to code repositories, we should be putting academic articles into git repositories and citing them by hash. Then we can abandon the DOI system (and publishers!) altogether.

There is one missing piece: given a hash we need to be able to find the repository. This should be easy: search engines should simply start indexing git repositories by hash. Strangely, they don’t already do so. Google, for instance, does not seem to be able to find GitHub repositories by hash. Even GitHub itself does not seem to index their repositories by hash. For example, 066f1e14c22b57791bb40bb783ab03b3cad9935b is the hash of one of my GitHub commits, and you can see it here on GitHub, but if you search for 066f1e14c22b57791bb40bb783ab03b3cad9935b on GitHub you get nothing.