Convert untrusted PDFs to trusted PDFs

February 28, 2013 ∞

This post by Joanna Rutkowska describes a nice technique for “disarming” malware. I’ve seen it used before, but it doesn’t seem to be widely known, or, at least, not widely applied.

Rutkowska applies it to malware in PDF files, which has been a persistent threat over the last couple of years (like this week’s MiniDuke). The problem (of course) is in parsing:

Anyway, the fundamental problem with the approaches mentioned above, is that all of them require parsing of the original PDF file. And parsing is where the “big bang” usually happens. Parsing is where our, normally pretty decent, code, comes in close, intimate contact with some unknown complex input data, which often leads to a successful abuse or exploitation.

So what she does is she parses the PDF within a sandbox (a Qubes virtual machine), and produces a much simpler, “equivalent” PDF:

The idea is that our parser (that runs in a Disposable VM) will be expected to return the Simple Representation of the original PDF. Of course, it might very well go wild (as a result of exploitation by the PDF it parses), and don’t obey our expectations, and instead return something totally different and potentially malicious. But that doesn’t matter! The whole point of the Simple Representation is that it should be, well… simple to parse it safely and discard in case what we’re getting doesn’t look like the Simple Representation.

Ok, so what’s the simplest possible representation of an arbitrary PDF file? Yes, it’s the RGB format, which is essentially just a raw array of RGB values for each pixel. In fact, I’m not sure there could be anything simpler in the Known Universe to represent a PDF file…

Very nice. Now you can have a PDF viewer that only knows how to render a PDF in Simple Representation, which is much easier to parse; such a viewer would therefore be much simpler itself, easier to get correct.

I’ve seen exactly the same approach used to protect against script injection attacks, in the paper BluePrint: Robust Prevention of Cross-site Scripting Attacks for Existing Browsers, by Ter Louw and Venkatakrishnan. In that case, the idea was to render HTML and produce a simpler form of HTML as output.

This is also related to proof-carrying code and certified evaluation. For example, in my work on SD3 I evaluate security policy queries and produce as output an answer along with a simple, checkable proof.