Math is better than data
October 15, 2015  

As I write this in the early part of the twenty-first century, the tech industry is in a love affair with data. Instead of making product decisions based on hunches or wishful thinking, we use A/B testing to gather data on the alternatives and we choose according to what the data shows.

The dark side of the data obsession is an insistence that every decision be backed up by data: you can’t make a move unless you prove that it’s worthwhile. Or so the thinking goes.

There are lots of pitfalls with this thinking. One in particular is that, sometimes, you don’t need data to prove your point. You can do better: you can use math.

Math, for example, tells us that the sum of the length of any two legs of a triangle is greater than the length of the other leg. It would be extremely dumb to demand that someone go out and measure a bunch of triangles to “prove” or “disprove” this. If we take a measurement that violates the Triangle Inequality, it can only mean that we have a broken ruler, poor measurement skills, or a busted “triangle” with crooked sides.

And this brings me to the subject of safe and unsafe programming languages, which I can’t stop writing about (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13).

Many people don’t realize that the word “safe” is not just wishful thinking. “Safe” is a technical term with a mathematical meaning. The behavior of these languages can be written down mathematically and it can be proved that their behavior does not include memory errors.

Some people won’t believe this until they see the data. They want a comprehensive review of all the literature, vulnerability reports, etc., something which would require an immense effort. If the point is to show that safe languages do not have memory errors, that would a waste. Math has already told us that.

This is not to say that data is useless. As with the Triangle Inequality, we may find a data point that seems to show that a safe language has a memory error—but math tells us that this would inevitably show a completely different problem. Namely, the safe language compiles to an unsafe language, or the safe language is interpreted by a program written in an unsafe language, or the safe language is calling a library written in an unsafe language. In other words, the data can help us find bugs in safe language implementations.

If safe language implementations can have memory errors, does this make their mathematical guarantee worthless? No, for two reasons.

First, fixing a bug in a safe language implementation makes all programs written in that language safer. This is the same reason it makes sense to use a well-tested and widely-used library for a critical task rather than roll your own code.

Second, remember that we have been working on safe languages for more than 40 years, and we know a lot about how to make their implementations safe. In particular, and we know how to create safe machine languages. That is, we can compile a safe high level language into a safe machine language that can run directly on the hardware. Turtles all the way down, as they say. Or better yet: math all the way down.