Strong Typing For Statistics

Background

Twenty years ago, the fashion in programming languages was strongly in favor of loosely typed languages. It even went as far as some languages automatically promoting numeric strings to numbers in numeric context; Perl being a prime example of how loosely “typing” could be. Pundits predicted disasters; in practice, it wasn’t horrible, but it did limit opportunities for static analysis, optimizations, and introduced security issues.

These days, the current fashion is for strongly typed languages, but with significant support from the compiler / interpreter to do all of the type checking. This gets the best of both worlds; the speed and ease of writing in a loosely typed language, with the correctness and optimizations of a strongly typed language. Swift and Hack are two primary examples of this “modern synthesis.”

R is a pretty good example of a loosely typed language, and once you start looking into the details of the S3 object model, it’s even looser. On many levels, this is great; it means it’s fast and easy to explore data, and to build new approaches to analysis. However, I think it also means that it severely restricts opportunities for static analysis.

Static Analysis

There is nothing in R (presently) that prevents you from doing something like this:

result <- mean(df$favorite_color)

However, this is wrong from a statistics perspective; $favorite_color is categorical data, and taking the mean here just doesn’t make sense. It would be nice to have a type system related to statistical analysis, and then you would get errors (warnings?) if you do something that is considered bad practice.

Now, “bad practice” is a relative term; it’s relatively common to take an average of an ordinal measure. It would be interesting to see if this package could be written in a way that makes it easy for people to customize their own “best practice;” teams could make decisions about the level of statistical “purity” they were willing to accept. (From a software design standpoint, this would be a fascinating endeavour.)

What This Misses

Although this helps, it’s not a magic bullet. The hard part here coming up with the right set of “types,” and then scaffolding novices to use the types properly. (An interesting elaboration here would be error messages that are actually useful, e.g for the example earlier of taking the average of a categorical variable, something like “A better measures of ‘average’ (central tendency) for categorical variables is median.”)

The next step is probably to think hard about how the type system would work. I mean there is definitely a notion of categorical, ordinal, interval, and ratio numbers; and then time and location; but what else should there be? Still trying to think it through.

Updated: