Finally, with the X2012 conference out of the way, the summer holidays a distant past and the last of the grant proposal I was to submit before the end of September also done, it is time to breathe out, get a cup of coffee and enjoy some mathematics. Not the really difficult type, obviously, but the one that is interesting and useful (to my immediate surroundings). This resulted in a publication in the Annals of Occupational Hygiene entitled: “The Use of Benford’s Law for Evaluation of Quality of Occupational Hygiene Data” for which the abstract can be found here and, if you have access the full paper can be downloaded here.

So what was it all about? Well a long, long time ago (1881 to more precise) in a far, far away country (Well, the US) there lived a man called Simon Newcomb. Simon was an astronomer with, as it appears, enough time on his hands to wonder about why the earlier pages in his book of logarithm tables (I know…people did strange things in the old days) the pages near the front cover of the book seemed to be used more often than the pages to the end of the book. Interestingly, people seemed to need numbers starting with a “1” more often than the other pages. Clearly still having some time on his hands he proposed a mathematical law that everybody ignored.

Fast forward a couple of decades to 1938 and to a man with the beautiful name Frank; Frank Benford that is. Frank was an American physicist who must have had an interest in ignored mathematical laws, since he dusted off Newcomb’s law, tested it on data from 20 wildly different datasets, and published it. Somewhere in the process the law was re-branded to Benford’s law.

Essentially, what this law states is that in many (but not all) large datasets of real-life processes the first digits 1 to 9 all have a specific probability of being present in those data. Have a look at the following figure I nicked of Wikipedia (link).

Similar figures can be made for the 2^{nd} digit, the 3^{rd}, and so on, where the distribution differs for the distribution of each of these.

It’s a very interesting, and somewhat contra-intuitive concept that this is in fact the case, but also that this applies to so many and widely different, sets of data. For example, Frank Benford in his original work tested it for the surface areas of US rivers, the sizes of US populations, a number of physical constants, molecular weights, entries from a mathematical handbook, numbers contained in an issue of Readers’ Digest, street addresses and death rates. Apparently, in 1995 the underlying maths was proven using mixed distributions (link), but let’s not get lost in details…

Just the fact that random processes in nature behave this way is, I think, amazing as well as…well weird. I don’t seem to be the only one thinking this since there is a vast amount of stuff available on the internet. For example, Ben Goldacre on his *Bad Science blog* has been discussing this (here) and showed an example where “the economy of Greece showed the largest and most suspicious deviations from Benford’s law of any country in the Euro”. I am not going to comment on macroeconomic events, but will hmmm…anyone surprised?

I’ve also found a website called “Testing Benford’s Law” (link), which basically does what you would expect from a website with such a name. On as many situations as possible, and ever expanding. So at the moment you can find that the *population of Spanish cities, Twitter users by followers, most common IPhone passcodes, distance of stars from the earth, loan amounts on kiva.com, total numbers of print materials in US libraries, file sizes in the Linux 2.6.39.2 source tree (*whatever that means*), the population of Mexican counties, stack overflow user reputation (*again, no clue*), line counts of the Rails 3.0.9 core source code (*anyone???*), UK Government spending, *and *2011 Russian parliamentary elections results*, all approximately follow Benford’s law. It’s amazing(ly weird).

So what does this have to do with the normal things I am writing about on this blog?

It’s interesting to think about what it means if a dataset does not follow Benford’s Law. In essence there are 4 reasons why this could occur of which the first two are not that exciting:

1. The data do not follow Benford’s law

2. The data in general really do follow Benford’s law, but due to random chance this particular set of

observations does not (type I error)

However, if we have thought about the data and have excluded one of these 2 reasons, then other explanations are:

3. There is a reasonable explanation for the deviation from Benford’s law

4. Some of the measurements in the data set are fraudulent, or others have been omitted.

As discussed in the paper, this makes it very interesting for use of environmental exposure datasets. Brown earlier showed (link) that pollutant concentrations in ambient air from monitoring networks should follow Benford’s Law (if they are large enough) and similarly so should occupational hygiene datasets.

Of course, we would all like to be one of the film noir detectives, using specialist methodology and our brilliant, investigative yet under-appreciated brains to unmask this undercover plot by devious individuals who attempt to hide high exposure levels of dangerous chemicals produces in their factories by deleting certain measurements, changing the numbers, or making up datasets (or maybe that’s just me…). And yes, Benford’s Law is suitable for this.

More likely however, routinely scanning any datasets one has to work with to check compliance with Benford’s Law will enable the detection of deviations, thinking about this, doing some further interrogation of the data, and number 3! There is a reasonable explanation for the deviation from Benford’s law…

For example, while coding data you (or of course more likely, someone else) made an error, there is an error in scripts used to recode data from one format to another, something Microsoft developed surprisingly didn’t do as intended (spot the sarcasm) or, for example as we showed in the paper for some real datasets of rubber industry measurements a decisions was made to replace all measurements below the limit of detection with a constant (limit of detection divided by the sqrt(2) in this particular case).

Maybe not as glamorous as unmasking (inter)national occupational hygiene measurement manipulation terrorism, but nonetheless pretty important; and don’t forget the other buzz words: it’s free, easy, and quick!

It only takes Excel to use this yourself. For example, a * step-by-step guide* on how to do this can be found here.

For those using **R software**, like me, for your convenience hereby some of the R scripts that should help you to get on the way:

*# function to extract first digit *

*benford <- function(k){*

*as.numeric(head(strsplit(as.character(k),”)[[1]],n=1))}*

*first.digit <- sapply(variable, benford)*

*# Benford’s expected distribution of the first digit*

*benford.1st <- c(0.301,0.176,0.125,0.097,0.079,0.067,0.058,0.051,0.046)*

*# observed distribution*

*library(MASS)*

*observed<-data.frame(table(first.digit))[,2]*

*# Test*

*chisq.test(observed,p=benford.1st,rescale.p=T)*

So yes, I think this is a great and straightforward method to improve the quality of the data we use, and I hope we will use Benford (with or without the scripts provided above) to routinely check the data we create, been given and/or handle.

**Please also, if you do find some interesting things please let me know. I’d be very interested!**

*Environmental Health, General, Occupational Health, Public Health, R tips and tricks*

Posted on October 11, 20120