December 19, 2012

Lately at work, I’ve been tinkering with finding ways to get R to do a lot of what we currently do using Gauss.  The reasons for using R are numerous: It’s cross-platform, open-source, and highly flexible.  Gauss, on the other hand, is proprietary, expensive, PC-only, and quite rigid in how we use it: A specific set of routines are hard-coded into a script, and the user is completely limited by pre-set options.  Moreover, most of the routines that we use regularly were first developed in the late 1970s and early 1980s, and written in the Gauss language by Hector Neff in the late 1980s and early 1990s,  Since then, the routines have remained essentially the same, though they have been modified by a number of folks at the laboratory here to fix bugs and keep up with changes in the underlying Gauss environment.

Needless to say, there have been a lot of advances in computing technology, statistical methodologies, and the ability to address “big data“.  Bayesian statistics, for example, were unheard of 30 years ago because computer systems simply were unable to handle the complexities involved in the required calculations.  That’s not to say that the routines we use…er…routinely are not still valid.  Rather, it just means that there are additional ways in which we can examine data, and that these may lead to a greater understanding of the underlying patterns.

Though it’d be pretty fantastic to write a whole new set of of routines allowing the Bayesian analysis of geochemical data, my first project here has far simpler goals in mind.  At some point, my boss modified the statistical routines we use to replace missing values in a data set.  Previously, missing values were replaced using an iterative least-squares regression process that produced unique values for each specimen based on the other known values.    I’m not sure why, but this got changed a few years ago to replacing missing values based on the average (mean) of the entire dataset in which a specimen was located.  I’ve noticed that this leads to some pretty bizarre behaviors in a dataset, and — especially when using canonical discriminant analysis or principal component analysis — potentially erroneous results. 

One interesting recent development in statistics and computing that has occurred in the past 30 years is the use of bootstrapping (a resampling technique used to obtain summary estimates), and I recently came across “Amelia II“, an R package developed by some folks at Harvard and UCLA that uses a bootstrapping algorithm to “multiply impute” missing values.  At its most basic level, the software using bootstrapping to produce a predefined number of datasets in which missing data have been replaced using an iterative regression process.

So, I figured this might make an interesting package to use for replacing values in my own datasets.  There are other packages out there for doing similar things with explicitly compositional data; however, I’ve not had much luck figuring out exactly how to get them to do it.

So far, the script seems to work brilliantly.  I tested it out with a dataset of 250 clay specimens, arbitrarily removed some values, imputed replacement values, and found that the results were often quite close (within detection limits of our analytical techniques).  It’s certainly not perfect, but it’s the first real script I’ve written in R that does exactly what it’s supposed to do.

You can download the script on my Web site