Wednesday, January 25, 2012

Reproducibility of Galacticus Models (and any other kind of model)

The blog has been a little too quiet recently as I've been focussing on a couple of new science results which should be finished in the next month or two. But, here's something from a side-project that grew out of a conversation with Matthew Turk over on Google+.

I suspect that 99.999% of all scientists have had this experience: You do some calculation and you make a figure showing the results. All good stuff. Then, six months later after you've finished teaching for the year, you come back to that calculation and check that you've remembered how to do it by remaking that figure. But, this time, it doesn't look quite the same........

This is a common example of the problem of reproducibility in science. Often, the final output of a calculation is the result of a long and complicated set of calculations, in which many different choices had to be made at each step. Sure, we should keep careful notes of how every step proceeded and all of the choices made, but..... sometimes we forget.

So, why not have all of the relevant information stored automatically. Seems like an obvious idea. The next question is where to store it? The Google+ discussion I mentioned above lead to the idea of using the fact that many image formats (e.g. JPG, PDF etc.) can store arbitrary metadata alongside their graphical contents. (In jpegs, this is often used to store information about the camera settings used to take the photo for example.) Making use of this facility, it's possible to store all relevant metadata required to recreate a figure derived from Galacticus - and to store it in the same file as the figure. That way, the metadata is never separated from the figure and never gets lost. If those figures form part of a scientific publication that gets uploaded to, then the instructions necessary to reproduce the result are stored forever.

So, today's revision of Galacticus includes some functionality to achieve this. When a figure is created a whole bunch of information is now stored inside of it, including:
  • the specific version and revision of Galacticus used;
  • the entire source changeset (i.e. any differences between the source code used to compile Galacticus for this model and the mainline branch of Galacticus) - this now gets stored in the main Galacticus output file too (thanks to Matt for this suggestion);
  • library and compiler versions, and Makefile options used to build Galacticus;
  • all input parameters passed to Galacticus;
  • the complete analysis script used to create the figure;
  • the UUID of the Galacticus model from which the results were derived.
There's also a tool provided which will extract this information from files, list the commands needed to recreate the exact version of Galacticus used, and generate a suitable input parameter file to reproduce the original model.

I think this is a really useful way to improve the reproducibility of results - and the ideas are applicable to any scientific work. I'll be advocating loudly for adopting a similar approach in other scientific tools!