Friday, July 22, 2011

Reconstructing Other Models Using Galacticus

One of the original goals of the Galacticus project was to be able to reproduce the behavior of other semi-analytic models - and to do so not just approximately, but precisely and algorithmically (i.e. using precisely the same algorithms as those models). Why would you want to do this? Read on......

Semi-analytic models are complex - Galacticus is around 100KLOC (kilo-lines of code), not a huge software package by commercial standards, but more than big enough to mean it takes a lot of work to maintain and to understand all of its behavior. Given this complexity it's impossible to fully define such a model by describing the algorithms in journal articles - the traditional way of communicating how a calculation was performed. If done well, a description of this form can probably get close to fully defining a model, but there will always be details that are missed.

The best, and perhaps only, way to fully define the model then is through the source code - which is why Galacticus is Free and Open Source Software. Want to know how a specific calculation is implemented? Download the source and go look.

This problem is crucial because the scientific method rests on the idea of reproducibility of results - if you can't reproduce my results, then they're probably wrong. You could argue that any model which so complex that you can't reproduce its results from the published description in a journal isn't really a robust physical model. But that's not the point! Whether you want to run my code to run your own calculations, or instead want to see precisely what I did so you can check whether your own implementation of the same calculation gives the same result, the ultimate, precise, complete definition of the algorithm is the source code itself (and, perhaps, the relevant programming language specificiation).

While Galacticus is Free and Open Source Software, other semi-analytic codes are not. But Galacticus is highly modular, so in principle we can implement the algorithms of those other models in Galacticus and it should give the same results that they do. This would be very useful for understanding differences between models, and allowing us to test the reproducibility of those results.

The problem of course, is that typically all we have to go on is the published journal articles describing those other models....... So, reconstructing them with precision might be impossible, and will certainly be very difficult. But, let's try anyway!

Specifically, I've been working on recreating the Baugh et al. (2005) model - this has some features which differed significantly from what was available in Galacticus, making it sufficiently challenging. I'd love to have a video montage to insert here, showing me coding while inspiring music plays..... But I don't, so you'll have to imagine it instead.

The bottom line is that it took about three weeks to get this to work - the biggest challenges were introducing the concept of "halo formation events" into Galacticus, and in precisely matching the adiabatic contraction algorithm (crucial since it determines the rotation speeds of galaxies, and star formation rates are proportional to the third power of this in the Baugh et al. model).  But, once those were done correctly, the results are very good. Here's a simple example showing, as a function of redshift, the mass density in galaxies and in stars in the original Baugh et al. model and in the reconstructed version using Galacticus.
You can see that the match is very good - although not perfect. There remain some differences - which can be traced to the fundamentally different way in which the two models handle timestepping - but overall, it works! Other quantities (e.g. sizes, luminosities etc.) of galaxies also look good.

So, while it's not easy, it is possible to reconstruct other models using Galacticus. Hopefully I'll repeat this process for some other published models - so that we can compare them on an equal footing. For now, if you want to run the Baugh et al. model using Galacticus, a suitable input parameter file is available here - works with revision 510 or later of Galacticus v0.9.0.

Thursday, July 21, 2011

Accuracy and Semi-Analytic Modelling

I submitted an article today (you can find it on arXiv here) which explores how the accuracy of semi-analytic models of galaxy formation is limited by the input dark matter halo merger trees.

Often, merger trees are extracted from N-body simulations of large scale structure. The usual approach in such simulations has been to dump out all of the particle information at a number of "snapshot" times. These are then post-processed to find dark matter halos which are then linked together to make merger trees. Those merger trees are fed into semi-analytic models which populate them with galaxies.

An obvious question is, "How many snapshots should I take?". The equally obvious answer is, "As many as you can!". But, simulations are large, and storing data from many snapshots quickly uses vast amounts of storage space. So, the real question it "How many snapshots do I need to get an accurate answer?" Surprisingly, this question hasn't been answered. There are a few short statements in a few papers that more-or-less say "We checked that we have enough snapshots." but which don't give any details. But, beyond that, nothing.

Fortunately, Galacticus is ideally suited to answering this question. Using Galacticus, I was able to construct merger trees with effectively infinite time resolution, and then degrade these to appear as they would if they'd been extracted from an N-body simulation with a finite number of snapshots. I could then see how the average properties of galaxies changed as I gradually increased the number of snapshots used.

Quantitatively, the answers are that 128 or more snapshots is sufficient to get most average galaxy properties to an accuracy of 5%. Fewer snapshots and errors begin to increase rapidly. For reference, the much-utilized Millennium simulation used 64 snapshots - not too bad, but few enough to lead to 20-30% errors in some quantities.

You might argue that errors at this level aren't too important (certainly there are other uncertainties in the galaxy formation calculation that are more uncertain than this).I would agree that quantitatively, that's true. Of course, we didn't know that the errors were this small until the test was done! Perhaps more importantly, checking for convergence with respect to numerical parameters (such as the number of timesteps) is just good practice in any kind of computational physics and should always be done.

Another nice feature of Galacticus: because it's free, anyone can run similar tests themselves (example parameter files are available here), and figure out how many snapshots they should use before they run an expensive N-body simulation.

Sunday, July 10, 2011

Code Profiling and the Slowest Steps of Galaxy Formation


Galacticus isn't the fastest semi-analytic model of galaxy formation out there. It was never meant to be. Instead, the goal was to focus on flexibility and accuracy. Want it to go faster? Buy more computers!

Of course, speed is an issue though, so it's useful to know which parts of the calculation are the slowest. Building merger trees is actually very fast - it's a minor contributor to the overall run time. Solving the baryonic physics is the slow part. I regularly run Galacticus through valgrind to profile the code. As a result of this, many parts have been optimized and there's now no single part where additional optimization would result in a significant improvement in speed.

Instead, the limiting factor is the size and number of timesteps that the ODE solver must take to advance galaxies forward in time with the requested accuracy. I recently added some new functionality to Galacticus to track the sizes of timesteps taken the the ODE solver as it solves the physics of galaxy formation (*). Perhaps more importantly, it also figures out which properties of galaxies are responsible for limiting those timesteps - i.e. which cause the bottleneck in evolving galaxies forward in time.

Here's some typical output from the new ODE evolver meta-data collecting routines, showing the distribution of timesteps taken:


and which properties limit the timesteps:

The timestep distribution peaks around 10-3Gyr, which is a small fraction of typical galaxy dynamical times. It falls off rapidly below this, but has an extended tail to long times. From the histogram it's clear that it's the gas component of the spheroid that mostly limits the stepsize taken. Some exploration shows that this is due to disk instabilities.

An unstable disk can funnel gas (and stars) to the spheroid on a timescale as short as the disk dynamical time. The timescale for changing the gas mass of the spheroid in this way is therefore (Mspheroid/Mdisk) tdisk which can be arbitrarily short if the spheroid mass is low (as happens when a spheroid is created as a result of a disk instability event. Even with sensible absolute tolerance limits set on the spheroid evolution this can require some short timesteps to accurately track the evolution of the spheroid. Probably the best solution for this would be a better motivated model for disk instabilities.....

(* The ODE profiling code is available in the current development version of Galacticus, but is deactivated by default as it needs a hacked FGSL library to work. If you want to use it, let me know and I'll send you the modified FGSL.)