Meditations on the 'Archivability Crisis' in Science and the Long-Term Reproducibility of Scientific Analyses

By John Pellman · Nov 19, 2018

This post is a response to C. Titus Brown’s How I learned to stop worrying and love the coming archivability crisis in scientific software, informed both by Emulation & Virtualization as Preservation Strategies by David S. H. Rosenthal and past experiences attending Vintage Computer Festival East.

I intend to react to several of Brown’s assertions. Namely: 1) that one can’t save all the software necessary to faithfully reproduce an analysis pipeline, 2) that containers and VM images, as black boxes, are bad for inspectability, and 3) that analyses have a “half-life of utility”, and this in turn renders literal reproducibility undesirable due to cost and effort. Note that Brown’s own views on this may have converged with my own at this point to some degree- I have not been able to fully keep up with his writings, so I apologize if he has made similar points to my own elsewhere. There quite likely have also been many other developments on the subject of scientific reproducibility that I am ignorant of as well.

Software Preservation, Digital Darwinism, and the Role of Packaging Systems in Promoting Reproducibility

I’ll start by putting my own biases out for inspection- I’m candidly a lot more bullish about the long-term viability of replicating scientific experiments in silica going forward, and my gut reaction to Brown’s post is that the claim that we can’t save all software, while superficially true in the sense that it would be near impossible to save the entire statistical population of all written software, is hyperbolic in the sense that the most popular scientific software stacks will be sufficiently preserved into the distant future. Why do I think this? Statistically, the more popular a software package is the more copies of it will be made, and the more likely it is to survive. Furthermore, a substantial number of the software layers that scientific computing relies upon are not unique to science, and even more copies of them are made by non-science actors due to their general purpose nature. Essentially, I have confidence in a sort of digital darwinism that will ensure that core packages essential to scientific analyses remain around for a while. Whether or not this darwinism applies to binaries and source code equally is debatable, but I strongly believe that most software will survive in some form through this manner (e.g., LINPACK in its 1988 incarnation is still downloadable today).

Beyond an evolutionary argument, modern software stacks have an advantage in the sense that most environments, whether they be Linux, BSD, or OS X (heck even Windows) now have package managers that allow one to install specific versions of a piece of software. So long as one mirrors the package repositories (or even just individual package files for dependencies) for all the applications and libraries used in an analysis, one can easily unpack pre-requisite software and re-create an environment for one’s operating system of choice. As an added bonus, for general purpose package repositories, there are already several mirrors of packages going back to at least the mid-2000s. Furthermore, most good package managers will also perform checksums on packages after they are installed to ensure an executable’s integrity, which prevents corruption in binaries from potentially influencing the result of an analysis. Prior to the development of package managers in the late 90s with yum and apt, software stacks definitely were much harder to reproduce due to inconsistent software distribution methods, but dependency resolution and software installation seems to have largely been a solved problem since. Unfortunately, due to the admittedly non-sexy work involved in maintaining package repositories, and time constraints due to the present publish or perish culture (which precludes academic service work other than paper production) it seems as though this key tool will not be leveraged to its full potential, although there are notable efforts such as the conda package manager, the Fedora SciTech SIG, DebianScience, NeuroFedora and NeuroDebian. Admittedly, there are also issues with the ability to reproduce the binaries themselves in package repositories reliably (see here), but as long as the binaries in the repositories themselves remain the same, a particular stack should be re-creatable (albeit potentially flawed).

My hope is also anchored in the fact that non-scientists, such as librarians and law enforcement officers, have a vested interest in maintaining software stacks on a decades-long time scale as well. Librarians do so to preserve cultural heritage. Law enforcement, in contrast, does so for more practical and less abstract reasons (although both groups often collaborate). Forensics specialists, in order to rapidly investigate born-digital evidence, have a need to have rapid access to both signatures identifying applications and key files included with those applications. To this end, NIST maintains a substantial collection of software packages as part of its National Software Reference Library. The public metadata for this collection indicates that a large number of popular pre-1999 scientific packages, such as MATLAB and SPSS, are already included in the collection.

Hardware and Software Emulation

Thus far I’ve focused primarily on software itself, but what about lower levels of the stack, such as the operating system and hardware? For legacy ecosystems such as SPARC, 68k,or PPC, these lower layers can be faithfully reproduced through software emulation at a minimum, and through hardware-based strategies if some other property (more authentic execution, lower power consumption, etc) is necessary. New developments in emulation, while primarily driven by retrocomputing and gaming enthusiasts instead of scientific researchers, hold great promise for the long-term near-literal reproducibility of older analyses (an especially notable project by Brian Stuart can even run reference programs for the ENIAC ). To better contextualize my thoughts on this, I’d like to note the distinction between two forms of fidelity when a digital artefact (such as an analysis or digital art exhibit) is emulated: execution fidelity, which pertains to the accuracy of instructions performed by a computer and experiential fidelity, which pertains to the ability of a simulation to accurately mimic the first-person subjective experience of a technology within its context (Rosenthal, 2.4.3). The former is trivially true due to Turing equivalence. As David Rosenthal notes:

In a Turing sense all computers are equivalent to each other, so it is possible for an emulator to replicate the behavior of the target machine’s CPU and memory exactly, and most emulators do that.

The latter form of fidelity cannot be implemented via emulation alone, but matters little for in silica reproducibility since any context related to the appearance of the hardware is irrelevant to the veracity of the results. Experiential fidelity firmly falls into that category of things so impertinent to the experiment that they belong to statistical error at best (e.g., the color of Rutherford’s tie when he first performed his gold foil experiment, if he even wore a tie). It simply does not matter whether or not a study’s result is displayed on a monochrome Mac Plus CRT display or a modern MacBook Pro Retina monitor (with the very rare exception perhaps of disciplines that rely on imaging). Truth that approaches objective truth is timeless and while the paradigms that guide science may be influenced by a contemporary culture as science evolves, the results that ultimately science converges upon belong to nature itself and not any specific cultural worldview.

It could be argued that while Turing equivalence makes it theoretically possible that any analysis can be reproduced literally, it may not be pragmatic to achieve this ideal. This is a fair point, which is why I think pragmatically science should strive for near-literal reproducibility rather than literal reproducibility. I believe that for scientific analyses at least, if we cannot have execution fidelity that is 100% accurate and precise, we should at the very least strive to minimize the range of our confidence intervals, almost in the same way that equipment manufacturers strive to create components within certain tolerance ranges. It’s also worth pointing out that even if the ideal execution fidelity were fulfilled, nondeterministic steps in an analysis pipeline could yield different end results anyways.

In terms of how we might emulate analyses produced today, it is important to note that two instruction set architectures, Intel/AMD x86_64 and ARM, underpin the vast majority of all computing due to market consolidation in the CPU space (Rosenthal, 3.2.4). Within scientific computing in particular, this market consolidation is even more acute, with only 6.2% of the Top 500 supercomputers using architectures other than AMD/Intel’s x86_64 as of June 2018 (see here). This means that, relative to the heterogeneity of hardware architectures in the 90s, the total set of hardware we would need to emulate in the future is quite small for a modern scientific analysis. Furthermore, it is abundantly clear that the majority of Top 500 HPC clusters are running Linux, which indicates that there are very few degrees of freedom when it comes to operating system choice in modern scientific environments.

In his blog, Brown cites how some researchers have proposed co-opting the software development concept of “continuous integration” as a potential solution for the reproducibility crisis, re-running analyses constantly as data is recorded. While this concept is intriguing, I’d suggest that scientists adapt another concept from the software development world, code coverage, but with the “coverage” element not relating to how many functions in their code have unit tests, but rather how many components of their software/hardware stack will likely be emulatable in the future. While there’s no guarantee that a software developer will write an emulator that implements every single feature of a given instruction set architecture in the future, this kind of “emulation coverage” might be informed by factors such as a) how niche the hardware is, b) how large the community of end-users is (and how fanatical they are), and c) how well-documented the hardware’s interfaces and underlying implementations are.

Beyond software-based emulation of older hardware, there’s also been a growing trend of using FPGAs to implement older hardware directly. There are numerous advantages to this approach, from better utilization of resources such as electricity and compute cycles (i.e., many cycles are idle when emulating a legacy machine on a powerful machine and nothing else is running in the background), to the ability to run custom hardware in the cloud via Amazon’s F1 instance types. A good example of this in the education space is Stephan Edwards’s Apple2fpga, which re-implements an Apple II+ using an FPGA board. This example also illustrates how the Apple II might be considered a platform with good emulation coverage, since as Edwards notes on his site:

The Apple II has been documented in great detail. Starting with the first Apple II “Redbook” Reference Manual, Apple itself has published the schematics for the Apple II series. When Woz spoke at Columbia, he mentioned this was intentional: he wanted to share as much technical information as possible to educate the users.

It should be noted that historically speaking, it wasn’t uncommon to run multiple software architectures on a single machine. At Vintage Computer Festival East 2018, an exhibit entitled “Microcomputers With an Identity Crisis” by Douglas Crawford, Chris Fala, and Todd George demonstrated how ASICs (such as the Apple IIe compatibility card for the Mac Color Classic) and add-on cards (such as a 486 CPU that operated in tandem with a PowerPC processor in the PowerMac 6100) were used to consolidate the large number of incompatible hardware architectures during the 80s and 90s onto single machines. Modern cloud infrastructure with FPGA instances can serve a similar purpose in allowing multiple hardware architectures to be run alongside each other, and it’s feasible that on-premises servers with FPGAs could also be used to re-create hardware architectures on-demand (although these would likely be for custom hardware such as GPUs or other co-processors, since CPU emulation via software in many cases seems more pragmatic).

On Docker/Virtualization, Inspectability and Configuration Management Tooling

I find myself in agreement with Brown’s assessment of Docker, in that treating scientific software and analysis pipelines as black boxes is dangerous for reproducibility and reduces transparency, although admittedly it could be argued that Dockerfiles fulfill some of the requirements of inspectability that Brown advocates. The primary issue I have with Dockerfiles for purposes of inspectability is that they aren’t sufficiently portable, and can’t readily be used to deploy to bare metal, virtual machines, or other containerization technologies. Anecdotally, it seems that many researchers (and companies) are in a rush to embrace Docker without fully considering that it could quite easily be supplanted by Kata Containers, jails or zones should the tech industry embrace these alternative technologies (which owing to some of Docker’s historical failings wouldn’t surprise me). This particular nitpick of mine is related to the “emulation coverage” I discussed earlier, where it’s important to consider the long-term viability of a technology before employing it.

Instead of Dockerfiles, I think that the inspectability criterion should be fulfilled by configuration management tools such as Ansible, Puppet, Chef, and Cfengine. Configuration management tools use a declarative syntax that allows one to specify exactly how one’s environment is set up (i.e., which packages are installed, which versions are used, which hard disk volumes are mounted, etc). While this syntax varies between tools, it can be used to both apply a series of commands against a base installation of an operating system to bring an execution host to a desired state, and can also serve as documentation that future researchers can look at and remix. Intriguingly, Cfengine was created with the explicit goal of setting up research environments for physicists in the early 90s, which consequently means that a large number of computing environments from that time period might be very well-characterized.

Conda environment.yml files also fulfill inspectability in a manner similar to configuration management tools. In fact, these files are one way to construct Docker image using HHMI’s Binder tool. At this point, it’s important to note that the amount of effort needed to reproduce a pipeline doesn’t always necessarily need to focus on OS-level dependencies. If a pipeline is implemented solely in a higher-level language such as Python, a researcher may have considerably less work in terms of documenting his/her environment. Indeed, Docker is basically to C and C libraries what a Python virtualenv is to Python and Python packages. If a researcher has no need for a specific version of a C library then a requirements.txt or environment.yml that can create a Python virtualenv (along with documentation for which base installation it was installed atop of, which could be provided by a systems administrator) might be adequate.

Finally, I believe that the inspectability criterion must be fulfilled by well-documented workflows that indicate how tools in an analysis fit together. My opinion is that this would be best accomplished by a common syntax such as Common Workflow Language (CWL). I’ve discussed my beliefs about the potential benefits of CWL at length elsewhere, so I won’t re-hash them here, but the short of it is that I strongly believe in a neutral, platform-agnostic way for describing how tools fit together (essentially high-level programming).

Thoughts on Literal Reproducibility, and the Utility of Research Products

While I agree with Brown that literal reproducibility is impractical, I strongly believe that the ability to reproduce analyses with the highest precision possible has value beyond the short-term. For one, I think that there is a moral imperative for analyses to be repeatable, since most analyses are not produced with private funding, but rather governmental funding, and as such should be available to the general public in a form that they can inspect and run with relative ease (should they obtain the hardware resources necessary to do so). I believe that this is essential for public instruction, to narrow the gap between scientists in an “ivory tower” and everyday citizens, to demonstrate good use of public funds, and to increase accountability / reduce scientific fraud. I also believe that it makes science less vulnerable to attack, especially in the case of climate science since it’s harder for climate science deniers to have as much traction if the exact analyses that they want to discredit are accessible and readily runnable in some fashion. Indeed, greater workflow transparency might mitigate against allegations that scientists are engaging in a conspiracy to further a political agenda (e.g., Climategate).

Additionally, I think that it’s dangerous to make assumptions about the utility of exact repeatability (or anything else in science or life for that matter) in the long-term. The value that we assign to any given research product, such as a reproducible analysis, is constantly in flux, and there’s no reasonable way to predict the worth of that analysis at any given point in the distant future. Historical judgements anticipating the future value of research products have led to extremely suboptimal decisions, such as the alleged destruction of data (see disclaimer below at [1]). Furthermore, value judgements about analyses have led to systematic issues within science itself such as the file drawer problem wherein null results are undervalued and thus never communicated to other scientists (at best).

It’s also worth pointing out that the problem-solving process in science itself benefits from having a legacy analysis available for revival if necessary in the distant future. In cognitive science, it is theorized that two methods of problem-solving, difference reduction and means-ends analysis, are employed to get to a goal state, such as an analysis-backed conclusion. In difference reduction, or hill climbing, actions are continually performed that minimize the difference between the current state and a desired goal state. This works fine for simple problems, but has the caveat that a problem solver can get stuck in a rut (or local maxima) and not step back to go down a path that would lead them to the ideal goal state (a global maximum). In means-ends analysis, multiple sub-goals are created as responses to blocking states as the problem-solving process occurs. These sub-goals are then considered separately to build a path to an overall goal state. Means-ends analysis is the ideal problem-solving strategy; however, it assumes a well-characterized goal, which is often not the case in more exploratory scientific analyses. Due to the high uncertainty of scientific outcomes, I would assume that difference reduction reasoning strategies are more prevalent than means-ends analysis in the scientific process. Within the context of the computational reproducibility conversation, what this means is that a scientific field could go down a kind of garden path for a while, get stuck in a local maxima, and then need to back up to unblock itself. If there is a need to back up to the point before a scientific paradigm branched towards a particular direction, we would want to be able re-run analyses from that exact point of branching to determine our next directions.

Other Considerations: Instrumentation Fidelity and the Role of Data in Reproducible Analyses

Up until now, I’ve focused on reproducing the CPU instructions / program that performs an analysis, but what about the data? While I’ve been confident about our ability to preserve software stacks and re-run them via emulation, I’m not quite as confident about our ability to preserve raw data into the distant future. This is in large part because the packaging tooling around datasets doesn’t seem nearly as mature for data as for the instructions that will run on data, and also most data out there isn’t general purpose enough for non-scientists to want to preserve. As I expressed earlier, even packaging systems such as yum or apt don’t seem to be fully leveraged by the scientific community, and tooling based around datasets seems to be even worse. There’s some cause for optimism, however, as projects/products like Datalad/git-annex, dat, osfclient, and Quilt have progressed. This space is still very new, however, and I wouldn’t expect it to really take off until the scientific and analytics/data science communities converge upon a standard.

My cynicism about the long-term preservation of scientific data is rooted in the fact that data seems to be treated as a second-class citizen and is more readily thrown out. This disposal of data is possibly in part due to a perception that new data can always be gathered and an analysis re-run, but also due to the fact that scientific and governmental institutions give little thought to sustainable funding models for data storage. Furthermore, the long-term preservation of any data, not exclusively scientific data, can be an expensive and complicated affair if one wants to do it right, and historically there hasn’t been a large interest in proactively engaging in data preservation (see here). Instead conversations about data preservation typically tend to be reactive and come up within the context of political censorship. There are other contributors to my pessimism, such as poor documentation/metadata indicating how raw data can be used and, as Brown points out in his blog, a disconnect between older formats and newer tools, which I might rant about later. y cynicism on this point is also slightly colored by the fate of the fMRI Data Center, the first large-scale effort at open access data sharing for neuroimaging data, which disappeared seemingly overnight after government funding dried up. Simply put, the structure of scientific grants does not always make it easy to create reliable core infrastructure, and more often not data storage (both short-term and long-term) seems to be given short shrift.

There’s also another form of fidelity at play in computational analyses that can get lost in the discussion when general-purpose computation is emphasized heavily; ideally the equipment and methods used for data collection should be repeatable in the future as well. This repeatability would constitute a variant of the concept of execution fidelity that I’d like to call “instrumentation fidelity”. While the execution fidelity discussed previously deals with Turing-complete general-purpose computers, instrumentation fidelity deals with niche processing units that take in input (sensations of natural phenomena), process these through a variety of sensors and other hardware, and produce output in the form of scientific data.

While such instrumentation fidelity is not as important if we consider data to be fixed in order to create a single repeatable version of a specific analysis, it does have implications for remixability and the general-purpose suitability of an analysis pipeline. Indeed, I don’t believe that it would be wise to consider collected data to be “hard-coded”. If an analysis is to be truly generalizable, it should also be able to produce results that fall within a confidence interval with an entirely novel set of data. In fact, the ability to re-run an analysis with an entirely novel set of data gathered by an independent lab might be a very useful form of blind peer review that increases the quality of published findings tremendously.

But how would we achieve inspectability with data gathering instruments, and how could we ensure that the instrumentation itself wasn’t flawed in some systematic way? While Brown asserts that “closed source software is useless crap because it satisfies neither repeatability nor inspectability”, it’s also true that most equipment designs for life sciences are proprietary, and in many cases it’s impossible to double-check for fundamental design flaws that might systematically distort the data a pipeline acts upon. Though there are some efforts to create open hardware for labs, in many cases instrumentation is too niche to have an open source equivalent or tool manufacturers benefit from economies of scale that individual labs don’t have.

Instrumentation fidelity also opens up a number of other questions about how literally an analysis should be reproduced. Should a neuroimaging researcher build his or her own open source EEG or MRI scanner using designs from 30 years ago to replicate a past analysis? Given the advances in technology over those 30 years, which would presumably lead to higher-quality data (unless newer instrumentation has its own systematic flaws that render it inferior to older equipment) this would seem to be an absurd proposal. My intuition tells me that reconstructing legacy instrumentation should not be a priority, though this seems as though it could be a dizzying debate for another time.

Summary

In brief, my thoughts are that:

It seems unlikely that scientific software stacks will be incredibly difficult to preserve in many (if not most) cases. Many elements of scientific software stacks, such as OS libraries, are more likely to be preserved due to their general purpose use. Non-scientists such as librarians and law enforcement have a vested interest in preserving scientific software on a decades-long time scale. Software packaging, if adopted more widely, would make reproducibility even easier to achieve, although at a minimum the use of version control tools such as git could provide some of the same functionality as packaging (such as the ability to install specific software versions via commit number).
While literal reproducibility is impractical (though theoretically possible) striving for narrow confidence intervals / tolerance limits for acceptable outputs from a scientific analysis is less so.
Hardware and software emulation of lower levels of the scientific computing stack should for the most part be adequate. Researchers, when creating computational analyses, should consider how difficult it will be to emulate their analysis in the future (“emulation coverage”).
I agree that the creation of VM images or Docker containers are a non-solution to the reproducibility of in silica scientific analyses. In lieu of relying solely on images and containers, I think that it is more important to document environments via configuration management tools or other environment specifications. These tools provide for both high inspectability and can be used to re-create environments on top of a base operating system installation.
Systematic documentation of workflows via a syntax such as Common Workflow Language (CWL) is essential for providing inspectability and documentation for how software components are linked together in an analysis.
More broadly, the value of declarative syntax in facilitating the inspectability of analyses cannot be overstated.
Changes in instrumentation for data collection also throw a ratchet into the problem of reproducibility, especially since many tools used to collect data are proprietary and data repositories don’t have the same well-established tooling that software packages have in terms of packaging or version control.
Some of what I’ve said in this post is rooted in a deeper belief that scientific reproducibility is, in some ways, more suited to an engineering mindset than a scientific one.

[1] I realize that the particular example I cite here (the case study of Henry Molaison) is controversial and that the jury is still out on whether any data was actually destroyed. Most likely no malicious intent was involved. Nevertheless, destruction of research materials definitely does occur on some scale throughout science and can lead to great distrust in the court of public opinion.