John Pellman - Systems Administrator

On How Computers Did (or Didn't) Break Science

Sun, 14 Jun 2020 17:23:49 -0400

Recently I came across an article (How computers broke science – and what we can do to fix it by Ben Marwick) that argues that electronic computers are breaking science. Namely, computers are blamed for:

Making data processing methods more opaque and converting processes that were formerly transparent into black boxes.
Being too versatile, and thus complicating methods reporting in journal articles. Furthermore, making it so that making results reproducible now involves documenting both your software and data management efforts.

While the article contains many positions that I agree with to varying degrees, such as increased sharing of data, the use of open source toolkits, the use of open formats, and a shift away from exclusively point-and-click applications, I find that the central premise- that computers in and of themselves have somehow disrupted the scientific process- to be poorly supported. I do think that there is an element of truth to what the author says in this regard, and I think that by and large we are in agreement on finer matters, but I think that his central position lacks precision.

Causes of Computational Reproducibility Issues

From my perspective as a technical professional with a different mindset and different background from many researchers, I regard problems with computational reproducibility as originating from two separate causes:

Nondeterministic algorithms, which do not necessarily produce the same outputs given the same input across multiple runs.
Human factors issues.

The article in question does not address mathematical causes of irreproducibility, and instead focuses on the human factors front. From this vantage point, there are two possibilities for why computational reproducibility might be challenging:

That scientists are misusing digital tools because computational illiteracy is reasonably widespread in the sciences, and the publish or perish incentive structure of the modern academy does not reward scientists who take the time to properly understand computation and its role in bringing about their study’s conclusions.
That there is a fundamental issue with how computing interfaces are designed for scientists, and that this leads them to perform actions that are maladaptive. Computational reproducibility is fundamentally hampered by poor user interface design decisions rather than by the researchers themselves. Scientific computation would benefit from more of what Don Norman has called user-centered design.

It is my belief that both possibilities are responsible for issues of computational reproducibility in varying proportions. The remedy for the former possibility is more education, and it is for this reason that efforts such as Software Carpentry exist. In an ideal world, education about scientific computing would begin even earlier at the undergraduate level, since computing is becoming essential to all areas of research, and would be better learned before the competing burden of needing to publish scholarly articles comes into play.

The latter possibility is addressed by efforts such as Jupyter Notebooks, Galaxy, brainlife, and NeuroCaaS, which simplify computing by abstracting away elements of general-purpose computing that are irrelevant to science while keeping elements that fit within a researcher’s cognitive schema / understanding of the world. Jupyter, for instance, uses a notebook analogy, similar to how a researcher might make notes in a literal notebook while performing benchwork. The other tools perform specific tasks, in well-defined pipelines, with fixed inputs and outputs deliberately constraining the problem space / software elements that a researcher must manage while increasing the consistency of research outputs. When running the mriqc pipeline on brainlife, for instance, functionality is restricted to a clear and obvious goal- ensuring that data quality is acceptable. While to some extent these are black boxes, they are also based upon incredibly transparent components that can be audited if need be- it is for this reason that I must clarify that I am not wholesale against the use of point and click applications, as long as such applications are built upon versatile and reasonably transparent components.

As a brief disclaimer, it is also worth pointing out that, at the time that the article I’m responding to was written, Jupyter Notebooks were relatively new and not as established as they are today, although other notebook interfaces such as MATLAB and Mathematica were (but were not yet web-based).

Human factors issues related to data management in particular are also being partially addressed by metadata standards such as BIDS, NIDM, EML, and CF Conventions. These standards encourage reproducibility by decreasing the number of possibilities that files on a filesystem can be organized, constraining researchers with a default set of good data management practices. Efforts such as Datajoint go even further, encouraging researchers to manage data within structured database tables. In the long-term, I believe that the data science world will come to influence data management practices within science positively, and that most analyses will be performed on data stored within highly structured databases in a transactional manner instead of on files directly, while files enriched with metadata schemas will come to be used as intermediate, portable representations of datasets that can be imported into databases via various connectors. Phrased differently, a structured database abstraction will force researchers to keep their data and its provenance organized through restricting the number of operations that can be performed on them, much like how photo organizer programs such as Apple Photos or digiKam tame the chaos of managing one’s own personal photos.

Are Computers Fundamentally Different from Other Instrumentation?

The author of How Computers Broke Science also cites a claim by Victoria Stodden that a computer is fundamentally different from other pieces of scientific instrumentation in his article. I am slightly skeptical of this claim as it stands in the modern world, in no small part because many pieces of modern instrumentation themselves contain full-blown onboard computers (similar to Raspberry Pis) that perform a portion of data processing to produce the “raw” data.

Phrased differently, unless you’re using analog instrumentation, your microscope almost certainly is running a full-fledged copy of Linux, a BSD, Windows, or Minix to ensure that the output you receive is encoded in a digital format. In fMRI research, MRI scanners don’t even use onboard PCs, instead using whole workstations that perform some rudimentary image processing steps as part of data acquisition (i.e., k-space transformations) that often go unreported and are typically forgotten about.

Even outside the realm of science, printers, ATMs, and New York City subway kiosks have been running complete copies of Windows for years. Even components within a computer, such as hard disk controllers or CPUs, have become themselves computers running their own operating systems. In 2017, it was even revealed that a large number of modern Intel processors have been secretly running the Minix operating system.

If the author is to critique computers for being too opaque, he cannot claim that most modern instrumentation is somehow fundamentally less opaque, since modern instruments are by-and-large application-specific digital computers in reality. In fact, such application-specific computers are arguably even more opaque than your average general-purpose computer, since the processes that they use to transform data into a “raw” format during data acquisition are often proprietary, undocumented by the instrument manufacturer, or both.

Beyond considerations of the opaqueness of on-board computers used by digital instrumentation, it’s important to note that even analog instrumentation performs processing steps upon data as it is acquired that are not too dissimilar from the processing done by digital instruments. Most methods reporting sections do not delve into the engineering details of such instrumentation and stop at mentioning the make and model of a particular data acquisition device, effectively making most instrumentation just as much of a black box as a digital computer.

Methods Reporting and Rosy Retrospection

In the article, the author claims that “For most of the history of science, researchers have reported their methods in a way that enabled independent reproduction of their results.” I suspect that this characterization of pre-electronic computer methods reporting exhibiting higher transparency is an example of rosy retrospection. To back up this hunch, I’d like to explore a few historical instances of methods reporting: the modeling of the action potential in Hodgkin and Huxley (1952), irreplicable findings on the chemical basis of schizophrenia in Heath et al (1958), and the (probable) discovery of the Red Spot storm on Jupiter by Hooke (1667).

Firstly, how transparent was the methods section of Hodgkin and Huxley? While it was transparent in giving the necessary formulae to reproduce its results (these formulae would be analogous to analysis source code or a Jupyter notebook today), nowhere in the article does it indicate how these formulae were applied. The most popular contemporary means of calculating results for articles were human computers, mechanical calculators, and vacuum tube computers such as Cambridge University’s EDSAC. Hodgkin and Huxley says nothing about these methods or the pitfalls and potential for error that they introduce; all of these methods were treated interchangeably and either the inclusion of the specific calculating method was regarded as superfluous or it never even occurred to Hodgkin and Huxley to include this detail. By modern standards of methods reporting, which often require that you report your computer’s CPU and the software used at a minimum, the Hodgkin and Huxley study is almost certainly more opaque. Eventually, it came out in 1992 that Hodgkin and Huxley had wanted to use EDSAC, but were forced to use a Brunsviga instead due to an extended maintenance window on EDSAC (see Schwiening (2012)).

Heath et al is a very dramatic historical case of poor transparency in methods reporting. This psychiatric study concluded that a chemical substance called taraxein was a direct cause of schizophrenic behavior. It was never replicated due to deliberately vague methods reporting by a coauthor (Matthew Cohen). As Matthew Cobb explains in his book The Idea of the Brain (2020):

[Matthew Cohen] had deliberately withheld key parts of the relevant protocol from their scientific publications, rendering their work impossible to replicate. Cohen was in fact a fraud with no scientific training; he was a gangster on the run and had kept part of the taraxein technique secret as an insurance policy in case of discovery.

Lastly, we have Hooke (1667), one of the first published articles in the first recognized journal (The Philosophical Transactions). Hooke (1667) is short enough to quote in full:

The Ingenious Mr. Hook did, some months since, intimate to a friend of his, that he had, with an excellent twelve foot Telescope, observed, some days before he then spoke of it (viz. on the ninth of May, 1664. about 9 of the clock at night) a small Spot in the biggest of the 3 obscurer Belts of Jupiter, and that, observing it from time to time, he found, that within 2. hours after, the said Spot had moved from East to West, about half the length of the Diameter of Jupiter.

This article, while brief (it is, in fact, shorter than most modern abstracts) is essentially an observation with precise details about the instrumentation used in the observation and fairly imprecise measurements (“about half”). There is no obvious computation, analysis or statistical testing. In spite of this, there is certainly computation occurring; it’s just occurring within Hooke’s brain as he identifies an object and tracks it with his eyes. This is no different from the modern methodology of most modern pseudoscientific UFOlogists, whose main motto appears to be “seeing is believing” as they stare at objects in the sky. Even with the precision reported, which likely wasn’t even reported by Hooke himself (Henry Oldenburg wrote most of the articles in Philosophical Transactions on behalf of other researchers), key details about instrumentation, such as the nature of the lenses (who manufactured them, lens thickness, what combination of different lens types were used, etc) are omitted. Because of the vagueness of this article, it has long been debated whether Hooke or Cassini discovered the Great Red Spot on Jupiter (Falorni (1987)). Creating an authentic reproduction of Hooke’s analysis (an effort more likely to be undertaken by a museum than a scientist) is thus somewhat difficult because of Hooke (1667)’s brevity and lack of clarity.

In Brief

I do not think that it can be said that computers, in and of themselves, have broken science fundamentally. Rather, I think that science tends to break down whenever there is a lack of focus or absence of constraints, as occurs any time there is a paradigm shift that disrupts what Thomas Kuhn has called “normal science”.
Computational reproducibility can be improved, not only by education, but also by re-engineering general-purpose computers into domain-specific computers through better interfaces.
Scientific instrumentation is often not as transparent or straightforward as one would think.
When discussing issues of reproducibility, it’s important not to romanticize the past. If we apply modern standards of academic publishing and methods reporting against previous decades, we will find that most articles fall woefully short of our contemporary expectations.

The National Academies of Science and Engineering and ACM Definitions of Reproducibility

Mon, 25 May 2020 09:36:14 -0400

In November of last year, I attended the SC19 conference, which brings together an assortment of computer scientists, systems administrators, and vendors. One of the various birds of a feather sessions I attended (The National Academies’ Report on Reproducibility and Replicability in Science: Inspirations for the SC Reproducibility Initiative) discussed the issue of reproducibility in great detail. The big news at this session was that several operationalizations for the concept of reproducibility that were developed independently by the ACM (see here) and the National Academies (see here) were being harmonized across the two organizations.

A primary motivation for this reconciliation of jargon was that the ACM and National Academies were, in fact, using the same words to refer to the nearly the exact opposite concepts. Namely, the ACM defined the term replicability to mean an attempt to reproduce an article’s results using the exact same experimental setup by another team, while the term reproducibility indicated the concept of convergent results made by a different team using a different experimental setup.

The National Academies, in contrast, defined reproducibility as the act of an independent team obtaining the same results through rigid adherence to the same methods and setup while also using the same dataset, while replicability referred to convergent results emerging from non-identical setups.

My personal take on this news is that the original operationalizations used by both the ACM and the National Academies are both flawed, not by dint of their actual definitions, but in the manner that they have been formulated. Both sets of definitions treat varying degrees of reproducibility as belonging to discrete categories, whereas I think that a better characterization of reproducibility would be to think of it as a continuous spectrum. Very rarely are replication attempts as black or white as to be binned into one or two categories, and by trying to lump studies together in this way valuable details about how a replication was performed could be lost.

It is my personal opinion that it would be better to use one term (reproducibility) consistently, and maybe treat other terms such “replicability” as redundant aliases for the main term, and then to characterize reproducibility as being composed of several continuous features (such as similarity between measurement apparatuses or computing environments) that can be reported alongside a replication attempt.

A Bigger Threat than Unfriendly AI

Fri, 11 Oct 2019 17:50:55 -0400

If you sample the discourse of futurists and transhumanists, you’ll quickly discover the following common talking point: if humanity develops a form of artificial intelligence that equals (via strong AI) or even exceeds (via a technological singularity) the intelligence endowed to us by natural selection, what prevents this artificial intelligence from being actively hostile towards humanity? Such a so-called “unfriendly AI” would strive to blow us all to dust similar to Skynet from the Terminator film franchise.

I would (in a brief moment) like to dispel such speculative fears and replace them with some very real, contemporary fears. To provide some context, it’s helpful to go back to the years after World War II, when artificial intelligence research experienced one of many frenzied booms. During these years, two parallel, non-contradictory schools of thought developed regarding computing. In one camp, you might find Marvin Minsky and Seymour Papert (proponents of symbolic AI, which has largely been abandoned for neural nets at this point, but remains influential via the legacy of LISP machines) and Frank Rosenblatt (the creator of the original perceptron). Complementary to this, however, there was a decent clique of researchers discussing intelligence amplification.

These researchers (most notably the “father of the internet”- J.C.R. Licklider) did not preclude the possibility of generalized artificial intelligence, but also did not think that such intelligence would arrive for some time (an assumption that is likely true today as well). Instead, they viewed computers as adjuncts or co-processors to the human brain that would greatly increase its efficiency when engaging in goal-directed behavior. Their perspective has had tangible results that have served humanity greatly, and is largely why an individual today can achieve in seconds (through scripting, automation, or even just by engaging with online interfaces) tasks that took weeks in bygone days. In more concrete terms, this group accurately forecast my ability to find a recipe for oat milk in under a minute. More importantly, the intelligence amplification provided by services such as Google has allowed me to find items that I never would have been able to find using practical pre-internet means, such as velociraptor pin-ups.

This brings me to my next point: general intelligence is not strictly related to the belief systems held by an individual. The fact that an individual human mind might be better at processing sensory input and making logical connections (i.e., have a higher IQ) does not automatically guarantee that that same human mind will reach accurate conclusions about reality. Even the most intelligent humans among us cling to erroneous beliefs- it doesn’t take too long to discover a Nobel Laureate that has succumbed to Nobel disease or an engineer that has caught engineers’ sydrome. One of the highest IQ humans (Christopher Langan) appears to be applying his exceptional mental abilities to generate a philosophy that is largely indistinguishable from Timecube (or similar ravings of madmen). More broadly, it should be noted that individuals at all levels of intellgience are susceptible to cognitive biases.

To use a computer analogy, general intelligence is like hardware, and belief systems are like the operating system or a program. You could make a powerful desktop build with an Intel i9, an NVIDIA 2080, and 64 GB of RAM to run a protein-folding simulation, but if the code you’re running doesn’t use floating point arithmetic, the results of the simulations would be wildly inaccurate. Similarly, you could try running a very advanced program such as a raytracer on a 386 with 16 MB of RAM. Assuming Turing completeness (and a lot of painstaking debugging and porting work, perhaps with software implementations of modern hardware features), the 386 could theoretically complete such a program- it would just take a substantially longer amount of time for execution to finish. Similarly, a high IQ individual (the powerful desktop) could be running a belief system that is glitchy or lacking important features, while a low IQ individual (the 386) could have a mental model that is closer to reality. The latter individual would stumble and sputter slowly through existence, but ultimately their beliefs would be more accurate, even if they never did come to any definitive conclusions about reality during their lifetime.

Given that even the most intelligent among us are imperfect, and that the applications of our intelligence do not always lead to accurate conclusions, it stands to reason that intelligence amplification might not only be increasing our efficiency with common tasks- it might also be increasing the efficiency with which we arrive at erroneous conclusions right now. These erroneous conclusions might be strengthened via phenomena described by psychology such as the “echo chambers” of groupthink or the tendency for members of groups to conform when under social pressure. We can see amplifications of erroneous beliefs today in the many internet mobs of trolls that lurk out there, and in the active misinformation campaigns being orchestrated by foreign powers such as Russia. In short, the internet- through its ability to form collective intelligences on platforms such as Twitter and Facebook- might actually constitute a bigger threat to humanity than any hypothetical synthetic life form.

Using the Gzip File Format as a Metadata Container

Tue, 03 Sep 2019 16:32:18 -0400

It’s been a while since I last posted, but I’ve been itching to commit to words a few thoughts I’ve been kicking around regarding one particular approach to adding metadata to arbitrary file formats. In the distant past, I was heavily involved in a community that was developing a standard for enriching neuroimaging datasets with metadata. I’m not going to re-hash all of the advantages that creating such a standard confers upon the neuroscience community, since others have done so extensively elsewhere, but the short of it is that the standard, the Brain Imaging Data Structure (BIDS), has done much to increase both the discoverability and reusability of fMRI/EEG/MEG datasets that otherwise would have remained “hidden” in a proverbial file drawer somewhere. If you’re interested in understanding BIDS more, there is a paper that goes into more detail. For this post, I’m going to focus on one particular technical choice that was made about how BIDS stores supplementary data (such what instrumentation was used and how equipment was configured).

In fMRI data analysis, the two most heavily used formats are NIfTI (described in great detail here and here) and MINC. In terms of format adoption, the majority of tools available to neuroscientists favor NIfTI support, which led the designers of BIDS to focus on finding ways to inject additional metadata into this format. The NIfTI file header already contains quite a few fields for neuroscience-specific metadata, but adding additional fields to this header requires that a working group convene, engage in debate, and that some degree of backwards compatibility be maintained so that existing neuroimaging tools don’t break. This change control process, while ensuring stability, also makes it difficult for researchers and tool developers to rapidly experiment with additional header fields [0], iterate upon standards, and observe how the addition of such fields affect data sharing and reusability.

The primary BIDS developers wanted to have that ability to contextualize NIfTI files with new tags while not breaking existing tools. The technical solution that they arrived at was to use a JSON-formatted sidecar file, wherein a file with the same name as a NIfTI file but a different file extension would contain various metadata entries as key-value pairs. Such sidecar files are a fairly common way to enhance rigidly-defined file formats with additional metadata, and are readily seen elsewhere in the wild. For instance, sidecar files are used by gmvault to store feature-specific Gmail fields that are not part of an official e-mail message format standard. Additionally, Kodi uses sidecar files to store video metadata (see here). In the case of BIDS, a long-term goal was to engage the NIfTI working group so that this JSON file could itself be embedded within the NIfTI file header in a future version of the NIfTI format:

Storing metadata in JSON files has advantages of accessibility, but can be error prone because data and metadata do not live in the same file. In future revisions of BIDS we will explore the possibility of storing metadata as a JSON text extension of the NIfTI header.

The statement above reflects what I have come to believe is a general best practice for adding metadata to a file- metadata should always be embedded if possible. I base this best practice on the observation that in any structured system, there is a natural tendency towards entropy. The ease with which certain operations can be applied in a computing environment often serve to accelerate the decay of order. Filesystem operations in particular, such as file moves and deletions, are particularly easy to perform. When using a sidecar file to store metadata, it is more likely that metadata will be lost, since the original file and its sidecar have no obvious relation to each other, and are treated as independent entities in the filesystem. In brief, there is no “stickiness” that ensures that both metadata and data will be equally affected by filesystem changes.

Interestingly enough, the above opinion appears to have been a primary motivating factor for the creation of the NIfTI format to begin with. NIfTI was preceded by the older ANALYZE format, which specified that metadata for an image file (with the “.img” extension) was to be stored in a sidecar file with the “.hdr” extension. As often is the case, time is cyclical and old problems in computing often recur.

To the extent that I am aware of developments in the wonderful world of NIfTI, a JSON text extension was never added to the NIfTI header. However, there’s an interesting workaround. NIfTI files are commonly gzipped, and all major tools I’ve worked with seamlessly decompress such NIfTI files when reading them in. According to the gzip file format specification described by RFC-1952, 2.3, if a specific bit (FLG.FCOMMENT) is set in the gzip file header, you can add an arbitrary amount of Latin-1 encoded text to the header. That means that, rather than embedding JSON within the NIfTI header (which again, is hard to change without convening together a working group), someone could instead embed the JSON within the gzip header. In this way, gzip can be used not only to compress data, but also as a container format for additional metadata.

The restriction that text be encoded as Latin-1, however, poses some difficulty. What if desirable metadata values are incompatible with this character set? Additionally, what if we wanted to embed metadata stored in a different file format into the gzip header? Fortunately for us, Latin-1 encoding is a superset of the commonly used US-ASCII encoding, and storing non-ASCII characters or arbitrary binary data as ASCII text is a problem that was solved long ago. In fact, e-mail relies extensively upon such solutions, via the MIME standard. This means that, instead of restricting ourselves to adding JSON text to the gzip header’s comment field, we can add pretty much anything, so long as that anything is formatted as an e-mail message with the relevant MIME headers.

The main disadvantage to this approach is that the data-to-text encoding most commonly used with MIME, base64, results in data taking up 1.37 times what it would otherwise (see here). This is pretty counterproductive from gzip’s perspective, considering that its main goal is to decrease the size of a dataset! Nevertheless, I’m not sure if this disadvantage outweighs the advantage of having metadata embedded directly within a dataset. Presumably the metadata takes up a trivial amount of space anyways, so the additional overhead doesn’t amount to much.

If anyone is curious about this approach, I’ve been playing around with it using some incredibly hacky Python code that can be found here. Unfortunately, Python’s gzip module doesn’t give you the ability to populate the gzip header comment field directly, so you have to do some more low-level coding to get the desired effect.

[0] Note (5/14/20): It has come to my attention that the NifTI header provides its own native mechanism for adding embedded metadata described here which I missed during my read through of the standard. This not does change the fact that the gzip header can be used to add embedded metadata to any arbitrary file.

Meditations on the 'Archivability Crisis' in Science and the Long-Term Reproducibility of Scientific Analyses

Mon, 19 Nov 2018 06:52:07 -0500

This post is a response to C. Titus Brown’s How I learned to stop worrying and love the coming archivability crisis in scientific software, informed both by Emulation & Virtualization as Preservation Strategies by David S. H. Rosenthal and past experiences attending Vintage Computer Festival East.

I intend to react to several of Brown’s assertions. Namely: 1) that one can’t save all the software necessary to faithfully reproduce an analysis pipeline, 2) that containers and VM images, as black boxes, are bad for inspectability, and 3) that analyses have a “half-life of utility”, and this in turn renders literal reproducibility undesirable due to cost and effort. Note that Brown’s own views on this may have converged with my own at this point to some degree- I have not been able to fully keep up with his writings, so I apologize if he has made similar points to my own elsewhere. There quite likely have also been many other developments on the subject of scientific reproducibility that I am ignorant of as well.

Software Preservation, Digital Darwinism, and the Role of Packaging Systems in Promoting Reproducibility

I’ll start by putting my own biases out for inspection- I’m candidly a lot more bullish about the long-term viability of replicating scientific experiments in silica going forward, and my gut reaction to Brown’s post is that the claim that we can’t save all software, while superficially true in the sense that it would be near impossible to save the entire statistical population of all written software, is hyperbolic in the sense that the most popular scientific software stacks will be sufficiently preserved into the distant future. Why do I think this? Statistically, the more popular a software package is the more copies of it will be made, and the more likely it is to survive. Furthermore, a substantial number of the software layers that scientific computing relies upon are not unique to science, and even more copies of them are made by non-science actors due to their general purpose nature. Essentially, I have confidence in a sort of digital darwinism that will ensure that core packages essential to scientific analyses remain around for a while. Whether or not this darwinism applies to binaries and source code equally is debatable, but I strongly believe that most software will survive in some form through this manner (e.g., LINPACK in its 1988 incarnation is still downloadable today).

Beyond an evolutionary argument, modern software stacks have an advantage in the sense that most environments, whether they be Linux, BSD, or OS X (heck even Windows) now have package managers that allow one to install specific versions of a piece of software. So long as one mirrors the package repositories (or even just individual package files for dependencies) for all the applications and libraries used in an analysis, one can easily unpack pre-requisite software and re-create an environment for one’s operating system of choice. As an added bonus, for general purpose package repositories, there are already several mirrors of packages going back to at least the mid-2000s. Furthermore, most good package managers will also perform checksums on packages after they are installed to ensure an executable’s integrity, which prevents corruption in binaries from potentially influencing the result of an analysis. Prior to the development of package managers in the late 90s with yum and apt, software stacks definitely were much harder to reproduce due to inconsistent software distribution methods, but dependency resolution and software installation seems to have largely been a solved problem since. Unfortunately, due to the admittedly non-sexy work involved in maintaining package repositories, and time constraints due to the present publish or perish culture (which precludes academic service work other than paper production) it seems as though this key tool will not be leveraged to its full potential, although there are notable efforts such as the conda package manager, the Fedora SciTech SIG, DebianScience, NeuroFedora and NeuroDebian. Admittedly, there are also issues with the ability to reproduce the binaries themselves in package repositories reliably (see here), but as long as the binaries in the repositories themselves remain the same, a particular stack should be re-creatable (albeit potentially flawed).

My hope is also anchored in the fact that non-scientists, such as librarians and law enforcement officers, have a vested interest in maintaining software stacks on a decades-long time scale as well. Librarians do so to preserve cultural heritage. Law enforcement, in contrast, does so for more practical and less abstract reasons (although both groups often collaborate). Forensics specialists, in order to rapidly investigate born-digital evidence, have a need to have rapid access to both signatures identifying applications and key files included with those applications. To this end, NIST maintains a substantial collection of software packages as part of its National Software Reference Library. The public metadata for this collection indicates that a large number of popular pre-1999 scientific packages, such as MATLAB and SPSS, are already included in the collection.

Hardware and Software Emulation

Thus far I’ve focused primarily on software itself, but what about lower levels of the stack, such as the operating system and hardware? For legacy ecosystems such as SPARC, 68k,or PPC, these lower layers can be faithfully reproduced through software emulation at a minimum, and through hardware-based strategies if some other property (more authentic execution, lower power consumption, etc) is necessary. New developments in emulation, while primarily driven by retrocomputing and gaming enthusiasts instead of scientific researchers, hold great promise for the long-term near-literal reproducibility of older analyses (an especially notable project by Brian Stuart can even run reference programs for the ENIAC ). To better contextualize my thoughts on this, I’d like to note the distinction between two forms of fidelity when a digital artefact (such as an analysis or digital art exhibit) is emulated: execution fidelity, which pertains to the accuracy of instructions performed by a computer and experiential fidelity, which pertains to the ability of a simulation to accurately mimic the first-person subjective experience of a technology within its context (Rosenthal, 2.4.3). The former is trivially true due to Turing equivalence. As David Rosenthal notes:

In a Turing sense all computers are equivalent to each other, so it is possible for an emulator to replicate the behavior of the target machine’s CPU and memory exactly, and most emulators do that.

The latter form of fidelity cannot be implemented via emulation alone, but matters little for in silica reproducibility since any context related to the appearance of the hardware is irrelevant to the veracity of the results. Experiential fidelity firmly falls into that category of things so impertinent to the experiment that they belong to statistical error at best (e.g., the color of Rutherford’s tie when he first performed his gold foil experiment, if he even wore a tie). It simply does not matter whether or not a study’s result is displayed on a monochrome Mac Plus CRT display or a modern MacBook Pro Retina monitor (with the very rare exception perhaps of disciplines that rely on imaging). Truth that approaches objective truth is timeless and while the paradigms that guide science may be influenced by a contemporary culture as science evolves, the results that ultimately science converges upon belong to nature itself and not any specific cultural worldview.

It could be argued that while Turing equivalence makes it theoretically possible that any analysis can be reproduced literally, it may not be pragmatic to achieve this ideal. This is a fair point, which is why I think pragmatically science should strive for near-literal reproducibility rather than literal reproducibility. I believe that for scientific analyses at least, if we cannot have execution fidelity that is 100% accurate and precise, we should at the very least strive to minimize the range of our confidence intervals, almost in the same way that equipment manufacturers strive to create components within certain tolerance ranges. It’s also worth pointing out that even if the ideal execution fidelity were fulfilled, nondeterministic steps in an analysis pipeline could yield different end results anyways.

In terms of how we might emulate analyses produced today, it is important to note that two instruction set architectures, Intel/AMD x86_64 and ARM, underpin the vast majority of all computing due to market consolidation in the CPU space (Rosenthal, 3.2.4). Within scientific computing in particular, this market consolidation is even more acute, with only 6.2% of the Top 500 supercomputers using architectures other than AMD/Intel’s x86_64 as of June 2018 (see here). This means that, relative to the heterogeneity of hardware architectures in the 90s, the total set of hardware we would need to emulate in the future is quite small for a modern scientific analysis. Furthermore, it is abundantly clear that the majority of Top 500 HPC clusters are running Linux, which indicates that there are very few degrees of freedom when it comes to operating system choice in modern scientific environments.

In his blog, Brown cites how some researchers have proposed co-opting the software development concept of “continuous integration” as a potential solution for the reproducibility crisis, re-running analyses constantly as data is recorded. While this concept is intriguing, I’d suggest that scientists adapt another concept from the software development world, code coverage, but with the “coverage” element not relating to how many functions in their code have unit tests, but rather how many components of their software/hardware stack will likely be emulatable in the future. While there’s no guarantee that a software developer will write an emulator that implements every single feature of a given instruction set architecture in the future, this kind of “emulation coverage” might be informed by factors such as a) how niche the hardware is, b) how large the community of end-users is (and how fanatical they are), and c) how well-documented the hardware’s interfaces and underlying implementations are.

Beyond software-based emulation of older hardware, there’s also been a growing trend of using FPGAs to implement older hardware directly. There are numerous advantages to this approach, from better utilization of resources such as electricity and compute cycles (i.e., many cycles are idle when emulating a legacy machine on a powerful machine and nothing else is running in the background), to the ability to run custom hardware in the cloud via Amazon’s F1 instance types. A good example of this in the education space is Stephan Edwards’s Apple2fpga, which re-implements an Apple II+ using an FPGA board. This example also illustrates how the Apple II might be considered a platform with good emulation coverage, since as Edwards notes on his site:

The Apple II has been documented in great detail. Starting with the first Apple II “Redbook” Reference Manual, Apple itself has published the schematics for the Apple II series. When Woz spoke at Columbia, he mentioned this was intentional: he wanted to share as much technical information as possible to educate the users.

It should be noted that historically speaking, it wasn’t uncommon to run multiple software architectures on a single machine. At Vintage Computer Festival East 2018, an exhibit entitled “Microcomputers With an Identity Crisis” by Douglas Crawford, Chris Fala, and Todd George demonstrated how ASICs (such as the Apple IIe compatibility card for the Mac Color Classic) and add-on cards (such as a 486 CPU that operated in tandem with a PowerPC processor in the PowerMac 6100) were used to consolidate the large number of incompatible hardware architectures during the 80s and 90s onto single machines. Modern cloud infrastructure with FPGA instances can serve a similar purpose in allowing multiple hardware architectures to be run alongside each other, and it’s feasible that on-premises servers with FPGAs could also be used to re-create hardware architectures on-demand (although these would likely be for custom hardware such as GPUs or other co-processors, since CPU emulation via software in many cases seems more pragmatic).

On Docker/Virtualization, Inspectability and Configuration Management Tooling

I find myself in agreement with Brown’s assessment of Docker, in that treating scientific software and analysis pipelines as black boxes is dangerous for reproducibility and reduces transparency, although admittedly it could be argued that Dockerfiles fulfill some of the requirements of inspectability that Brown advocates. The primary issue I have with Dockerfiles for purposes of inspectability is that they aren’t sufficiently portable, and can’t readily be used to deploy to bare metal, virtual machines, or other containerization technologies. Anecdotally, it seems that many researchers (and companies) are in a rush to embrace Docker without fully considering that it could quite easily be supplanted by Kata Containers, jails or zones should the tech industry embrace these alternative technologies (which owing to some of Docker’s historical failings wouldn’t surprise me). This particular nitpick of mine is related to the “emulation coverage” I discussed earlier, where it’s important to consider the long-term viability of a technology before employing it.

Instead of Dockerfiles, I think that the inspectability criterion should be fulfilled by configuration management tools such as Ansible, Puppet, Chef, and Cfengine. Configuration management tools use a declarative syntax that allows one to specify exactly how one’s environment is set up (i.e., which packages are installed, which versions are used, which hard disk volumes are mounted, etc). While this syntax varies between tools, it can be used to both apply a series of commands against a base installation of an operating system to bring an execution host to a desired state, and can also serve as documentation that future researchers can look at and remix. Intriguingly, Cfengine was created with the explicit goal of setting up research environments for physicists in the early 90s, which consequently means that a large number of computing environments from that time period might be very well-characterized.

Conda environment.yml files also fulfill inspectability in a manner similar to configuration management tools. In fact, these files are one way to construct Docker image using HHMI’s Binder tool. At this point, it’s important to note that the amount of effort needed to reproduce a pipeline doesn’t always necessarily need to focus on OS-level dependencies. If a pipeline is implemented solely in a higher-level language such as Python, a researcher may have considerably less work in terms of documenting his/her environment. Indeed, Docker is basically to C and C libraries what a Python virtualenv is to Python and Python packages. If a researcher has no need for a specific version of a C library then a requirements.txt or environment.yml that can create a Python virtualenv (along with documentation for which base installation it was installed atop of, which could be provided by a systems administrator) might be adequate.

Finally, I believe that the inspectability criterion must be fulfilled by well-documented workflows that indicate how tools in an analysis fit together. My opinion is that this would be best accomplished by a common syntax such as Common Workflow Language (CWL). I’ve discussed my beliefs about the potential benefits of CWL at length elsewhere, so I won’t re-hash them here, but the short of it is that I strongly believe in a neutral, platform-agnostic way for describing how tools fit together (essentially high-level programming).

Thoughts on Literal Reproducibility, and the Utility of Research Products

While I agree with Brown that literal reproducibility is impractical, I strongly believe that the ability to reproduce analyses with the highest precision possible has value beyond the short-term. For one, I think that there is a moral imperative for analyses to be repeatable, since most analyses are not produced with private funding, but rather governmental funding, and as such should be available to the general public in a form that they can inspect and run with relative ease (should they obtain the hardware resources necessary to do so). I believe that this is essential for public instruction, to narrow the gap between scientists in an “ivory tower” and everyday citizens, to demonstrate good use of public funds, and to increase accountability / reduce scientific fraud. I also believe that it makes science less vulnerable to attack, especially in the case of climate science since it’s harder for climate science deniers to have as much traction if the exact analyses that they want to discredit are accessible and readily runnable in some fashion. Indeed, greater workflow transparency might mitigate against allegations that scientists are engaging in a conspiracy to further a political agenda (e.g., Climategate).

Additionally, I think that it’s dangerous to make assumptions about the utility of exact repeatability (or anything else in science or life for that matter) in the long-term. The value that we assign to any given research product, such as a reproducible analysis, is constantly in flux, and there’s no reasonable way to predict the worth of that analysis at any given point in the distant future. Historical judgements anticipating the future value of research products have led to extremely suboptimal decisions, such as the alleged destruction of data (see disclaimer below at [1]). Furthermore, value judgements about analyses have led to systematic issues within science itself such as the file drawer problem wherein null results are undervalued and thus never communicated to other scientists (at best).

It’s also worth pointing out that the problem-solving process in science itself benefits from having a legacy analysis available for revival if necessary in the distant future. In cognitive science, it is theorized that two methods of problem-solving, difference reduction and means-ends analysis, are employed to get to a goal state, such as an analysis-backed conclusion. In difference reduction, or hill climbing, actions are continually performed that minimize the difference between the current state and a desired goal state. This works fine for simple problems, but has the caveat that a problem solver can get stuck in a rut (or local maxima) and not step back to go down a path that would lead them to the ideal goal state (a global maximum). In means-ends analysis, multiple sub-goals are created as responses to blocking states as the problem-solving process occurs. These sub-goals are then considered separately to build a path to an overall goal state. Means-ends analysis is the ideal problem-solving strategy; however, it assumes a well-characterized goal, which is often not the case in more exploratory scientific analyses. Due to the high uncertainty of scientific outcomes, I would assume that difference reduction reasoning strategies are more prevalent than means-ends analysis in the scientific process. Within the context of the computational reproducibility conversation, what this means is that a scientific field could go down a kind of garden path for a while, get stuck in a local maxima, and then need to back up to unblock itself. If there is a need to back up to the point before a scientific paradigm branched towards a particular direction, we would want to be able re-run analyses from that exact point of branching to determine our next directions.

Other Considerations: Instrumentation Fidelity and the Role of Data in Reproducible Analyses

Up until now, I’ve focused on reproducing the CPU instructions / program that performs an analysis, but what about the data? While I’ve been confident about our ability to preserve software stacks and re-run them via emulation, I’m not quite as confident about our ability to preserve raw data into the distant future. This is in large part because the packaging tooling around datasets doesn’t seem nearly as mature for data as for the instructions that will run on data, and also most data out there isn’t general purpose enough for non-scientists to want to preserve. As I expressed earlier, even packaging systems such as yum or apt don’t seem to be fully leveraged by the scientific community, and tooling based around datasets seems to be even worse. There’s some cause for optimism, however, as projects/products like Datalad/git-annex, dat, osfclient, and Quilt have progressed. This space is still very new, however, and I wouldn’t expect it to really take off until the scientific and analytics/data science communities converge upon a standard.

My cynicism about the long-term preservation of scientific data is rooted in the fact that data seems to be treated as a second-class citizen and is more readily thrown out. This disposal of data is possibly in part due to a perception that new data can always be gathered and an analysis re-run, but also due to the fact that scientific and governmental institutions give little thought to sustainable funding models for data storage. Furthermore, the long-term preservation of any data, not exclusively scientific data, can be an expensive and complicated affair if one wants to do it right, and historically there hasn’t been a large interest in proactively engaging in data preservation (see here). Instead conversations about data preservation typically tend to be reactive and come up within the context of political censorship. There are other contributors to my pessimism, such as poor documentation/metadata indicating how raw data can be used and, as Brown points out in his blog, a disconnect between older formats and newer tools, which I might rant about later. y cynicism on this point is also slightly colored by the fate of the fMRI Data Center, the first large-scale effort at open access data sharing for neuroimaging data, which disappeared seemingly overnight after government funding dried up. Simply put, the structure of scientific grants does not always make it easy to create reliable core infrastructure, and more often not data storage (both short-term and long-term) seems to be given short shrift.

There’s also another form of fidelity at play in computational analyses that can get lost in the discussion when general-purpose computation is emphasized heavily; ideally the equipment and methods used for data collection should be repeatable in the future as well. This repeatability would constitute a variant of the concept of execution fidelity that I’d like to call “instrumentation fidelity”. While the execution fidelity discussed previously deals with Turing-complete general-purpose computers, instrumentation fidelity deals with niche processing units that take in input (sensations of natural phenomena), process these through a variety of sensors and other hardware, and produce output in the form of scientific data.

While such instrumentation fidelity is not as important if we consider data to be fixed in order to create a single repeatable version of a specific analysis, it does have implications for remixability and the general-purpose suitability of an analysis pipeline. Indeed, I don’t believe that it would be wise to consider collected data to be “hard-coded”. If an analysis is to be truly generalizable, it should also be able to produce results that fall within a confidence interval with an entirely novel set of data. In fact, the ability to re-run an analysis with an entirely novel set of data gathered by an independent lab might be a very useful form of blind peer review that increases the quality of published findings tremendously.

But how would we achieve inspectability with data gathering instruments, and how could we ensure that the instrumentation itself wasn’t flawed in some systematic way? While Brown asserts that “closed source software is useless crap because it satisfies neither repeatability nor inspectability”, it’s also true that most equipment designs for life sciences are proprietary, and in many cases it’s impossible to double-check for fundamental design flaws that might systematically distort the data a pipeline acts upon. Though there are some efforts to create open hardware for labs, in many cases instrumentation is too niche to have an open source equivalent or tool manufacturers benefit from economies of scale that individual labs don’t have.

Instrumentation fidelity also opens up a number of other questions about how literally an analysis should be reproduced. Should a neuroimaging researcher build his or her own open source EEG or MRI scanner using designs from 30 years ago to replicate a past analysis? Given the advances in technology over those 30 years, which would presumably lead to higher-quality data (unless newer instrumentation has its own systematic flaws that render it inferior to older equipment) this would seem to be an absurd proposal. My intuition tells me that reconstructing legacy instrumentation should not be a priority, though this seems as though it could be a dizzying debate for another time.

Summary

In brief, my thoughts are that:

It seems unlikely that scientific software stacks will be incredibly difficult to preserve in many (if not most) cases. Many elements of scientific software stacks, such as OS libraries, are more likely to be preserved due to their general purpose use. Non-scientists such as librarians and law enforcement have a vested interest in preserving scientific software on a decades-long time scale. Software packaging, if adopted more widely, would make reproducibility even easier to achieve, although at a minimum the use of version control tools such as git could provide some of the same functionality as packaging (such as the ability to install specific software versions via commit number).
While literal reproducibility is impractical (though theoretically possible) striving for narrow confidence intervals / tolerance limits for acceptable outputs from a scientific analysis is less so.
Hardware and software emulation of lower levels of the scientific computing stack should for the most part be adequate. Researchers, when creating computational analyses, should consider how difficult it will be to emulate their analysis in the future (“emulation coverage”).
I agree that the creation of VM images or Docker containers are a non-solution to the reproducibility of in silica scientific analyses. In lieu of relying solely on images and containers, I think that it is more important to document environments via configuration management tools or other environment specifications. These tools provide for both high inspectability and can be used to re-create environments on top of a base operating system installation.
Systematic documentation of workflows via a syntax such as Common Workflow Language (CWL) is essential for providing inspectability and documentation for how software components are linked together in an analysis.
More broadly, the value of declarative syntax in facilitating the inspectability of analyses cannot be overstated.
Changes in instrumentation for data collection also throw a ratchet into the problem of reproducibility, especially since many tools used to collect data are proprietary and data repositories don’t have the same well-established tooling that software packages have in terms of packaging or version control.
Some of what I’ve said in this post is rooted in a deeper belief that scientific reproducibility is, in some ways, more suited to an engineering mindset than a scientific one.

[1] I realize that the particular example I cite here (the case study of Henry Molaison) is controversial and that the jury is still out on whether any data was actually destroyed. Most likely no malicious intent was involved. Nevertheless, destruction of research materials definitely does occur on some scale throughout science and can lead to great distrust in the court of public opinion.

Why I Support the Common Workflow Language

Wed, 05 Sep 2018 19:48:32 -0400

I’ve been wanting to write a post about Common Workflow Language (CWL) for a while now and, realizing that if I don’t do so now I likely never will, have decided to embark upon an attempt at articulating my thoughts about why I support this project. For those who are unfamiliar with CWL, it is essentially a simple YAML-based syntax for expressing input-output relations between programs in a workflow. This is similar to the concept of piping inputs between commands in a Unix shell, or defining steps that need to be performed to compile a program using a makefile. I’ve been following it sporadically since I stopped working in science since it isolates the pipeline definition functionality of other flow-based tools used by scientists such as Nipype or Galaxy in a platform and field agnostic way.

This agnosticism is the key innovation that CWL brings to the table; by separating pipeline specification from the logic that determines how a pipeline is executed scientists and data engineers using CWL can switch between different job schedulers with different ways of optimizing machine resources. Essentially, programs wrapped in CWL can be thought of like a CPU instruction set and a job scheduler that knows how to interpret CWL can be thought of like a hardware backend that knows certain tricks to make those CWL-wrapped commands execute more quickly. If a scientist or engineer is using one backend, but then discovers that another backend uses slightly more clever algorithms that result in a 15% speedup in performance, this portability of CWL would allow him/her to jump from the first backend to the other with relative ease.

CWL also enforces good practices in scientific computing, by emphasizing interfaces over implementations. When you wrap a command line tool in CWL, you not only have something practical that allows you to construct a workflow in Rabix or Cromwell, you also have a specification of what inputs and outputs a typical scientist in your field would want to have for a particular step in an analysis. If you have such a specification and suddenly find that current tools in your field are inadequate for some reason (speed, method has inadequate precision, incorrect outputs), you can create another tool with the same name that takes in the same inputs and outputs but uses totally different logic under the hood. In this way, a tool that would benefit from GPU computing can be converted without breaking a previously defined workflow or a tool that was originally was CPU-limited can be rewritten to leverage FPGAs or ASICs for great performance gains. This is related to my point above about how using an agnostic markup for describing workflows can allow scientists to switch between backends- if you think of a command line tool as a component in a backend (e.g., a floating point unit), CWL allows you to further modify that backend by replacing slower parts with spiffier ones. It’s even conceivable that a scheduler could also be written that is aware of niche hardware and knows how to schedule steps that make use of it effectively.

This emphasis on interfaces could also enforce a healthy separation between scientists and the software engineers that develop scientific software. It’s fairly well-established that while many scientists are experienced in their respective fields, not very many are necessarily good at both writing software and keeping up with domain knowledge. In fact, most scientists don’t see the value in formal instruction in software development and consequently produce less than ideal code, a fact that is bemoaned by a Nature editorial. While I agree that scientists need to become more computationally literate, I disagree that they should be expected to fully absorb the expertise necessary to be a competent software developer. Instead, I’m generally supportive of the movement to legitimize research software engineers as distinct specialists in their own right who sit at the intersection of research and engineering. That said, focusing on defining an interface for a black-box function is something that would not require scientists to stray too far from their realms of expertise, while also allowing an individual who was specialized in software engineering to understand their needs. In other words, a CWL tool could serve as a common ground where domain specialists from scientific fields and research software engineering could meet in the middle. Scientists wouldn’t need to write shoddy code but would still be able to steer development for their tools.

Lastly, I like CWL because it enforces the use of the Unix shell. This last point is admittedly something I still haven’t fully fleshed out and might not be super well-substantiated, but I intuitively feel like using general purpose programming languages as glue languages in scientific computing is a mistake. Essentially, I think that the shell is superior for orchestrating complex pipelines because it doesn’t lock you into a particular language ecosystem and is expressly purpose-built to be a user interface to the operating system that programs are running on. In contrast, committing to Python and Perl to glue together all or even some of the steps in a pipeline is riskier due to the fact that these languages are optimized for functionality that programmers want, rather than what a program execution environment needs. Because they cater to programmer desires, these languages are much more likely to evolve quickly, with the consequence that older tools written in prior versions of a language may not interoperate with tools written in a newer version of the language if forced to use the same interpreter. A programmer might even find that a new domain-specific language offers enough of an advantage over incumbent languages that he/she want to write new tools in that language. Science candidly does not have the resources to constantly concerning itself with porting code from one iteration of a language to the next (or to a different language altogether), so there needs to be some way to guarantee that old, creaky code can interoperate well with shiny, sexy new code. In contrast, shell functionality is fairly rigid and unchanging (assuming POSIX compliance) and allows you to more readily switch between language environments.

There are other reasons for my interest in CWL, but I don’t have enough time to discuss them in depth at the moment. To quickly touch on them though:

I believe that everything I’ve said above about scientists can also be true of practitioners in other fields- librarians could benefit from CWL for defining preservation workflows, doctors and medical lab scientists could benefit when systematizing their own protocols.
CWL forces researchers to document how their analyses were performed, which leads to more reproducible science. It’s much easier to hand over markup specifying how the results of a study were obtained than to write in more verbose journal-ese and since the markup is necessary for the analysis itself to execute, it’s impossible to leave off steps (unlike the current status quo].

2018: A Digitization and Data Migration Odyssey

Sun, 24 Jun 2018 19:37:50 -0400

Recently I journeyed into the hinterlands of upstate New York to visit my mother for the entire week of Memorial Day weekend. This was partially to be a good son and keep my mother company, partially to escape the air and noise pollution of New York City for a world of grass and open spaces, and partially to help with another large family project- a general cleanup and decluttering of my mom’s house. Since my dad’s passing, it’s become increasingly obvious that my childhood home is too packed with odd objects and artifacts that needlessly complicate my mom’s life, and I wanted to do my part to get rid of some of those bits and pieces.

As someone working in the library space, my focus was drawn immediately to perhaps the biggest contributor to clutter in the house- obsolete analogue and digital media formats. From 8mm tapes to audio cassettes to VHS tapes, Jaz disks and floppies, the family house is littered with all manner of data that can now be stored on thumb drives and other modern media that take up a fraction of the physical space. With a mission in mind, I began preparing myself in advance by ordering supplies on Amazon and eBay to facilitate (as the title would imply)… a home-based digitization and data migration odyssey! Now, I know that there are a number of companies that are out there that would do this professionally, but costs can be prohibitive for an individual household (and candidly not worth it for some of the items I brought into the 21st century) and I wanted a fun and relaxing project to work on during my break from work (the irony is not lost on me).

One of the first tasks that I needed to do involved winding up an earlier endeavor that I had nearly finished last December (I had begun this particular subgoal around 2010 but had only been able to work on it piecemeal since then). There were a number of VHS and VHS-C tapes that needed to be migrated to a digital medium (only unique home videos; nothing that could be found on Netflix). The bulk of these tapes were converted through a process that involved playing the tapes back in real time and recording them to DVD with a DVD/VHS combo player and then ripping the DVDs using Handbrake with the Fast 720p30 setting (chosen primarily because it matched the DVD resolution and was already more than enough to work well given the quality of VHS recordings; admittedly it might not be the ideal setting). Most of the VHS tapes around the house had undergone this process, but there were a few stragglers (some of which were duplicates of tapes that I ripped earlier, but possibly with higher quality). And thus I set out to digitize the remaining few at a slightly cluttered station:


The VHS-to-DVD Conversion Station, complete with cranberry juice, disk sleeves, DVDs, labels, a pen, and love

There also were a number of audio cassettes scattered about the house, which were rather simple to rip. For these I simply used an older iMac with an audio-in jack connected to a boombox, the aforementioned boombox, and a version of Audacity compatible with the version of Mac OS X on the iMac. It was tedious, but my approach was simply to play the cassettes back in real time and keep a 30 or 45 minute timer (depending on the side length) to alert me when the cassettes needed to be flipped.


A lightly improvised cassette ripping studio

Concurrent with my analogue ripping tasks, I also juggled finding means to migrate data from numerous obsolete digital media to my laptop’s hard drive. I had mixed success on this front, ripping a small number of 3.5 inch floppies, all of my family’s 5.25 inch floppies (this was actually the easiest migration, somewhat counterintuitively), and various other SCSI-based media.

I started with the 3.5 floppies because I had pre-emptively purchased a cheap USB floppy drive (from Chuanganzhuo) and I figured this would be the easiest task. My strategy was simply to run dd on each disk as I inserted it and then eject it afterwards. On Macs this can be a bit tedious, since you first need to run diskutil unmountDisk /path/to/dev to unmount the disk before invoking dd, so I wrote short script to wrap this command, dd, and a sleep between the two for safety. Unfortunately, I was only able to get through about 47 floppies (including a few curiosities from a local mid-90s ISP) due to a hitch involving Apple Macintosh floppy disks. According to Sonic Purity, Apple (in the true spirit of “Think Different”) had outfitted their floppy drives with variable speed motors to cram in an extra 40-80k onto each disk, and the only way to migrate floppies designed for these drives was to use the original hardware. This explained why numerous disks simply refused to mount with my USB drive, leaving a large number of disks orphaned until a suitable machine for a future migration can be found (both my Mac Plus and Centris were busted, unfortunately).


The 3.5 floppy migration setup

The 5.25 floppies were refreshingly easier, which shocked me since the hardware I was dealing with was substantially older. For these I used an Apple IIe with ADTPro, a Super Serial Card (though I also had several Serial Pro cards that I was tempted to experiment with), a DB9 to DB25 null modem connector, and DB9 to USB adapter.


The Super Serial card, before setting the jumpers

I had originally considered using the audio jacks on the Apple IIe instead of the serial interface but after realizing how ridiculously slow this would be (about 13 minutes for one 140k floppy according to the rate here) and how fatuously stingy I would have been, I went ahead and went the serial route. The actual transfers were relatively seamless, although I was unable to use the Speediboot bootstrap for some reason and had to send ProDOS first followed by the ADTPro Serial program before I could save the ADTPro client to a 5.25 inch floppy. There were a few floppies that crashed the ADTPro client and gave the monitor a ‘Snow Crash’-like error of various characters, but I was able to avoid this by either retrying or transferring the disk as a nibble image. Naturally, some of the disks experienced data corruption due to their age, but I was pleasantly surprised to find that most were migrated without errors.


My 5.25 inch floppy migration setup

Afterwards, to enjoy the fruits of my labor I tested some of the 5.25 disks I ripped with Virtual II and had a good time replaying a few games of questionable quality that I wrote in BASIC during elementary school.


First screen of a game of questionable quality


Second screen of a game of questionable quality

Last (but not least), I migrated a number of SCSI-based media to my laptop by buying a SCSI II card and then installing it into a PC I built when I was in 7th grade (900 Mhz Duron; MS-6330 motherboard). Aside from a few hiccups (Ubuntu Server 10.04, one of the last releases of a modern distro I could find for a 32-bit x86 chip, was not happy with my ATI Radeon 9800 Pro and I had to replace it with an ATI Rage 128), I was able to connect several devices to my old PC, dd their contents into files and then transfer these files to my laptop via a USB thumbdrive. In this way I ripped an older hard drive I used when I was in 5th grade for storing 3d models and several Jaz disks. I’m slightly less sure that the Jaz disks ripped properly though- the Jaz drive started to complain when the dd output got to around the 730 MB mark, which makes me think that I encountered some form of the infamous click of death and will need to try another drive to migrate the data on those disks with higher fidelity (the resulting images were mountable within a Sheepshaver environment though, so I was still able to see most of the possibly corrupt files).


SCSI Migrations with “Montagine”

I also tried to migrate a 30 MB Mac Plus external hard drive and several 88MB SyQuest disks with this setup, but experienced little luck, possibly due to lack of driver support or some other quirk that I’m not familiar with.

Some particularly observant readers might notice stickers on the media that I was digitizing. These stickers are essentially base 26 numbers using only letters and no numerals, that are part of an ad-hoc classification scheme wherein certain media are allotted specific ranges. This allowed me to easily match tagged photographs I had taken of each item I migrated / digitized with the copies that made their way to modern media.


My ad-hoc classification system

Thus concludes the chronicle of my most recent great foray into the world of hobbyist data migration / digital preservation. I’m nowhere near done with completely migrating everything, but I made good progress and am confident that I’ll be able to finish the rest in due time (I’ve already been thinking about the possibility of using a Wolverine 8mm digitizer for some old family movies that my grandparents made). I can also rest a little easier at night, knowing that a large chunk of otherwise atrophying physical media have been given new life on modern storage devices.

On the Use of Distributed Databases for File Format Identification

Sun, 06 May 2018 15:07:58 -0400

A perennial issue in the field of digital preservation is how to unambiguously identify an incoming file that is being stored for long-term archival. The Unix file command uses magic numbers stored in a text file to determine what format a file is, but this text file might not be uniform across Unix/Linux installations in use by libraries, and it is tedious to maintain across multiple institutions. Additionally, DOS/Windows-based files rely on file extensions for identification.

Enter PRONOM, which aspires to be a definitive source of truth for file format identification regardless of the platform those files were encoded on. Two similar efforts, GDFR and UDFR also took upon this challenge, but are currently inactive, leaving PRONOM as the last file format registry standing. Unfortunately, adding entries to PRONOM can be a bit of a pain since you need to go through a submission form and the registry is structured as a centralized store. This means that the barrier to entry for entry could possibly be intimidating for non-librarians (or really anyone who isn’t a librarian at the British National Archives) and that if something were to happen to PRONOM, such as a loss of funding, the registry would cease to be actively maintained (as occurred with GDFR and UDFR). David Rosenthal, in his Emulation & Virtualization as Preservation Strategies also points out that PRONOM is ill-suited for identifying files contained within disk images and emulation since it doesn’t include technical metadata that emulators would need to reproduce an archived computing environment. Since large portions of digital content will likely be saved as whole images using dd commands or similar, this constitutes a pretty significant flaw.

A good solution to these issues (i.e., lack of robustness since PRONOM isn’t decentralized, lack of appeal due to a high barrier to entry for new file type records, lack of additional records needed to reproduce computing environments) would be the development of a distributed database similar to DNS or Handle (basically used for persistent identifiers for library resources; DOIs are the most well-known implementation of Handle). In fact, it seems that the PRONOM folks at one point were working towards this:

However, The National Archives is planning to develop a range of services to expose PRONOM registry content, including a resolution service for PUIDs. -From PRONOM Unique Identifiers

If a DNS-like system for file identification were to exist, I’d be interested in seeing the following kinds of records in it:

PUID
A list of potential natural language names for a file format
A list of regular expressions that can be applied against text files for more accurate identification (i.e., to identify text-based formats such as JSON or TextGrid (used in linguistics research). The number of regexs that successfully match could produce a confidence score, which alongside a threshold, could be used for identification.
Technical metadata relating to how best access a file; configuration management style data, possibly spread across multiple records.

Coming from a scientific background, I’m also a bit quizzical about PRONOM’s goal of making “unambiguous” file identifications. In my experience working alongside folks studying nature, ambiguity is never going away because the world is apathetic about our quest to understand it and won’t actively help. We’ll never have a fully accurate model of our world, and the most we can hope for are probabilistic models that can give us explanations of what we’re likely seeing, given our data (i.e., Bayesian file identification). I think that a better approach would be for tools that make use of a file signature repository like PRONOM to continuously query that repository to check for new data regarding their identity, and then to update their beliefs accordingly based on the identification with the highest confidence. A DNS-style system would be well-suited for this sort of thing since files could easily make queries at selected intervals (similar to how entries in a DNS cache might expire). This would be kind of conceptually similar to the software engineering practice of continuous integration.

Unfortunately, the pace of development on these sorts of projects in the libraries world can often be sluggish unless adequate collaboration in non-libraries fields can be found. A friend of mine, however, noted that this sort of system is something that security researchers would be very much interested in, since file signatures for common viruses are often stored centrally by companies such as Norton and McAffee, and a distributed database of virus signatures might be more robust and open. In fact, it seems that someone at USC did put together a distributed database of malware signatures, albeit using blockchain (see BitAV: Fast Anti-Malware by Distributed Blockchain Consensus and Feedforward Scanning). Thus, it might be fruitful for librarians to reach out more to security researchers in the future.

Scientific Shower Thoughts - The Holocaust, Contextual Psycholinguistics and Holograms

Fri, 13 Apr 2018 20:26:52 -0400

I recently came across an interesting article in the New York Times discussing the Holocaust, the increasing ignorance amongst members of my generation about certain key facts, and the looming issue wherein concentration camp survivors are dying off due to old age, making it impossible to continue to hear their stories firsthand. I myself was fortunate enough to hear from a local area survivor, Helen Sperling when I was in high school, and was always struck by the intimacy of being in the same room as someone who had lived through an indisputably horrific experience. My most vivid memory of Helen’s story was how her best friend rapidly came to perceive Helen as “dirty” due to her Jewishness (an event summarized here at some level of detail).

The Times article piqued my interest, when at the end of the article, it mentioned that the Illinois Holocaust Museum and Educational Center had started presenting holograms to try to capture the experience of interacting with and hearing speak a Holocaust survivor, as I did in my teenage years. This led me to take a look at the following video:

What struck me in particular about the Shoah Foundation’s New Dimensions in Testimony program was the use of a geodesic dome (seen in some of the photos of their website) for recording the survivors’ testimonies in 3D. The use of an NLP model for answering questions was similarly intriguing, although didn’t catch my imagination as much. And now I must apologize in advance if this is too meandering, but in my mind this harkened back to an experience in my first lab studying psycholinguistics.

When I was working in the LAB Lab as a research assistant we were studying speech processing in real-world contexts using high density EEG headsets (see here). We were using a methodology to localize brain activity based off electrical activity on the scalp using a forward model to describe how known electrical activity should look on the scalp and an inverse model to infer where novel data would be localized to given the forward model. For a very short period, we were able to try out a piece of equipment that could capture the exact position of electrodes on an individual’s skull, thus allowing one to better estimate the forward model for an individual participant. This piece of hardware was essentially a much smaller and more limited version of the geodesic dome used to record the Holocaust survivors (see below).


The mothership

This got me to thinking, what if a much larger dome, similar to the dome used to record the Shoah interviewees were used to simultaneously record both EEG signals and an individual’s motion. Motion and position data could be used to both constrain a head model and also provide a means of potentially correcting head motion artifacts in the EEG data. There is some precedent to perform motion correction using camera tracking via a technique known as REEGMAS, and there has been recent work at Carnegie Mellon indicating that motion can be tracked effectively with a large number of cameras on a dome. If these feats of engineering were combined and elaborated upon, one could envision an experiment in which several individuals wearing EEG headsets could move about naturally within a confined space and communicate as they would normally in a real-world context. Overall, the addition of comprehensive and immersive audiovisual data could provide great insight into how we use speech and communicate.

If I put my mad scientist cap on, this also makes me wonder if in the future we might better be able to constrain NLP models based on the understanding we could gain from such a setup to be substantially more accurate and better convey a communicator’s thoughts, even after they have moved on from life. Or perhaps I’ve just watched the season finale of Black Mirror series 4 too many times.

The Immortality of Writers

Tue, 03 Apr 2018 19:04:20 -0400

I have a post in the works for this blog (I swear!) although it’s not quite ready yet. In the meantime, I’m going to leave a few words of wisdom that will hopefully inspire me to actually write:

If you would only accomplish this, becoming expert in writing: Those writers of knowledge from the time of events after the gods, those who foretold the future, their names have become fixed for eternity, though they are gone, they have completed their lifespan, and all their kin are forgotten.

They did not make for themselves a chapel of copper, or a stela for it of iron from the sky. They did not manage to leave heirs, from their children, to pronounce their names, but they have achieved heirs out of writings, out of the teachings in those.

…

The doors of their chapels are undone, Their ka-priests have gone. Their tombstones are smeared with mud, their tombs are forgotten, but their names are read out on their scrolls, written when they were young. Being remembered makes them, to the limits of eternity.

A man is dead, his corpse is in the ground: when all his family are laid in the earth, It is writing that lets him be remembered, in the mouth of the reciter of the formula. Scrolls are more useful than a built house, than chapels on the west, they are more perfect than palace towers, longer-lasting than a monument in a temple.

–Papyrus Chester Beatty IV, “The Immortality of Writers”

From UCL (originally from Lichteim 1976)