Using the Gzip File Format as a Metadata Container

It’s been a while since I last posted, but I’ve been itching to commit to words a few thoughts I’ve been kicking around regarding one particular approach to adding metadata to arbitrary file formats. In the distant past, I was heavily involved in a community that was developing a standard for enriching neuroimaging datasets with metadata. I’m not going to re-hash all of the advantages that creating such a standard confers upon the neuroscience community, since others have done so extensively elsewhere, but the short of it is that the standard, the Brain Imaging Data Structure (BIDS), has done much to increase both the discoverability and reusability of fMRI/EEG/MEG datasets that otherwise would have remained “hidden” in a proverbial file drawer somewhere. If you’re interested in understanding BIDS more, there is a paper that goes into more detail. For this post, I’m going to focus on one particular technical choice that was made about how BIDS stores supplementary data (such what instrumentation was used and how equipment was configured).

In fMRI data analysis, the two most heavily used formats are NIfTI (described in great detail here and here) and MINC. In terms of format adoption, the majority of tools available to neuroscientists favor NIfTI support, which led the designers of BIDS to focus on finding ways to inject additional metadata into this format. The NIfTI file header already contains quite a few fields for neuroscience-specific metadata, but adding additional fields to this header requires that a working group convene, engage in debate, and that some degree of backwards compatibility be maintained so that existing neuroimaging tools don’t break. This change control process, while ensuring stability, also makes it difficult for researchers and tool developers to rapidly experiment with additional header fields, iterate upon standards, and observe how the addition of such fields affect data sharing and reusability.

The primary BIDS developers wanted to have that ability to contextualize NIfTI files with new tags while not breaking existing tools. The technical solution that they arrived at was to use a JSON-formatted sidecar file, wherein a file with the same name as a NIfTI file but a different file extension would contain various metadata entries as key-value pairs. Such sidecar files are a fairly common way to enhance rigidly-defined file formats with additional metadata, and are readily seen elsewhere in the wild. For instance, sidecar files are used by gmvault to store feature-specific Gmail fields that are not part of an official e-mail message format standard. Additionally, Kodi uses sidecar files to store video metadata (see here). In the case of BIDS, a long-term goal was to engage the NIfTI working group so that this JSON file could itself be embedded within the NIfTI file header in a future version of the NIfTI format:

Storing metadata in JSON files has advantages of accessibility, but can be error prone because data and metadata do not live in the same file. In future revisions of BIDS we will explore the possibility of storing metadata as a JSON text extension of the NIfTI header.

The statement above reflects what I have come to believe is a general best practice for adding metadata to a file- metadata should always be embedded if possible. I base this best practice on the observation that in any structured system, there is a natural tendency towards entropy. The ease with which certain operations can be applied in a computing environment often serve to accelerate the decay of order. Filesystem operations in particular, such as file moves and deletions, are particularly easy to perform. When using a sidecar file to store metadata, it is more likely that metadata will be lost, since the original file and its sidecar have no obvious relation to each other, and are treated as independent entities in the filesystem. In brief, there is no “stickiness” that ensures that both metadata and data will be equally affected by filesystem changes.

Interestingly enough, the above opinion appears to have been a primary motivating factor for the creation of the NIfTI format to begin with. NIfTI was preceded by the older ANALYZE format, which specified that metadata for an image file (with the “.img” extension) was to be stored in a sidecar file with the “.hdr” extension. As often is the case, time is cyclical and old problems in computing often recur.

To the extent that I am aware of developments in the wonderful world of NIfTI, a JSON text extension was never added to the NIfTI header. However, there’s an interesting workaround. NIfTI files are commonly gzipped, and all major tools I’ve worked with seamlessly decompress such NIfTI files when reading them in. According to the gzip file format specification described by RFC-1952, 2.3, if a specific bit (FLG.FCOMMENT) is set in the gzip file header, you can add an arbitrary amount of Latin-1 encoded text to the header. That means that, rather than embedding JSON within the NIfTI header (which again, is hard to change without convening together a working group), someone could instead embed the JSON within the gzip header. In this way, gzip can be used not only to compress data, but also as a container format for additional metadata.

The restriction that text be encoded as Latin-1, however, poses some difficulty. What if desirable metadata values are incompatible with this character set? Additionally, what if we wanted to embed metadata stored in a different file format into the gzip header? Fortunately for us, Latin-1 encoding is a superset of the commonly used US-ASCII encoding, and storing non-ASCII characters or arbitrary binary data as ASCII text is a problem that was solved long ago. In fact, e-mail relies extensively upon such solutions, via the MIME standard. This means that, instead of restricting ourselves to adding JSON text to the gzip header’s comment field, we can add pretty much anything, so long as that anything is formatted as an e-mail message with the relevant MIME headers.

The main disadvantage to this approach is that the data-to-text encoding most commonly used with MIME, base64, results in data taking up 1.37 times what it would otherwise (see here). This is pretty counterproductive from gzip’s perspective, considering that its main goal is to decrease the size of a dataset! Nevertheless, I’m not sure if this disadvantage outweighs the advantage of having metadata embedded directly within a dataset. Presumably the metadata takes up a trivial amount of space anyways, so the additional overhead doesn’t amount to much.

If anyone is curious about this approach, I’ve been playing around with it using some incredibly hacky Python code that can be found here. Unfortunately, Python’s gzip module doesn’t give you the ability to populate the gzip header comment field directly, so you have to do some more low-level coding to get the desired effect.