On the Use of Distributed Databases for File Format Identification

By John Pellman · May 6, 2018

A perennial issue in the field of digital preservation is how to unambiguously identify an incoming file that is being stored for long-term archival. The Unix file command uses magic numbers stored in a text file to determine what format a file is, but this text file might not be uniform across Unix/Linux installations in use by libraries, and it is tedious to maintain across multiple institutions. Additionally, DOS/Windows-based files rely on file extensions for identification.

Enter PRONOM, which aspires to be a definitive source of truth for file format identification regardless of the platform those files were encoded on. Two similar efforts, GDFR and UDFR also took upon this challenge, but are currently inactive, leaving PRONOM as the last file format registry standing. Unfortunately, adding entries to PRONOM can be a bit of a pain since you need to go through a submission form and the registry is structured as a centralized store. This means that the barrier to entry for entry could possibly be intimidating for non-librarians (or really anyone who isn’t a librarian at the British National Archives) and that if something were to happen to PRONOM, such as a loss of funding, the registry would cease to be actively maintained (as occurred with GDFR and UDFR). David Rosenthal, in his Emulation & Virtualization as Preservation Strategies also points out that PRONOM is ill-suited for identifying files contained within disk images and emulation since it doesn’t include technical metadata that emulators would need to reproduce an archived computing environment. Since large portions of digital content will likely be saved as whole images using dd commands or similar, this constitutes a pretty significant flaw.

A good solution to these issues (i.e., lack of robustness since PRONOM isn’t decentralized, lack of appeal due to a high barrier to entry for new file type records, lack of additional records needed to reproduce computing environments) would be the development of a distributed database similar to DNS or Handle (basically used for persistent identifiers for library resources; DOIs are the most well-known implementation of Handle). In fact, it seems that the PRONOM folks at one point were working towards this:

However, The National Archives is planning to develop a range of services to expose PRONOM registry content, including a resolution service for PUIDs. -From PRONOM Unique Identifiers

If a DNS-like system for file identification were to exist, I’d be interested in seeing the following kinds of records in it:

PUID
A list of potential natural language names for a file format
A list of regular expressions that can be applied against text files for more accurate identification (i.e., to identify text-based formats such as JSON or TextGrid (used in linguistics research). The number of regexs that successfully match could produce a confidence score, which alongside a threshold, could be used for identification.
Technical metadata relating to how best access a file; configuration management style data, possibly spread across multiple records.

Coming from a scientific background, I’m also a bit quizzical about PRONOM’s goal of making “unambiguous” file identifications. In my experience working alongside folks studying nature, ambiguity is never going away because the world is apathetic about our quest to understand it and won’t actively help. We’ll never have a fully accurate model of our world, and the most we can hope for are probabilistic models that can give us explanations of what we’re likely seeing, given our data (i.e., Bayesian file identification). I think that a better approach would be for tools that make use of a file signature repository like PRONOM to continuously query that repository to check for new data regarding their identity, and then to update their beliefs accordingly based on the identification with the highest confidence. A DNS-style system would be well-suited for this sort of thing since files could easily make queries at selected intervals (similar to how entries in a DNS cache might expire). This would be kind of conceptually similar to the software engineering practice of continuous integration.

Unfortunately, the pace of development on these sorts of projects in the libraries world can often be sluggish unless adequate collaboration in non-libraries fields can be found. A friend of mine, however, noted that this sort of system is something that security researchers would be very much interested in, since file signatures for common viruses are often stored centrally by companies such as Norton and McAffee, and a distributed database of virus signatures might be more robust and open. In fact, it seems that someone at USC did put together a distributed database of malware signatures, albeit using blockchain (see BitAV: Fast Anti-Malware by Distributed Blockchain Consensus and Feedforward Scanning). Thus, it might be fruitful for librarians to reach out more to security researchers in the future.