Why I Support the Common Workflow Language
I’ve been wanting to write a post about Common Workflow Language (CWL) for a while now and, realizing that if I don’t do so now I likely never will, have decided to embark upon an attempt at articulating my thoughts about why I support this project. For those who are unfamiliar with CWL, it is essentially a simple YAML-based syntax for expressing input-output relations between programs in a workflow. This is similar to the concept of piping inputs between commands in a Unix shell, or defining steps that need to be performed to compile a program using a makefile. I’ve been following it sporadically since I stopped working in science since it isolates the pipeline definition functionality of other flow-based tools used by scientists such as Nipype or Galaxy in a platform and field agnostic way.
This agnosticism is the key innovation that CWL brings to the table; by separating pipeline specification from the logic that determines how a pipeline is executed scientists and data engineers using CWL can switch between different job schedulers with different ways of optimizing machine resources. Essentially, programs wrapped in CWL can be thought of like a CPU instruction set and a job scheduler that knows how to interpret CWL can be thought of like a hardware backend that knows certain tricks to make those CWL-wrapped commands execute more quickly. If a scientist or engineer is using one backend, but then discovers that another backend uses slightly more clever algorithms that result in a 15% speedup in performance, this portability of CWL would allow him/her to jump from the first backend to the other with relative ease.
CWL also enforces good practices in scientific computing, by emphasizing interfaces over implementations. When you wrap a command line tool in CWL, you not only have something practical that allows you to construct a workflow in Rabix or Cromwell, you also have a specification of what inputs and outputs a typical scientist in your field would want to have for a particular step in an analysis. If you have such a specification and suddenly find that current tools in your field are inadequate for some reason (speed, method has inadequate precision, incorrect outputs), you can create another tool with the same name that takes in the same inputs and outputs but uses totally different logic under the hood. In this way, a tool that would benefit from GPU computing can be converted without breaking a previously defined workflow or a tool that was originally was CPU-limited can be rewritten to leverage FPGAs or ASICs for great performance gains. This is related to my point above about how using an agnostic markup for describing workflows can allow scientists to switch between backends- if you think of a command line tool as a component in a backend (e.g., a floating point unit), CWL allows you to further modify that backend by replacing slower parts with spiffier ones. It’s even conceivable that a scheduler could also be written that is aware of niche hardware and knows how to schedule steps that make use of it effectively.
This emphasis on interfaces could also enforce a healthy separation between scientists and the software engineers that develop scientific software. It’s fairly well-established that while many scientists are experienced in their respective fields, not very many are necessarily good at both writing software and keeping up with domain knowledge. In fact, most scientists don’t see the value in formal instruction in software development and consequently produce less than ideal code, a fact that is bemoaned by a Nature editorial. While I agree that scientists need to become more computationally literate, I disagree that they should be expected to fully absorb the expertise necessary to be a competent software developer. Instead, I’m generally supportive of the movement to legitimize research software engineers as distinct specialists in their own right who sit at the intersection of research and engineering. That said, focusing on defining an interface for a black-box function is something that would not require scientists to stray too far from their realms of expertise, while also allowing an individual who was specialized in software engineering to understand their needs. In other words, a CWL tool could serve as a common ground where domain specialists from scientific fields and research software engineering could meet in the middle. Scientists wouldn’t need to write shoddy code but would still be able to steer development for their tools.
Lastly, I like CWL because it enforces the use of the Unix shell. This last point is admittedly something I still haven’t fully fleshed out and might not be super well-substantiated, but I intuitively feel like using general purpose programming languages as glue languages in scientific computing is a mistake. Essentially, I think that the shell is superior for orchestrating complex pipelines because it doesn’t lock you into a particular language ecosystem and is expressly purpose-built to be a user interface to the operating system that programs are running on. In contrast, committing to Python and Perl to glue together all or even some of the steps in a pipeline is riskier due to the fact that these languages are optimized for functionality that programmers want, rather than what a program execution environment needs. Because they cater to programmer desires, these languages are much more likely to evolve quickly, with the consequence that older tools written in prior versions of a language may not interoperate with tools written in a newer version of the language if forced to use the same interpreter. A programmer might even find that a new domain-specific language offers enough of an advantage over incumbent languages that he/she want to write new tools in that language. Science candidly does not have the resources to constantly concerning itself with porting code from one iteration of a language to the next (or to a different language altogether), so there needs to be some way to guarantee that old, creaky code can interoperate well with shiny, sexy new code. In contrast, shell functionality is fairly rigid and unchanging (assuming POSIX compliance) and allows you to more readily switch between language environments.
There are other reasons for my interest in CWL, but I don’t have enough time to discuss them in depth at the moment. To quickly touch on them though:
- I believe that everything I’ve said above about scientists can also be true of practitioners in other fields- librarians could benefit from CWL for defining preservation workflows, doctors and medical lab scientists could benefit when systematizing their own protocols.
- CWL forces researchers to document how their analyses were performed, which leads to more reproducible science. It’s much easier to hand over markup specifying how the results of a study were obtained than to write in more verbose journal-ese and since the markup is necessary for the analysis itself to execute, it’s impossible to leave off steps (unlike the current status quo].