I just realized that the current projet I'm working on basically should be a reproducible workflow. After all the hype about Common Workflow Languange (CWL) at the Bioinformatics Open Source Conference (BOSC) 2015, I decided to give this a spin myself.
For starters, I wondered how hard it would be to wrap our antiSMASH/ tool in a Docker container with all the dependencies and databases installed, and then use CWL to take care of handling input/output mapping. Essentially, I wanted to see how hard it'd be to create a CWL version of my standalone antiSMASH setup.
If you want to follow along, I've put my example files in a git repository.
The first step is to install a cwl-runner
tool that can run CWL
descriptions/scripts/workflows or whatever you call them. So off to the CWL
homepage to install it. But of course CWL is a
specification, and there's a bunch of implementations. At the time of writing
this, the CWL homepage lists 10 different implementations! Fortunately, the very
active CWL Gitter
community
quickly pointed me towards using the "reference implementation" one. Installing
it on Ubuntu 14.04 is nice and easy. In a python virtualenv, all it took was
pip install cwltool
. I also installed the cwl-runner
convenience wrapper by
running pip install cwl-runner
. Note that the name of this package might
change to cwltool-cwl-runner
soon to match the name of the wrappers for the
other implementations.
Now that CWL is installed, it's time to create a wrapper for the command line tool. A nice overview can be found in the "Gentle introduction to Common Workflow Language" document. For my use case, it was a bit too simplistic at times, but it is indeed a gentle introduction. Notably, it does not cover how to build actual workflows, which is what the CWL is all about after all.
CWL wrappers are written in a YAML-like format, and most parameters in the file should be reasonably self-explanatory. A very basic description of the antiSMASH wrapper looks like this:
#!/usr/bin/env cwl-runner
cwlVersion: cwl:draft-3
class: CommandLineTool
description: "Run antiSMASH"
baseCommand: run_antismash.py
inputs:
- id: sequence
type: File
inputBinding:
position: 1
- id: outputfolder
type: string
inputBinding:
prefix: "--outputfolder"
outputs:
- id: result
type: File
outputBinding:
glob: "$(inputs.outputfolder)/*.final.gbk"
The first line allows us to run the .cwl
file directly, which is neat.
Then some administrative commens like the verison of the CWL spec used and what
kind of process the cwl file is about. In my case, I'm wrapping a command line
tool.
With this description, you can already run antiSMASH like this:
./antismash.cwl --sequence myseqence.gbk --outputfolder my_analysis
That is, if you have antiSMASH installed and all the databases and the like. With a complex software like antiSMASH, that's a bit complex as well. Fortunately, that's where Docker comes in. As I wrote before in my standalone antiSMASH setup post, Docker solves part of this problem. Unfortunately, mapping inputs and outputs to the container is a bit of a bother, and in the previous post I described some wrapper script magic dance to deal with this. Now let's see how to do this in CWL. In the wrapper script, I add these three lines:
hints:
- class: DockerRequirement
dockerPull: antismash/antismash:3.0.5
And that's it. Now running
./antismash.cwl --sequence myseqence.gbk --outputfolder my_analysis
automatically runs in a Docker container. Neat. Ok, I cheated a bit and had to create a Docker image for antiSMASH without my magic mapping dance entrypoint script first, but that was easy. I just deleted the extra magic from the Dockerfile and rebuilt the image.
One thing that I found missing from the Common Workflow introduction was how to
deal with boolean parameters. antiSMASH has a lot of those, and the introduction
doesn't show how to use them. Fortunately again, the CWL Gitter community is
really helpful and pointed me into the right direction. So, if I for example
want to add the --verbose
flag to the wrapper, all I need to add is the
following lines in the inputs
section:
- id: verbose
type: boolean
inputBinding:
prefix: "--verbose"
Now I can specify the --verbose
flag while running ./antismash.cwl
and
antiSMASH outputs verbose output. But wait, something is wrong. If I don't want
verbose output, and leave out the flag, the following happens:
./antismash.cwl: error: argument --verbose is required
That's not what we want. But of course we told our wrapper definition that we
wanted a boolean parameter, and didn't specify it. Also, because of the way
command line arguments are parsed, --verbose false
doesn't work either.
Digging through some other example CWL
files
I found the way to tell CWL about optional parameters. Basically, the type of
the input needs to be changed from type: boolean
to type: ["null", boolean]
and we're all set. Now I just need to add mappings for the other 52 parameters
antiSMASH has. But not today, I guess.
From a reproducible research perspective, specifying your parameters on the
command line is a bit dangerous, as you need to make sure to record those
parameters somewhere. CWL has a solution to that, as well. You can provide
parameters as a JSON file. To analyse the nisin gene cluster, this is the file I
use (called job.json
):
{
"sequence": {
"class": "File",
"path": "nisin.gbk"
},
"verbose": true,
"outputfolder": "nisin"
}
Now I can run ./antismash.cwl job.json
, and all my parameters are documented
in a nice text file I can keep in version control.
If you want to give this a try yourself, clone the git
repo and follow the instructions in the
README file.
As a next step, I'll have to build an actual analysis workflow that uses the antiSMASH tool description I just creted. I'm not quite there yet, and I'll have to come up with a workflow that is actually going to be useful.
In summary, I'm beginning to see why people are excited about workflow management systems. Most of my day-to-day work involves analyses that are frequently one-off and don't fit too well, but for a workflow you want to repeat again and again, I see how a system like CWL is a great fit. Also, creating the wrapper descriptions for the tools is the hard work (anyone want to handle the remaining 52 command line flags for antiSMASH?). Once the wrappers have been written, using them is pretty straightforward. Also, CWL's YAML is much easier to read and write than XML descriptions some other sytems use. I'm not comepletely sold on workflow systems / CWL yet, but I guess this will find a place in my toolbox.
Comments
comments powered by Disqus