Tales from the underfunded cousin of DevOps, while trying to get research done.

I just realized that the current projet I'm working on basically should be a reproducible workflow. After all the hype about Common Workflow Languange (CWL) at the Bioinformatics Open Source Conference (BOSC) 2015, I decided to give this a spin myself.

For starters, I wondered how hard it would be to wrap our antiSMASH/ tool in a Docker container with all the dependencies and databases installed, and then use CWL to take care of handling input/output mapping. Essentially, I wanted to see how hard it'd be to create a CWL version of my standalone antiSMASH setup.

If you want to follow along, I've put my example files in a git repository.

The first step is to install a cwl-runner tool that can run CWL descriptions/scripts/workflows or whatever you call them. So off to the CWL homepage to install it. But of course CWL is a specification, and there's a bunch of implementations. At the time of writing this, the CWL homepage lists 10 different implementations! Fortunately, the very active CWL Gitter community quickly pointed me towards using the "reference implementation" one. Installing it on Ubuntu 14.04 is nice and easy. In a python virtualenv, all it took was pip install cwltool. I also installed the cwl-runner convenience wrapper by running pip install cwl-runner. Note that the name of this package might change to cwltool-cwl-runner soon to match the name of the wrappers for the other implementations.

Now that CWL is installed, it's time to create a wrapper for the command line tool. A nice overview can be found in the "Gentle introduction to Common Workflow Language" document. For my use case, it was a bit too simplistic at times, but it is indeed a gentle introduction. Notably, it does not cover how to build actual workflows, which is what the CWL is all about after all.

CWL wrappers are written in a YAML-like format, and most parameters in the file should be reasonably self-explanatory. A very basic description of the antiSMASH wrapper looks like this:

#!/usr/bin/env cwl-runner
cwlVersion: cwl:draft-3

class: CommandLineTool

description: "Run antiSMASH"

baseCommand: run_antismash.py

inputs:
  - id: sequence
    type: File
    inputBinding:
      position: 1
  - id: outputfolder
    type: string
    inputBinding:
      prefix: "--outputfolder"

outputs:
  - id: result
    type: File
    outputBinding:
      glob: "$(inputs.outputfolder)/*.final.gbk"

The first line allows us to run the .cwl file directly, which is neat. Then some administrative commens like the verison of the CWL spec used and what kind of process the cwl file is about. In my case, I'm wrapping a command line tool.

With this description, you can already run antiSMASH like this:

./antismash.cwl --sequence myseqence.gbk --outputfolder my_analysis

That is, if you have antiSMASH installed and all the databases and the like. With a complex software like antiSMASH, that's a bit complex as well. Fortunately, that's where Docker comes in. As I wrote before in my standalone antiSMASH setup post, Docker solves part of this problem. Unfortunately, mapping inputs and outputs to the container is a bit of a bother, and in the previous post I described some wrapper script magic dance to deal with this. Now let's see how to do this in CWL. In the wrapper script, I add these three lines:

hints:
  - class: DockerRequirement
    dockerPull: antismash/antismash:3.0.5

And that's it. Now running

./antismash.cwl --sequence myseqence.gbk --outputfolder my_analysis

automatically runs in a Docker container. Neat. Ok, I cheated a bit and had to create a Docker image for antiSMASH without my magic mapping dance entrypoint script first, but that was easy. I just deleted the extra magic from the Dockerfile and rebuilt the image.

One thing that I found missing from the Common Workflow introduction was how to deal with boolean parameters. antiSMASH has a lot of those, and the introduction doesn't show how to use them. Fortunately again, the CWL Gitter community is really helpful and pointed me into the right direction. So, if I for example want to add the --verbose flag to the wrapper, all I need to add is the following lines in the inputs section:

  - id: verbose
    type: boolean
    inputBinding:
      prefix: "--verbose"

Now I can specify the --verbose flag while running ./antismash.cwl and antiSMASH outputs verbose output. But wait, something is wrong. If I don't want verbose output, and leave out the flag, the following happens:

./antismash.cwl: error: argument --verbose is required

That's not what we want. But of course we told our wrapper definition that we wanted a boolean parameter, and didn't specify it. Also, because of the way command line arguments are parsed, --verbose false doesn't work either.

Digging through some other example CWL files I found the way to tell CWL about optional parameters. Basically, the type of the input needs to be changed from type: boolean to type: ["null", boolean] and we're all set. Now I just need to add mappings for the other 52 parameters antiSMASH has. But not today, I guess.

From a reproducible research perspective, specifying your parameters on the command line is a bit dangerous, as you need to make sure to record those parameters somewhere. CWL has a solution to that, as well. You can provide parameters as a JSON file. To analyse the nisin gene cluster, this is the file I use (called job.json):

{
    "sequence": {
        "class": "File",
        "path": "nisin.gbk"
    },
    "verbose": true,
    "outputfolder": "nisin"
}

Now I can run ./antismash.cwl job.json, and all my parameters are documented in a nice text file I can keep in version control. If you want to give this a try yourself, clone the git repo and follow the instructions in the README file.

As a next step, I'll have to build an actual analysis workflow that uses the antiSMASH tool description I just creted. I'm not quite there yet, and I'll have to come up with a workflow that is actually going to be useful.

In summary, I'm beginning to see why people are excited about workflow management systems. Most of my day-to-day work involves analyses that are frequently one-off and don't fit too well, but for a workflow you want to repeat again and again, I see how a system like CWL is a great fit. Also, creating the wrapper descriptions for the tools is the hard work (anyone want to handle the remaining 52 command line flags for antiSMASH?). Once the wrappers have been written, using them is pretty straightforward. Also, CWL's YAML is much easier to read and write than XML descriptions some other sytems use. I'm not comepletely sold on workflow systems / CWL yet, but I guess this will find a place in my toolbox.


Comments

comments powered by Disqus