Containerized Workflows¶

Workflow Management Using Snakemake¶

In this breakout session you’ll learn about snakemake, a workflow management system consisting of a text-based workflow specification language and a scalable execution environment. You will be introduced to the Snakemake workflow definition language and how to use the execution environment to scale workflows to compute servers and clusters while adapting to hardware specific constraints.

Snakemake is designed specifically for computationally intensive and/or complex data analysis pipelines. The name is a reference to the programming language Python, which forms the basis for the Snakemake syntax.

See Snakemake Slides here and pdf.

Setup¶

Right-Click the button below and login to CyVerse Discovery Environment for a quick launch of Snakemake VICE Jupyter lab app.
To run Snakemake inside a docker container, run the following on your instance with docker installed:

docker run -it --entrypoint bash cyversevice/jupyterlab-snakemake

Click here for a Snakemake tutorial by NBISweden.
Clone RNAseq Snakemake tutorial repository

git clone https://github.com/NBISweden/workshop-reproducible-research.git

cd workshop-reproducible-research/docker/

git checkout devel

ls

Dry-Run RNAseq Snakefile

snakemake -n

Run RNAseq Snakefile

snakemake

Why Snakemake¶

From where and how to get data for your analysis, to where and how to treat the outputs, workflow managers can help you achieve better scientific reproducibility and scalability. Once you learn to properly use Snakemake (or similar workflow management tools), keeping track of and sharing your work becomes second nature, not only saving you time whenever you need to re-run all or part of an analysis but helping you reduce errors that naturally get introduced whenever a non-automated activity is done (i.e., as part of the human condition of doing computational science and not being a bot!).

Other Workflow Managers¶

CCTools offers Makeflow a workflow management system similar to Snakemake and also WorkQueue for scaling-up through Distributed Computing for customized and efficient utilization of resources. Read more here.