Skip to content

Reproducible bioinformatics: how?

Brian M. Schilder edited this page Aug 11, 2021 · 2 revisions

The Turing Institute has written an extensive article about making computational research reproducible, we should endevour to follow it: it's called, The Turing Way. A video covering the general philosophy, and demoing the process of creating a reproducible paper within RStudio is available here and the associated resources are here.

It is important that all the lab code is storing on github, prepared as R packages as earlier as possible and ideally prepared as workflows and shared via TERRA. Read about the benefits of workflow systems here.

Nature wrote a good article about best practises working with large datasets. If we're not using anything from the article yet (like, Harvard Dataverse, [Zenodo)(https://zenodo.org/) or NextJournal) , it would be great if you could try it out and let me know how you get on.

Docker containers

Learning NextFlow

A complete 10 hour workshop on learning NextFlow has been digitised by Seqera Labs: the videos and the online resources.

The lab's full set of tutorial's for NextFlow are available here but these remain a work in progress. We had a workshop in Feb 2020 and we kept the discussion and logs of this in a slack channel #nextflow-workshop... take a look on there for example scripts and relevant links.

This example scripts shows how to launch an Rscript from Nextflow in parallel on the cluster:

#!/usr/bin/env nextflow
params.datasets = ['iris', 'mtcars']
process writeDataset {
    executor = 'pbspro'
    clusterOptions = '-lselect=1:ncpus=1:mem=1Gb -l walltime=24:00:00 -V'
    tag "${dataset}"
    publishDir "$baseDir/data/", mode: 'copy', overwrite: false, pattern: "*.tsv"
    input:
    each dataset from params.datasets
    output:
    file '*.tsv' into datasets_ch
    """
    module load R
    """
    """
    #!/usr/bin/env Rscript 
    data("${dataset}")
    write.table(${dataset}, file = "${dataset}.tsv", sep = "\t", col.names = TRUE, row.names = FALSE)
    """
}

Some tutorial information for Nextflow from a workshop at the Sanger is available here.

An active NextFlow chatroom where you can ask questions is on gitter.

NextFlow on Google Cloud Life Sciences Platform (GCP)

Follow the guide here: https://cloud.google.com/life-sciences/docs/tutorials/nextflow

Learning WDL / TERRA

Here are screen casts from Lynn Langit explaining how to use WDL: https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM. This should be your starting point. Here's Lynn's github repo with all the example code: https://github.com/openwdl/learn-wdl. The screencasts show how to take code from this repo and put them into TERRA to run them.

I've prepared a basic introduction to use of TERRA here but it doesn't really cover WDL yet. The best tutorial resources for learning WDL are here: https://support.terra.bio/hc/en-us/sections/360007274612.

There is a tutorial on how to use notebook's within TERRA.

Clone this wiki locally