-
Notifications
You must be signed in to change notification settings - Fork 0
Reproducible bioinformatics: how?
The Turing Institute has written an extensive article about making computational research reproducible, we should endevour to follow it: it's called, The Turing Way. A video covering the general philosophy, and demoing the process of creating a reproducible paper within RStudio is available here and the associated resources are here.
It is important that all the lab code is storing on github, prepared as R packages as earlier as possible and ideally prepared as workflows and shared via TERRA. Read about the benefits of workflow systems here.
Nature wrote a good article about best practises working with large datasets. If we're not using anything from the article yet (like, Harvard Dataverse, [Zenodo)(https://zenodo.org/) or NextJournal) , it would be great if you could try it out and let me know how you get on.
- Ten simple rules for writing Dockerfiles for reproducible data science
- See also our Wiki page on Docker.
A complete 10 hour workshop on learning NextFlow has been digitised by Seqera Labs: the videos and the online resources.
The lab's full set of tutorial's for NextFlow are available here but these remain a work in progress. We had a workshop in Feb 2020 and we kept the discussion and logs of this in a slack channel #nextflow-workshop... take a look on there for example scripts and relevant links.
This example scripts shows how to launch an Rscript from Nextflow in parallel on the cluster:
#!/usr/bin/env nextflow
params.datasets = ['iris', 'mtcars']
process writeDataset {
executor = 'pbspro'
clusterOptions = '-lselect=1:ncpus=1:mem=1Gb -l walltime=24:00:00 -V'
tag "${dataset}"
publishDir "$baseDir/data/", mode: 'copy', overwrite: false, pattern: "*.tsv"
input:
each dataset from params.datasets
output:
file '*.tsv' into datasets_ch
"""
module load R
"""
"""
#!/usr/bin/env Rscript
data("${dataset}")
write.table(${dataset}, file = "${dataset}.tsv", sep = "\t", col.names = TRUE, row.names = FALSE)
"""
}
Some tutorial information for Nextflow from a workshop at the Sanger is available here.
An active NextFlow chatroom where you can ask questions is on gitter.
Follow the guide here: https://cloud.google.com/life-sciences/docs/tutorials/nextflow
Here are screen casts from Lynn Langit explaining how to use WDL: https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM. This should be your starting point. Here's Lynn's github repo with all the example code: https://github.com/openwdl/learn-wdl. The screencasts show how to take code from this repo and put them into TERRA to run them.
I've prepared a basic introduction to use of TERRA here but it doesn't really cover WDL yet. The best tutorial resources for learning WDL are here: https://support.terra.bio/hc/en-us/sections/360007274612.
There is a tutorial on how to use notebook's within TERRA.
- Home
- Useful Info
- To do list for new starters
- Recommended Reading
-
Computing
- Our Private Cloud System
- Cloud Computing
- Docker
- Creating a Bioconductor package
- PBS example scripts for the Imperial HPC
- HPC Issues list
- Nextflow
- Analysing TIP-seq data with the nf-core/cutandrun pipeline
- Shared tools on Imperial HPC
- VSCode
- Working with Google Cloud Platform
- Retrieving raw sequence data from the SRA
- Submitting read data to the European Nucleotide Archive
- R markdown
- Lab software
- Genetics
- Reproducibility
- The Lab Website
- Experimental
- Lab resources
- Administrative stuff