Data pre-processing

Datasets used for this session

Dataset used during this session can be found in the following location within abel:

/work/projects/nn9305k/tmp/Files_for_Dec14/

NB: Replace <your_user_name> with your abel username

Create a new folder called Data_pre_processing_Dec14 in your home area and move there.

cd /work/projects/nn9305k/home/<your_user_name>/
mkdir Data_pre_processing_Dec14
cd Data_pre_processing_Dec14

Create three folders here and move to the data folder:

mkdir data
mkdir raw_fastqc
mkdir trim

cd data

Type the following command to link the files (not copy):

ln -s /work/projects/nn9305k/tmp/Files_for_Dec14/*fq.gz .

Fastq quality check

We will use FastQC to check the quality of raw sequenced data. It is important to check most of the graphs and make sure that the data represents the sample that was library prep'd and sequenced. For more information regarding the graphs, either visit the above website or check this video

Task

Run fastqc on the file Ha_R1.fq.gz (in the login node).
type in the following (don't include the $).

$ fastqc Ha_R1.fq.gz

Try to find help for fastqc and discuss what flags one can use to process multiple samples. You will use -t option to use multiple threads. One thread will analyse one file at a time.
Use SLURM to process the other four files.

SLURM script

#!/bin/bash
#
# Job name:
#SBATCH --job-name=raw_fastq
#
# Project: 
#SBATCH --account=nn9305k
#
# Wall clock limit: 
#SBATCH --time=01:00:00
#
#SBATCH --ntasks=4 
#
# Max memory usage: 
## A good suggestion here is 4GB
#SBATCH --mem-per-cpu=4Gb
## Set up job environment

source /cluster/bin/jobsetup
  
module load fastqc
fastqc -t 4 Br_R* Ed_R*

Move the html and zip files to raw_fastqc

cd ../raw_fastqc
mv ../data/*html .
mv ../data/*.zip .

Copy the raw_fastqc folder from abel to Biolinux in a folder called Data_pre_processing_Dec14 in Desktop using scp. Option -r (stands for recursively) will help in copying folders and all the content inside (and do not forget the '.' at the end of the command.

In Biolinux:

cd 
cd Desktop
mkdir Data_pre_processing_Dec14
cd Data_pre_processing_Dec14
scp -r <your_user_name>@abel.uio.no:/work/projects/nn9305k/home/<your_user_name>/Data_pre_processing_Dec14/raw_fastqc .

Go through the html files and discuss.

Trimmomatic - adapter trimming and removing

We wll use Trimmomatic to trim/remove adapter and low quality reads. This tool is NOT available via module load in abel but available at /work/projects/nn9305k/bin/. Make sure you know where the adapter sequences are available.

Task

Move to trim folder.

cd ../trim

Run trimmomatic-0.36.jar on Ha_R1.fq.gz file.

SLURM script

#!/bin/bash
#
# Job name:
#SBATCH --job-name=trim
#
# Project:
#SBATCH --account=nn9305k
#
# Wall clock limit:
#SBATCH --time=01:00:00
#
#SBATCH --ntasks=12
#
# Max memory usage:
## A good suggestion here is 4GB
#SBATCH --mem-per-cpu=4Gb
## Set up job environment
  
source /cluster/bin/jobsetup

java -jar /work/projects/nn9305k/bin/trimmomatic-0.36.jar SE -threads 12 -phred33 ../data/Ha_R1.fq.gz Ha_trim_R1.fq.gz ILLUMINACLIP:/work/projects/nn9305k/db_flatfiles/trimmomatic_adapters/TruSeq3-SE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:15 CROP:75

Trimmomatic	Description
SE	stands for single reads
-threads	to utilize 12 threads and use multiple CPUs in parallel
-phred33	specifying the quality encoding of fastq files generated by Illumina sequencers
ILLUMINACLIP	to trim/remove adapter reads using an adapter file and a sliding window (2:30:10)
LEADING	to remove low quality bases in the beginning of the read
TRAILING	to remove low quality bases from the end of the read
SLIDINGWINDOW	to trim/remove low quality reads using a sliding window (4:15)
MINLEN	reads with minimum length to be retained. In general this should be 36 but since we are dealing with miRNAs here, it should be 15
CROP	to trim the reads to a specific length

Run fastqc on the output fastq files and copy the html and zip to BioLinux and view them in the browser

Homework

Use trimmomatic to trim/remove adapters and low quality reads in Br_R1.fq.gz and Br_R2.fq.gz (or/and Ed_R1.fq.gz and Ed_R2.fq.gz)

Remember that you are working with paired end data (Change SE to PE).
There are two input files and four output files.
Use TruSeq3-PE-2.fa instead of TruSeq3-SE.fa since we are dealing with paired end reads.
Change MINLEN parameter to 36.
Use appropriate value for CROP (check the fastqc output for raw reads and use the correct value).
Also, remember to change #SBATCH --mem-per-cpu=4Gb to #SBATCH --mem-per-cpu=12Gb. This is a bigger job and needs more memory (12Gb instead of 4Gb).

SLURM script

#!/bin/bash
#
# Job name:
#SBATCH --job-name=trim
#
# Project:
#SBATCH --account=nn9305k
#
# Wall clock limit:
#SBATCH --time=01:00:00
#
#SBATCH --ntasks=12
#
# Max memory usage:
## A good suggestion here is 4GB
#SBATCH --mem-per-cpu=12Gb
## Set up job environment

source /cluster/bin/jobsetup
  
java -jar /work/projects/nn9305k/bin/trimmomatic-0.36.jar PE -threads 12 -phred33 ../data/Br_R1.fq.gz ../data/Br_R2.fq.gz Br_trim_R1.fq.gz Br_trim_R1_UNPAIRED.fq.gz Br_trim_R2.fq.gz Br_trim_R2_UNPAIRED.fq.gz ILLUMINACLIP:/work/projects/nn9305k/db_flatfiles/trimmomatic_adapters/TruSeq3-PE-2.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 CROP:150
  
java -jar /work/projects/nn9305k/bin/trimmomatic-0.36.jar PE -threads 12 -phred33 ../data/Ed_R1.fq.gz ../data/Ed_R2.fq.gz Ed_trim_R1.fq.gz Ed_trim_R1_UNPAIRED.fq.gz Ed_trim_R2.fq.gz Ed_trim_R2_UNPAIRED.fq.gz ILLUMINACLIP:/work/projects/nn9305k/db_flatfiles/trimmomatic_adapters/TruSeq3-PE-2.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 CROP:150

FastX-Toolkit

I mentioned about FastX-toolkit FastX-Toolkit and showed one example to collapse fastq reads to identify the list of unique reads. There are more then 10 tools in this toolkit and do read the above page to familiarize yourself with these tools.

Fastx-Toolkit does not accept compressed files. So we need to uncompress the .gz file to normal fastq file using gunzip. In folder trim:

gunzip -c Ha_trim_R1.fq.gz > Ha_trim_R1.fq
fastx_collapser -v -i Ha_trim_R1.fq -o Ha_trim_R1_Collapse.fq -Q33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data_pre_processing.md

data_pre_processing.md

Data pre-processing

Datasets used for this session

NB: Replace <your_user_name> with your abel username

Fastq quality check

Trimmomatic - adapter trimming and removing

Homework

FastX-Toolkit

Files

data_pre_processing.md

Latest commit

History

data_pre_processing.md

File metadata and controls

Data pre-processing

Datasets used for this session

NB: Replace <your_user_name> with your abel username

Fastq quality check

Trimmomatic - adapter trimming and removing

Homework

FastX-Toolkit