Skip to content

Latest commit

 

History

History
203 lines (162 loc) · 6.76 KB

data_pre_processing.md

File metadata and controls

203 lines (162 loc) · 6.76 KB

Data pre-processing

Datasets used for this session

Dataset used during this session can be found in the following location within abel:

/work/projects/nn9305k/tmp/Files_for_Dec14/

NB: Replace <your_user_name> with your abel username

Create a new folder called Data_pre_processing_Dec14 in your home area and move there.

cd /work/projects/nn9305k/home/<your_user_name>/
mkdir Data_pre_processing_Dec14
cd Data_pre_processing_Dec14

Create three folders here and move to the data folder:

mkdir data
mkdir raw_fastqc
mkdir trim

cd data

Type the following command to link the files (not copy):

ln -s /work/projects/nn9305k/tmp/Files_for_Dec14/*fq.gz .

Fastq quality check

We will use FastQC to check the quality of raw sequenced data. It is important to check most of the graphs and make sure that the data represents the sample that was library prep'd and sequenced. For more information regarding the graphs, either visit the above website or check this video


Task

  1. Run fastqc on the file Ha_R1.fq.gz (in the login node).
  2. type in the following (don't include the $).
$ fastqc Ha_R1.fq.gz
  1. Try to find help for fastqc and discuss what flags one can use to process multiple samples. You will use -t option to use multiple threads. One thread will analyse one file at a time.
  2. Use SLURM to process the other four files.

SLURM script

#!/bin/bash
#
# Job name:
#SBATCH --job-name=raw_fastq
#
# Project: 
#SBATCH --account=nn9305k
#
# Wall clock limit: 
#SBATCH --time=01:00:00
#
#SBATCH --ntasks=4 
#
# Max memory usage: 
## A good suggestion here is 4GB
#SBATCH --mem-per-cpu=4Gb
## Set up job environment

source /cluster/bin/jobsetup
  
module load fastqc
fastqc -t 4 Br_R* Ed_R*
  1. Move the html and zip files to raw_fastqc
cd ../raw_fastqc
mv ../data/*html .
mv ../data/*.zip .
  1. Copy the raw_fastqc folder from abel to Biolinux in a folder called Data_pre_processing_Dec14 in Desktop using scp. Option -r (stands for recursively) will help in copying folders and all the content inside (and do not forget the '.' at the end of the command.

In Biolinux:

cd 
cd Desktop
mkdir Data_pre_processing_Dec14
cd Data_pre_processing_Dec14
scp -r <your_user_name>@abel.uio.no:/work/projects/nn9305k/home/<your_user_name>/Data_pre_processing_Dec14/raw_fastqc .
  1. Go through the html files and discuss.

Trimmomatic - adapter trimming and removing

We wll use Trimmomatic to trim/remove adapter and low quality reads. This tool is NOT available via module load in abel but available at /work/projects/nn9305k/bin/. Make sure you know where the adapter sequences are available.


Task

  1. Move to trim folder.
cd ../trim
  1. Run trimmomatic-0.36.jar on Ha_R1.fq.gz file.

SLURM script

#!/bin/bash
#
# Job name:
#SBATCH --job-name=trim
#
# Project:
#SBATCH --account=nn9305k
#
# Wall clock limit:
#SBATCH --time=01:00:00
#
#SBATCH --ntasks=12
#
# Max memory usage:
## A good suggestion here is 4GB
#SBATCH --mem-per-cpu=4Gb
## Set up job environment
  
source /cluster/bin/jobsetup

java -jar /work/projects/nn9305k/bin/trimmomatic-0.36.jar SE -threads 12 -phred33 ../data/Ha_R1.fq.gz Ha_trim_R1.fq.gz ILLUMINACLIP:/work/projects/nn9305k/db_flatfiles/trimmomatic_adapters/TruSeq3-SE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:15 CROP:75
Trimmomatic Description
SE stands for single reads
-threads to utilize 12 threads and use multiple CPUs in parallel
-phred33 specifying the quality encoding of fastq files generated by Illumina sequencers
ILLUMINACLIP to trim/remove adapter reads using an adapter file and a sliding window (2:30:10)
LEADING to remove low quality bases in the beginning of the read
TRAILING to remove low quality bases from the end of the read
SLIDINGWINDOW to trim/remove low quality reads using a sliding window (4:15)
MINLEN reads with minimum length to be retained. In general this should be 36 but since we are dealing with miRNAs here, it should be 15
CROP to trim the reads to a specific length
  1. Run fastqc on the output fastq files and copy the html and zip to BioLinux and view them in the browser

Homework

Use trimmomatic to trim/remove adapters and low quality reads in Br_R1.fq.gz and Br_R2.fq.gz (or/and Ed_R1.fq.gz and Ed_R2.fq.gz)

  1. Remember that you are working with paired end data (Change SE to PE).
  2. There are two input files and four output files.
  3. Use TruSeq3-PE-2.fa instead of TruSeq3-SE.fa since we are dealing with paired end reads.
  4. Change MINLEN parameter to 36.
  5. Use appropriate value for CROP (check the fastqc output for raw reads and use the correct value).
  6. Also, remember to change #SBATCH --mem-per-cpu=4Gb to #SBATCH --mem-per-cpu=12Gb. This is a bigger job and needs more memory (12Gb instead of 4Gb).

SLURM script

#!/bin/bash
#
# Job name:
#SBATCH --job-name=trim
#
# Project:
#SBATCH --account=nn9305k
#
# Wall clock limit:
#SBATCH --time=01:00:00
#
#SBATCH --ntasks=12
#
# Max memory usage:
## A good suggestion here is 4GB
#SBATCH --mem-per-cpu=12Gb
## Set up job environment

source /cluster/bin/jobsetup
  
java -jar /work/projects/nn9305k/bin/trimmomatic-0.36.jar PE -threads 12 -phred33 ../data/Br_R1.fq.gz ../data/Br_R2.fq.gz Br_trim_R1.fq.gz Br_trim_R1_UNPAIRED.fq.gz Br_trim_R2.fq.gz Br_trim_R2_UNPAIRED.fq.gz ILLUMINACLIP:/work/projects/nn9305k/db_flatfiles/trimmomatic_adapters/TruSeq3-PE-2.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 CROP:150
  
java -jar /work/projects/nn9305k/bin/trimmomatic-0.36.jar PE -threads 12 -phred33 ../data/Ed_R1.fq.gz ../data/Ed_R2.fq.gz Ed_trim_R1.fq.gz Ed_trim_R1_UNPAIRED.fq.gz Ed_trim_R2.fq.gz Ed_trim_R2_UNPAIRED.fq.gz ILLUMINACLIP:/work/projects/nn9305k/db_flatfiles/trimmomatic_adapters/TruSeq3-PE-2.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 CROP:150

FastX-Toolkit

I mentioned about FastX-toolkit FastX-Toolkit and showed one example to collapse fastq reads to identify the list of unique reads. There are more then 10 tools in this toolkit and do read the above page to familiarize yourself with these tools.

Fastx-Toolkit does not accept compressed files. So we need to uncompress the .gz file to normal fastq file using gunzip. In folder trim:

gunzip -c Ha_trim_R1.fq.gz > Ha_trim_R1.fq
fastx_collapser -v -i Ha_trim_R1.fq -o Ha_trim_R1_Collapse.fq -Q33