Dataset used during this session can be found in the following location within abel:
/work/projects/nn9305k/tmp/Files_for_Dec14/
Create a new folder called Data_pre_processing_Dec14 in your home area and move there.
cd /work/projects/nn9305k/home/<your_user_name>/
mkdir Data_pre_processing_Dec14
cd Data_pre_processing_Dec14
Create three folders here and move to the data folder:
mkdir data
mkdir raw_fastqc
mkdir trim
cd data
Type the following command to link the files (not copy):
ln -s /work/projects/nn9305k/tmp/Files_for_Dec14/*fq.gz .
We will use FastQC to check the quality of raw sequenced data. It is important to check most of the graphs and make sure that the data represents the sample that was library prep'd and sequenced. For more information regarding the graphs, either visit the above website or check this video
Task
- Run fastqc on the file Ha_R1.fq.gz (in the login node).
- type in the following (don't include the
$
).
$ fastqc Ha_R1.fq.gz
- Try to find help for fastqc and discuss what flags one can use to process multiple samples. You will use -t option to use multiple threads. One thread will analyse one file at a time.
- Use SLURM to process the other four files.
SLURM script
#!/bin/bash
#
# Job name:
#SBATCH --job-name=raw_fastq
#
# Project:
#SBATCH --account=nn9305k
#
# Wall clock limit:
#SBATCH --time=01:00:00
#
#SBATCH --ntasks=4
#
# Max memory usage:
## A good suggestion here is 4GB
#SBATCH --mem-per-cpu=4Gb
## Set up job environment
source /cluster/bin/jobsetup
module load fastqc
fastqc -t 4 Br_R* Ed_R*
- Move the html and zip files to raw_fastqc
cd ../raw_fastqc
mv ../data/*html .
mv ../data/*.zip .
- Copy the raw_fastqc folder from abel to Biolinux in a folder called Data_pre_processing_Dec14 in Desktop using scp. Option -r (stands for recursively) will help in copying folders and all the content inside (and do not forget the '.' at the end of the command.
In Biolinux:
cd
cd Desktop
mkdir Data_pre_processing_Dec14
cd Data_pre_processing_Dec14
scp -r <your_user_name>@abel.uio.no:/work/projects/nn9305k/home/<your_user_name>/Data_pre_processing_Dec14/raw_fastqc .
- Go through the html files and discuss.
We wll use Trimmomatic to trim/remove adapter and low quality reads. This tool is NOT available via module load in abel but available at /work/projects/nn9305k/bin/. Make sure you know where the adapter sequences are available.
Task
- Move to trim folder.
cd ../trim
- Run trimmomatic-0.36.jar on Ha_R1.fq.gz file.
SLURM script
#!/bin/bash
#
# Job name:
#SBATCH --job-name=trim
#
# Project:
#SBATCH --account=nn9305k
#
# Wall clock limit:
#SBATCH --time=01:00:00
#
#SBATCH --ntasks=12
#
# Max memory usage:
## A good suggestion here is 4GB
#SBATCH --mem-per-cpu=4Gb
## Set up job environment
source /cluster/bin/jobsetup
java -jar /work/projects/nn9305k/bin/trimmomatic-0.36.jar SE -threads 12 -phred33 ../data/Ha_R1.fq.gz Ha_trim_R1.fq.gz ILLUMINACLIP:/work/projects/nn9305k/db_flatfiles/trimmomatic_adapters/TruSeq3-SE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:15 CROP:75
Trimmomatic | Description |
---|---|
SE | stands for single reads |
-threads | to utilize 12 threads and use multiple CPUs in parallel |
-phred33 | specifying the quality encoding of fastq files generated by Illumina sequencers |
ILLUMINACLIP | to trim/remove adapter reads using an adapter file and a sliding window (2:30:10) |
LEADING | to remove low quality bases in the beginning of the read |
TRAILING | to remove low quality bases from the end of the read |
SLIDINGWINDOW | to trim/remove low quality reads using a sliding window (4:15) |
MINLEN | reads with minimum length to be retained. In general this should be 36 but since we are dealing with miRNAs here, it should be 15 |
CROP | to trim the reads to a specific length |
- Run fastqc on the output fastq files and copy the html and zip to BioLinux and view them in the browser
Use trimmomatic to trim/remove adapters and low quality reads in Br_R1.fq.gz and Br_R2.fq.gz (or/and Ed_R1.fq.gz and Ed_R2.fq.gz)
- Remember that you are working with paired end data (Change SE to PE).
- There are two input files and four output files.
- Use TruSeq3-PE-2.fa instead of TruSeq3-SE.fa since we are dealing with paired end reads.
- Change MINLEN parameter to 36.
- Use appropriate value for CROP (check the fastqc output for raw reads and use the correct value).
- Also, remember to change #SBATCH --mem-per-cpu=4Gb to #SBATCH --mem-per-cpu=12Gb. This is a bigger job and needs more memory (12Gb instead of 4Gb).
SLURM script
#!/bin/bash
#
# Job name:
#SBATCH --job-name=trim
#
# Project:
#SBATCH --account=nn9305k
#
# Wall clock limit:
#SBATCH --time=01:00:00
#
#SBATCH --ntasks=12
#
# Max memory usage:
## A good suggestion here is 4GB
#SBATCH --mem-per-cpu=12Gb
## Set up job environment
source /cluster/bin/jobsetup
java -jar /work/projects/nn9305k/bin/trimmomatic-0.36.jar PE -threads 12 -phred33 ../data/Br_R1.fq.gz ../data/Br_R2.fq.gz Br_trim_R1.fq.gz Br_trim_R1_UNPAIRED.fq.gz Br_trim_R2.fq.gz Br_trim_R2_UNPAIRED.fq.gz ILLUMINACLIP:/work/projects/nn9305k/db_flatfiles/trimmomatic_adapters/TruSeq3-PE-2.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 CROP:150
java -jar /work/projects/nn9305k/bin/trimmomatic-0.36.jar PE -threads 12 -phred33 ../data/Ed_R1.fq.gz ../data/Ed_R2.fq.gz Ed_trim_R1.fq.gz Ed_trim_R1_UNPAIRED.fq.gz Ed_trim_R2.fq.gz Ed_trim_R2_UNPAIRED.fq.gz ILLUMINACLIP:/work/projects/nn9305k/db_flatfiles/trimmomatic_adapters/TruSeq3-PE-2.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 CROP:150
I mentioned about FastX-toolkit FastX-Toolkit and showed one example to collapse fastq reads to identify the list of unique reads. There are more then 10 tools in this toolkit and do read the above page to familiarize yourself with these tools.
Fastx-Toolkit does not accept compressed files. So we need to uncompress the .gz file to normal fastq file using gunzip. In folder trim:
gunzip -c Ha_trim_R1.fq.gz > Ha_trim_R1.fq
fastx_collapser -v -i Ha_trim_R1.fq -o Ha_trim_R1_Collapse.fq -Q33