Updated markdown documents to reflect recent changes for the Minor Da…

…ta Science
mkempenaar · May 8, 2024 · 5182244 · 5182244
1 parent 6a624fb
commit 5182244
Show file tree

Hide file tree

Showing 12 changed files with 370 additions and 440 deletions.
diff --git a/.gitignore b/.gitignore
@@ -10,5 +10,4 @@ _bookdown_files/
 data/
 _main*.*
 gene_expression.Rmd
-libs/
 rscripts/
diff --git a/_bookdown.yml b/_bookdown.yml
@@ -7,10 +7,11 @@ delete_merged_file: true
 
 rmd_files: ['index.Rmd',
             ## UNCOMMENT FOR BFVH15CAPSTONE BOOK ##
-            ##'chapters/01-introduction_BFVH15CAPSTONE.Rmd',
-            'chapters/02-eda.Rmd',
-            'chapters/03-deg.Rmd',
-            'chapters/04-data_analysis.Rmd',
+            ##'chapters/introduction_BFVH15CAPSTONE.Rmd',
+            'chapters/datasets.Rmd',
+            'chapters/eda.Rmd',
+            'chapters/deg.Rmd',
+            'chapters/data_analysis.Rmd',
             'chapters/a1-batch_data_loading.Rmd',
             'chapters/a2-annotation.Rmd',
-            'chapters/a3-datasets.Rmd']
+            'chapters/a3-read-mapping.Rmd']
diff --git a/_output.yml b/_output.yml
@@ -4,7 +4,7 @@ bookdown::gitbook:
   config:
     toc:
       before: |
-        <li><a href="./">2.1.2 Genomics & Transcriptomics</a></li>
+        <li><a href="./">Capstone Project</a></li>
 
 bookdown::pdf_book:
   includes:
@@ -13,3 +13,8 @@ bookdown::pdf_book:
   citation_package: natbib
 bookdown::epub_book: default
 bookdown::html_book: default
+bookdown::bs4_book:
+  css: style.css
+  theme:
+    primary: "#096B72"
+  repo: https://github.com/rstudio/bookdown-demo
diff --git a/chapters/a1-batch_data_loading.Rmd b/chapters/a1-batch_data_loading.Rmd
@@ -1,6 +1,6 @@
 # (APPENDIX) Appendix {-} 
 
-# Appendix A: Loading Expression Data in R {#a1-data_loading}
+# Loading Expression Data in R {#a1-data_loading}
 
 ## Exploring the Available Project Data {#expl-data}
 

diff --git a/chapters/a2-annotation.Rmd b/chapters/a2-annotation.Rmd
@@ -186,15 +186,15 @@ The `biomaRt` library interfaces with the [biomart.org](http://www.biomart.org)
 
 You are required to explore a number of objects and the given example code is most likely not suitable for your data/ organism. The project is well documented on the [Bioconductor biomaRt](https://bioconductor.org/packages/release/bioc/html/biomaRt.html) website that links to the [biomaRt users guide](https://bioconductor.org/packages/release/bioc/vignettes/biomaRt/inst/doc/biomaRt.html).
 
-```{r load_biomart, echo=TRUE, eval=FALSE}
+```{r load_biomart, echo=TRUE, eval=TRUE}
 # Load the library
 library(biomaRt)
 
 # Use an alternative database server as the regular one sometimes has issues..
 ensembl=useMart("ENSEMBL_MART_ENSEMBL", host="https://www.ensembl.org")
 ```
 
-```{r .pander_marts, echo=FALSE, eval=TRUE}
+```{r .pander_marts, echo=FALSE, eval=FALSE}
 ## << LOADED FROM SESSION FILE >>
 pander(listMarts(), caption = 'Available databases in the Ensembl biomaRt')
 ```
@@ -254,7 +254,7 @@ We now have all three needed elements (species, filter and attributes) so we are
 
 The example below retrieves data using a set of five Ensembl gene IDs since not all genes have an Ensembl ID as we've seen above, so we filter those out first and use this as the `values` parameter below.
 
-```{r retrieve_biomart_results, echo=TRUE, eval=TRUE}
+```{r retrieve_biomart_results, echo=TRUE, eval=FALSE}
 # Set the 'attributes' values
 attrs.get <- c("ensembl_gene_id", "chromosome_name", 
                "start_position","end_position", "description")
@@ -270,7 +270,7 @@ results$gene_length <- abs(results$end_position - results$start_position)
 
 The `results` object is a `data.frame` with 5 columns that we can merge with our data set giving us the following annotation columns (combined from the `AnnotationDBI` and `biomaRt` libraries).
 
-```{r .pander_biomart_results, echo=FALSE, eval=TRUE}
+```{r .pander_biomart_results, echo=FALSE, eval=FALSE}
 ipsc.tri.vs.di.annot <- merge(x = ipsc.tri.vs.di, y = results, 
                         by.x = 'Ensembl', by.y = 'ensembl_gene_id')
 
@@ -279,6 +279,4 @@ pander(ipsc.tri.vs.di.annot[, c('Ensembl', 'EntrezID',
                                         'end_position', 'gene_length', 'description')])
 ```
 
-This was just an example on how to use the `biomaRt` library and it comes down to selecting the correct filter and looking for interesting attributes to retrieve. Further information can be found in the documentation avaialble with `vignette('biomaRt')
-
-## References
+This was just an example on how to use the `biomaRt` library and it comes down to selecting the correct filter and looking for interesting attributes to retrieve. Further information can be found in the documentation avaialble with `vignette('biomaRt')
diff --git a/chapters/a3-read-mapping.Rmd b/chapters/a3-read-mapping.Rmd
@@ -0,0 +1,222 @@
+# From Reads to Counts {#a3-read-mapping}
+
+This chapter can be used to gain insight into how the count files are generated. Step one in any RNA-Seq data analysis is generating the gene or transcript counts for each sample. This overlaps partly with steps taken in a genomics workflow (that we performed in Galaxy):
+
+* Download SRA files and extract FastQ files
+* Perform quality control on raw reads (FastQC)
+* Perform trimming as needed (Trimmomatic)
+* Perform quality control post trimming (FastQC)
+* Perform mapping against reference genome (most often STAR)
+* Generate gene/ transcript counts (either STAR, HTseq or featurecounts)
+* *Downstream analysis* (starting from chapter 3 of this document)
+
+This chapter shows these steps for an example data set (`GSE149995`)
+
+```{block, type='rmdtip'}
+**Note about storage**:
+
+The fastq-files used for an RNA-Seq experiment can be fairly large and depending on the number of samples, the total storage required for downloading and processing these files can run into multiple terrabytes. For students there is enough storage in the `/students/` folder. Please create a new directory in the `/students/202x-202x/minor/` folder that will be used for all the following steps. Note that you need to replace the `/local-fs/staff/marcel/` path with the path to your own directory.
+
+Furthermore, it is adviced (or basically required) to run all commands on the `assemblix` server as that has the capacity to process multiple samples efficiently in parallel. 
+```
+
+## Downloading and extracting FastQ files
+
+Given the SRA identifier for this project, select all samples and download as accession file (SRX identifiers). Using these identifiers as input for the `prefetch` system command (requires the NCBI SRA-Toolkit installed) downloads the SRA files and we convert them using the `fasterq-dump` command to fastq files.
+```{bash SRA_download, eval=FALSE}
+prefetch $(<GSE149995_SRA_ACC_RNA-Seq.csv) --output-directory \
+  /local-fs/staff/marcel/GSE149995/SRA
+```
+
+The `fasterq-dump` by default performs a `split 3` operation, see [the SRA toolkit manual](https://github.com/ncbi/sra-tools/wiki/HowTo:-fasterq-dump) for further details.
+
+```{bash extract_fastq, eval=FALSE}
+find /local-fs/staff/marcel/GSE149995/SRA/ -name "*.sra" | \
+  parallel 'fasterq-dump -O /local-fs/staff/marcel/GSE149995/fastq/ {}'
+```
+
+## (Optional) Annotation
+
+The NCBI GEO page lists 9 samples with 3 conditions (`WT`, `potent` and `arf7`). However, this experiment includes 18 sequence files. By downloading the 'series matrix file' from the GEO (using the `GEOquery` library) and combining this with the run info downloadable from the SRA, we can list the mapping of run to sample. 
+
+**Note**: This is an optional step for now and is only useful when there are doubts about the group/sample setup.
+
+```{r, echo=FALSE, message=FALSE, warning=FALSE}
+library(tidyr)
+library(pander)
+```
+
+```{r sample_annotation, message=FALSE, warning=FALSE}
+library(GEOquery)
+gse <- getGEO(GEO = "GSE149995")
+# Combine the condition with a sample number and replicate number
+Condition = paste(rep(gse[[1]]@phenoData@data$`genotype/variation:ch1`, each = 2),
+                  rep(1:3, each = 2),
+                  paste0('r', rep(1:2, 9)), sep = '_')
+run_info <- read.csv(file = "data/GSE149995_Sra_RunInfo.csv")
+setup <- cbind(run_info %>% dplyr::select(Run, Experiment), Condition)
+pander(setup)
+```
+
+## Quality Control
+
+Here we perform the FastQC -> Trimmomatic -> FastQC steps
+
+### FastQC
+
+Running FastQC 
+```{r fastQC, eval=FALSE}
+fqdir <- "/local-fs/staff/marcel/GSE149995/fastq/"
+fastqc(fq.dir = "/local-fs/staff/marcel/GSE149995/fastq/",
+       qc.dir = "/local-fs/staff/marcel/GSE149995/fqc",
+       fastqc.path = "/usr/bin/fastqc",
+       threads = 32)
+```
+
+QC-report using `multiqc`
+
+```{bash MultiQC, eval=FALSE}
+multiqc \
+  --filename ~/Development/2.1.2-Transcriptomics/GSE149995_FQC_multiqc_report.html \
+  /local-fs/staff/marcel/GSE149995/fqc/ 
+```
+
+## Trimming
+
+Proper trimming requires experimentation and research regarding the used sequencing technology. The following command uses Trimmomatic (release 0.39, available on [Github](https://github.com/usadellab/Trimmomatic/releases)) to trim all fastq files. Note that this example is for *single-end* sequencing only, use the `TrimmomaticPE` command for paired-end data processing and the `TruSeq` FASTA file ending in `PE` instead.
+
+```{bash Trimmomatic, eval=FALSE}
+cat data/GSE149995_Sra_RunInfo.csv | \
+  parallel 'TrimmomaticSE -threads 4 ' \
+                  '/local-fs/staff/marcel/GSE149995/fastq/{}.fastq.gz ' \
+                  '/local-fs/staff/marcel/GSE149995/fastq-trimmed/{}.trimmed.fastq.gz ' \
+                  'ILLUMINACLIP:/homes/marcelk/Development/2.1.2-Transcriptomics/TruSeq3-SE.fa:2:30:10 ' \
+                  'MINLEN:40 ' \
+                  'SLIDINGWINDOW:4:20'
+```
+
+
+Re-running FastQC
+```{r fastQC-trimmed, eval=FALSE}
+fastqc(fq.dir = "/local-fs/staff/marcel/GSE149995/fastq-trimmed/",
+       qc.dir = "/local-fs/staff/marcel/GSE149995/fqc-trimmed",
+       fastqc.path = "/usr/bin/fastqc",
+       threads = 32)
+```
+
+```{bash multiqc-fastqc-trimmed, eval=FALSE}
+multiqc --filename \
+  ~/Development/2.1.2-Transcriptomics/GSE149995_FQC_Trimmed.html \
+  /local-fs/staff/marcel/GSE149995/fqc-trimmed/
+```
+
+## Mapping with STAR
+
+### Building STAR index
+
+Other than with `bwa`, STAR requires an extra annotation file (GFF or GTF format) describing features on the genome. This data set requires the FASTA of the Arabidopsis thaliana genome and its accompanying annotation file. The `genomeDir` specifies the output location for the index.
+```{bash star_index, eval=FALSE}
+cd /local-fs/staff/marcel/GSE149995/reference
+mkdir star; cd star
+STAR \
+  --runThreadN 12 \
+  --runMode genomeGenerate \
+  --genomeDir /local-fs/staff/marcel/GSE149995/reference/star \
+  --genomeFastaFiles /local-fs/staff/marcel/GSE149995/reference/TAIR10_chr_all.fas \
+  --sjdbGTFfile /local-fs/staff/marcel/GSE149995/reference/TAIR10_GTF_genes.gtf \
+  --genomeSAindexNbases 12 \
+  --sjdbOverhang 49
+```
+
+### Read mapping and transcript counting
+
+The following command performs mapping against the reference genome, saves this as a sorted BAM file and directly performs a read count for all transcripts (NOTE: requires the `--sjdbGTFfile` parameter during index building). For each input file, there is a `ReadsPerGenes.out.tab` file generated containing 4 columns:
+
+* column 1: gene ID
+* column 2: counts for unstranded RNA-seq
+* column 3: counts for the 1st read strand aligned with RNA (htseq-count option -s yes)
+* column 4: counts for the 2nd read strand aligned with RNA (htseq-count option -s reverse)
+
+See an interesting discussion on [BioStars](https://www.biostars.org/p/218995/) regarding which column to use.
+
+When planning to use a different tool for calculating gene counts, you can remove the `quantMode` argument which will only generate the sorted BAM files. Please read the manual of the tool of choice to see if an indexed BAM file as the BAM file does not have an index.
+```{bash star_mapping, eval=FALSE}
+find /local-fs/staff/marcel/GSE149995/fastq-trimmed/ -name "*.fastq.gz" -exec basename {} \; | \
+  sed 's/\.fastq.gz$//' | parallel 'STAR --runThreadN 6 ' \
+    '--genomeDir /local-fs/staff/marcel/GSE149995/reference/star ' \
+    '--readFilesIn /local-fs/staff/marcel/GSE149995/fastq-trimmed/{}.fastq ' \
+    '--outSAMtype BAM SortedByCoordinate ' \
+    '--quantMode GeneCounts ' \
+    '--outFileNamePrefix /local-fs/staff/marcel/GSE149995/mapping/star/{}_star_'
+```
+
+Generating MultiQC report for STAR mapping phase
+
+```{bash multiqc-star-mapping, eval=FALSE}
+~/miniconda3/bin/multiqc --filename \
+  ~/Development/2.1.2-Transcriptomics/STAR-mapping.html \
+  /local-fs/staff/marcel/GSE149995/mapping/star/
+```
+
+### Combining STAR Count Files
+
+STAR generates a `[sample]_star_ReadsPerGene.out.tab` file for each sample, containing the gene IDs and the count numbers. We can merge these files into a single data frame for further processing (see also [Appendix A1 in the manual](https://mkempenaar.github.io/gene_expression_analysis/a1-data_loading.html#a1-batch_data_loading)). This example takes the 2nd column for the count values. Make sure that you know which column to use based on the description above. The reason that we're merging the data is because it might not always be guaranteed that you get the exact same transcripts in each file.
+
+Note: if you use the `featurecounts` tool, you will already get a single file containing the count values for all samples.
+
+```{r merge_count_files_function, eval=FALSE}
+file.names <- list.files('/local-fs/staff/marcel/GSE149995/mapping/star/',
+                         pattern = '*_star_ReadsPerGene.out.tab')
+
+## Function for reading in files
+read_sample <- function(file.name) {
+  ## Extract the sample name for naming the column (retaining the 'SRR....' part)
+  sample.name <- strsplit(file.name, ".", fixed = TRUE)[[1]][1]
+  sample <- read.table(file.name, header = FALSE, sep="\t", 
+                       row.names = NULL, skip = 4)
+  ## Rename the count column
+  names(sample)[2] <- sample.name
+  ## Return a subset containing the transcript ID and sample name columns
+  return(sample[c(1, 2)])
+}
+```
+
+Now we read all count files and merge them into a single data frame.
+
+```{r merge_count_files, eval=FALSE}
+setwd('/local-fs/staff/marcel/GSE149995/mapping/star/')
+
+## Read the FIRST sample
+counts <- read_sample(file.names[1])
+
+## Read the remaining files and merge the contents
+for (file.name in file.names[2:length(file.names)]) {
+  sample <- read_sample(file.name)
+  counts <- merge(counts, sample, by = 1)
+}
+
+# Set the row names to the transcript IDs
+rownames(counts) <- counts$V1
+counts <- counts[-1]
+```
+
+We now have a single data frame with count values for all samples and transcripts. The names however are nondescriptive since they are just the `SRR` identifiers. Since we already did some annotation, we're going to rename the columns to something more meaningful:
+
+```{r rename_cols, eval=FALSE}
+# For each SRR id, find the corresponding column in counts and rename
+# using the name in the 'Condition' column
+for (i in 1:nrow(setup)) {
+  idx <- grep(setup$Run[i], names(counts))
+  names(counts)[idx] <- setup$Condition[i]
+}
+
+# Show first 6 columns and rows
+pander(counts[1:6, 1:6])
+```
+
+```{r, echo=FALSE}
+load("data/star_counts.RData")
+pander(counts[1:6, 1:6])
+```
+
+## References
diff --git a/chapters/data_analysis.Rmd b/chapters/data_analysis.Rmd
@@ -233,7 +233,7 @@ data(gene.idtype.list);
 pander(gene.idtype.list)
 ```
 
-```{r pathview_data_prep, eval=TRUE}
+```{r pathview_data_prep, eval=FALSE}
 ## Prepare data for visualization
 deseq.degs.logfc <- subset(deseq.results, padj < pval_threshold, select = log2FoldChange)
 
@@ -255,12 +255,12 @@ To also look at the *topology* of *how* genes interact with each other, we can u
 
 The code below uses data objects that were not always shown before, see the tables below for their contents.
 
-```{r spia_data_load, echo=FALSE}
+```{r spia_data_load, echo=FALSE, eval=FALSE}
 load("./data/annotated_ipsc_di_vs_tri.RData")
 deseq.results <- ipsc.tri.vs.di
 ```
 
-```{r spia_data_prep, message=FALSE, warning=FALSE, error=FALSE, eval=TRUE}
+```{r spia_data_prep, message=FALSE, warning=FALSE, error=FALSE, eval=FALSE}
 # Bioconductor package for "Signaling Pathway Impact Analysis"
 # (install with biocLite("SPIA") if it is not installed)
 library(SPIA)
@@ -276,7 +276,7 @@ all_genes <- deseq.results$EntrezID
 ```
 
 The input `sig_genes` named vector contents:
-```{r spia_contents, echo=FALSE, eval=TRUE}
+```{r spia_contents, echo=FALSE, eval=FALSE}
 idx = !is.na(names(sig_genes))
 .sig.genes <- head(data.frame(EntrezID=names(sig_genes[idx]),
                               "logFC"=sig_genes[idx], 
@@ -286,7 +286,7 @@ pander(.sig.genes, caption = "(subset) of DESeq2 DEGs with their Entrez ID and l
 ```
 
 And the `all_genes` vector or Entrez IDs input (note that this does not contain any expression information, just IDs):
-```{r, eval=TRUE}
+```{r, eval=FALSE}
 pander(all_genes[which(idx)[100:110]])
 ```
 
@@ -323,12 +323,14 @@ Done pathway 135 : Viral myocarditis..
 The resulting data frame is ordered by a p-value and contains the following information:
 
 ```{r spia_result_table, echo=FALSE, eval=TRUE}
+library(SPIA)
 load(file = "./data/spia.RData")
 pander(head(spia_result[,1:11], n=5), caption = "Top signaling pathways ")
 ```
 
 To get an overal picture we can create a plot where each pathway is a point which is placed according to the `spia` calculated over-representation p-value and pNDE (which is the probability to observe at least nDE genes on the pathway). In this case it is obvious that Alzheimer's and Parkinson's pathways with ID's 05010 and 05012 respectively are the most influenced by this experiment, according to this method.
 
 ```{r spiaResultPlotP, eval=TRUE, fig.cap="SPIA two-way evidence plot showing the most influenced pathways"}
+
 plotP(spia_result)
 ```
diff --git a/chapters/datasets.Rmd b/chapters/datasets.Rmd
@@ -1,4 +1,4 @@
-# Discovering Data Sets {#a3-annotation}
+# Discovering Data Sets {#datasets}
 
 ## Finding Public Data Sets of Interest {#finding-public-data}
 

diff --git a/chapters/deg.Rmd b/chapters/deg.Rmd
@@ -296,7 +296,7 @@ plotBCV(y)
 
 Or the log-fold changes for all genes, once we have the output of the `exactTest` function (output `et` is an `DGEExact` object) with the `plotSmear` function. The `abline` shows a log-FC threshold:
 
-```{r edgerResults, eval=TRUE, fig.cap="edgeR results; genes marked in red are DEGs"}
+```{r edgerResults, eval=FALSE, fig.cap="edgeR results; genes marked in red are DEGs"}
 deGenes <- decideTestsDGE(qlf, p=0.05)
 deGenes <- rownames(qlf)[as.logical(deGenes)]
 plotSmear(qlf, de.tags=deGenes)