index.Rmd

---
title: "2.1.2 Genomics & Transcriptomics"
author: "Marcel Kempenaar"
date: "`r Sys.Date()`"
site: bookdown::bookdown_site
output: bookdown::gitbook
fig_caption: yes
documentclass: book
bibliography: [book.bib, packages.bib]
biblio-style: apalike
link-citations: yes
github-repo: mkempenaar/gene_expression_analysis
description: "Course Syllabus for course 2.1.2"
---

# Introduction {#chapter-1}

Course 2.1.2 involves two separate projects performed on a single, published, data set. First, we start by performing a *Genomics* analysis similar to the case study from module 2.1.1. Then, we take the result of the mapping step, process this into read counts and perform a subsequent *Transcriptomics* analysis. 

## 2.1.2 Genomics

For this Genomics analysis we are going to look at variants between the chosen data and a reference genome. One important aspect during the analysis is often assessing the quality, both from the input data set as well as the output of certain steps. Once a set of variants have been determined they can be annotated and finally linked to the biology underlying the original experiment, either reproducing their findings or aiming for novel ideas.

### Project Deliverables

* **Project Proposal**: A short written proposal of the project clearly describing the data set, the research question and the methods + tools to be used.
* **Genomics pipeline**: The genomics analysis pipeline is well documented in one or more *lab journals* and contains the correct tools based on the research question and data set.
* **Presentation**: The research results will be presented in an attractive method and proper level given the target audience.

### Data Processing

Similar to the 2.1.1 NGS & Mapping course we will go through similar steps to process the data (i.e. quality control, trimming, mapping, variant calling, etc.). This time however, we are not using a workflow manager such as Galaxy, but execute all the tools *command line*. Two reasons for this are that the number and size of the input files are far larger compared to the previously used data sets and the Galaxy server is not capable of handling that many large data sets. Also, having some proper terminal experience is very valuable.

::: rmdnote

**Note on writing the pipeline**: all steps need to be documented in RMarkdown documents. Each step must be introduced with a short description of the tool used, the version and followed by the command used to execute the tool. As this is a terminal command, we use code cells with the `bash` language. **All** the code cells **must** have the option `eval=FALSE` set. This is to prevent the code from being executed when knitting the document. 

Furthermore, it is also required to run any command on the RStudio server provided at [https://bioinf.nl/rstudio](https://bioinf.nl/rstudio). This RStudio server is hosted on our `assemblix` server which is capable of both accessing the data and running tools on large data sets. The workstations we provide in the classrooms are *not* capable of these tasks. Also, this server already has a number of tools pre-installed (FastQC, trimmomatic, BWA/STAR mappers, etc.). Lastly, by running them from and RMarkdown document on the RStudio server forces you to document the commands compared to simply using `ssh`.

**Note on storing (intermediate) data**: **all** data relevant to this project **must** be stored in the `/students/2024-2025/Thema05/` directory. In here, create a project directory for your team and create sub-directories as you go. Your home folder has a hard limit of 25GB of data and is therefore not sufficient for storing these project files. Use commands such as `chmod` and `chown` to allow your project partners to access the data too.

**Note on processing large amounts of data**: building correct terminal commands to execute for instance a trimming step on a large number of files can be quite challenging. It is therefore **required** that test-data is used for **all** steps. Reasons for this are:

* Spotting errors quickly instead of having to wait several hours to see an error message
* Inspecting resource usage (CPU, memory, disk space) to prevent overloading the system
  - Some steps can be very memory intensive (over 1 TB on large data sets!) and can crash the system
* Creates a small output set that can be used as a test set for the next step

**Note on the use of GATK**: The Genome Analysis Toolkit (GATK) is an often used set of tools for variant calling. Many publications cite the use of GATK but unfortunately do not include all needed details to reproduce the results. Due to the complexity of GATK we do not recommend its use, unless the publication is very clear on the exact workflow and settings used. Prepare to spend a lot of time on the official [GATK website](https://gatk.broadinstitute.org/hc/en-us) browsing manuals before you can use it effectively.

:::

### Finding a Data Set

Finding a suitable data set can be challenging as there are a few requirements such a data set needs to meet:

* It needs to have both a DNA- and RNA-Seq data set available.
* The data set needs to be published (publication should be open access)
* For the RNA-Seq part, there is a minimum amount of samples required
  - At least two sample *groups* with a minimum of 3 samples per group (*replicates*)
  - See the appendix [chapter A1](#datasets) for example valid experimental setups, this chapter also contains tips on what to do with it (mostly relevant for Transcriptomics).

Data is available from the NCBI [Sequence Read Archive](https://www.ncbi.nlm.nih.gov/sra) and Gene Expression Omnibus.

The proper way to search for a data set suitable for both Genomics and Transcriptomics projects is by creating a query that searches for both: [("expression profiling by high throughput sequencing"[DataSet Type]) AND "genome variation profiling by high throughput sequencing"[DataSet Type]](https://www.ncbi.nlm.nih.gov/gds/?term=(%22expression+profiling+by+high+throughput+sequencing%22%5BDataSet+Type%5D)+AND+%22genome+variation+profiling+by+high+throughput+sequencing%22%5BDataSet+Type%5D). If this selection actually contains anything of interest is up to you; if you cannot find something interesting please use a different search strategy. One could be to search for RNA-Seq experiments in the GEO where one of the groups is preferably a mutant. 

Note that for the genomics part, not all available samples have to be processed. Other than for transcriptomics, it is not common that there is a case/control experimental setup but rather a single condition of group with possibly a few replicates. If the data set does include many samples, it is recommended to select a subset of samples to process (do discuss the reasoning behind this selection).

## 2.1.2 Transcriptomics

The use of RNA-Sequencing techniques for measuring gene expression is relatively new and replaces *microarrays*, though in some cases microarrays are still used. Gene expression data gives valuable insights into the workings of cells in certain conditions. Especially when comparing for instance healthy and diseased samples it can become clear which genes are causal or under influence of a specific condition. Finding the genes of interest (genes showing differing expression accross conditions, called the Differentially Expressed Genes (DEGs)) is the goal of this project.

While there is no golden standard for analyzing RNA-sequencing data sets as there are many tools (all manufacturers of sequencing equipment also deliver software packages) we will use proven R libraries for processing, visualizing and analyzing publicly available data sets. This manual describes the steps often performed in a transcriptomics experiment. Use this as a very general guideline to process data accompanying the chosen published research. Note that this manual was originally written for processing *count* data and therefore does not describe the processing of NGS read files up to the count data as part of the *Genomics* analysis performed in this course. RNA-Seq counts describe the number of reads mapped to a gene which corresponds to the relative number of transcripts (mRNA sequences) of that gene present in the cell at the time of sampling.

### Project Deliverables

* **Project Proposal**: A short written proposal of the project clearly describing the data set, the research question and the methods + libraries to be used.
* **Lab Journal**: The lab journal is readable, complete and aimed at the reproducability of the research. Relevant visualisations are include and statistical methods performed to support the biological interpretation of the results.

## *Lab Journal*

As you may know from previous projects, it is essential to keep a proper journal detailing every step you have done during the analysis. This journal is to be kept in an R markdown file, showing which steps have been taken in the analysis of the data set. This markdown should be *knitted* into a single PDF-file once the project is completed thus containing text detailing the steps and any decisions you've made, R-code (always visible!) and their resulting output/ images. 

```{block, type='rmdtip'}
**Notes**:
As a general advice; do not wait with knitting this whole document until the project is done as knitting is very prone to errors and trying to fix these in a large document is not easy. Give each code chunk the proper attributes, including a name at the minimum. This helps spot errors during knitting as that process mentiones which chunk has been processed. Note that chunk-names *must* be unique. And try to make proper use of chapters and sections and include an (optional) table of contents.
```

## Schedule

The Genomics and Transcriptomics parts both have their own schedule as listed below

### Genomics
Empty rows means the previous topic or task will most likely take more than one session.

<style type="text/css">
.tg  {border-collapse:collapse;border-color:#ccc;border-spacing:0;}
.tg td{background-color:#fff;border-color:#ccc;border-style:solid;border-width:1px;color:#333;
  font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{background-color:#f0f0f0;border-color:#ccc;border-style:solid;border-width:1px;color:#333;
  font-family:Arial, sans-serif;font-size:14px;font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-baqh{text-align:center;vertical-align:top}
.tg .tg-c3ow{border-color:inherit;text-align:center;vertical-align:top}
.tg .tg-0pky{border-color:inherit;text-align:left;vertical-align:top}
</style>
<table class="tg">
<thead>
  <tr>
    <th class="tg-baqh">Week</th>
    <th class="tg-baqh">Lesson</th>
    <th class="tg-baqh">Subject</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td class="tg-0pky" rowspan="3">1</td>
    <td class="tg-c3ow">1</td>
    <td class="tg-0pky">Finding a suitable experiment</td>
  </tr>
  <tr>
    <td class="tg-c3ow">2</td>
    <td class="tg-0pky">Statistics - recap and experimental setup</td>
  </tr>
  <tr>
    <td class="tg-c3ow">3</td>
    <td class="tg-0pky">Writing a Project Proposal</td>
  </tr>
  <tr>
    <td class="tg-0pky" rowspan="3">2</td>
    <td class="tg-c3ow">1</td>
    <td class="tg-0pky">Presenting Project Proposal</td>
  </tr>
  <tr>
    <td class="tg-c3ow">2</td>
    <td class="tg-0pky">NGS Quality Control</td>
  </tr>
  <tr>
    <td class="tg-c3ow">3</td>
    <td class="tg-0pky"></td>
  </tr>
  <tr>
    <td class="tg-0pky" rowspan="3">3</td>
    <td class="tg-c3ow">1</td>
    <td class="tg-0pky">Read Mapping and QC</td>
  </tr>
  <tr>
    <td class="tg-c3ow">2</td>
    <td class="tg-0pky"></td>
  </tr>
  <tr>
    <td class="tg-c3ow">3</td>
    <td class="tg-0pky">Variant Calling</td>
  </tr>
  <tr>
    <td class="tg-0pky" rowspan="4">4</td>
    <td class="tg-c3ow">1</td>
    <td class="tg-0pky">Annotation</td>
  </tr>
  <tr>
    <td class="tg-c3ow">2</td>
    <td class="tg-0pky"></td>
  </tr>
  <tr>
    <td class="tg-c3ow">3</td>
    <td class="tg-0pky"></td>
  </tr>
  <tr>
    <td class="tg-c3ow">4</td>
    <td class="tg-0pky"></td>
  </tr>
  <tr>
    <td class="tg-0pky" rowspan="2">5</td>
    <td class="tg-c3ow">1</td>
    <td class="tg-0pky">Visualisation</td>
  </tr>
  <tr>
    <td class="tg-c3ow">2</td>
    <td class="tg-0pky"></td>
  </tr>
</tbody>
</table>

### Transcriptomics

<style type="text/css">
.tg  {border-collapse:collapse;border-color:#ccc;border-spacing:0;}
.tg td{background-color:#fff;border-color:#ccc;border-style:solid;border-width:1px;color:#333;
  font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{background-color:#f0f0f0;border-color:#ccc;border-style:solid;border-width:1px;color:#333;
  font-family:Arial, sans-serif;font-size:14px;font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-baqh{text-align:center;vertical-align:top}
.tg .tg-0lax{text-align:left;vertical-align:top}
</style>
<table class="tg">
<thead>
  <tr>
    <th class="tg-baqh">Week</th>
    <th class="tg-baqh">Lesson</th>
    <th class="tg-baqh">Subject</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td class="tg-0lax" rowspan="2">5</td>
    <td class="tg-baqh">1</td>
    <td class="tg-0lax">Writing a Project Proposal</td>
  </tr>
  <tr>
    <td class="tg-baqh">2</td>
    <td class="tg-0lax"></td>
  </tr>
  <tr>
    <td class="tg-0lax" rowspan="4">6</td>
    <td class="tg-baqh">1</td>
    <td class="tg-0lax">Presenting Project Proposal</td>
  </tr>
  <tr>
    <td class="tg-baqh">2</td>
    <td class="tg-0lax">Read Mapping and Quantification</td>
  </tr>
  <tr>
    <td class="tg-baqh">3</td>
    <td class="tg-0lax"></td>
  </tr>
  <tr>
    <td class="tg-baqh">4</td>
    <td class="tg-0lax">Statistics (distributions, normalization, PCA)</td>
  </tr>
  <tr>
    <td class="tg-0lax" rowspan="5">7</td>
    <td class="tg-baqh">1</td>
    <td class="tg-0lax">Exploratory Data Analysis (<a href="#chapter-3">chapter 3</a>)</td>
  </tr>
  <tr>
    <td class="tg-baqh">2</td>
    <td class="tg-0lax"></td>
  </tr>
  <tr>
    <td class="tg-baqh">3</td>
    <td class="tg-0lax">Statistics (batch effects, regression, linear models)</td>
  </tr>
  <tr>
    <td class="tg-baqh">4</td>
    <td class="tg-0lax">Finding Differentially Expressed Genes (<a href="#chapter-4">chapter 4</a>)</td>
  </tr>
  <tr>
    <td class="tg-baqh">5</td>
    <td class="tg-0lax">Statistics (Enrichment analysis)</td>
  </tr>
  <tr>
    <td class="tg-0lax" rowspan="6">8</td>
    <td class="tg-baqh">1</td>
    <td class="tg-0lax">Data Analysis and Visualization (<a href="#chapter-5">chapter 5</a>)</td>
  </tr>
  <tr>
    <td class="tg-baqh">2</td>
    <td class="tg-0lax"></td>
  </tr>
  <tr>
    <td class="tg-baqh">3</td>
    <td class="tg-0lax"></td>
  </tr>
  <tr>
    <td class="tg-baqh">4</td>
    <td class="tg-0lax"></td>
  </tr>
  <tr>
    <td class="tg-baqh">5</td>
    <td class="tg-0lax"></td>
  </tr>
  <tr>
    <td class="tg-baqh">6</td>
    <td class="tg-0lax">Hand in Lab Journal</td>
  </tr>
</tbody>
</table>

## Literature

The following (online) documents can be used throughout this course:

* Gene Set Enrichment Analysis – Theory. Available from: <a href="https://www.pathwaycommons.org/guide/primers/data_analysis/gsea/">https://www.pathwaycommons.org/guide/primers/data_analysis/gsea/</a>
* Li D. Statistical Methods for RNA Sequencing Data Analysis. In: Husi H, editor. Computational Biology [Internet]. Brisbane (AU): Codon Publications; 2019 Nov 21. Chapter 6. Available from: <a href="https://www.ncbi.nlm.nih.gov/books/NBK550334/">https://www.ncbi.nlm.nih.gov/books/NBK550334/</a>
* Yu, L., Fernandez, S. & Brock, G. Power analysis for RNA-Seq differential expression studies. BMC Bioinformatics 18, 234 (2017). <a href="https://doi.org/10.1186/s12859-017-1648-2">https://doi.org/10.1186/s12859-017-1648-2</a>
* Sun, S., Hood, M., Scott, L., Peng, Q., Mukherjee, S., Tung, J., & Zhou, X. (2017). Differential expression analysis for RNAseq using Poisson mixed models. Nucleic acids research, 45(11), e106. <a href="https://doi.org/10.1093/nar/gkx204">https://doi.org/10.1093/nar/gkx204</a>

# From Reads to Counts

Step one in any RNA-Seq data analysis is generating the gene or transcript counts for each sample. This overlaps partly with steps taken in a genomics workflow:

* Download SRA files and extract FastQ files
* Perform quality control on raw reads (FastQC)
* Perform trimming as needed (Trimmomatic)
* Perform quality control post trimming (FastQC)
* Perform mapping against reference genome (most often STAR)
* Generate gene/ transcript counts (either STAR, HTseq or featurecounts)
* *Downstream analysis* (chapter 4 of this document)

This chapter shows these steps for an example data set (`GSE149995`)

## Downloading and extracting FastQ files

Given the SRA identifier for this project, select all samples and download as accession file (SRX identifiers). Using these identifiers as input for the `prefetch` system command (requires the NCBI SRA-Toolkit installed) downloads the SRA files and we convert them using the `fasterq-dump` command to fastq files.
```{bash SRA_download, eval=FALSE}
prefetch $(<GSE149995_SRA_ACC_RNA-Seq.csv) --output-directory \
  /local-fs/staff/marcel/GSE149995/SRA
```

The `fasterq-dump` by default performs a `split 3` operation, see [the SRA toolkit manual](https://github.com/ncbi/sra-tools/wiki/HowTo:-fasterq-dump) for further details.

```{bash extract_fastq, eval=FALSE}
find /local-fs/staff/marcel/GSE149995/SRA/ -name "*.sra" | \
  parallel 'fasterq-dump -O /local-fs/staff/marcel/GSE149995/fastq/ {}'
```

## (Optional) Annotation

The NCBI GEO page lists 9 samples with 3 conditions (`WT`, `potent` and `arf7`). However, this experiment includes 18 sequence files. By downloading the 'series matrix file' from the GEO (using the `GEOquery` library) and combining this with the run info downloadable from the SRA, we can list the mapping of run to sample. 

**Note**: This is an optional step for now and is only useful when there are doubts about the group/sample setup.

```{r, echo=FALSE}
library(tidyr)
library(pander)
```

```{r sample_annotation, message=FALSE, warning=FALSE}
library(GEOquery)
gse <- getGEO(GEO = "GSE149995")
# Combine the condition with a sample number and replicate number
Condition = paste(rep(gse[[1]]@phenoData@data$`genotype/variation:ch1`, each = 2),
                  rep(1:3, each = 2),
                  paste0('r', rep(1:2, 9)), sep = '_')
run_info <- read.csv(file = "data/GSE149995_Sra_RunInfo.csv")
setup <- cbind(run_info %>% dplyr::select(Run, Experiment), Condition)
pander(setup)
```

## Quality Control

Here we perform the FastQC -> Trimmomatic -> FastQC steps

### FastQC

Running FastQC 
```{r fastQC, eval=FALSE}
fqdir <- "/local-fs/staff/marcel/GSE149995/fastq/"
fastqc(fq.dir = "/local-fs/staff/marcel/GSE149995/fastq/",
       qc.dir = "/local-fs/staff/marcel/GSE149995/fqc",
       fastqc.path = "/usr/bin/fastqc",
       threads = 32)
```

QC-report using `multiqc`

```{bash MultiQC, eval=FALSE}
multiqc \
  --filename ~/Development/2.1.2-Transcriptomics/GSE149995_FQC_multiqc_report.html \
  /local-fs/staff/marcel/GSE149995/fqc/ 
```

## Trimming

Proper trimming requires experimentation and research regarding the used sequencing technology. The following command uses Trimmomatic (release 0.39, available on [Github](https://github.com/usadellab/Trimmomatic/releases)) to trim all fastq files. Note that this example is for *single-end* sequencing only, use the `TrimmomaticPE` command for paired-end data processing and the `TruSeq` FASTA file ending in `PE` instead.

```{bash Trimmomatic, eval=FALSE}
cat data/GSE149995_Sra_RunInfo.csv | \
  parallel 'TrimmomaticSE -threads 4 ' \
                  '/local-fs/staff/marcel/GSE149995/fastq/{}.fastq.gz ' \
                  '/local-fs/staff/marcel/GSE149995/fastq-trimmed/{}.trimmed.fastq.gz ' \
                  'ILLUMINACLIP:/homes/marcelk/Development/2.1.2-Transcriptomics/TruSeq3-SE.fa:2:30:10 ' \
                  'MINLEN:40 ' \
                  'SLIDINGWINDOW:4:20'
```


Re-running FastQC
```{r fastQC-trimmed, eval=FALSE}
fastqc(fq.dir = "/local-fs/staff/marcel/GSE149995/fastq-trimmed/",
       qc.dir = "/local-fs/staff/marcel/GSE149995/fqc-trimmed",
       fastqc.path = "/usr/bin/fastqc",
       threads = 32)
```

```{bash multiqc-fastqc-trimmed, eval=FALSE}
multiqc --filename \
  ~/Development/2.1.2-Transcriptomics/GSE149995_FQC_Trimmed.html \
  /local-fs/staff/marcel/GSE149995/fqc-trimmed/
```

## Mapping with STAR

### Building STAR index

Other than with `bwa`, STAR requires an extra annotation file (GFF or GTF format) describing features on the genome. This data set requires the FASTA of the Arabidopsis thaliana genome and its accompanying annotation file. The `genomeDir` specifies the output location for the index.
```{bash star_index, eval=FALSE}
cd /local-fs/staff/marcel/GSE149995/reference
mkdir star; cd star
STAR \
  --runThreadN 12 \
  --runMode genomeGenerate \
  --genomeDir /local-fs/staff/marcel/GSE149995/reference/star \
  --genomeFastaFiles /local-fs/staff/marcel/GSE149995/reference/TAIR10_chr_all.fas \
  --sjdbGTFfile /local-fs/staff/marcel/GSE149995/reference/TAIR10_GTF_genes.gtf \
  --genomeSAindexNbases 12 \
  --sjdbOverhang 49
```

### Read mapping and transcript counting

The following command performs mapping against the reference genome, saves this as a sorted BAM file and directly performs a read count for all transcripts (NOTE: requires the `--sjdbGTFfile` parameter during index building). For each input file, there is a `ReadsPerGenes.out.tab` file generated containing 4 columns:

* column 1: gene ID
* column 2: counts for unstranded RNA-seq
* column 3: counts for the 1st read strand aligned with RNA (htseq-count option -s yes)
* column 4: counts for the 2nd read strand aligned with RNA (htseq-count option -s reverse)

See an interesting discussion on [BioStars](https://www.biostars.org/p/218995/) regarding which column to use.

When planning to use a different tool for calculating gene counts, you can remove the `quantMode` argument which will only generate the sorted BAM files. Please read the manual of the tool of choice to see if an indexed BAM file as the BAM file does not have an index.
```{bash star_mapping, eval=FALSE}
find /local-fs/staff/marcel/GSE149995/fastq-trimmed/ -name "*.fastq.gz" -exec basename {} \; | \
  sed 's/\.fastq.gz$//' | parallel 'STAR --runThreadN 6 ' \
    '--genomeDir /local-fs/staff/marcel/GSE149995/reference/star ' \
    '--readFilesIn /local-fs/staff/marcel/GSE149995/fastq-trimmed/{}.fastq ' \
    '--outSAMtype BAM SortedByCoordinate ' \
    '--quantMode GeneCounts ' \
    '--outFileNamePrefix /local-fs/staff/marcel/GSE149995/mapping/star/{}_star_'
```

Generating MultiQC report for STAR mapping phase

```{bash multiqc-star-mapping, eval=FALSE}
~/miniconda3/bin/multiqc --filename \
  ~/Development/2.1.2-Transcriptomics/STAR-mapping.html \
  /local-fs/staff/marcel/GSE149995/mapping/star/
```

### Combining STAR Count Files

STAR generates a `[sample]_star_ReadsPerGene.out.tab` file for each sample, containing the gene IDs and the count numbers. We can merge these files into a single data frame for further processing (see also [Appendix A1 in the manual](https://mkempenaar.github.io/gene_expression_analysis/a1-data_loading.html#a1-batch_data_loading)). This example takes the 2nd column for the count values. Make sure that you know which column to use based on the description above. The reason that we're merging the data is because it might not always be guaranteed that you get the exact same transcripts in each file.

Note: if you use the `featurecounts` tool, you will already get a single file containing the count values for all samples.

```{r merge_count_files_function, eval=FALSE}
file.names <- list.files('/local-fs/staff/marcel/GSE149995/mapping/star/',
                         pattern = '*_star_ReadsPerGene.out.tab')

## Function for reading in files
read_sample <- function(file.name) {
  ## Extract the sample name for naming the column (retaining the 'SRR....' part)
  sample.name <- strsplit(file.name, ".", fixed = TRUE)[[1]][1]
  sample <- read.table(file.name, header = FALSE, sep="\t", 
                       row.names = NULL, skip = 4)
  ## Rename the count column
  names(sample)[2] <- sample.name
  ## Return a subset containing the transcript ID and sample name columns
  return(sample[c(1, 2)])
}
```

Now we read all count files and merge them into a single data frame.

```{r merge_count_files, eval=FALSE}
setwd('/local-fs/staff/marcel/GSE149995/mapping/star/')

## Read the FIRST sample
counts <- read_sample(file.names[1])

## Read the remaining files and merge the contents
for (file.name in file.names[2:length(file.names)]) {
  sample <- read_sample(file.name)
  counts <- merge(counts, sample, by = 1)
}

# Set the row names to the transcript IDs
rownames(counts) <- counts$V1
counts <- counts[-1]
```

We now have a single data frame with count values for all samples and transcripts. The names however are nondescriptive since they are just the `SRR` identifiers. Since we already did some annotation, we're going to rename the columns to something more meaningful:

```{r rename_cols, eval=FALSE}
# For each SRR id, find the corresponding column in counts and rename
# using the name in the 'Condition' column
for (i in 1:nrow(setup)) {
  idx <- grep(setup$Run[i], names(counts))
  names(counts)[idx] <- setup$Condition[i]
}

# Show first 6 columns and rows
pander(counts[1:6, 1:6])
```

```{r, echo=FALSE}
load("data/star_counts.RData")
pander(counts[1:6, 1:6])
```