rnateam.rnaseq.yaml

name: Galaxy RNA-workbench RNA-Seq Introduction Exercise
description: The Galaxy RNA-workbench RNA-Seq tour
title_default: "Galaxy RNA-workbench RNA-Seq Exercise"
tags:
  - "RNA"

steps:

    - title: "<b>Welcome to the Galaxy RNA-worbench RNA-Seq example tour!</b>"
      content: "This tour will walk you through an exercise which guides you on how to use the Galaxy RNA-workbench. <br><br>
                Read and Follow the instructions before clicking <b>'Next'</b>.<br><br>
                Click <b>'Prev'</b> in case you missed out on any step"
      backdrop: true

    - title: "<b>Scenario</b>"
      content: "In the study of <a href=\"http://genome.cshlp.org/content/21/2/193.long\" target=\"_blank\">Brooks et al. 2011</a>, the Pasilla (PS) gene, *Drosophila* homologue of the Human splicing regulators Nova-1 and Nova-2 Proteins, was depleted in *Drosophila melanogaster* by RNAi. The authors wanted to identify exons that are regulated by Pasilla gene using RNA sequencing data.<br>
      Total RNA was isolated and used for preparing either single-end or paired-end RNA-seq libraries for treated (PS depleted) samples and untreated samples. These libraries were sequenced to obtain a collection of RNA sequencing reads for each sample. The effects of Pasilla gene depletion on splicing events can then be analyzed by comparison of RNA sequencing data of the treated (PS depleted) and the untreated samples.<br>
      The genome of *Drosophila melanogaster* is known and assembled. It can be used as reference genome to ease this analysis.  In a reference based RNA-seq data analysis, the reads are aligned (or mapped) against a reference genome, *Drosophila melanogaster* here, to significantly improve the ability to reconstruct transcripts and then identify differences of expression between several conditions."
      backdrop: true

    - title: "<b>Goal</b>"
      content: "The goal of this exercise is to <b>become familiar with basic RNA-Seq analysis</b>."
      backdrop: true

    - title: "<b>Disclaimer</b>"
      content: "We are <b>not affiliated</b> with the authors of the paper and we don't make a statement about the relevance or quality of the paper. It is <b>just a fitting example</b> and nothing else.<br>"
      backdrop: true

    - title: "<b>Overview</b>"
      content: "Together we will go through the following:<br>
                <b>Pretreatments, Mapping and Analysis of differential expression</b>
                 <dir>
                   <li>Step 1: Create and name a new history</li>
                   <li>Step 2: Download data</li>
                   <li>Step 3: Quality control</li>
                   <li>Step 4: Mapping</li>
                   <li>Step 5: Inspection of TopHat results</li>
                   <li>Step 6: IGV</li>
                   <li>Step 7: Analysis of differential gene expression</li>
                   <li>Step 8: Count the number of reads per annotated gene</li>
                   <li>Step 9: Analysis of DGE</li>
                   <li>Step 10: Inspect DGE</li>
                   <li>Step 11: Visualize DGE</li>
                   <li>Step 12: Analysis of the functional enrichment among differentially expressed genes</li>
                   <li>Step 13:  Inference of the differential exon usage</li>
                   <li>Step 14:  Annotation of the result tables with gene information</li>
                   <li>Step 15: Make a workflow</li>
                   <li>Step 16: Share workflow</li>
                   <li>Conclusion</li>
                 </dir>"
      backdrop: true

    - title: "<b>Step 1: Create and name a new history</b>"
      element: "#current-history-panel > div.controls > div.title > div"
      intro: "Change the name of your history."
      position: "bottom"

    - title: "<b>Step 2: Download data</b>"
      content: "We will now proceed to download data to Galaxy. The original data is available at NCBI Gene Expression Omnibus (GEO) under accession number <a href=\"http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE18508\" target=\"_blank\">GSE18508</a>. We will look at the 7 first samples: <br>
      - 3 treated samples with Pasilla (PS) gene depletion: <a href=\"https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM461179\" target=\"_blank\">GSM461179</a>, <a href=\"https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM461180\" target=\"_blank\">GSM461180</a>, <a href=\"https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM4611810\" target=\"_blank\">GSM461181</a><br>
        - 4 untreated samples: <a href=\"https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM461176\" target=\"_blank\">GSM461176</a>, <a href=\"https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM461177\" target=\"_blank\">GSM461177</a>, <a href=\"https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM461178\" target=\"_blank\">GSM461178</a>, <a href=\"https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM461182\" target=\"_blank\">GSM461182</a><br>
          Each sample constitutes a separate biological replicate of the corresponding condition (treated or untreated). Moreover, two of the treated and two of the untreated samples are from a paired-end sequencing assay, while the remaining samples are from a single-end sequencing experiment.<br>
          We have extracted sequences from the Sequence Read Archive (SRA) files to build FASTQ files."
      backdrop: true

    - title: "<b>Step 2: Download fastq files with RNA sequences</b>"
      element: ".upload-button"
      intro: "Use the upload button to upload the file to Galaxy.<br><br>
              Click <b>'Next'</b> and the tour will take you to the Upload screen"
      position: "right"
      postclick:
        - ".upload-button"

    - title: "<b>Step 2: Download fastq files with RNA sequences</b>"
      element: ".upload-text-content:first"
      intro: "We now paste the links to a fastq dataset pair into the upload-box. Click next to do so."
      preclick:
        - ".upload-button"
        - "button#btn-new"
      textinsert: |
        https://zenodo.org/record/61771/files/GSM461177_untreat_paired_subset_1.fastq
        https://zenodo.org/record/61771/files/GSM461177_untreat_paired_subset_2.fastq

    - title: "<b>Step 2: Download fasta file with RNA sequence</b>"
      element: "button#btn-start"
      intro: "Now that you've selected the file, select <b>'dm3'</b> as the genome
              and fastqsanger as file format.<br><br>
              Click <b>'Next'</b> and the tour will <b>'Start'</b> the upload.<br>
              Galaxy will automatically unpack the file."
      position: "bottom"
      postclick:
        - "button#btn-start"
        - "button#btn-close"

    - title: "<b>Step 2: Download fasta file with RNA sequence</b>"
      element: "#right"
      intro: "This is your history!<br><br>
              All <b>analysis steps will be recorded</b> and can be redone at any time.<br><br>
              You should be able to see your uploaded file here in a few moments."
      position: "left"

    - title: "<b>Step 3: Quality Control</b>"
      intro: "These files contain the first 100.000 paired-end reads of one sample. The sequences are raw sequences from the sequencing machine, without any pretreatments. They need to be controlled for their quality.<br>
      For quality control, we use similar tools as described in <a href=\"https://www.github.com/bgruening/training-material/NGS-QC\">NGS-QC tutorial</a>"
      position: "right"

    - title: "<b>Step 3: Quality Control</b>"
      element: '#tool-search-query'
      intro: "We now want to examine the quality of our RNA-Seq reads using <b>FastQC</b>.<br>
      This Galaxy instance has FastQC already integrated, so we don't need to install it.<br>
      <b>Note:</b> You can use 'tool search' to locate tools. Tools may take a couple of moments to load, please bear with us."
      position: "right"

    - title: "<b>Step 3: Quality Control</b>"
      element: '#tool-search-query'
      intro: "You can now type and select <b>'FastQC'</b>.<br><br>
              <b>Follow this set of instructions once the tool was loaded:</b><br>
              <dir>
              <li>Select one of the samples from the paired datset.</li>
              <li>Keep the rest of the options at their default values.</li>
              <li>Click button 'Execute' and wait for the tool to finish.</li>
              <li>Repeat with the other dataset.</li>
              </dir>"
      position: "right"

    - title: "<b>Step 3: Quality Control</b>"
      element: '#current-history-panel'
      intro: "To inspect the results of the FastQC run just<br><br>
              <dir>
                <li>Click on the <b>eye icon</b> of the latest dataset and have a look at the output, what do you see? What is the read length, is there anything you notice when you compare both datasets?</li>
              </dir>"
      position: "left"

    - title: "<b>Step 3: Quality Control</b>"
      element: '#current-history-panel'
      intro: "You should notice the following:<br><br>
              <dir>
              <li>The read length is 37 bp</li>
              <li>The report for GSM461177_untreat_paired_subset_1 is quite good compared to the one for GSM461177_untreat_paired_subset_2. For the latter, the per base sequence quality is bad around the 25th bp (same for the per base N content), because the quality in the 2nd tile is bad (maybe because of some event during sequencing). We need to process these samples according to this quality control and keep in mind the paired-end information.</li>
              </dir>"
      position: "left"

    - title: "<b>Step 3: Quality Control</b>"
      element: '#current-history-panel'
      intro: "We now process the samples according to the quality of sequences by running <b>Trim Galore</b> on the paired-end datasets<br><br>
              </dir>"
      position: "left"

    - title: "<b>Step 3: Quality Control</b>"
      element: '#tool-search-query'
      intro: "You can now type and select <b>'TrimGalore'</b>.<br><br>
              <b>Follow this set of instructions once the tool was loaded:</b><br>
              <ol>
                <li>Select the samples from the paired datset and set the sequencing type to paired-end.</li>
                <li>Keep the rest of the options at their default values.</li>
                <li>Click button 'Execute' and wait for the tool to finish.</li>
              </ol>"
      position: "right"

    - title: "<b>Step 3: Quality Control</b>"
      element: '#tool-search-query'
      intro: "You can now type and select <b>'FastQC'</b>.<br><br>
              <b>Follow this set of instructions once the tool was loaded:</b><br>
              <ol>
                <li>Select one of the TrimGalore processed samples.</li>
                <li>Keep the rest of the options at their default values.</li>
                <li>Click button 'Execute' and wait for the tool to finish.</li>
                <li>Repeat with the other dataset.</li>
              </ol>"
      position: "right"

    - title: "<b>Step 3: Quality Control</b>"
      element: '#current-history-panel'
      intro: "To inspect the results of the FastQC run again just<br><br>
              <dir>
                <li>Click on the <b>eye icon</b> of the latest dataset and have a look at the output, what do you see? What is the read length now, is there anything you notice when you compare both datasets now to before the quality processing?</li>
              </dir>"
      position: "left"

    - title: "<b>Step 4: Mapping</b>"
      intro: "Now that we have quality controlled our samples, we want to continue our analysis.<br>
      As the genome of *Drosophila melanogaster* is known and assembled, we can use this information and map the sequences on this genome to identify the effects of Pasilla gene depletion on splicing events.<br>
      To make sense of the reads, their positions within *Drosophila melanogaster* genome must be determined. This process is known as aligning or 'mapping' the reads to the reference genome."
      position: "center"

    - title: "<b>Step 4: Mapping</b>"
      intro: "Because in the case of a eukaryotic transcriptome, most reads originate from processed mRNAs lacking exons, they cannot be simply mapped back to the genome as we normally do for DNA data. Instead the reads must be separated into two categories:<br>
      <dir>
      <li>Reads that map entirely within exons</li>
      <li>Reads that cannot be mapped within an exon across their entire length because they span two or more exons</li>
      </dir>"
      position: "center"

    - title: "<b>Step 4: Mapping</b>"
      intro: "Spliced mappers have been developed to efficiently map transcript-derived reads against genomes.<br>
      <a href=\"https://ccb.jhu.edu/software/tophat/index.shtml\">TopHat</a> was one of the first tools designed specifically to address this problem:<br>
      <dir>
      <li>1. Identification of potential exons using reads that do map to the genome</li>
      <li>2. Generation of possible splices between neighboring exons</li>
      <li>3. Comparison of reads that did not initially map to the genome against these *in silico* created junctions</li>
      </dir>"
      position: "center"

    - title: "<b>Step 4: Mapping</b>"
      intro: "TopHat needs to know two important parameters about the sequencing library<br>
      <dir>
      <li>The library type</li>
      <li>The mean inner distance between the mate pairs for paired end data</li>
      </dir><br>
      These information should usually come with your FASTQ files, ask your sequencing facility! If not, try to find them on the site where you downloaded the data or in the corresponding publication.<br>
      Another option is to estimate these parameters with a *preliminary mapping* of a *downsampled* file and some analysis programs. Afterward, the actual mapping can be redone on the original files with the optimized parameters.<br>
      To help finding the needed previous information and afterward annotating RNA sequences, we can take advantage from already known reference gene annotations."
      position: "center"

    - title: "<b>Step 4: Mapping</b>"
      element: ".upload-button"
      intro: "Use the upload button to upload the dm3 reference genome annotation file to Galaxy.<br><br>
              Click <b>'Next'</b> and the tour will take you to the Upload screen"
      position: "right"
      postclick:
        - ".upload-button"

    - title: "<b>Step 4: Mapping</b>"
      element: ".upload-text-content:first"
      intro: "We now paste the link to the ENSEMBL gene annotation gtf file into the upload-box. Click next to do so."
      preclick:
        - ".upload-button"
        - "button#btn-new"
      textinsert: |
        https://zenodo.org/record/61771/files/Drosophila_melanogaster.BDGP5.78.gtf

    - title: "<b>Step 4: Mapping</b>"
      element: "button#btn-start"
      intro: "Now that you've selected the file, select <b>'dm3'</b> as the genome
              and gtf as file format.<br><br>
              Click <b>'Next'</b> and the tour will <b>'Start'</b> the upload.<br>
              Galaxy will automatically unpack the file."
      position: "bottom"
      postclick:
        - "button#btn-start"
        - "button#btn-close"

    - title: "<b>Step 4: Mapping</b>"
      element: "#tool-search-query"
      intro: "Now we will use TopHat to map out reads to the dm3 genome. You can now type and select <b>'TopHat'</b> and use the full parameter set to get the best mapping results.<br><br>
              <b>Follow this set of instructions once the tool was loaded:</b><br>
              <ol>
                <li>Paired-end instead of single-end</li>
                <li>TrimGalore output as input in correct order (forward and reverse reads)</li>
                <li>Unstranded</li>
                <li>'dm3' as reference genome</li>
                <li>Mean inner distance to 112</li>
                <li>Library type to</li>
                <li>Minimum length of read segments to 18</li>
                <li>'Yes' to use own junction data</li>
                <li>'Yes' to use Gene Annotation Model</li>
                <li>`Drosophila_melanogaster.BDGP5.78.gtf` as Gene Model Annotations (to enable transcriptome alignment)</li>
                <li>'No (--coverage-search)' to use coverage-based search for junctions as it needs a lot a time. But consider this option for real world data.<br>
                The TopHat algorithm splits reads into segments to map the reads across splice junctions. Coverage-based search for junctions increases the sensitivity.</li>
                <li>Keep the rest of the options at their default values.</li>
                <li>Click button 'Execute' and wait for the tool to finish.</li>
              </ol>"
      position: "right"

    - title: "<b>Step 5: Inspect TopHat Output</b>"
      element: '#current-history-panel'
      intro: "<b>To inspect the output of <b>TopHat</b>:</b><br>
              <dir>
                <li>Click on the <b>'eye icon'</b> of the corresponding dataset</li>
                <li>Inspect the 'align summary' file</li>
                <li>How many forward and reverse reads where mapped?</li>
                <li>What is the 'overall read mapping rate' and the 'concordant pair alignment rate'?</li>
                <li>Why do some reads have multiple alignments?</li>
              </dir>"
      position: "left"

    - title: "<b>Step 5: Inspect TopHat Output</b>"
      intro: "<b>You should see the following:</b><br>
              <dir>
                <li>90.7% of the forward reads were mapped and 85.8% of the reverse reads</li>
                <li>The 'overall read mapping rate' is the rate of mapping when we take into account all reads (forward and reverse reads). Here it is 88.3%.<br>
                The 'concordant pair alignment rate' is (number of aligned pair - number of discordant alignments)/(number of paired reads). Here the value is 80.3%, a quite good value. Maximizing this value is the goal.</li>
                <li>The reads are small and with pseudogenes and other valid genome duplications, it is possible that the reads are mapped multiple times</li>
              </dir>"
      position: "center"

    - title: "<b>Step 5: Inspect TopHat Output</b>"
      intro: "<b>TopHat generates a BAM file with the mapped reads and three BED files containing splice junctions, insertions and deletions.<br>
        The datasets we used were a subset of the original data. They are then too small to give you a good impression of how real data looks like. So we have run TopHat for you on the real datasets. We extracted only the reads mapped to chromosome 4 of Drosophila, which we will now inspect using 'IGV'</b>"
      position: "center"

    - title: "<b>Step 6: IGV</b>"
      element: "#current-history-panel > div.controls > div.title > div"
      intro: "Create and name a new history"
      position: "bottom"

    - title: "<b>Step 6: IGV</b>"
      element: ".upload-text-content:first"
      intro: "We now paste the links to the new dataset into the upload-box. Click next to do so."
      preclick:
        - ".upload-button"
        - "button#btn-new"
      textinsert: |
        https://zenodo.org/record/61771/files/GSM461177_untreat_paired_chr4.bam
        https://zenodo.org/record/61771/files/GSM461177_untreat_paired_deletions_chr4.bed
        https://zenodo.org/record/61771/files/GSM461177_untreat_paired_insertions_chr4.bed
        https://zenodo.org/record/61771/files/GSM461177_untreat_paired_junctions_chr4.bed

    - title: "<b>Step 6: IGV</b>"
      element: "button#btn-start"
      intro: "Now that you've selected the file, select <b>'dm3'</b> as the genome
              and the corresponding file formats.<br><br>
              Click <b>'Next'</b> and the tour will <b>'Start'</b> the upload.<br>
              Galaxy will automatically unpack the files."
      position: "bottom"
      postclick:
        - "button#btn-start"
        - "button#btn-close"

    - title: "<b>Step 6: IGV</b>"
      element: '#current-history-panel'
      intro: "Visualize this BAM file and the three BED files, particularly the region on chromosome 4 between 560 kb to 600 kb (`chr4:560,000-600,000`). Click on the 'IGV' symbol of the bam dataset you just uploaded to galaxy to start IGV.<br>
      <dir>
      <li>Open dataset  click on display with IGV and web current</li>
      <li>Open the file with a JAVA plugin (e.g.,IcedTea)</li>
      <li>Go to View and Preferences and Alignments and set the visibility range to $>=50$kb</li>
      <li>Inspect the region on chr4 between 560 kb to 600 kb and copy chr4:560000-600000 to locus window and click GO</li>
      <li>Now import the bed output into IGV and Open dataset and click on display with IGV and local</li>
      <li>Inspect the results using a Sashimi plot (right-click on the bam file and select Sashimi Plot from the context menu)</li>
      </dir>"
      position : "center"

    - title: "<b>Step 6: IGV</b>"
      intro: "What you should be able to see: Which information does the `GSM461177_untreat_paired_junctions_chr4.bed` BED file contain?<br>
      How is this information represented in the BED file? And in IGV?<br>
      Where is the 'JUNC00013368' junction situated? What is its score?<br>
      How many reads are concerned by the 'JUNC00013368' junction, visible when we zoom on `chr4:568,476-571,814`? Can you relate that to the score?<br>
      And how many are concerned by the 'JUNC00013369' junction?"
      backdrop: true

    - title: "<b>Step 6: IGV</b>"
      intro: "Answers
      <dir>
      <li>`GSM461177_untreat_paired_junctions_chr4.bed` BED file contain the splicing events, *i.e.* when at least a single read splits across two exons in the alignment track</li>
      <li>The BED file is a tabular with: Chrom, Start, End, Name, Score, Strand, ThickStart, ThickEnd, ItemRGB, BlockCount, BlockSizes, BlockStart. In IGV, the junctions are represented by an arc from the beginning to the end of the junction. The color of the arc represent the strand on which the junction is found. The height of the arc, and its thickness, are proportional to the depth of read coverage. </li>
      <li>The 'JUNC00013368' junction starts at 568,736 and ends at 569,905. It has a score of 6.</li>
      <li>6 reads split across 'JUNC00013368', exactly the score</li>
      <li>8 reads split across 'JUNC00013369'. 3 reads are also mapped in the junction chromosome part: these reads are then part of the exon and may be implied in a different splicing.</li>
    </dir>
    "

    - title: "<b>Step 6: IGV</b>"
      intro: "Sashimi Plot<br>
      In the IGV window Right click on the BAM file and select <b>Sashimi Plot</b> from the context menu.<br>
      <dir>
      <li>What does the bar graph represent? And the numbered line?</li>
      <li>What does the number means?</li>
      <li>What is the name of the junction where 10 reads split? What is its position on the genome?</li>
      </dir>"

    - title: "<b>Step 6: IGV</b>"
      intro: "Sashimi Plot<br>
      <dir>
      <li>The coverage for each alignment track is plotted as a bar graph. Arcs representing splice junctions connecting exons</li>
      <li>Arcs display the number of reads split across the junction (junction depth). </li>
      <li>JUNC00013370 starts at 574338 and ends at 578091.</li>
      </dir>"

    - title: "<b>Step 7: Analysis of the differential gene expression</b>"
      intro: "To identify exons that are regulated by the Pasilla gene, we need to identify genes and exons which are differentially expressed between samples with PS gene depletion and control samples.<br>
      To compare the expression of single genes between different conditions (e.g. with or without PS depletion), an first essential step is to quantify the number of reads per gene. <a href=\"http://www-huber.embl.de/users/anders/HTSeq/doc/count.html\">HTSeq-count</a> is one of the most popular tools for gene expression quantification.<br>
      To quantify the number of reads mapped to a gene, an annotation of the genomic features is needed. We already uploaded the <a href=\"https://zenodo.org/record/61771/files/Drosophila_melanogaster.BDGP5.78.gtf\">Drosophila_melanogaster.BDGP5.78.gtf</a> with the Ensembl gene annotation for *Drosophila melanogasterto Galaxy."
      position: "center"

    - title: "<b>Step 8: Count the number of reads per annotated gene</b>"
      content: "In principle, the counting of reads overlapping with genomic features is a fairly simple task, but there are some details that need to be decided. HTSeq-count offers 3 choices ('union', 'intersection_strict' and 'intersection_nonempty') to handle read mapping to multiple locations, reads overlapping introns, or reads that overlap more than one genomic feature<br>
      The recommended mode is 'union', which counts overlaps even if a read only shares parts of its sequence with a genomic feature and disregards reads that overlap more than one feature."

    - title: "<b>Step 8: Count the number of reads per annotated gene</b>"
      element: '#current-history-panel'
      intro: " Copy the `Drosophila_melanogaster.BDGP5.78.gtf` file from the first history<br>
      Click on 'View all histories' in the top right <br>
      Drag and drop the file you want to copy to your new history <br>
      Click on 'Done' on the top left "

    - title: "<b>Step 8: Count the number of reads per annotated gene</b>"
      element: "#tool-search-query"
      intro: "You can now type and select <b>'HTSeq-count'</b>.<br><br>
              <b>Follow this set of instructions once the tool was loaded:</b><br>
              <ol>
                <li>Input is the sorted bam file downloaded before.</li>
                <li>`Drosophila_melanogaster.BDGP5.78.gtf` as 'GFF file'</li>
                <li>The 'union' mode</li>
                <li>A 'Minimum alignment quality' of 10</li>
                <li>Click button 'Execute' and wait for the tool to finish.</li>
              </ol>"
      position: "right"

    - title: "<b>Step 8: Count the number of reads per annotated gene</b>"
      intro: "Which feature has the most reads mapped on it?<br>
      To display the most often found feature, we first need to sort the output file with the feature by the number of reads found for these feature. We do that using Sort tool, sort on the second column and in descending order. This shows us that FBgn0017545 is the feature with the most reads mapped on it with 4,030 reads."

    - title: "<b>Step 9: Analysis of DGE</b>"
      intro: "In the previous section, we counted only reads that mapped to chromosome 4 for only one sample. To be able to identify differential gene expression induced by PS depletion, all datasets (3 treated and 4 untreated) must be analyzed with the similar procedure.<br>
      You can export a workflow from the previous steps and rerun it on the 7 samples whose the raw sequences are available on [Zenodo](http://dx.doi.org/10.5281/zenodo.61771). For time saving, we run the previous steps for you and obtain 7 count files, available on [Zenodo](http://dx.doi.org/10.5281/zenodo.61771)<br>
      These files contain for each gene the number of reads mapped to it. We could compare directly the files and then having the differential gene expression. But the number of sequenced reads mapped to a gene depends on:<br>
      <dir>
      <li> Its own expression level</li>
      <li> Its length</li>
      <li> The sequencing depth</li>
      <li> The expression of all other genes within the sample</li>
      </dir>"

    - title: "<b>Step 9: Analysis of DGE</b>"
      content: "For within as well as for inter-sample comparison, the counts need to be normalized. We can then run Differential Gene Expression (DGE) analysis, which has two basic tasks:<br>
        <dir>
        <li> Estimate the biological variance using the replicates for each condition</li>
        <li> Estimate the significance of expression differences between any two conditions</li>
        </dir>
        This expression analysis is estimated from read counts and attempts are made to correct for variability in measurements using replicates that are <b>absolutely essential<b> for accurate results. For your own analysis, we advice you to use at least 3, better even 5 biological replicates."


    - title: "<b>Step 9: Analysis of DGE</b>"
      content: "
      <a href=\"https://bioconductor.org/packages/release/bioc/html/DESeq2.html\">DESeq2</a> is a great tool for DGE analysis. It takes read counts produced by **HTseq-count** and applies size factor normalization:
      <dir>
      <li> Computation for each gene of the geometric mean of read counts across all samples</li>
      <li> Division of every gene count by the geometric mean</li>
      <li> Use of the median of these ratios as sample's size factor for normalization</li>
      </dir>
      Multiple factors can then be incorporated in the analysis. In our example, we have samples with two varying factors:
        <dir>
        <li> Treatment (either treated or untreated)</li>
        <li> Sequencing type (paired-end or single-end)</li>
        </dir>
        Here treatment is the primary factor which we are interested in.The sequencing type is further information that we know about the data that might effect the analysis. This particular multi-factor analysis allows us to assess the effect of the treatment taking also the sequencing type into account."

    - title: "<b>Step 9: Analysis of DGE</b>"
      element: "#current-history-panel > div.controls > div.title > div"
      intro: "Create and name a new history"
      position: "bottom"

    - title: "<b>Step 9: Analysis of DGE</b>"
      element: ".upload-button"
      intro: "Use the upload button to upload the file to Galaxy.<br><br>
              Click <b>'Next'</b> and the tour will take you to the Upload screen"
      position: "right"
      postclick:
        - ".upload-button"

    - title: "<b>Step 9: Analysis of DGE</b>"
      element: ".upload-text-content:first"
      intro: "We now paste the links to the new dataset into the upload-box. Click next to do so."
      preclick:
        - ".upload-button"
        - "button#btn-new"
      textinsert: |
        https://zenodo.org/record/61771/files/GSM461176_untreat_single.counts
        https://zenodo.org/record/61771/files/GSM461177_untreat_paired.counts
        https://zenodo.org/record/61771/files/GSM461178_untreat_paired.counts
        https://zenodo.org/record/61771/files/GSM461179_treat_single.counts
        https://zenodo.org/record/61771/files/GSM461180_treat_paired.counts
        https://zenodo.org/record/61771/files/GSM461181_treat_paired.counts
        https://zenodo.org/record/61771/files/GSM461182_untreat_single.counts

    - title: "<b>Step 9: Analysis of DGE</b>"
      element: "button#btn-start"
      intro: "Now that you've selected the file, select <b>'dm3'</b> as the genome.<br><br>
              Click <b>'Next'</b> and the tour will <b>'Start'</b> the upload.<br>
              Galaxy will automatically unpack the files."
      position: "bottom"
      postclick:
        - "button#btn-start"
        - "button#btn-close"

    - title: "<b>Step 9: Analysis of DGE</b>"
      element: "#tool-search-query"
      intro: "You can now type and select <b>'DESeq2'</b>.<br><br>
              <b>Follow this set of instructions once the tool was loaded:</b><br>
              <ol>
                <li>Treatment as first factor and untreated as levels and selectio of count files corresponding to both levels.</li>
                <li>You can select several files by keeping the CTRL (or COMMAND) key pressed and clicking on the interesting files</li>
                <li>'Sequencing' as second factor with 'PE' and 'SE' as levels and selection of count files corresponding to both levels</li>
                <li>Keep the rest of the options at their default values.</li>
                <li>Click button 'Execute' and wait for the tool to finish.</li>
              </ol>"
      position: "right"

    - title: "Step 10: Inspect DGE"
      element: '#current-history-panel'
      intro: "<b>To inspect the output of <b>DESeq2</b>:</b><br>
              <dir>
              <li>Click on the <b>'eye icon'</b> of the corresponding dataset</li>
              <li>First insepct the tabular file, the columns are Gene Identifiers; Mean normalized counts, averaged over all samples from both conditions; Logarithm (to basis 2) of the fold change </li>
              </dir>"
      position: "left"

    - title: "Step 10: Inspect DGE"
      content: "The log2 fold changes are based on primary factor level 1 vs. factor level 2. The order of factor levels is important. For example, for the factor 'Treatment', DESeq2 computes fold changes of 'treated' samples against 'untreated', i.e. the values correspond to up- or downregulations of genes in treated samples.<br>
      <dir>
      <li>Standard error estimate for the log2 fold change estimate</li>
      <li><a href=\"https://en.wikipedia.org/wiki/Wald_test) statistic\">Wald</a></li>
      <li>*p*-value for the statistical significance of this change</li>
      <li>*p*-value adjusted for multiple testing with the Benjamini-Hochberg procedure which controls false discovery rate <a href=\"https://en.wikipedia.org/wiki/False_discovery_rate\"FDR</a></li>"
      position: "center"

    - title: "Step 10: Inspect DGE"
      content: " Run <b>Filter</b> to extract genes with a significant change in gene expression (adjusted *p*-value equal or below 0.05) between treated and untreated samples<br>
      <b>Find out: How many genes have a significant change in gene expression between these conditions?</b>"
      backdrop: true

    - title: "Step 10: Inspect DGE"
      content: " Run <b>Filter</b> to extract genes with a significant change in gene expression (adjusted *p*-value equal or below 0.05) between treated and untreated samples<br>
      <b>Find out: How many genes have a significant change in gene expression between these conditions?</b><br>
      Filter for all genes from the DESeq2 result file that have a significant adjusted p-value of 0.05 or below (Filter tool: condition c7$<=$0.05). Please note that the output was already sorted by adjusted p-value.<br>
      We get 751 genes (5.05%) with a significant change in gene expression between treated and untreated samples."

    - title: "Step 10: Inspect DGE"
      content: "The file with the independent filtered results can be used for further downstream analysis as it excludes genes with only few read counts as these genes will not be considered as significantly differentially expressed."

    - title: "Step 10: Inspect DGE"
      element: '#current-history-panel'
      intro: "Rename your filtered datasets for downstream analysis"

    - title: "Step 10: Inspect DGE"
      content: "<b>Are there more upregulated or downregulated genes in the treated samples?</b><br>
      To obtain the up-regulated genes, we filter the previously generated file (with the significant change in gene expression) with the condition 'c3>0' (the log2 fold changes must be greater than 0). We obtain 331 genes (44.07% of the genes with a significant change in gene expression). For the down-regulated genes, we do the inverse and we find 420 genes (55.93% of the genes with a significant change in gene expression"

    - title: "Step 11: Visualize DGE"
      element: '#current-history-panel'
      intro: "In addition to the list of genes, <b>DESeq2</b> outputs a graphical summary of the results, useful to evaluate the quality of the experiment:<br>
      To inspect the Histogram of *p*-values for all tests:<br>
      <dir>
      <li>Click on the <b>'eye icon'</b> of the corresponding dataset</li>
      </dir>"
      position: "right"

    - title: "Step 11: Visualize DGE"
      element: '#current-history-panel'
      intro: "<a href=\"https://en.wikipedia.org/wiki/MA_plot\">MA plot</a>: global view of the relationship between the expression change of conditions (log ratios, M), the average expression strength of the genes (average mean, A), and the ability of the algorithm to detect differential gene expression. The genes that passed the significance threshold (adjusted p-value < 0.1) are colored in red.<br>
      To inspect the MA-plot:<br>
      <dir>
      <li>Click on the <b>'eye icon'</b> of the corresponding dataset</li>
      </dir>"
      position: "right"

    - title: "Step 11: Visualize DGE"
      element: '#current-history-panel'
      intro: "Principal Component Analysis <a href=\"https://en.wikipedia.org/wiki/Principal_component_analysis\">PCA</a><br>
      Each replicate is plotted as an individual data point. This type of plot is useful for visualizing the overall effect of experimental covariates and batch effects.<br>
      To inspect the PCA-plot:<br>
      <dir>
      <li>Click on the <b>'eye icon'</b> of the corresponding dataset</li>
      </dir><br>
      <b>What are the two axis separating?</b>"
      position: "right"
      backdrop: true

    - title: "Step 11: Visualize DGE"
      intro: "<dir>
      <li>The first axis is seperating the treated samples from the untreated samples, as defined when DeSeq was launched</li>
      <li>The second axis is separating the single-end datasets from the paired-end datasets</li>
      </dir>"

    - title: "Step 11: Visualize DGE"
      element: '#current-history-panel'
      intro: "Heatmap of sample-to-sample distance matrix: overview over similarities and dissimilarities between samples<br>
      To inspect the Heatmap:<br>
      <dir>
      <li>Click on the <b>'eye icon'</b> of the corresponding dataset</li>
      </dir><br>
      <b>How are the samples grouped?</b>"
      position: "right"
      backdrop: true

    - title: "Step 11: Visualize DGE"
      intro: "They are first grouped depending on the treatment (the first factor) and then on the library type (the second factor), as defined when DeSeq was launched"
      position: "center"

    - title: "Step 11: Visualize DGE"
      element: '#current-history-panel'
      intro: "Dispersion estimates: gene-wise estimates (black), the fitted values (red), and the final maximum a posteriori estimates used in testing (blue)<br>
      To inspect the Dispersion estimates plot:<br>
      <dir>
      <li>Click on the <b>'eye icon'</b> of the corresponding dataset</li>
      </dir><br>"
      position: "right"
      backdrop: true

    - title: "Step 11: Visualize DGE"
      intro: "This dispersion plot is typical, with the final estimates shrunk from the gene-wise estimates towards the fitted estimates. <br>
      Some gene-wise estimates are flagged as outliers and not shrunk towards the fitted value.<br>
      The amount of shrinkage can be more or less than seen here, depending on the sample size, the number of coefficients, the row mean and the variability of the gene-wise estimates.<br>
      For more information about <b>DESeq2</b> and its outputs, you can have a look at <a href=\"https://www.bioconductor.org/packages/release/bioc/manuals/DESeq2/man/DESeq2.pdf\">DESeq2 documentation</a>"
      position: "center"
      backdrop: true

    - title: "Step 12: Analysis of the functional enrichment among differentially expressed genes"
      content: "We have extracted genes that are differentially expressed in treated (with PS gene depletion) samples compared to untreated samples. We would like to know the functional enrichment among the differentially expressed genes.<br>
      The Database for Annotation, Visualization and Integrated Discovery ([DAVID](https://david.ncifcrf.gov/)) provides a comprehensive set of functional annotation tools for investigators to understand the biological meaning behind large lists of genes.<br>
      We use then DAVID to identify functional annotations of the upregulated genes and the downregulated genes."
      position: "center"

    - title: "Step 12: Analysis of the functional enrichment among differentially expressed genes"
      element: '#current-history-panel'
      content: "Sort the 2 datasets generated previously (upregulated genes and downregulated genes) given the log2 fold change, in descending or ascending order (to obtain the higher absolute log2 fold changes on the top)."
      position: "right"

    - title: "Step 12: Analysis of the functional enrichment among differentially expressed genes"
      element: '#current-history-panel'
      content: "Extract the first 100 lines of sorted files and then run DAVID on these files"

    - title: "Step 12: Analysis of the functional enrichment among differentially expressed genes"
      element: "#tool-search-query"
      intro: "You can now type and select <b>'DAVID'</b>.<br><br>
      <b>Follow this set of instructions once the tool was loaded:</b><br>
      <dir>
      <li>Input from first 100 lines of sorted files</li>
      <li>First column as 'Column with identifiers'</li>
      <li>'FLYBASE_GENE_ID' as 'Identifier type'</li>
      <li>Click button 'Execute' and wait for the tool to finish.</li>
      </dir>"
      position: "right"

    - title: "Step 12: Analysis of the functional enrichment among differentially expressed genes"
      content: "The output of the <b>DAVID</b> tool is a HTML file with a link to the DAVID website.<br>
      Inspect the Functional Annotation Chart<br>
      What functional categories are the most represented ones?"

    - title: "Step 12: Analysis of the functional enrichment among differentially expressed genes"
      content: "The up-regulated genes are mostly related to membrane (in the number of genes). The most represented functional categories are linked to signal and pathways for the down-regulated genes.<br>
      Now inspect the Functional Annotation Clusterings<br>
      What functional annotations are the first clusters related to?"

    - title: "Step 12: Analysis of the functional enrichment among differentially expressed genes"
      content: "For the up-regulated genes, the first cluster is more composed of functions related to chaperone and stress response. The down-regulated genes are more linked to ligase activity."

    - title: "Step 13: Inference of the differential exon usage"
      content: "Now, we would like to know the differential exon usage between treated (PS depleted) and untreated samples using RNA-seq exon counts.<br>
      We will therefore go back and work on the mapping results <a href=\"https://zenodo.org/record/61771/files/GSM461177_untreat_paired_chr4.bam\">GSM461177_untreat_paired_chr4.bam</a>.<br>
      Copy the `Drosophila_melanogaster.BDGP5.78.gtf` file and the bam file from the first and second history<br>
      We use <a href=\"http://www.bioconductor.org/packages/release/bioc/html/DEXSeq.html>DEXSeq</a> which detects high sensitivity genes, and in many cases exons, that are subject to differential exon usage."

    - title: "Step 13: Inference of the differential exon usage"
      element: "#tool-search-query"
      intro: "First, we need to count the number of reads mapping the exons. This step is similar to counting the number of reads per annotated gene. Here instead of HTSeq-count, we are using DEXSeq-Count<br>
      You can now type and select <b>'DEXSeq-Count'</b>.<br><br>
      <b>Follow this set of instructions once the tool was loaded:</b><br>
      <dir>
      <li>Run on Drosophila_melanogaster.BDGP5.78.gtf</li>
      <li>'Prepare annotation' as 'Mode of operation'</li>
      <li>Keep the rest of the options at their default values.</li>
      <li>Click button 'Execute' and wait for the tool to finish.</li>
      <li>The output is again a GTF file that is ready to use for counting.</li>
      </dir>"
      position: "right"

    - title: "Step 13: Inference of the differential exon usage"
      element: "#tool-search-query"
      intro: "Now we count reads<br>
      You can now type and select <b>'DEXSeq-Count'</b>.<br><br>
      <b>Follow this set of instructions once the tool was loaded:</b><br>
      <dir>
      <li>Run on GSM461177_untreat_paired_chr4.bam</li>
      <li>'Count Reads' as 'Mode of operation'</li>
      <li>Keep the rest of the options at their default values.</li>
      <li>Click button 'Execute' and wait for the tool to finish.</li>
      </dir>"
      position: "right"

    - title: "Step 13: Inference of the differential exon usage"
      element: '#current-history-panel'
      intro: "Inspect the output<br>
      <b>Which exon has the most read mapped on it? From which gene has this exon beed extracted? Is it similar to the previous result with HTSeq-count?</b>"
      position: "right"

    - title: "Step 13: Inference of the differential exon usage"
      intro: "FBgn0017545:004 is the exon with the most read mapped on it. It is part of FBgn0017545, the feature with the most reads mapped from HTSeq-count"
      position: "center"

    - title: "Step 13: Inference of the differential exon usage"
      intro: "DEXSeq usage is similar to DESeq2. It uses similar statistics to find differentially used exons.<br>
      As for DESeq2, we counted only reads that mapped to exons on chromosome 4 for only one sample in the previous step. To be able to identify differential exon usage induced by PS depletion, all datasets (3 treated and 4 untreated) must be analyzed with the similar procedure.<br>
      To save time, we did that for you. The results are available on <a href=\"http://dx.doi.org/10.5281/zenodo.61771\">Zenodo</a>, we will load them into your history using the file upload procedure as before. You can now create a new history."
      position: "center"
      backdrop: true

    - title: "Step 13: Inference of the differential exon usage"
      element: ".upload-text-content:first"
      intro: "We now paste the links to the new dataset into the upload-box. Click next to do so."
      preclick:
        - ".upload-button"
        - "button#btn-new"
      textinsert: |
        https://zenodo.org/record/61771/files/dexseq.gtf
        https://zenodo.org/record/61771/files/treated1_singlea.txt
        https://zenodo.org/record/61771/files/treated2_paired.txt
        https://zenodo.org/record/61771/files/treated3_paired.txt
        https://zenodo.org/record/61771/files/untreated1_single.txt
        https://zenodo.org/record/61771/files/untreated2_single.txt
        https://zenodo.org/record/61771/files/untreated3_paired.txt
        https://zenodo.org/record/61771/files/untreated4_paired.txt

    - title: "Step 13: Inference of the differential exon usage"
      element: "button#btn-start"
      intro: "Now that you've selected the file, select <b>'dm3'</b> as the genome.<br><br>
              Click <b>'Next'</b> and the tour will <b>'Start'</b> the upload.<br>
              Galaxy will automatically unpack the files."
      position: "bottom"
      postclick:
        - "button#btn-start"
        - "button#btn-close"

    - title: "Step 13:  Inference of the differential exon usage"
      element: "#tool-search-query"
      intro: "You can now type and select <b>'DEXSeq'</b>.<br><br>
      <b>Follow this set of instructions once the tool was loaded:</b><br>
      <ol>
      <li>Condition as first factor and treated and untreated as levels and selectio of count files corresponding to both levels.</li>
      <li>You can select several files by keeping the CTRL (or COMMAND) key pressed and clicking on the interesting files</li>
      <li>Unlike DESeq2, DEXSeq does not allow flexible primary factor names. Always use your primary factor name as 'condition'</li>
      <li>'Sequencing' as second factor with 'PE' and 'SE' as levels and selection of count files corresponding to both levels</li>
      <li>Keep the rest of the options at their default values.</li>
      <li>Click button 'Execute' and wait for the tool to finish.</li>
      </ol>"
      position: "right"

    - title: "Step 13:  Inference of the differential exon usage"
      content: "Similarly to DESeq2, DEXSeq generates a table with:<br>
      <dir>
      <li>Exon identifiers</li>
      <li>Gene identifiers</li>
      <li>Exon identifiers in the Gene</li>
      <li>Mean normalized counts, averaged over all samples from both conditions</li>
      <li>Logarithm (to basis 2) of the fold change</li>
      <li>The log2 fold changes are based on primary factor level 1 vs. factor level 2. The order of factor levels is then important. For example, for the factor 'Condition', DESeq2 computes fold changes of 'treated' samples against 'untreated', *i.e.* the values correspond to up- or downregulations of genes in treated samples.</li>
      <li>Standard error estimate for the log2 fold change estimate</li>
      <li>p-value for the statistical significance of this change</li>
      <li>p-value adjusted for multiple testing with the Benjamini-Hochberg procedure which controls false discovery rate <a href=\"https://en.wikipedia.org/wiki/False_discovery_rate\">FDR</a></li>
      </dir>"
      position: "center"

    - title: "Step 13:  Inference of the differential exon usage"
      content: "Run <b>Filter</b> to extract exons with a significant usage (adjusted *p*-value equal or below 0.05) between treated and untreated samples<br>
      How many exons have a significant change in usage between these conditions?"

    - title: "Step 13:  Inference of the differential exon usage"
      content: "We get 38 exons (12.38%) with a significant usage change between treated and untreated samples"

    - title: "Step 14:  Annotation of the result tables with gene information"
      content: "Unfortunately, in the process of counting, we loose all the information of the gene except its identifier. In order to get the information back to our final counting tables, we can use the tool 'Annotate DE(X)Seq result' to get the link between identifier and annotation."

    - title: "Step 14:  Annotation of the result tables with gene information"
      element: "#tool-search-query"
      content: "Run <b>Annotate DE(X)Seq result</b> on a counting table (from DESeq or DEXSeq) using the `Drosophila_melanogaster.BDGP5.78.gtf` as annotation file."

    - title: "<b>Step 12: Name  your history</b>"
      element: "#current-history-panel > div.controls > div.title > div"
      intro: "Change the name of your history."
      position: "bottom"

    - title: "<b>Step 13: Make a workflow out of steps 5 to 9</b>"
      element: '#history-options-button'
      intro: "Please extract your history to a workflow.<br>
      <b>(History options :: Extract workflow)</b><br><br>
      <b>Do not include:</b> 'RNAplot'<br><br>
      Click <b>'Create Workflow'</b>."
      position: "left"

    - title: "<b>Step 15: Make a workflow</b>"
      element: '#history-options-button'
      intro: "To make sure the workflow is correct, check it in the editor and make some small adjustments.<br><br>
      <dir>
      <li>Click on the name of your new workflow and select <b>'Edit'</b></li>
      <li>The individual steps are displayed as boxes, their <b>outputs and inputs are connected through lines</b></li>
      <li>When you click on a box you see the tool options on the right. Besides the tools, you should see two additional boxes titled <b>'Input dataset'</b>. These represent the data we want to feed into our workflow.</li>
      </dir>"
      position: "left"

    - title: "<b>Step 15: Make a workflow</b>"
      element: '#history-options-button'
      intro: "To make sure our workflow is correct, we look at it in the editor and make some small adjustments.<br><br>
      <dir>
      <li>Although we have several inputs in the workflow they are missing their connection to some tools we used, because we didn't carry over the intermediate steps</li>
      <li><b>Connect</b> each input dataset to the Intersect tool by <b>dragging</b> the arrow pointing outwards on the right of its box (which denotes an output) to an arrow on the left of the Intersect box pointing inwards (which denotes an input). Connect each input dataset with a different input of Intersect</li>
      <li>You can also <b>change the names</b> of the input datasets. Don't forget to save it in the end by clicking on <b>'Options'</b> (top right) and selecting <b>'Save'</b></li>
      </dir>"
      position: "left"

    - title: "<b>Step 16: Sharing workflow</b>"
      element: 'a[href$="/workflow/list_for_run"]'
      intro: "You can share your new workflow.<br>
      <dir>
      <li>Click on your workflow's name and select <b>'Share or publish'</b></li>
      <li>Click <b>'Share with a user'</b></li>
      <li>Enter the email address of the person you wish to share your workflow with.(the same as he/she uses to login to Galaxy)</li>
      <li>Hit <b>'Share'</b></li>
      </dir>"
      position: "top"

    - title: "<b>Step 16: Sharing workflow</b>"
      element: '#center-panel'
      intro: "If a workflow has been shared with you, you can find it under <b>'Workflows shared with you by others'</b>:<br>
      <dir>
      <li>Click on a workflow name and select <b>'View'</b></li>
      <li>You can compare the workflows of others with your workflow</li>
      </dir>"
      position: "right"

    - title: "Concluding remarks"
      content: "In this tutorial, we have analyzed real RNA sequencing data to extract useful information, such as which genes are up- or downregulated by depletion of the Pasilla gene and which genes are regulated by the Pasilla gene. To answer these questions, we analyzed RNA sequence datasets using a reference-based RNA-seq data analysis approach."

    - title: "<b>Enjoy the Galaxy RNA-workbench</b>"
      intro: "Thanks for taking this tour! Happy research with Galaxy!"