From 4683251b5f3b5774c8b688e6403fdb6cb99163f5 Mon Sep 17 00:00:00 2001
From: HDash <16350928+HDash@users.noreply.github.com>
Date: Wed, 17 Apr 2024 11:42:27 +0100
Subject: [PATCH 1/4] Populate main package vignette
---
vignettes/MotifStats.Rmd | 172 ++++++++++++++++++++++++++++++++++++++-
1 file changed, 170 insertions(+), 2 deletions(-)
diff --git a/vignettes/MotifStats.Rmd b/vignettes/MotifStats.Rmd
index f0e87af..967e380 100644
--- a/vignettes/MotifStats.Rmd
+++ b/vignettes/MotifStats.Rmd
@@ -7,13 +7,181 @@ vignette: >
%\VignetteEncoding{UTF-8}
---
-```{r, include = FALSE}
+```{r include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
-```{r setup}
+
+## Introduction
+
+`MotifStats` is a simple R package to calculate the the metrics to quantify the
+relationship between peaks and motifs. It is based on [Analysis of Motif
+Enrichment (AME)](https://meme-suite.org/meme/doc/ame.html) and [Find Individual
+Motif Occurrences (FIMO)](https://meme-suite.org/meme/doc/fimo.html) from the
+[MEME suite](https://meme-suite.org/meme/index.html).
+
+
+It has two distinct functions:
+
+1. Calculate motif enrichment motif enrichment relative to a set of background
+sequences using AME.
+2. Calculate the distance between each motif and its nearest peak summit, where
+each motif is identified using FIMO.
+
+## Data
+The `MotifStats` package comes with motif and peak data for transcription
+factors CTCF and CREB1. Details of the files are as follows:
+
+- **CREB1 TIP-seq peaks (narrowPeaks)**[^f1]
+-- *Data not publicly available*
+
+- **CTCF TIP-seq peaks (narrowPeaks)**[^f1]
+-- Retrieved from NCBI GEO under
+accession
+[GSE188512](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE188512)
+
+- **CREB motif**
+-- Retrieved from JASPAR2024 under matrix ID
+[MA0018.5](http://jaspar.genereg.net/matrix/MA0018.5/)
+
+- **CTCF motif**
+-- Retrieved from JASPAR2024 under matrix ID
+[MA1930.2](http://jaspar.genereg.net/matrix/MA1930.2/)
+
+
+## Installation
+`MotifMarker` relies on [MEME suite](https://meme-suite.org/meme/index.html) as
+a system dependency. Directions for installation can be found [here](https://www.bioconductor.org/packages/release/bioc/vignettes/memes/inst/doc/install_guide.html).
+
+To install the package, use the following command:
+```{r eval = FALSE}
+if(!require("remotes")) install.packages("remotes")
+remotes::install_github("neurogenomics/MotifStats")
+```
+
+
+## Usage
+In this example analysis, we will compare the enrichment of the CREB motif in
+CTCF TIP-seq peaks relative to the background. We will also calculate the
+distance between the centre of each CTCF motif and its nearest peak summit.
+
+### Load packages
+Load the installed package.
+```{r include = FALSE}
library(MotifStats)
```
+
+For this example, we will also load the `BSgenome.Hsapiens.UCSC.hg38` package to
+provide the genome build our peaks have been derived from.
+```{r include = FALSE}
+library(BSgenome.Hsapiens.UCSC.hg38)
+```
+
+### Prepare input data
+First, we need to load the motif and peak data. The motif data obtained from
+JASPAR can be loaded as follows:
+```{r eval = FALSE}
+name_motif <- read_motif_file(
+ "/path/to/motfif/file.jaspar",
+ motif_id = "id_of_motif",
+ file_format = "jaspar"
+ )
+```
+
+Next, we load the narrowPeak files from MACS2/3.
+```{r eval = FALSE}
+name_peaks <- read_peak_file("/path/to/peak/file.narrowPeak")
+```
+
+For this example, we will load the built-in CREB1 TIP-seq peaks (as `GRanges`
+object) and CREB motif (as `PWMatrix` object) data.
+```{r include = TRUE}
+data("ctcf_motif")
+data("ctcf_peaks")
+```
+
+### Calculate motif enrichment
+To calculate the motif relative to a set of background sequences, we use
+`peak_proportion()`.
+
+- An additional `out_dir` argument can be used to specify the
+directory to save the AME output files[^f2].
+
+```{r include = TRUE}
+ctcf_read_prop <- peak_proportion(
+ peak_input = ctcf_peaks,
+ motif = ctcf_motif,
+ genome_build = BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38
+)
+```
+
+`ctcf_read_prob` is a list with two components:
+1. `$tp` (True positives) with the number of true positive motif occurrences in
+the given sequence followed by its relative percentage.
+1. `$fp` (False positives) with the number of false positive motif occurrences
+in the given sequence followed by its relative percentage.
+
+
+### Find motif-summit distances
+To calculate the distance between each motif and its nearest peak summit, we use
+`summit_to_motif()`.
+
+- `fp_rate` argument specifies the desired false-positive rate for FIMO. A
+p-value is calculated using the formula:
+$$p = \frac{\text{fp_rate}}{2 * \text{promoter_length}}$$
+- An additional `out_dir` argument can be used to specify the
+directory to save the 0-order background file.
+
+```{r include = TRUE}
+ctcf_read_sum_motif <- summit_to_motif(
+ peak_input = ctcf_peaks,
+ motif = ctcf_motif,
+ fp_rate = 0.05,
+ genome_build = BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38
+)
+```
+
+`ctcf_read_sum_motif` is a list with two objects:
+1. `peak_set` with peak information.
+2. `distance_to_summit` with distances between the centre of each motif and its
+nearest peak summit.
+
+
+#### Visualize results
+We can optionally visualize the distribution of distances by using `density_plot()`.
+```{r include = TRUE, fig.width = 7, fig.height = 4}
+density_plot(
+ ctcf_read_sum_motif$distance_to_summit,
+ plot_title = "CTCF motif distance to summit",
+ x_label = "Distance to summit (bp)",
+ y_label = "Density"
+)
+```
+
+
+> **NOTE:** Since AME and FIMO accept different parameters and are calculated
+independently, it is not possible to obtain directly comparable results.
+
+
+## Future Enchancements
+- Calculate metrics for more than one motif at a time.
+
+
+## Session Info
+
+
+
+```{r echo = FALSE}
+sessionInfo()
+```
+
+
+
+
+[^f1]: The peak file is a subset of chromosome 19 to reduce the size of the
+built-in data.
+[^f2]: For more information on the output files, refer to the
+[AME documentation](https://meme-suite.org/meme/doc/ame.html).
From 910db57a0cb4624eda8739ebbc4a769d6d507bc7 Mon Sep 17 00:00:00 2001
From: HDash <16350928+HDash@users.noreply.github.com>
Date: Thu, 18 Apr 2024 15:37:45 +0100
Subject: [PATCH 2/4] Correct typos in main vignette
---
vignettes/MotifStats.Rmd | 11 ++++++-----
1 file changed, 6 insertions(+), 5 deletions(-)
diff --git a/vignettes/MotifStats.Rmd b/vignettes/MotifStats.Rmd
index 967e380..324cf03 100644
--- a/vignettes/MotifStats.Rmd
+++ b/vignettes/MotifStats.Rmd
@@ -53,7 +53,7 @@ accession
## Installation
-`MotifMarker` relies on [MEME suite](https://meme-suite.org/meme/index.html) as
+`MotifStats` relies on [MEME suite](https://meme-suite.org/meme/index.html) as
a system dependency. Directions for installation can be found [here](https://www.bioconductor.org/packages/release/bioc/vignettes/memes/inst/doc/install_guide.html).
To install the package, use the following command:
@@ -64,9 +64,10 @@ remotes::install_github("neurogenomics/MotifStats")
## Usage
-In this example analysis, we will compare the enrichment of the CREB motif in
+In this example analysis, we will compare the enrichment of the CTCF motif in
CTCF TIP-seq peaks relative to the background. We will also calculate the
-distance between the centre of each CTCF motif and its nearest peak summit.
+distance between the centre of each motif occurrence and its nearest peak
+summit.
### Load packages
Load the installed package.
@@ -96,7 +97,7 @@ Next, we load the narrowPeak files from MACS2/3.
name_peaks <- read_peak_file("/path/to/peak/file.narrowPeak")
```
-For this example, we will load the built-in CREB1 TIP-seq peaks (as `GRanges`
+For this example, we will load the built-in CTCF TIP-seq peaks (as `GRanges`
object) and CREB motif (as `PWMatrix` object) data.
```{r include = TRUE}
data("ctcf_motif")
@@ -118,7 +119,7 @@ ctcf_read_prop <- peak_proportion(
)
```
-`ctcf_read_prob` is a list with two components:
+`ctcf_read_prop` is a list with two components:
1. `$tp` (True positives) with the number of true positive motif occurrences in
the given sequence followed by its relative percentage.
1. `$fp` (False positives) with the number of false positive motif occurrences
From c1dad8eaa317f26b5faf3dc9088c5930582a2099 Mon Sep 17 00:00:00 2001
From: HDash <16350928+HDash@users.noreply.github.com>
Date: Thu, 18 Apr 2024 16:59:06 +0100
Subject: [PATCH 3/4] Add further explanations to the main vignette
---
vignettes/MotifStats.Rmd | 63 +++++++++++++++++++++++++++++++++-------
1 file changed, 52 insertions(+), 11 deletions(-)
diff --git a/vignettes/MotifStats.Rmd b/vignettes/MotifStats.Rmd
index 324cf03..f784c47 100644
--- a/vignettes/MotifStats.Rmd
+++ b/vignettes/MotifStats.Rmd
@@ -31,7 +31,9 @@ sequences using AME.
2. Calculate the distance between each motif and its nearest peak summit, where
each motif is identified using FIMO.
+
## Data
+
The `MotifStats` package comes with motif and peak data for transcription
factors CTCF and CREB1. Details of the files are as follows:
@@ -53,6 +55,7 @@ accession
## Installation
+
`MotifStats` relies on [MEME suite](https://meme-suite.org/meme/index.html) as
a system dependency. Directions for installation can be found [here](https://www.bioconductor.org/packages/release/bioc/vignettes/memes/inst/doc/install_guide.html).
@@ -64,27 +67,32 @@ remotes::install_github("neurogenomics/MotifStats")
## Usage
+
In this example analysis, we will compare the enrichment of the CTCF motif in
CTCF TIP-seq peaks relative to the background. We will also calculate the
distance between the centre of each motif occurrence and its nearest peak
summit.
+
### Load packages
+
Load the installed package.
-```{r include = FALSE}
+```{r include = TRUE, message = FALSE, warning = FALSE}
library(MotifStats)
```
For this example, we will also load the `BSgenome.Hsapiens.UCSC.hg38` package to
provide the genome build our peaks have been derived from.
-```{r include = FALSE}
+```{r include = TRUE, message = FALSE, warning = FALSE}
library(BSgenome.Hsapiens.UCSC.hg38)
```
+
### Prepare input data
+
First, we need to load the motif and peak data. The motif data obtained from
JASPAR can be loaded as follows:
-```{r eval = FALSE}
+```{r eval = FALSE}
name_motif <- read_motif_file(
"/path/to/motfif/file.jaspar",
motif_id = "id_of_motif",
@@ -92,8 +100,8 @@ name_motif <- read_motif_file(
)
```
-Next, we load the narrowPeak files from MACS2/3.
-```{r eval = FALSE}
+Next, we load the narrowPeak files from MACS2/3[^f2].
+```{r eval = FALSE}
name_peaks <- read_peak_file("/path/to/peak/file.narrowPeak")
```
@@ -104,12 +112,20 @@ data("ctcf_motif")
data("ctcf_peaks")
```
+
### Calculate motif enrichment
-To calculate the motif relative to a set of background sequences, we use
-`peak_proportion()`.
+To calculate the motif relative to a set of background sequences, we use
+`peak_proportion()`.
+
+- Under the hood, it calls `meme::runAme` for motif
+enrichment scoring. In context of this call, it identifies the occurrences of
+input motif in the input sequences compared with background sequences and
+outputs relevant statistics.
+- A 0-order background model with the same nucleotide composition as the input
+sequences is generated for comparison.
- An additional `out_dir` argument can be used to specify the
-directory to save the AME output files[^f2].
+directory to save the AME output files[^f3] and the background model.
```{r include = TRUE}
ctcf_read_prop <- peak_proportion(
@@ -125,14 +141,23 @@ the given sequence followed by its relative percentage.
1. `$fp` (False positives) with the number of false positive motif occurrences
in the given sequence followed by its relative percentage.
+In context of this function, the true positives represent to the
+number/percentage of peaks with an associated motif occurrence, while the false
+positives represent the number/percentage of peaks without an associated motif
+occurrence.
+
+
### Find motif-summit distances
+
To calculate the distance between each motif and its nearest peak summit, we use
`summit_to_motif()`.
- `fp_rate` argument specifies the desired false-positive rate for FIMO. A
p-value is calculated using the formula:
$$p = \frac{\text{fp_rate}}{2 * \text{promoter_length}}$$
+- A 0-order background model with the same nucleotide composition as the input
+sequences is generated for comparison.
- An additional `out_dir` argument can be used to specify the
directory to save the 0-order background file.
@@ -146,13 +171,19 @@ ctcf_read_sum_motif <- summit_to_motif(
```
`ctcf_read_sum_motif` is a list with two objects:
-1. `peak_set` with peak information.
+1. `peak_set` with peak information, as a `GRanges` object.
2. `distance_to_summit` with distances between the centre of each motif and its
nearest peak summit.
+**NOTE**: When a motif is found multiple times within a single peak, the
+`peak_set` and `distance_to_summit` objects will contain multiple entries (rows)
+corresponding to the same peak. Each of these entries represents a distinct
+occurrence of the motif within that peak.
#### Visualize results
-We can optionally visualize the distribution of distances by using `density_plot()`.
+
+We can optionally visualize the distribution of distances by using
+`density_plot()`.
```{r include = TRUE, fig.width = 7, fig.height = 4}
density_plot(
ctcf_read_sum_motif$distance_to_summit,
@@ -162,12 +193,19 @@ density_plot(
)
```
+For this given example, we can observe the distribution of distances between the
+centre of each motif, and its nearest peak summit follows a normal distribution.
+With a mean of the distribution around 0 base pairs (bp), we can infer that the
+motif is likely to be located at the peak summit, suggesting that the identified
+peak summits are associated with binding of regulatory proteins.
+
> **NOTE:** Since AME and FIMO accept different parameters and are calculated
independently, it is not possible to obtain directly comparable results.
## Future Enchancements
+
- Calculate metrics for more than one motif at a time.
@@ -181,8 +219,11 @@ sessionInfo()
+
[^f1]: The peak file is a subset of chromosome 19 to reduce the size of the
built-in data.
-[^f2]: For more information on the output files, refer to the
+[^f2]: narrowPeak files from both version 2 and 3 of [MACS: Model-based Analysis
+for ChIP-Seq](https://github.com/macs3-project/MACS) is supported.
+[^f3]: For more information on the output files, refer to the
[AME documentation](https://meme-suite.org/meme/doc/ame.html).
From e22b77d17620faccdf12b9c17e7f84cdb3bcb93d Mon Sep 17 00:00:00 2001
From: HDash <16350928+HDash@users.noreply.github.com>
Date: Thu, 18 Apr 2024 17:17:54 +0100
Subject: [PATCH 4/4] Update `out_dir` behavior for `runAme` per roxygen docs
for `peak_proportion`
---
R/peak_proportion.R | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/R/peak_proportion.R b/R/peak_proportion.R
index b250644..8505400 100644
--- a/R/peak_proportion.R
+++ b/R/peak_proportion.R
@@ -52,7 +52,7 @@ peak_proportion <- function(peak_input,
out_dir = out_dir)
ame_out <- memes::runAme(peak_sequences,
database = list(motif),
- outdir = "./")
+ outdir = out_dir)
return(
list(