Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Populate main package vignette #2

Merged
merged 4 commits into from
Apr 19, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion R/peak_proportion.R
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good spot

Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ peak_proportion <- function(peak_input,
out_dir = out_dir)
ame_out <- memes::runAme(peak_sequences,
database = list(motif),
outdir = "./")
outdir = out_dir)

return(
list(
Expand Down
214 changes: 212 additions & 2 deletions vignettes/MotifStats.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -7,13 +7,223 @@ vignette: >
%\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
```{r include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```

```{r setup}

## Introduction

`MotifStats` is a simple R package to calculate the the metrics to quantify the
relationship between peaks and motifs. It is based on [Analysis of Motif
Enrichment (AME)](https://meme-suite.org/meme/doc/ame.html) and [Find Individual
Motif Occurrences (FIMO)](https://meme-suite.org/meme/doc/fimo.html) from the
[MEME suite](https://meme-suite.org/meme/index.html).

<br>
It has two distinct functions:

1. Calculate motif enrichment motif enrichment relative to a set of background
sequences using AME.
2. Calculate the distance between each motif and its nearest peak summit, where
each motif is identified using FIMO.


## Data

The `MotifStats` package comes with motif and peak data for transcription
factors CTCF and CREB1. Details of the files are as follows:

- **CREB1 TIP-seq peaks (narrowPeaks)**[^f1]
-- *Data not publicly available*

- **CTCF TIP-seq peaks (narrowPeaks)**[^f1]
-- Retrieved from NCBI GEO under
accession
[GSE188512](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE188512)

- **CREB motif**
-- Retrieved from JASPAR2024 under matrix ID
[MA0018.5](http://jaspar.genereg.net/matrix/MA0018.5/)

- **CTCF motif**
-- Retrieved from JASPAR2024 under matrix ID
[MA1930.2](http://jaspar.genereg.net/matrix/MA1930.2/)


## Installation

`MotifStats` relies on [MEME suite](https://meme-suite.org/meme/index.html) as
a system dependency. Directions for installation can be found [here](https://www.bioconductor.org/packages/release/bioc/vignettes/memes/inst/doc/install_guide.html).
<br>
To install the package, use the following command:
```{r eval = FALSE}
if(!require("remotes")) install.packages("remotes")
remotes::install_github("neurogenomics/MotifStats")
```


## Usage

In this example analysis, we will compare the enrichment of the CTCF motif in
CTCF TIP-seq peaks relative to the background. We will also calculate the
HDash marked this conversation as resolved.
Show resolved Hide resolved
distance between the centre of each motif occurrence and its nearest peak
summit.


### Load packages

Load the installed package.
```{r include = TRUE, message = FALSE, warning = FALSE}
library(MotifStats)
```

For this example, we will also load the `BSgenome.Hsapiens.UCSC.hg38` package to
provide the genome build our peaks have been derived from.
```{r include = TRUE, message = FALSE, warning = FALSE}
library(BSgenome.Hsapiens.UCSC.hg38)
```


### Prepare input data

First, we need to load the motif and peak data. The motif data obtained from
JASPAR can be loaded as follows:
```{r eval = FALSE}
name_motif <- read_motif_file(
"/path/to/motfif/file.jaspar",
motif_id = "id_of_motif",
file_format = "jaspar"
)
```

Next, we load the narrowPeak files from MACS2/3[^f2].
```{r eval = FALSE}
name_peaks <- read_peak_file("/path/to/peak/file.narrowPeak")
```

For this example, we will load the built-in CTCF TIP-seq peaks (as `GRanges`
object) and CREB motif (as `PWMatrix` object) data.
```{r include = TRUE}
data("ctcf_motif")
data("ctcf_peaks")
```


### Calculate motif enrichment

To calculate the motif relative to a set of background sequences, we use
`peak_proportion()`.

- Under the hood, it calls `meme::runAme` for motif
enrichment scoring. In context of this call, it identifies the occurrences of
input motif in the input sequences compared with background sequences and
outputs relevant statistics.
- A 0-order background model with the same nucleotide composition as the input
sequences is generated for comparison.
- An additional `out_dir` argument can be used to specify the
directory to save the AME output files[^f3] and the background model.

```{r include = TRUE}
ctcf_read_prop <- peak_proportion(
peak_input = ctcf_peaks,
motif = ctcf_motif,
genome_build = BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38
)
```

`ctcf_read_prop` is a list with two components:
1. `$tp` (True positives) with the number of true positive motif occurrences in
the given sequence followed by its relative percentage.
1. `$fp` (False positives) with the number of false positive motif occurrences
HDash marked this conversation as resolved.
Show resolved Hide resolved
in the given sequence followed by its relative percentage.

In context of this function, the true positives represent to the
number/percentage of peaks with an associated motif occurrence, while the false
positives represent the number/percentage of peaks without an associated motif
occurrence.



### Find motif-summit distances

To calculate the distance between each motif and its nearest peak summit, we use
`summit_to_motif()`.

- `fp_rate` argument specifies the desired false-positive rate for FIMO. A
p-value is calculated using the formula:
$$p = \frac{\text{fp_rate}}{2 * \text{promoter_length}}$$
- A 0-order background model with the same nucleotide composition as the input
sequences is generated for comparison.
- An additional `out_dir` argument can be used to specify the
HDash marked this conversation as resolved.
Show resolved Hide resolved
directory to save the 0-order background file.

```{r include = TRUE}
ctcf_read_sum_motif <- summit_to_motif(
peak_input = ctcf_peaks,
motif = ctcf_motif,
fp_rate = 0.05,
genome_build = BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38
)
```

`ctcf_read_sum_motif` is a list with two objects:
1. `peak_set` with peak information, as a `GRanges` object.
2. `distance_to_summit` with distances between the centre of each motif and its
nearest peak summit.

**NOTE**: When a motif is found multiple times within a single peak, the
`peak_set` and `distance_to_summit` objects will contain multiple entries (rows)
corresponding to the same peak. Each of these entries represents a distinct
occurrence of the motif within that peak.

#### Visualize results

We can optionally visualize the distribution of distances by using
`density_plot()`.
```{r include = TRUE, fig.width = 7, fig.height = 4}
density_plot(
HDash marked this conversation as resolved.
Show resolved Hide resolved
ctcf_read_sum_motif$distance_to_summit,
plot_title = "CTCF motif distance to summit",
x_label = "Distance to summit (bp)",
y_label = "Density"
)
```

For this given example, we can observe the distribution of distances between the
centre of each motif, and its nearest peak summit follows a normal distribution.
With a mean of the distribution around 0 base pairs (bp), we can infer that the
motif is likely to be located at the peak summit, suggesting that the identified
peak summits are associated with binding of regulatory proteins.


> **NOTE:** Since AME and FIMO accept different parameters and are calculated
independently, it is not possible to obtain directly comparable results.


## Future Enchancements

- Calculate metrics for more than one motif at a time.


## Session Info

<details>

```{r echo = FALSE}
sessionInfo()
```

</details>


<!-- Footnotes -->
[^f1]: The peak file is a subset of chromosome 19 to reduce the size of the
built-in data.
[^f2]: narrowPeak files from both version 2 and 3 of [MACS: Model-based Analysis
for ChIP-Seq](https://github.com/macs3-project/MACS) is supported.
[^f3]: For more information on the output files, refer to the
[AME documentation](https://meme-suite.org/meme/doc/ame.html).
Loading