Fibril_Seurat.RMD

---
title: "Parkinson Disease Stimulation Exp - Unstim Samples"
author: "Ashley Richardson"
date: "2024-01-22"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
### Rmarkdown to pre-process scRNA seq from fibril stimulated PBMCs. 


## Create the Seurat Object. 
```{r Library Loading, warning=FALSE} 

.libPaths(c("/hpc/packages/minerva-centos7/rpackages/4.2.0/site-library", "/hpc/packages/minerva-centos7/rpackages/bioconductor/3.15", .libPaths()))

library(sctransform, lib.loc = "/hpc/users/richaa21/.Rlib")
library(Seurat, lib.loc = "/hpc/users/richaa21/.Rlib")
library(patchwork)
library(dplyr)
```

```{r Create Seurat Object, paged.print=TRUE, warning=FALSE}
pbmc.data <- Read10X(data.dir = "/sc/arion/projects/ad-omics/ashley/data/aF/outs/per_sample_outs/aF/count/sample_filtered_feature_bc_matrix")

pbmc.f <- CreateSeuratObject(counts = pbmc.data, project = "fibril", min.cells = 3, min.features = 200)
pbmc.f
```

Lets check how many cells and features we are starting with. 
```{r}
length(colnames(pbmc.f)) #number of cells
```
```{r}
length(rownames(pbmc.f)) #number of features 
```

## Quality Control (QC)

    a. QC mitochondiral genes. The PercentFeatureSet() calculates the % of counts originating from a set of features - here we are first looking at mitochondiral features, which are genes starting with MT. 
    b. By using [[ ]], I am adding columns to the pbmc matrix to store this QC data. 
```{r QC: Mitochondria content, warning=FALSE}
pbmc.f[["percent.mt"]] <- PercentageFeatureSet(pbmc.f, pattern = "^MT-")

# Show QC metrics for the first 5 cells
head(pbmc.f@meta.data, 5)
```


```{r QC: VLNplot, warning = FALSE}
library(farver, lib.loc = "/hpc/packages/minerva-centos7/rpackages/4.2.0/site-library")
VlnPlot(pbmc.f, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
```

```{r QC: ScatterPLot, warning=FALSE}
plot1 <- FeatureScatter(pbmc.f, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot2 <- FeatureScatter(pbmc.f, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot1 
plot2
```

```{r QC: remove cells with high MT, warning = FALSE}
# I will select for %mt under 10% and features greater than 200 to capture alive healthy cells. 
pbmc.f <- subset(pbmc.f, subset = nFeature_RNA > 200 & percent.mt < 10)
head(pbmc.f@meta.data)
```

## Normalize the data. 
The data is normalized based on the feature (number of genes in a cell) by the total expression. This number is multiplied by 10,000 and then log transformed. The function to do this is "NormalizeData." The values specied below are the default values of this function. 
```{r Normalize Data}
pbmc.f <- NormalizeData(pbmc.f, normalization.method = "LogNormalize", scale.factor = 10000)
```


## Identification of highly variable features
```{r Identify Highly Variable Features, warning = FALSE}
pbmc.f <- FindVariableFeatures(pbmc.f, selection.method = "vst", nfeatures = 2000)

# Identify the 10 most highly variable genes
top10 <- head(VariableFeatures(pbmc.f), 10)

# plot variable features with and without labels
plot1 <- VariableFeaturePlot(pbmc.f)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE, xnudge = 0, ynudge = 0)
plot1 
plot2
```
## Scaling the Data
The data is scaled by doing a linear transformation. 
The ScaleData function does this by shifting the expression of genes so that the mean expression becomes 0 and the variance is 1. 
By default, only variable genes are scaled. This is changed by features = all.genes 
```{r Scale Data, warning=FALSE}
##Scaling RNA data, we only scale the variable features here for efficiency
all.genes <- rownames(pbmc.f)
pbmc.f <- ScaleData(pbmc.f, vars.to.regress = c("percent.mt"))
```

## Perform linear dimensional reduction 
For the first principal components, RunPCA shows the most positive (correlated) and most negative (anticorrelated) genes
```{r Linear Dimensional Reduction, warning=FALSE}
pbmc.f <- RunPCA(pbmc.f, features = VariableFeatures(object = pbmc.f))
print(pbmc.f[["pca"]], dims = 1:5, nfeatures = 5)
```

The PCA results can be visualized in different ways. 
```{r Dim Plot of PCs, warning=FALSE}
VizDimLoadings(pbmc.f, dims = 1:2, reduction = "pca")
```

```{r PC2 vs PC1 scatter plot, warning=FALSE}
DimPlot(pbmc.f, reduction = "pca") + NoLegend()
```

```{r Heatmap of PC1, warning=FALSE}
DimHeatmap(pbmc.f, dims = 1, cells = 500, balanced = TRUE)
```

```{r Heatmap of all PCs}
DimHeatmap(pbmc.f, dims = 1:15, cells = 500, balanced = TRUE)
```


## Determine the ‘dimensionality’ of the dataset
Cells will be clustered based on PCA. How many PC to use is dependent on many factors. For example, if trying to analyze a rare cell subset, you might want to add more PCs. Usually, the first 10 is good to see dimensionality of the data. 

```{r ElbowPlot of PCs, warning=FALSE}
ElbowPlot(pbmc.f)
```

## Cluster the cells. 
```{r Cluster Cells, warning=FALSE}
##We select the top 15 PCs for clustering and tSNE based on PCElbowPlot
pbmc.f <- FindNeighbors(pbmc.f, reduction = "pca", dims = 1:15)

pbmc.f <- FindClusters(pbmc.f, resolution = 0.5, verbose = FALSE)
```

```{r Head Clusters, warning=FALSE}
head(Idents(pbmc.f), 5)
```

## Run non-linear dimensional reduction (UMAP/tSNE)

```{r, warning=FALSE}
pbmc.f <- RunUMAP(pbmc.f, dims = 1:15)
```

```{r umap, warning=FALSE}
# note that you can set `label = TRUE` or use the LabelClusters function to help label
# individual clusters
DimPlot(pbmc.f, reduction = "umap", label = TRUE)
```

```{r}
pbmc.f <- RunTSNE(pbmc.f, reduction = "pca", dims = 1:15)
DimPlot(pbmc.f, reduction = "tsne", label = TRUE)
```

Lets check our metadata now of the seurat object to see what has been added. 
```{r}
head(pbmc.f@meta.data)

#We now have seurat clusters and RNA_snn_res.0.5 columns added. 
```


## Finding differentially expressed features (cluster biomarkers)

```{r markers for every cluster, warning=FALSE}
# find markers for every cluster compared to all remaining cells, report only the positive
# ones
markers_f <- FindAllMarkers(pbmc.f, only.pos = TRUE)

cluster5.markers <- FindMarkers(pbmc.f, ident.1 = 5, logfc.threshold = 0.25, test.use = "roc", only.pos = TRUE)
```


VlnPlot will display the differential expression across the clusters. 
For example, I am looking here at CD8A and CD4 expression in the clusters. 
```{r vlnplot of specified marker, warning=FALSE}
VlnPlot(pbmc.f, features = c("CD8A", "CD4"))
```


The raw counts can also be shown instead by adding some parameters.   
```{r, warning=FALSE}
VlnPlot(pbmc.f, features = c("CD8A", "CD4"), slot = "counts", log = TRUE)
```

```{r umap feature plot, warning=FALSE}
FeaturePlot(pbmc.f, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP",
    "CD8A"))
```


### Combine Seurat object with demuxlet output.  

## First, load and edit the .best demuxlet output file to make it more compatible. 
```{r}
.libPaths(c("/hpc/packages/minerva-centos7/rpackages/4.2.0/site-library", "/hpc/packages/minerva-centos7/rpackages/bioconductor/3.15", .libPaths()))

library(patchwork)
library(tidyr)
f_demuxlet = read.delim("/sc/arion/projects/ad-omics/ashley/data/aF.best", header = T, stringsAsFactors = F, check.names = F)
head(f_demuxlet)
```
To edit: I will split the Best column into multiple columns. 
```{r}
f_demuxlet_edit = f_demuxlet %>% 
  mutate(BEST = gsub("-01","", BEST)) %>%
  separate(BEST, into=c("DMX_classification.global","DMX_maxID","DMX_secondID"), sep="-") %>%
  separate(DMX_maxID, into=c("DMX_garbage1","DMX_garbage2","DMX_maxID"), sep ="_") %>%
  separate(DMX_secondID, into=c("DMX_garbage3","DMX_garbage4","DMX_secondID"), sep ="_") %>%
  select(-contains("garbage"))

head(f_demuxlet_edit)
```

```{r}
table(f_demuxlet_edit$DMX_classification.global) #num of singlets and doublets
```

```{r}
table(f_demuxlet_edit$DMX_maxID) #number of cells identified as each donor
```


```{r}
table(f_demuxlet_edit[,c("DMX_classification.global","DMX_maxID")]) #number of singlets or doublets identified as each donor
```


## Next, I will add this edited demuxlet to the Seurat object. 

```{r}
f_demuxlet_edit.subset <- f_demuxlet_edit[f_demuxlet_edit$BARCODE %in% colnames(pbmc.f),]

pbmc.f@meta.data <- cbind(pbmc.f@meta.data,f_demuxlet_edit.subset$DMX_maxID,f_demuxlet_edit.subset$DMX_classification.global, f_demuxlet_edit.subset$BARCODE)

head(pbmc.f@meta.data)
```

## Pre-Processing the new object. 
The Seurat Object has already been preprocessed in my case, so this should be clean)

```{r}
pbmc.f[["percent.mt"]] <- PercentageFeatureSet(pbmc.f, pattern = "^MT-")

library(cowplot)
# look at distribution of metrics by classification
plot_grid(VlnPlot(pbmc.f, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3, group.by = "f_demuxlet_edit.subset$DMX_maxID"))

```


```{r}
VlnPlot(pbmc.f, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3, group.by = "f_demuxlet_edit.subset$DMX_classification.global")
```

Now, select for MT content under 10% and for nfeatureRNA > 200 to ensure getting alive cells. 
I previously did this, so should see no change in cell numnbers. 
```{r}
pbmc.f <- subset(pbmc.f, subset = nFeature_RNA > 200 & percent.mt < 10)
length(colnames(pbmc.f))
length(rownames(pbmc.f))
```
Add column to meta data to identify seurat object as Basline condition. Also rename some columns for clarity purposes. 
```{r}
pbmc.f@meta.data$condition <- 'Fibril'
names(pbmc.f@meta.data)[names(pbmc.f@meta.data) == "f_demuxlet_edit.subset$DMX_classification.global"] <- "DMX_classification.global"
names(pbmc.f@meta.data)[names(pbmc.f@meta.data) == "f_demuxlet_edit.subset$DMX_maxID"] <- "DMX_maxID"
names(pbmc.f@meta.data)[names(pbmc.f@meta.data) == "f_demuxlet_edit.subset$BARCODE"] <- "Barcode"
head(pbmc.f@meta.data)
```


```{r Save  as .finalRDS file}
saveRDS(pbmc.f, file = "/sc/arion/projects/ad-omics/ashley/PD_Stim/pbmc.fibril.final.RDS")
```