Skip to content

Commit

Permalink
v1.0.9
Browse files Browse the repository at this point in the history
  • Loading branch information
jsxlei committed Feb 23, 2021
1 parent d6f10b9 commit 707fede
Show file tree
Hide file tree
Showing 7 changed files with 360 additions and 261 deletions.
55 changes: 24 additions & 31 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@
# Single-Cell ATAC-seq analysis via Latent feature Extraction
![](https://github.com/jsxlei/SCALE/wiki/png/model.png)

## News
2021.01.14 Update to compatible with [h5ad](https://anndata.readthedocs.io/en/latest/anndata.AnnData.html) file and [scanpy](https://scanpy.readthedocs.io/en/stable/index.html)

## Installation

SCALE neural network is implemented in [Pytorch](https://pytorch.org/) framework.
Expand All @@ -17,29 +20,27 @@ Installation only requires a few minutes.
## Quick Start

#### Input
* either a **count matrix file**:
* h5ad file
* **count matrix file**:
* row is peak and column is barcode, in **txt** / **tsv** (sep=**"\t"**) or **csv** (sep=**","**) format
* or a **folder** contains **three files**:
* mtx **folder** contains **three files**:
* **count file**: count in **mtx** format, filename contains key word **"count"** / **"matrix"**
* **peak file**: 1-column of peaks **chr_start_end**, filename contains key word **"peak"**
* **barcode file**: 1-column of barcodes, filename contains key word **"barcode"**

#### Run
with known cluster number k:

SCALE.py -d [input] -k [k]

with estimated cluster number k by SCALE if k is unknown:
#### Run

SCALE.py -d [input]

if cluster number k is known:

SCALE.py -d [input] -k [k]

#### Output
Output will be saved in the output folder including:
* **model.pt**: saved model to reproduce results cooperated with option --pretrain
* **feature.txt**: latent feature representations of each cell used for clustering or visualization
* **cluster_assignments.txt**: clustering assignments of each cell
* **emb_tsne.txt**: 2d t-SNE embeddings of each cell
* **emb_tsne.pdf**: visualization of 2d t-SNE embeddings of each cell
* **adata.h5ad**: saved data including Leiden cluster assignment, latent feature matrix and UMAP results.
* **umap.pdf**: visualization of 2d UMAP embeddings of each cell

#### Imputation
Get binary imputed data in folder **binary_imputed** with option **--binary** (recommended for saving storage)
Expand All @@ -52,21 +53,13 @@ or get numerical imputed data in file **imputed_data.txt** with option **--imput
#### Useful options
* save results in a specific folder: [-o] or [--outdir]
* embed feature by tSNE or UMAP: [--emb] TSNE/UMAP
* filter rare peaks if the peaks quality if not good or too many, default is 0.01: [-x]
* filter low quality cells by valid peaks number, default 100: [--min_peaks]
* embed feature by tSNE or UMAP: [--embed] tSNE/UMAP
* filter low quality cells by valid peaks number, default 100: [--min_peaks]
* filter low quality peaks by valid cells number, default 10: [--min_cells]
* modify the initial learning rate, default is 0.002: [--lr]
* change iterations by watching the convergence of loss, default is 30000: [-i] or [--max_iter]
* change random seed for parameter initialization, default is 18: [--seed]
* binarize the imputation values: [--binary]
* run with scRNA-seq dataset: [--log_transform]

#### Note
If come across the nan loss,
* try another random seed
* filter peaks with harsher threshold, e.g. -x 0.04 or 0.06
* filter low quality cells, e.g. --min_peaks 400 or 600
* change the initial learning rate, e.g. --lr 0.0002


#### Help
Expand Down Expand Up @@ -97,14 +90,14 @@ Use functions in SCALE packages.


#### Data availability
* [Forebrain](http://zhanglab.net/SCALE_SOURCE_DATA/Forebrain.tar)
* [Splenocyte](http://zhanglab.net/SCALE_SOURCE_DATA/Splenocyte.tar)
* [mouse_atlas](http://zhanglab.net/SCALE_SOURCE_DATA/mouse_atlas.tar)
* [InSilico](http://zhanglab.net/SCALE_SOURCE_DATA/InSilico.tar)
* [Leukemia](http://zhanglab.net/SCALE_SOURCE_DATA/Leukemia.tar)
* [GM12878vsHEK](http://zhanglab.net/SCALE_SOURCE_DATA/GM12878vsHEK.tar)
* [GM12878vsHL](http://zhanglab.net/SCALE_SOURCE_DATA/GM12878vsHL.tar)
* [Breast_Tumor](http://zhanglab.net/SCALE_SOURCE_DATA/Breast_Tumor.tar)
* [Forebrain](http://zhanglab.net/SCALE_SOURCE_DATA/Forebrain.h5ad)
* [Splenocyte](http://zhanglab.net/SCALE_SOURCE_DATA/Splenocyte.h5ad)
* [mouse_atlas](http://zhanglab.net/SCALE_SOURCE_DATA/mouse_atlas.h5ad)
* [InSilico](http://zhanglab.net/SCALE_SOURCE_DATA/InSilico.h5ad)
* [Leukemia](http://zhanglab.net/SCALE_SOURCE_DATA/Leukemia.h5ad)
* [GM12878vsHEK](http://zhanglab.net/SCALE_SOURCE_DATA/GM12878vsHEK.h5ad)
* [GM12878vsHL](http://zhanglab.net/SCALE_SOURCE_DATA/GM12878vsHL.h5ad)
* [Breast_Tumor](http://zhanglab.net/SCALE_SOURCE_DATA/Breast_Tumor.h5ad)


## Reference
Expand Down
139 changes: 69 additions & 70 deletions SCALE.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,44 +20,44 @@
import numpy as np
import pandas as pd
import os
import scanpy as sc
import argparse

from scale import SCALE
from scale.dataset import SingleCellDataset
from scale.dataset import load_dataset
from scale.utils import read_labels, cluster_report, estimate_k, binarization
from scale.plot import plot_embedding

from sklearn.preprocessing import MaxAbsScaler
from sklearn.cluster import KMeans
from torch.utils.data import DataLoader


if __name__ == '__main__':

parser = argparse.ArgumentParser(description='SCALE: Single-Cell ATAC-seq Analysis via Latent feature Extraction')
parser.add_argument('--dataset', '-d', type=str, help='input dataset path')
parser.add_argument('--data_list', '-d', type=str, nargs='+', default=[])
parser.add_argument('--n_centroids', '-k', type=int, help='cluster number')
parser.add_argument('--outdir', '-o', type=str, default='output/', help='Output path')
parser.add_argument('--verbose', action='store_true', help='Print loss of training process')
parser.add_argument('--pretrain', type=str, default=None, help='Load the trained model')
parser.add_argument('--lr', type=float, default=0.002, help='Learning rate')
parser.add_argument('--lr', type=float, default=0.0002, help='Learning rate')
parser.add_argument('--batch_size', '-b', type=int, default=32, help='Batch size')
parser.add_argument('--gpu', '-g', default=0, type=int, help='Select gpu device number when training')
parser.add_argument('--seed', type=int, default=18, help='Random seed for repeat results')
parser.add_argument('--encode_dim', type=int, nargs='*', default=[1024, 128], help='encoder structure')
parser.add_argument('--decode_dim', type=int, nargs='*', default=[], help='encoder structure')
parser.add_argument('--latent', '-l',type=int, default=10, help='latent layer dim')
parser.add_argument('--low', '-x', type=float, default=0.01, help='Remove low ratio peaks')
parser.add_argument('--high', type=float, default=0.9, help='Remove high ratio peaks')
parser.add_argument('--min_peaks', type=float, default=100, help='Remove low quality cells with few peaks')
parser.add_argument('--min_cells', type=float, default=0.01, help='Remove low quality peaks')
parser.add_argument('--log_transform', action='store_true', help='Perform log2(x+1) transform')
parser.add_argument('--max_iter', '-i', type=int, default=30000, help='Max iteration')
parser.add_argument('--weight_decay', type=float, default=5e-4)
parser.add_argument('--impute', action='store_true', help='Save the imputed data')
parser.add_argument('--binary', action='store_true', help='Save binary imputed data')
# parser.add_argument('--no_tsne', action='store_true', help='Not save the tsne embedding')
parser.add_argument('--emb', type=str, default='UMAP')
parser.add_argument('--reference', '-r', default=None, type=str, help='Reference celltypes')
parser.add_argument('--transpose', '-t', action='store_true', help='Transpose the input matrix')
parser.add_argument('--impute', action='store_true', help='Save the imputed data in layer impute')
parser.add_argument('--binary', action='store_true', help='Save binary imputed data in layer binary')
parser.add_argument('--embed', type=str, default='UMAP')
parser.add_argument('--reference', type=str, default='celltype')
parser.add_argument('--cluster_method', type=str, default='leiden')

args = parser.parse_args()

Expand All @@ -72,35 +72,41 @@
else:
device='cpu'
batch_size = args.batch_size

print("\n**********************************************************************")
print(" SCALE: Single-Cell ATAC-seq Analysis via Latent feature Extraction")
print("**********************************************************************\n")

normalizer = MaxAbsScaler()
dataset = SingleCellDataset(args.dataset, low=args.low, high=args.high, min_peaks=args.min_peaks,
transpose=args.transpose, transforms=[normalizer.fit_transform])
trainloader = DataLoader(dataset, batch_size=batch_size, shuffle=True, drop_last=True)
testloader = DataLoader(dataset, batch_size=batch_size, shuffle=False, drop_last=False)

cell_num = dataset.shape[0]
input_dim = dataset.shape[1]
adata, trainloader, testloader = load_dataset(
args.data_list,
batch_categories=None,
join='inner',
batch_key='batch',
batch_name='batch',
min_genes=args.min_peaks,
min_cells=args.min_cells,
batch_size=args.batch_size,
log=None,
)

cell_num = adata.shape[0]
input_dim = adata.shape[1]

if args.n_centroids is None:
k = min(estimate_k(dataset.data.T), 30)
print('Estimate k {}'.format(k))
k = estimate_k(adata.X.T)
print('Estimate k = {}'.format(k))
else:
k = args.n_centroids
lr = args.lr
name = args.dataset.strip('/').split('/')[-1]
args.min_peaks = int(args.min_peaks) if args.min_peaks >= 1 else args.min_peaks

outdir = args.outdir

outdir = args.outdir+'/'
if not os.path.exists(outdir):
os.makedirs(outdir)

print("\n**********************************************************************")
print(" SCALE: Single-Cell ATAC-seq Analysis via Latent feature Extraction")
print("**********************************************************************\n")
print("======== Parameters ========")
print('Cell number: {}\nPeak number: {}\nn_centroids: {}\nmax_iter: {}\nbatch_size: {}\ncell filter by peaks: {}\nrare peak filter: {}\ncommon peak filter: {}'.format(
cell_num, input_dim, k, args.max_iter, batch_size, args.min_peaks, args.low, args.high))
print("\n======== Parameters ========")
print('Cell number: {}\nPeak number: {}\nn_centroids: {}\nmax_iter: {}\nbatch_size: {}\ncell filter by peaks: {}\npeak filter by cells: {}'.format(
cell_num, input_dim, k, args.max_iter, batch_size, args.min_peaks, args.min_cells))
print("============================")

dims = [input_dim, args.latent, args.encode_dim, args.decode_dim]
Expand All @@ -116,53 +122,46 @@
verbose=args.verbose,
device = device,
max_iter=args.max_iter,
name=name,
# name=name,
outdir=outdir
)
# torch.save(model.to('cpu').state_dict(), os.path.join(outdir, 'model.pt')) # save model
torch.save(model.state_dict(), os.path.join(outdir, 'model.pt')) # save model
else:
print('\n## Loading Model: {}\n'.format(args.pretrain))
model.load_model(args.pretrain)
model.to(device)

### output ###
print('outdir: {}'.format(outdir))
# 1. latent feature
feature = model.encodeBatch(testloader, device=device, out='z')
pd.DataFrame(feature, index=dataset.barcode).to_csv(os.path.join(outdir, 'feature.txt'), sep='\t', header=False)

# 2. cluster assignments
pred = model.predict(testloader, device)
pd.Series(pred, index=dataset.barcode).to_csv(os.path.join(outdir, 'cluster_assignments.txt'), sep='\t', header=False)

# 3. imputed data
if args.impute or args.binary:
recon_x = model.encodeBatch(testloader, device, out='x', transforms=[normalizer.inverse_transform])
if args.binary:
import scipy
print("Saving binary imputed data")
recon_x = binarization(recon_x, dataset.data).T
imputed_dir = outdir+'/binary_imputed/'
os.makedirs(imputed_dir, exist_ok=True)
scipy.io.mmwrite(imputed_dir+'count.mtx', recon_x)
pd.Series(dataset.peaks).to_csv(imputed_dir+'peak.txt', sep='\t', index=False, header=None)
pd.Series(dataset.barcode).to_csv(imputed_dir+'barcode.txt', sep='\t', index=False, header=None)
elif args.impute:
print("Saving imputed data")
recon_x = pd.DataFrame(recon_x.T, index=dataset.peaks, columns=dataset.barcode)
recon_x.to_csv(os.path.join(outdir, 'imputed_data.txt'), sep='\t')

# torch.save(model.to('cpu').state_dict(), os.path.join(outdir, 'model.pt')) # save model
adata.obsm['latent'] = model.encodeBatch(testloader, device=device, out='z')

# 2. cluster
sc.pp.neighbors(adata, n_neighbors=30, use_rep='latent')
if args.cluster_method == 'leiden':
sc.tl.leiden(adata)
elif args.cluster_method == 'kmeans':
kmeans = KMeans(n_clusters=k, n_init=20, random_state=0)
adata.obs['kmeans'] = kmeans.fit_predict(adata.obsm['latent']).astype(str)

# if args.reference in adata.obs:
# cluster_report(adata.obs[args.reference].cat.codes, adata.obs[args.cluster_method].astype(int))

sc.settings.figdir = outdir
sc.set_figure_params(dpi=80, figsize=(6,6), fontsize=10)
if args.embed == 'UMAP':
sc.tl.umap(adata, min_dist=0.1)
color = [c for c in ['celltype', args.cluster_method] if c in adata.obs]
sc.pl.umap(adata, color=color, save='.pdf', wspace=0.4, ncols=4)
elif args.embed == 'tSNE':
sc.tl.tsne(adata, use_rep='latent')
color = [c for c in ['celltype', args.cluster_method] if c in adata.obs]
sc.pl.tsne(adata, color=color, save='.pdf', wspace=0.4, ncols=4)

# if not args.no_tsne:
print("Plotting embedding")
if args.reference:
ref = pd.read_csv(args.reference, sep='\t', header=None, index_col=0)[1]
labels = ref.reindex(dataset.barcode, fill_value='unknown')
else:
labels = pred
plot_embedding(feature, labels, method=args.emb,
save=os.path.join(outdir, 'emb_{}.pdf'.format(args.emb)), save_emb=os.path.join(outdir, 'emb_{}.txt'.format(args.emb)))
# plot_embedding(feature, labels,
# save=os.path.join(outdir, 'tsne.pdf'), save_emb=os.path.join(outdir, 'tsne.txt'))

if args.impute:
adata.obsm['impute'] = model.encodeBatch(testloader, device=device, out='x')
if args.binary:
adata.obsm['impute'] = model.encodeBatch(testloader, device=device, out='x')
adata.obsm['binary'] = binarization(adata.obsm['impute'], adata.X)

adata.write(outdir+'adata.h5ad', compression='gzip')
Loading

0 comments on commit 707fede

Please sign in to comment.