KnowEnG's Data Cleanup Pipeline

This is the Knowledge Engine for Genomics (KnowEnG), an NIH BD2K Center of Excellence, Data Cleanup Pipeline. This pipeline cleanup the data of a given spreadsheet for subsequent processing by KnowEnG Analytics Platform.

Detailed cleanup steps for each pipeline

geneset_characterization_pipeline

After removing empty rows and columns for user spreadsheet data, check :

if spreadsheet is empty, reject.
if spreadsheet contains NaN value/s, drop the corresponding columns.
if spreadsheet contains only non-negative real value. If not, reject.
if spreadsheet contains NaN value, reject.
if spreadsheet contains duplicate column names, remove the duplicated column.
if spreadsheet contains duplicate row names, remove the duplicated row.
if spreadsheet gene names can be mapped to ensemble gene name, then generates mapping files.

samples_clustering_pipeline

After removing empty rows and columns for user spreadsheet data, check:

if spreadsheet contains NaN value/s, drop the corresponding columns.
if spreadsheet contains only real, positive values, accept. If not, reject.
if spreadsheet contains NaN value in gene name, remove corresponding rows.
if spreadsheet contains duplicate column name, remove duplicate columns.
if spreadsheet contains duplicate row name, remove duplicate rows.
map spreadsheet gene name to ensemble name and generates mapping files.

If the user provides with the network data, check :

if network data is empty, reject.
if network data can not be intersected with genomic spreadsheet, reject.

If the user provides with the phenotype data, after removing empty rows and columns, check : 3. if phenotypic data cannot be intersected with the genomic spreadsheet, reject.

gene_prioritization_pipeline

After removing empty rows and columns for user spreadsheet data, check :

for spreadsheet, based on impute option user selected: a. reject: reject user spreadsheet if spreadsheet contain NA value. b. average: replace NA value with mean of each row. c. remove: drop entire column which contains NA value.
genomic or phenotypic data is empty.
if spreadsheet column contains NaN value/s, drop the corresponding columns.
if spreadsheet contains only real value, accept. If not, reject.
if spreadsheet contains NaN in gene name, remove corresponding rows
if spreadsheet contains duplicate column name, remove duplicate columns.
if spreadsheet contains duplicate row name, remove duplicate rows.
map spreadsheet gene name to ensemble name and generates mapping files.

After removing empty rows and columns phenotype data, check:

for t_test, check if a phenotypic data satisfy the following conditions: a. if number of unique values/categories < 2, reject. b. if number of elements per category < 2, reject. c. expand the phenotypic data and keep the original NAs
for pearson test, check if a phenotypic data contains only numeric value. If not, reject.
for every single drug: 1. drops NA for the current drug. 2. intersects header with spreadsheet. If an intersection exists, add this drug to a common list until iterate through all drugs. 3. checks if the common list return by step 2 is emtpy. If it's empty, reject.

pasted_gene_list

After removing empty rows and columns in user spreadsheet data, check:

if a spreadsheet input gene names contain NaN value/s, remove corresponding rows.
casts index of input genes dataframe to string type
retrieve gene mapping status from database and creates a status column to existing dataframe
if the dataframe from step 3 intersects with universal genes list from redis database, mark the intersected genes with value 1, else 0.

general_clustering_pipeline

After removing empty rows and columns for user spreadsheet data, check :

if spreadsheet contains NaN value/s, drop the corresponding columns.
if spreadsheet contains only real, positive values, accept. If not, reject.
if spreadsheet contains NaN value in gene name, remove corresponding rows.
if spreadsheet contains NaN value in header, remove corresponding columns.
if spreadsheet contains duplicate row names, remove duplicate rows.
if spreadsheet contains duplicate column names, remove duplicate columns.

If the user provides with the phenotype data: After removing empty rows and columns, check :

if phenotypic spreadsheet contains duplicate column name, remove duplicate column.
if phenotypic spreadsheet contains duplicate row name, remove duplicate row.
if phenotypic spreadsheet intersects with the genomic spreadsheet, accept. If not, reject.

signatuer_analysis_pipeline

After removing empty rows and columns for user spreadsheet data, check :

if spreadsheet contains NaN value/s, drop the corresponding columns.
if spreadsheet contains only positive, real value, accept. If not, reject.
if spreadsheet contains duplicate row names, reject.
if spreadsheet contains duplicate column names, reject.
if spreadsheet contains at least two unique values per column, accpet. If not, reject.
map spreadsheet gene name to ensemble name and generates mapping files.

After removing empty rows and columns for signature data, check :

if signature data can be intersected with spreadsheet.

If the user provides with the network data, check :

if the unique genes in network data has intersection with signature data and spreadsheet data.

feature_prioritization_pipeline

After removing empty rows and columns in user spreadsheet data, check :

based on impute option user selected: a. reject: reject user spreadsheet if there is NA. b. average: replace NA value with mean of each row. c. remove: drop entire column which contains NA value.
if a spreadsheet contains NaN value/s, drop the corresponding columns.
if a spreadsheet contains only real value, accept. If not, reject.
if correlation_meature is t_test, perform phenotype expansion

After removing empty rows and columns, check:

for t_test, check if a phenotypic data satisfy the following conditions: a. if number of unique values/categories < 2, reject. b. if number of elements per category < 2, reject. c. expand the phenotypic data and keep the original NAs
for pearson test, check if a phenotypic data contains only numeric value. If not, reject.
for every single drug: 1. drops NA for the current drug. 2. intersects header with spreadsheet. If an intersection exists, add this drug to a common list until iterate through all drugs. 3. checks if the common list return by step 2 is emtpy. If it's empty, reject.

phenotype_prediction_pipeline

After removing empty rows and columns in user spreadsheet data, check :

if spreadsheet contains NaN value/s, drop the corresponding columns.
if spreadsheet contains only real value, accept. If not, reject.
if spreadsheet contains duplicate row names, remove duplicate rows.
if spreadsheet contains duplicate column names, remove duplicate columns.
map spreadsheet gene name to ensemble name and generates mapping files.

After removing empty rows and columns in phenotype data, check :

if phenotypic data intersects with spreadsheet on phenotype.
if phenotypic data for pearson test, contains only real value or NaN.
for every single drug: 1. drops NA for the current drug. 2. intersects header with spreadsheet. If an intersection exists, add this drug to a common list until iterate through all drugs. 3. checks if the common list return by step 2 is emtpy. If it's empty, reject.

simplified_inpherno_pipeline

After removing empty rows and columns in user spreadsheet data, check :

if expression_sample data contains only real value, accept. If not, reject.
if expression_sample data's gene name can be mapped to ensemble gene name, then generates mapping files.

After removing empty rows and columns in Pvalue gene phenotype data, check :

if Pvalue_gene_phenotype data contains only real value, accept. If not, reject.
if Pvalue_gene_phenotype's gene name can be mapped to ensemble gene name, then generates mapping files.

After removing empty rows and columns in TF expression data, check :

if TFexpression data contains only real value and doesn't contain NA, accept. If not, reject.
if TFexpression data's gene name can be mapped to ensemble gene name, then generates mapping files.

How to run this pipeline with our data

1. Clone the Data_Cleanup_Pipeline Repo

 git clone https://github.com/KnowEnG/Data_Cleanup_Pipeline.git

2. Install the following (Ubuntu or Linux)

 apt-get install -y python3-pip
 apt-get install -y libblas-dev liblapack-dev libatlas-base-dev gfortran
 pip3 install numpy
 pip3 install pandas
 pip3 install scipy==0.19.1
 pip3 install scikit-learn==0.19.2
 apt-get install -y libfreetype6-dev libxft-dev
 pip3 install xmlrunner
 pip3 install pyyaml
 pip3 install knpackage
 pip3 install redis

3. Change directory to Data_Cleanup_Pipeline

cd Data_Cleanup_Pipeline

4. Change directory to test

cd test

5. Create a local directory "run_dir" and place all the run files in it

make env_setup

6. Use one of the following "make" commands to select and run a data cleanup pipeline

Command	Option
make run_data_cleaning	example test with large dataset
make run_samples_clustering_pipeline	samples clustering test
make run_gene_prioritization_pipeline_pearson	pearson correlation test
make run_gene_prioritization_pipeline_t_test	t-test correlation test
make run_geneset_characterization_pipeline	geneset characterization test
make run_general_clustering_pipeline	general clustering test
make run_pasted_gene_list	pasted gene list test
make run_phenotype_prediction_pipeline	phenotype prediction pipeline test
make run_feature_prioritization_pipeline	feature prioritization pipeline test
make run_signature_analysis_pipeline	signature analysis pipeline test
make run_simplified_inpherno_pipeline	simplified_inpherno_pipeline test

How to run this pipeline with Your data

Follow steps 1-4 above then do the following:

* Create your run directory

mkdir run_directory

* Change directory to the run_directory

cd run_directory

* Create your results directory

mkdir results_directory

* Create run_paramters file (YAML Format)

Look for examples of run_parameters in ./Data_Cleanup_Pipeline/data/run_files/TEMPLATE_data_cleanup.yml

* Modify run_paramters file (YAML Format)

set the spreadsheet, and drug_response (phenotype data) file names to point to your data

* Run the Data Cleanup Pipeline:

Update PYTHONPATH enviroment variable

export PYTHONPATH='../src':$PYTHONPATH

Run (these relative paths assume you are in the test directory with setup as described above)

python3 ../src/data_cleanup.py -run_directory ./run_dir -run_file TEMPLATE_data_cleanup.yml

Description of "run_parameters" file

Key	Value	Comments
pipeline_type	gene_priorization_pipeline, ...	Choose pipeline cleaning type
spreadsheet_name_full_path	directory+spreadsheet_name	Path and file name of user genomic spreadsheet
phenotype_full_path	directory+phenotype_data_name	Path and file name of user phenotypic spreadsheet
gg_network_name_full_path	directory+gg_network_name	Path and file name of user network
results_directory	directory	Directory to save the output files
redis_credential	host, password and port	Credential to access gene names lookup
taxonid	9606	Taxon id of the genes
source_hint	' '	Hint for lookup ensembl names
correlation_measure	t_test/pearson	Correlation measure gene prioritization pipeline

spreadsheet_name_full_path = TEST_1_gene_expression.tsv phenotype_full_path = TEST_1_phenotype.tsv

Description of Output files saved in results directory

Output files

input_file_name_ETL.tsv. Input file after Extract Transform Load (cleaning)

input_file_name_MAP.tsv.

(translated gene)	(input gene name)
ENS00000012345	abc_def_er
...	...
ENS00000054321	def_org_ifi

input_file_name_UNMAPPED.tsv.

(input gene name)	(unmapped-none)
abcd_iffe	unmapped-none
...	...
abdcefg_hijk	unmapped-none

Name		Name	Last commit message	Last commit date
Latest commit History 435 Commits
build/docker		build/docker
data		data
docs		docs
src		src
test		test
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KnowEnG's Data Cleanup Pipeline

Detailed cleanup steps for each pipeline

geneset_characterization_pipeline

samples_clustering_pipeline

gene_prioritization_pipeline

pasted_gene_list

general_clustering_pipeline

signatuer_analysis_pipeline

feature_prioritization_pipeline

phenotype_prediction_pipeline

simplified_inpherno_pipeline

How to run this pipeline with our data

1. Clone the Data_Cleanup_Pipeline Repo

2. Install the following (Ubuntu or Linux)

3. Change directory to Data_Cleanup_Pipeline

4. Change directory to test

5. Create a local directory "run_dir" and place all the run files in it

6. Use one of the following "make" commands to select and run a data cleanup pipeline

How to run this pipeline with Your data

* Create your run directory

* Change directory to the run_directory

* Create your results directory

* Create run_paramters file (YAML Format)

* Modify run_paramters file (YAML Format)

* Run the Data Cleanup Pipeline:

Description of "run_parameters" file

Description of Output files saved in results directory

About

Releases

Packages

Languages

License

candicegjing/Data_Cleanup_Pipeline

Folders and files

Latest commit

History

Repository files navigation

KnowEnG's Data Cleanup Pipeline

Detailed cleanup steps for each pipeline

geneset_characterization_pipeline

samples_clustering_pipeline

gene_prioritization_pipeline

pasted_gene_list

general_clustering_pipeline

signatuer_analysis_pipeline

feature_prioritization_pipeline

phenotype_prediction_pipeline

simplified_inpherno_pipeline

How to run this pipeline with our data

1. Clone the Data_Cleanup_Pipeline Repo

2. Install the following (Ubuntu or Linux)

3. Change directory to Data_Cleanup_Pipeline

4. Change directory to test

5. Create a local directory "run_dir" and place all the run files in it

6. Use one of the following "make" commands to select and run a data cleanup pipeline

How to run this pipeline with Your data

* Create your run directory

* Change directory to the run_directory

* Create your results directory

* Create run_paramters file (YAML Format)

* Modify run_paramters file (YAML Format)

* Run the Data Cleanup Pipeline:

Description of "run_parameters" file

Description of Output files saved in results directory

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages