Skip to content

Latest commit

 

History

History
303 lines (237 loc) · 13.8 KB

README.md

File metadata and controls

303 lines (237 loc) · 13.8 KB

KnowEnG's Data Cleanup Pipeline

This is the Knowledge Engine for Genomics (KnowEnG), an NIH BD2K Center of Excellence, Data Cleanup Pipeline. This pipeline cleanup the data of a given spreadsheet for subsequent processing by KnowEnG Analytics Platform.

Detailed cleanup steps for each pipeline

geneset_characterization_pipeline

After removing empty rows and columns for user spreadsheet data, check :

  1. if spreadsheet is empty, reject.
  2. if spreadsheet contains NaN value/s, drop the corresponding columns.
  3. if spreadsheet contains only non-negative real value. If not, reject.
  4. if spreadsheet contains NaN value, reject.
  5. if spreadsheet contains duplicate column names, remove the duplicated column.
  6. if spreadsheet contains duplicate row names, remove the duplicated row.
  7. if spreadsheet gene names can be mapped to ensemble gene name, then generates mapping files.

samples_clustering_pipeline

After removing empty rows and columns for user spreadsheet data, check:

  1. if spreadsheet contains NaN value/s, drop the corresponding columns.
  2. if spreadsheet contains only real, positive values, accept. If not, reject.
  3. if spreadsheet contains NaN value in gene name, remove corresponding rows.
  4. if spreadsheet contains duplicate column name, remove duplicate columns.
  5. if spreadsheet contains duplicate row name, remove duplicate rows.
  6. map spreadsheet gene name to ensemble name and generates mapping files.

If the user provides with the network data, check :

  1. if network data is empty, reject.
  2. if network data can not be intersected with genomic spreadsheet, reject.

If the user provides with the phenotype data, after removing empty rows and columns, check : 3. if phenotypic data cannot be intersected with the genomic spreadsheet, reject.

gene_prioritization_pipeline

After removing empty rows and columns for user spreadsheet data, check :

  1. for spreadsheet, based on impute option user selected: a. reject: reject user spreadsheet if spreadsheet contain NA value. b. average: replace NA value with mean of each row. c. remove: drop entire column which contains NA value.
  2. genomic or phenotypic data is empty.
  3. if spreadsheet column contains NaN value/s, drop the corresponding columns.
  4. if spreadsheet contains only real value, accept. If not, reject.
  5. if spreadsheet contains NaN in gene name, remove corresponding rows
  6. if spreadsheet contains duplicate column name, remove duplicate columns.
  7. if spreadsheet contains duplicate row name, remove duplicate rows.
  8. map spreadsheet gene name to ensemble name and generates mapping files.

After removing empty rows and columns phenotype data, check:

  1. for t_test, check if a phenotypic data satisfy the following conditions: a. if number of unique values/categories < 2, reject. b. if number of elements per category < 2, reject. c. expand the phenotypic data and keep the original NAs
  2. for pearson test, check if a phenotypic data contains only numeric value. If not, reject.
  3. for every single drug: 1. drops NA for the current drug. 2. intersects header with spreadsheet. If an intersection exists, add this drug to a common list until iterate through all drugs. 3. checks if the common list return by step 2 is emtpy. If it's empty, reject.

pasted_gene_list

After removing empty rows and columns in user spreadsheet data, check:

  1. if a spreadsheet input gene names contain NaN value/s, remove corresponding rows.
  2. casts index of input genes dataframe to string type
  3. retrieve gene mapping status from database and creates a status column to existing dataframe
  4. if the dataframe from step 3 intersects with universal genes list from redis database, mark the intersected genes with value 1, else 0.

general_clustering_pipeline

After removing empty rows and columns for user spreadsheet data, check :

  1. if spreadsheet contains NaN value/s, drop the corresponding columns.
  2. if spreadsheet contains only real, positive values, accept. If not, reject.
  3. if spreadsheet contains NaN value in gene name, remove corresponding rows.
  4. if spreadsheet contains NaN value in header, remove corresponding columns.
  5. if spreadsheet contains duplicate row names, remove duplicate rows.
  6. if spreadsheet contains duplicate column names, remove duplicate columns.

If the user provides with the phenotype data: After removing empty rows and columns, check :

  1. if phenotypic spreadsheet contains duplicate column name, remove duplicate column.
  2. if phenotypic spreadsheet contains duplicate row name, remove duplicate row.
  3. if phenotypic spreadsheet intersects with the genomic spreadsheet, accept. If not, reject.

signatuer_analysis_pipeline

After removing empty rows and columns for user spreadsheet data, check :

  1. if spreadsheet contains NaN value/s, drop the corresponding columns.
  2. if spreadsheet contains only positive, real value, accept. If not, reject.
  3. if spreadsheet contains duplicate row names, reject.
  4. if spreadsheet contains duplicate column names, reject.
  5. if spreadsheet contains at least two unique values per column, accpet. If not, reject.
  6. map spreadsheet gene name to ensemble name and generates mapping files.

After removing empty rows and columns for signature data, check :

  1. if signature data can be intersected with spreadsheet.

If the user provides with the network data, check :

  1. if the unique genes in network data has intersection with signature data and spreadsheet data.

feature_prioritization_pipeline

After removing empty rows and columns in user spreadsheet data, check :

  1. based on impute option user selected: a. reject: reject user spreadsheet if there is NA. b. average: replace NA value with mean of each row. c. remove: drop entire column which contains NA value.
  2. if a spreadsheet contains NaN value/s, drop the corresponding columns.
  3. if a spreadsheet contains only real value, accept. If not, reject.
  4. if correlation_meature is t_test, perform phenotype expansion

After removing empty rows and columns, check:

  1. for t_test, check if a phenotypic data satisfy the following conditions: a. if number of unique values/categories < 2, reject. b. if number of elements per category < 2, reject. c. expand the phenotypic data and keep the original NAs
  2. for pearson test, check if a phenotypic data contains only numeric value. If not, reject.
  3. for every single drug: 1. drops NA for the current drug. 2. intersects header with spreadsheet. If an intersection exists, add this drug to a common list until iterate through all drugs. 3. checks if the common list return by step 2 is emtpy. If it's empty, reject.

phenotype_prediction_pipeline

After removing empty rows and columns in user spreadsheet data, check :

  1. if spreadsheet contains NaN value/s, drop the corresponding columns.
  2. if spreadsheet contains only real value, accept. If not, reject.
  3. if spreadsheet contains duplicate row names, remove duplicate rows.
  4. if spreadsheet contains duplicate column names, remove duplicate columns.
  5. map spreadsheet gene name to ensemble name and generates mapping files.

After removing empty rows and columns in phenotype data, check :

  1. if phenotypic data intersects with spreadsheet on phenotype.
  2. if phenotypic data for pearson test, contains only real value or NaN.
  3. for every single drug: 1. drops NA for the current drug. 2. intersects header with spreadsheet. If an intersection exists, add this drug to a common list until iterate through all drugs. 3. checks if the common list return by step 2 is emtpy. If it's empty, reject.

simplified_inpherno_pipeline

After removing empty rows and columns in user spreadsheet data, check :

  1. if expression_sample data contains only real value, accept. If not, reject.
  2. if expression_sample data's gene name can be mapped to ensemble gene name, then generates mapping files.

After removing empty rows and columns in Pvalue gene phenotype data, check :

  1. if Pvalue_gene_phenotype data contains only real value, accept. If not, reject.
  2. if Pvalue_gene_phenotype's gene name can be mapped to ensemble gene name, then generates mapping files.

After removing empty rows and columns in TF expression data, check :

  1. if TFexpression data contains only real value and doesn't contain NA, accept. If not, reject.
  2. if TFexpression data's gene name can be mapped to ensemble gene name, then generates mapping files.

How to run this pipeline with our data


1. Clone the Data_Cleanup_Pipeline Repo

 git clone https://github.com/KnowEnG/Data_Cleanup_Pipeline.git

2. Install the following (Ubuntu or Linux)

 apt-get install -y python3-pip
 apt-get install -y libblas-dev liblapack-dev libatlas-base-dev gfortran
 pip3 install numpy
 pip3 install pandas
 pip3 install scipy==0.19.1
 pip3 install scikit-learn==0.19.2
 apt-get install -y libfreetype6-dev libxft-dev
 pip3 install xmlrunner
 pip3 install pyyaml
 pip3 install knpackage
 pip3 install redis

3. Change directory to Data_Cleanup_Pipeline

cd Data_Cleanup_Pipeline 

4. Change directory to test

cd test

5. Create a local directory "run_dir" and place all the run files in it

make env_setup

6. Use one of the following "make" commands to select and run a data cleanup pipeline

Command Option
make run_data_cleaning example test with large dataset
make run_samples_clustering_pipeline samples clustering test
make run_gene_prioritization_pipeline_pearson pearson correlation test
make run_gene_prioritization_pipeline_t_test t-test correlation test
make run_geneset_characterization_pipeline geneset characterization test
make run_general_clustering_pipeline general clustering test
make run_pasted_gene_list pasted gene list test
make run_phenotype_prediction_pipeline phenotype prediction pipeline test
make run_feature_prioritization_pipeline feature prioritization pipeline test
make run_signature_analysis_pipeline signature analysis pipeline test
make run_simplified_inpherno_pipeline simplified_inpherno_pipeline test

How to run this pipeline with Your data


Follow steps 1-4 above then do the following:

* Create your run directory

mkdir run_directory

* Change directory to the run_directory

cd run_directory

* Create your results directory

mkdir results_directory

* Create run_paramters file (YAML Format)

Look for examples of run_parameters in ./Data_Cleanup_Pipeline/data/run_files/TEMPLATE_data_cleanup.yml

* Modify run_paramters file (YAML Format)

set the spreadsheet, and drug_response (phenotype data) file names to point to your data

* Run the Data Cleanup Pipeline:

  • Update PYTHONPATH enviroment variable
export PYTHONPATH='../src':$PYTHONPATH    
  • Run (these relative paths assume you are in the test directory with setup as described above)
python3 ../src/data_cleanup.py -run_directory ./run_dir -run_file TEMPLATE_data_cleanup.yml

Description of "run_parameters" file


Key Value Comments
pipeline_type gene_priorization_pipeline, ... Choose pipeline cleaning type
spreadsheet_name_full_path directory+spreadsheet_name Path and file name of user genomic spreadsheet
phenotype_full_path directory+phenotype_data_name Path and file name of user phenotypic spreadsheet
gg_network_name_full_path directory+gg_network_name Path and file name of user network
results_directory directory Directory to save the output files
redis_credential host, password and port Credential to access gene names lookup
taxonid 9606 Taxon id of the genes
source_hint ' ' Hint for lookup ensembl names
correlation_measure t_test/pearson Correlation measure gene prioritization pipeline

spreadsheet_name_full_path = TEST_1_gene_expression.tsv phenotype_full_path = TEST_1_phenotype.tsv


Description of Output files saved in results directory


  • Output files

input_file_name_ETL.tsv. Input file after Extract Transform Load (cleaning)

input_file_name_MAP.tsv.

(translated gene) (input gene name)
ENS00000012345 abc_def_er
... ...
ENS00000054321 def_org_ifi

input_file_name_UNMAPPED.tsv.

(input gene name) (unmapped-none)
abcd_iffe unmapped-none
... ...
abdcefg_hijk unmapped-none