This is the Knowledge Engine for Genomics (KnowEnG), an NIH BD2K Center of Excellence, Data Cleanup Pipeline. This pipeline cleanup the data of a given spreadsheet for subsequent processing by KnowEnG Analytics Platform.
After removing empty rows and columns for user spreadsheet data, check :
- if spreadsheet is empty, reject.
- if spreadsheet contains NaN value/s, drop the corresponding columns.
- if spreadsheet contains only non-negative real value. If not, reject.
- if spreadsheet contains NaN value, reject.
- if spreadsheet contains duplicate column names, remove the duplicated column.
- if spreadsheet contains duplicate row names, remove the duplicated row.
- if spreadsheet gene names can be mapped to ensemble gene name, then generates mapping files.
After removing empty rows and columns for user spreadsheet data, check:
- if spreadsheet contains NaN value/s, drop the corresponding columns.
- if spreadsheet contains only real, positive values, accept. If not, reject.
- if spreadsheet contains NaN value in gene name, remove corresponding rows.
- if spreadsheet contains duplicate column name, remove duplicate columns.
- if spreadsheet contains duplicate row name, remove duplicate rows.
- map spreadsheet gene name to ensemble name and generates mapping files.
If the user provides with the network data, check :
- if network data is empty, reject.
- if network data can not be intersected with genomic spreadsheet, reject.
If the user provides with the phenotype data, after removing empty rows and columns, check : 3. if phenotypic data cannot be intersected with the genomic spreadsheet, reject.
After removing empty rows and columns for user spreadsheet data, check :
- for spreadsheet, based on impute option user selected: a. reject: reject user spreadsheet if spreadsheet contain NA value. b. average: replace NA value with mean of each row. c. remove: drop entire column which contains NA value.
- genomic or phenotypic data is empty.
- if spreadsheet column contains NaN value/s, drop the corresponding columns.
- if spreadsheet contains only real value, accept. If not, reject.
- if spreadsheet contains NaN in gene name, remove corresponding rows
- if spreadsheet contains duplicate column name, remove duplicate columns.
- if spreadsheet contains duplicate row name, remove duplicate rows.
- map spreadsheet gene name to ensemble name and generates mapping files.
After removing empty rows and columns phenotype data, check:
- for t_test, check if a phenotypic data satisfy the following conditions: a. if number of unique values/categories < 2, reject. b. if number of elements per category < 2, reject. c. expand the phenotypic data and keep the original NAs
- for pearson test, check if a phenotypic data contains only numeric value. If not, reject.
- for every single drug: 1. drops NA for the current drug. 2. intersects header with spreadsheet. If an intersection exists, add this drug to a common list until iterate through all drugs. 3. checks if the common list return by step 2 is emtpy. If it's empty, reject.
After removing empty rows and columns in user spreadsheet data, check:
- if a spreadsheet input gene names contain NaN value/s, remove corresponding rows.
- casts index of input genes dataframe to string type
- retrieve gene mapping status from database and creates a status column to existing dataframe
- if the dataframe from step 3 intersects with universal genes list from redis database, mark the intersected genes with value 1, else 0.
After removing empty rows and columns for user spreadsheet data, check :
- if spreadsheet contains NaN value/s, drop the corresponding columns.
- if spreadsheet contains only real, positive values, accept. If not, reject.
- if spreadsheet contains NaN value in gene name, remove corresponding rows.
- if spreadsheet contains NaN value in header, remove corresponding columns.
- if spreadsheet contains duplicate row names, remove duplicate rows.
- if spreadsheet contains duplicate column names, remove duplicate columns.
If the user provides with the phenotype data: After removing empty rows and columns, check :
- if phenotypic spreadsheet contains duplicate column name, remove duplicate column.
- if phenotypic spreadsheet contains duplicate row name, remove duplicate row.
- if phenotypic spreadsheet intersects with the genomic spreadsheet, accept. If not, reject.
After removing empty rows and columns for user spreadsheet data, check :
- if spreadsheet contains NaN value/s, drop the corresponding columns.
- if spreadsheet contains only positive, real value, accept. If not, reject.
- if spreadsheet contains duplicate row names, reject.
- if spreadsheet contains duplicate column names, reject.
- if spreadsheet contains at least two unique values per column, accpet. If not, reject.
- map spreadsheet gene name to ensemble name and generates mapping files.
After removing empty rows and columns for signature data, check :
- if signature data can be intersected with spreadsheet.
If the user provides with the network data, check :
- if the unique genes in network data has intersection with signature data and spreadsheet data.
After removing empty rows and columns in user spreadsheet data, check :
- based on impute option user selected: a. reject: reject user spreadsheet if there is NA. b. average: replace NA value with mean of each row. c. remove: drop entire column which contains NA value.
- if a spreadsheet contains NaN value/s, drop the corresponding columns.
- if a spreadsheet contains only real value, accept. If not, reject.
- if correlation_meature is t_test, perform phenotype expansion
After removing empty rows and columns, check:
- for t_test, check if a phenotypic data satisfy the following conditions: a. if number of unique values/categories < 2, reject. b. if number of elements per category < 2, reject. c. expand the phenotypic data and keep the original NAs
- for pearson test, check if a phenotypic data contains only numeric value. If not, reject.
- for every single drug: 1. drops NA for the current drug. 2. intersects header with spreadsheet. If an intersection exists, add this drug to a common list until iterate through all drugs. 3. checks if the common list return by step 2 is emtpy. If it's empty, reject.
After removing empty rows and columns in user spreadsheet data, check :
- if spreadsheet contains NaN value/s, drop the corresponding columns.
- if spreadsheet contains only real value, accept. If not, reject.
- if spreadsheet contains duplicate row names, remove duplicate rows.
- if spreadsheet contains duplicate column names, remove duplicate columns.
- map spreadsheet gene name to ensemble name and generates mapping files.
After removing empty rows and columns in phenotype data, check :
- if phenotypic data intersects with spreadsheet on phenotype.
- if phenotypic data for pearson test, contains only real value or NaN.
- for every single drug: 1. drops NA for the current drug. 2. intersects header with spreadsheet. If an intersection exists, add this drug to a common list until iterate through all drugs. 3. checks if the common list return by step 2 is emtpy. If it's empty, reject.
After removing empty rows and columns in user spreadsheet data, check :
- if expression_sample data contains only real value, accept. If not, reject.
- if expression_sample data's gene name can be mapped to ensemble gene name, then generates mapping files.
After removing empty rows and columns in Pvalue gene phenotype data, check :
- if Pvalue_gene_phenotype data contains only real value, accept. If not, reject.
- if Pvalue_gene_phenotype's gene name can be mapped to ensemble gene name, then generates mapping files.
After removing empty rows and columns in TF expression data, check :
- if TFexpression data contains only real value and doesn't contain NA, accept. If not, reject.
- if TFexpression data's gene name can be mapped to ensemble gene name, then generates mapping files.
git clone https://github.com/KnowEnG/Data_Cleanup_Pipeline.git
apt-get install -y python3-pip
apt-get install -y libblas-dev liblapack-dev libatlas-base-dev gfortran
pip3 install numpy
pip3 install pandas
pip3 install scipy==0.19.1
pip3 install scikit-learn==0.19.2
apt-get install -y libfreetype6-dev libxft-dev
pip3 install xmlrunner
pip3 install pyyaml
pip3 install knpackage
pip3 install redis
cd Data_Cleanup_Pipeline
cd test
make env_setup
Command | Option |
---|---|
make run_data_cleaning | example test with large dataset |
make run_samples_clustering_pipeline | samples clustering test |
make run_gene_prioritization_pipeline_pearson | pearson correlation test |
make run_gene_prioritization_pipeline_t_test | t-test correlation test |
make run_geneset_characterization_pipeline | geneset characterization test |
make run_general_clustering_pipeline | general clustering test |
make run_pasted_gene_list | pasted gene list test |
make run_phenotype_prediction_pipeline | phenotype prediction pipeline test |
make run_feature_prioritization_pipeline | feature prioritization pipeline test |
make run_signature_analysis_pipeline | signature analysis pipeline test |
make run_simplified_inpherno_pipeline | simplified_inpherno_pipeline test |
Follow steps 1-4 above then do the following:
mkdir run_directory
cd run_directory
mkdir results_directory
Look for examples of run_parameters in ./Data_Cleanup_Pipeline/data/run_files/TEMPLATE_data_cleanup.yml
set the spreadsheet, and drug_response (phenotype data) file names to point to your data
- Update PYTHONPATH enviroment variable
export PYTHONPATH='../src':$PYTHONPATH
- Run (these relative paths assume you are in the test directory with setup as described above)
python3 ../src/data_cleanup.py -run_directory ./run_dir -run_file TEMPLATE_data_cleanup.yml
Key | Value | Comments |
---|---|---|
pipeline_type | gene_priorization_pipeline, ... | Choose pipeline cleaning type |
spreadsheet_name_full_path | directory+spreadsheet_name | Path and file name of user genomic spreadsheet |
phenotype_full_path | directory+phenotype_data_name | Path and file name of user phenotypic spreadsheet |
gg_network_name_full_path | directory+gg_network_name | Path and file name of user network |
results_directory | directory | Directory to save the output files |
redis_credential | host, password and port | Credential to access gene names lookup |
taxonid | 9606 | Taxon id of the genes |
source_hint | ' ' | Hint for lookup ensembl names |
correlation_measure | t_test/pearson | Correlation measure gene prioritization pipeline |
spreadsheet_name_full_path = TEST_1_gene_expression.tsv phenotype_full_path = TEST_1_phenotype.tsv
- Output files
input_file_name_ETL.tsv. Input file after Extract Transform Load (cleaning)
input_file_name_MAP.tsv.
(translated gene) | (input gene name) |
---|---|
ENS00000012345 | abc_def_er |
... | ... |
ENS00000054321 | def_org_ifi |
input_file_name_UNMAPPED.tsv.
(input gene name) | (unmapped-none) |
---|---|
abcd_iffe | unmapped-none |
... | ... |
abdcefg_hijk | unmapped-none |