Skip to content

This work aims to integrate diverse, curated biological databases with a comprehensive analysis of the growing body of literature on the potential long-term neurological consequences of COVID-19 infections, employing advanced natural language processing and text mining techniques.

Notifications You must be signed in to change notification settings

SCAI-BIO/covid-NDD-comorbidity-NLP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

73 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Exploring the Current State of Knowledge on the Link Between COVID-19 and Neurodegeneration

This repository contains the data, scripts, and analyses used in the research titled "Understanding the Co-Morbidity between COVID-19 and Neurodegenerative Diseases at Mechanism-Level: Comprehensive Analysis Integrating Databases and Text Mining". The project leverages Neo4j paltform for graph-based analysis and integrates natural language processing to explore relationships between COVID-19 and neurodegenerative diseases (NDDs). Logo

Table of Contents

Overview

This project explores the connections between COVID-19 and neurodegenerative diseases by:

  1. Integrating database information about COVID-19 and NDDs and storing them in a graph structure.
  2. Extracting textual data from scientific literature and using natural language processing pipelines for information extraction and KG construction. 
  3. Loading all KG in Neo4j to identify and analyse relationships and pathways between entities such as genes, diseases, and chemicals.
  4. Construction of a hypothesis database for omorbidity between COVID-19 and NDDs to explore, analyse, and visualise testable comorbidity hypotheses.

Data

The repository includes the following directories:

  1. Expert-curated-publications: Contains manually curated publications relevant to the study, ensuring high-quality and accurate information.

  2. PubTator3-results: Includes results from PubTator3, a web-based system that offers a comprehensive set of features and tools for exploring biomedical literature using advanced text mining and AI techniques. :contentReference[oaicite:0]{index=0}

  3. Sherpa-results: Houses outputs from Sherpa, a tool designed to assist in the curation of biomedical literature by providing automated annotations and insights.

  4. Textual-corpora-for-textmining: Comprises textual corpora prepared for text mining purposes, facilitating the extraction of meaningful patterns and relationships regarding COVID-19 and NDD.

Sources

1. comorbidity-hypothesis-db.py

  • Purpose: Automatically opens the Neo4j Browser with prefilled credentials to connect to the AuraDB instance for comorbidity hypothesis exploration.
  • Key Features:
    • Simplifies connection to Neo4j by generating a pre-configured URL.
    • Useful for direct interaction with the knowledge graph.
  • Usage: Run the script, and the Neo4j Browser will open in your default web browser:
    python comorbidity-hypothesis-db.py
    

2. comorbidity-space-neo4j-upload.py

  • Purpose: Uplaoding the comorbidity hypothesis paths to the AuraDB instance for comorbidity hypothesis exploration. The candidate curated paths along with pmids and evidences are stored in 'src/hypothesis_pmid_evidences.csv'.
  • Key Features:
    • Simplifies uploading the hypothesis comorbidity candidates.
  • Usage: Run the script, and the Neo4j Browser will open in your default web browser:
    python comorbidity-space-neo4j-upload.py
    

3. hypothesis-graph-database-upload.py

  • Purpose: Manages the upload of hypothesis-based graph data to Neo4j.

  • Key Features:

    • Dedicated notebook for hypothesis data integration
    • Structured data validation
    • Automated graph relationship creation
  • Usage:

    • Open in Jupyter environment
    • Configure data paths
    • Execute cells sequentially
  • Purpose:

A comprehensive data integration pipeline for analyzing relationships between COVID-19 and neurodegenerative diseases (NDDs). This pipeline processes and uploads three types of biomedical data to Neo4j:

  1. Triples hypothesis (filtered triples from all dbs)
  2. Pathway hypothesis (filtered pathways)
  3. GWAS Data (shared variants)

The project leverages Neo4j for graph-based analysis and integrates various data sources to explore disease relationships.

Quick Start

  1. Install Dependencies
pip install pandas neo4j requests rapidfuzz fuzzywuzzy python-Levenshtein
  1. Configure Neo4j Connection Create config.json:
{
    "neo4j": {
        "uri": "neo4j+s://09f8d4e9.databases.neo4j.io",
        "user": "neo4j",
        "password": "your-password"
    }
}
  1. Run Pipeline
from hypothesis-graph-database-upload import DataPipelineRunner, Neo4jConfig

# Configure Neo4j connection
config = Neo4jConfig(
    uri="your_neo4j_uri",
    user="your_username",
    password="your_password"
)

# Run pipeline
runner = DataPipelineRunner(config)
runner.run(
    triple_file="path/to/cleaned_all_db_association.csv",
    pathway_file="path/to/your/hypothesis_pmid_evidences.csv",
    gwas_file="path/to/your/shared-variants.xlsx"
)

Getting Started

Prerequisites

  • Neo4j AuraDB: Ensure you have access to a Neo4j AuraDB instance. Use the provided connection details or set up your own.

  • Python Environment: Install the required libraries:

    pip install neo4j pandas
    

Notebooks

1. analyze-neo4j.ipynb

  • Purpose: Analyzes the knowledge graph loaded to Neo4j to extract insights.
  • Key Features:
    • Counts nodes and edges in the graph.
    • Executes community detection algorithms like Louvain using Neo4j's Graph Data Science (GDS) library.
    • Retrieves and visualizes properties of detected clusters
  • Usage: Open the Jupyter Notebook and follow the instructions to:
    • Query the Neo4j database.
    • Get general statistics about nodes, triple and pathways, and analyze them.

2. import-neo4j-all-dbs.ipynb

  • Purpose: These scripts are designed to upload multiple databases into Neo4j, providing a streamlined workflow for graph-based data integration and analysis.
  • Prerequisites:
    • bel_json_import package for BEL data conversion to eBEL format
    • Properly formatted database extracts
  • Key Features:
    • Efficiently import graph data into Neo4j using a common schema
    • Seamless integration of complex biological networks
    • Privacy-aware data handling
  • Usage:
    • Open the notebook in Jupyter Notebook or JupyterLab
    • Place data in required locations
    • Run cells specific to each source

Exploring the Covid-NDD Comorbidity Database

To manually explore the comorbidity graph database:

  1. Open the Neo4j Browser:

    Navigate to https://browser.neo4j.io.

  2. Enter the Connection Details:

  3. Run Cypher Queries:

    Once connected, you can execute Cypher queries to explore the graph. For example, to retrieve a sample of nodes:

    MATCH (n) RETURN n LIMIT 10;

Contact

For any questions, suggestions, or collaborations, please contact:

Negin Babaiha
Email: [email protected]
Google Scholar Profile

Feel free to reach out for discussions regarding the project!

About

This work aims to integrate diverse, curated biological databases with a comprehensive analysis of the growing body of literature on the potential long-term neurological consequences of COVID-19 infections, employing advanced natural language processing and text mining techniques.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published