-
- questions_analysis.ipynb
- data_cleaning.ipynb
- visualization.ipynb
-
- nlp_preprocessing.ipynb
- nlp_processing.ipynb
- processed_preparing.ipynb
-
- data_splitting.ipynb
- model_building.ipynb
- model_testing.ipynb
This project addresses the problem of predicting duplicate questions in question-answering systems. The aim is to develop an effective deep learning model capable of accurately identifying redundant queries, thereby improving search efficiency and user experience
This is a group project for the course Professional Personal Project at the National Institute of Applied Science and Technology, Tunisia.
The project consists of the following folders:
-
config: Contains some necessary configuration files like init.py that appends 'src' directory to the system path.
-
data: Stores the dataset and its variations throughout the whole span of the project to avoid redoing data transformation processes and simply load them whenever needed.
-
models: Stores trained models versions.
-
notebooks: houses Jupyter notebooks used for our different processes ( View Order of execution on the top ).
-
reports: holds generated reports, such as the model graph.
-
src: Contains the scripts of the functions used in the notebooks to promote code organization and maintainability.
To run the project, follow the steps below:
-
clone the repository by using the following command:
git clone https://github.com/Dhouib-Mohamed/Duplicate-Question-Predictor
-
Install the required packages listed in requirements.txt using the following command:
pip install -r requirements.txt
-
Run the necessary configuration in config file:
python .\config\__init__.py
-
Run Each Notebook in the correct order
The data pre-processing step includes the following steps:
- Case Normalization: Convert all text to lowercase.
- Data Cleaning: Remove special characters, and ponctuation.
- Stopwords Removal: Remove stopwords from the text.
- Lemmalization: Extracting the lemma from each word.
The feature engineering step includes the following steps:
- Gensim Vectorization: Convert text to a matrix of Gensim features.
The model training and evaluation step includes the following steps:
-
Train/Test Split: Split the data into training and testing sets.
-
Model Training: Train a classifier model using the training set.
-
Model Evaluation: Evaluate the model using the testing set.
accuracy: 0.68513 precision recall f1-score support Positive 0.72 0.82 0.76 45989 Negative 0.61 0.47 0.53 28090 accuracy 0.69 74079 macro avg 0.66 0.64 0.65 74079 weighted avg 0.68 0.69 0.67 74079
- The project utilizes various Python packages such as pandas, NLTK, scikit-learn, Matplotlib, seaborn, keras... . Make sure to install these packages, as mentioned in the requirements.txt file.