Personal Professional Project :

Predicting Duplicate Questions in a question-and-answer platform

Order of execution of the notebooks :

Exploratory Data Analysis :

explore the dataset , identify its properties and outliers and generate visualizations
- questions_analysis.ipynb
- data_cleaning.ipynb
- visualization.ipynb
Feature Engineering :

transforme raw text into a format that is suitable for the Model using NLP techniques
- nlp_preprocessing.ipynb
- nlp_processing.ipynb
- processed_preparing.ipynb
Models Training :

develop and train deep learning models using the prepared dataset as well as evaluate them
- data_splitting.ipynb
- model_building.ipynb
- model_testing.ipynb

This project addresses the problem of predicting duplicate questions in question-answering systems. The aim is to develop an effective deep learning model capable of accurately identifying redundant queries, thereby improving search efficiency and user experience

This is a group project for the course Professional Personal Project at the National Institute of Applied Science and Technology, Tunisia.

Folders

The project consists of the following folders:

config: Contains some necessary configuration files like init.py that appends 'src' directory to the system path.
data: Stores the dataset and its variations throughout the whole span of the project to avoid redoing data transformation processes and simply load them whenever needed.
models: Stores trained models versions.
notebooks: houses Jupyter notebooks used for our different processes ( View Order of execution on the top ).
reports: holds generated reports, such as the model graph.
src: Contains the scripts of the functions used in the notebooks to promote code organization and maintainability.

Getting Started

To run the project, follow the steps below:

clone the repository by using the following command:

git clone https://github.com/Dhouib-Mohamed/Duplicate-Question-Predictor

Install the required packages listed in requirements.txt using the following command:
```
pip install -r requirements.txt
```
Run the necessary configuration in config file:
```
python .\config\__init__.py
```
Run Each Notebook in the correct order

Project overview: Methodology and Approach

Data Pre-processing

The data pre-processing step includes the following steps:

Case Normalization: Convert all text to lowercase.
Data Cleaning: Remove special characters, and ponctuation.
Stopwords Removal: Remove stopwords from the text.
Lemmalization: Extracting the lemma from each word.

Feature Engineering

The feature engineering step includes the following steps:

Gensim Vectorization: Convert text to a matrix of Gensim features.

Model Training and Evaluation

The model training and evaluation step includes the following steps:

Train/Test Split: Split the data into training and testing sets.
Model Training: Train a classifier model using the training set.

Model Evaluation: Evaluate the model using the testing set.

 accuracy:   0.68513
           precision    recall  f1-score   support

Positive       0.72      0.82      0.76     45989
Negative       0.61      0.47      0.53     28090

accuracy                           0.69     74079
macro avg      0.66      0.64      0.65     74079
weighted avg   0.68      0.69      0.67     74079

Additional Notes

The project utilizes various Python packages such as pandas, NLTK, scikit-learn, Matplotlib, seaborn, keras... . Make sure to install these packages, as mentioned in the requirements.txt file.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.idea		.idea
config		config
data/raw_data		data/raw_data
models		models
notebooks		notebooks
reports		reports
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Personal Professional Project :

Predicting Duplicate Questions in a question-and-answer platform

Order of execution of the notebooks :

Exploratory Data Analysis :

explore the dataset , identify its properties and outliers and generate visualizations

Feature Engineering :

transforme raw text into a format that is suitable for the Model using NLP techniques

Models Training :

develop and train deep learning models using the prepared dataset as well as evaluate them

Folders

Getting Started

Project overview: Methodology and Approach

Data Pre-processing

Feature Engineering

Model Training and Evaluation

Additional Notes

Team Members

About

Releases

Packages

Contributors 2

Languages

Dhouib-Mohamed/Duplicate-Question-Predictor

Folders and files

Latest commit

History

Repository files navigation

Personal Professional Project :

Predicting Duplicate Questions in a question-and-answer platform

Order of execution of the notebooks :

Exploratory Data Analysis :

explore the dataset , identify its properties and outliers and generate visualizations

Feature Engineering :

transforme raw text into a format that is suitable for the Model using NLP techniques

Models Training :

develop and train deep learning models using the prepared dataset as well as evaluate them

Folders

Getting Started

Project overview: Methodology and Approach

Data Pre-processing

Feature Engineering

Model Training and Evaluation

Additional Notes

Team Members

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages