This repository contains a solution for tackling a text classification problem given bbc_news dataset using Python with Jupyter Notebook environment.
The dataset consists of 2,225 documents from the BBC news website corresponding to stories in five topical areas from 2004-2005. There are five natural classes within this dataset, namely business, entertainment, politics, sport, and tech.
Based on this dataset, tasks such as data pre-processing, features selection, and training and evaluating of a machine learning model were performed to classify news articles. Additionally,
- Three different features were used to train the model, namely named entity recognition, TF-IDF, and paragraph embedding.
- Feature selection was performed to reduce the dimensionality of all features
- Lauch Jupyter Notebook in terminal using the command line
Jupyter Notebook
- Download the Jupyter Notebook
Coursework report – part 2.ipynb
- Make sure the required Python libraries are installed
- Download the zipfile named
bbc.zip
, make sure that the unzipped folder is in the same folder as theCoursework report – part 2.ipynb
- Open the Jupyter Notebook and run all the commands
- Note that some cells were commented out because it takes relatively long time to execute. Be sure to remove the comment if you want to execute the code.
Python version: 3.11.5
Jupyter Notebook version: 6.5.4
spacy version: 3.7.4
pandas version: 2.0.3
xgboost version: 2.0.3
sklearn version: 1.3.0
gensim version: 4.3.2