Skip to content

alexzzkk/NLP_bbc_news_classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 

Repository files navigation

Description

This repository contains a solution for tackling a text classification problem given bbc_news dataset using Python with Jupyter Notebook environment.

Description of dataset

The dataset consists of 2,225 documents from the BBC news website corresponding to stories in five topical areas from 2004-2005. There are five natural classes within this dataset, namely business, entertainment, politics, sport, and tech.

Description of tasks

Based on this dataset, tasks such as data pre-processing, features selection, and training and evaluating of a machine learning model were performed to classify news articles. Additionally,

  • Three different features were used to train the model, namely named entity recognition, TF-IDF, and paragraph embedding.
  • Feature selection was performed to reduce the dimensionality of all features

Instruction for execution

  • Lauch Jupyter Notebook in terminal using the command line Jupyter Notebook
  • Download the Jupyter Notebook Coursework report – part 2.ipynb
  • Make sure the required Python libraries are installed
  • Download the zipfile named bbc.zip, make sure that the unzipped folder is in the same folder as the Coursework report – part 2.ipynb
  • Open the Jupyter Notebook and run all the commands
  • Note that some cells were commented out because it takes relatively long time to execute. Be sure to remove the comment if you want to execute the code.

Requirement

Python version: 3.11.5
Jupyter Notebook version: 6.5.4
spacy version: 3.7.4
pandas version: 2.0.3
xgboost version: 2.0.3
sklearn version: 1.3.0
gensim version: 4.3.2

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published