Skip to content

In this notebook i implement clinical text classfication on the medical transcription dataset from kaggle

Notifications You must be signed in to change notification settings

rsreetech/ClinicalTextClassification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 

Repository files navigation

ClinicalTextClassification

In this notebook i implement clinical text classfication on the medical transcription dataset from kaggle Here we look at Medical Transcriptions dataset from Kaggle https://www.kaggle.com/tboyle10/medicaltranscriptions

Pre-requisites:

1)NLTK: https://github.com/nltk/nltk

2)Pandas: https://pandas.pydata.org/

3)Matplotlib: https://matplotlib.org/

4)scikit-learn : https://scikit-learn.org/stable/index.html

5)spaCy (https://spacy.io/)

6)https://imbalanced-learn.readthedocs.io/en/stable/#

This data was scraped from mtsamples.com

The task is to correcrtly classify the medical specialties based on the transcription text

Here is the disctibution of samples in the dataset per medical specilaty

Number of sentences in transcriptions column: 140214

Number of unique words in transcriptions column: 35822

===========Original Categories =======================

Cat:1 Allergy / Immunology : 7

Cat:2 Autopsy : 8

Cat:3 Bariatrics : 18

Cat:4 Cardiovascular / Pulmonary : 371

Cat:5 Chiropractic : 14

Cat:6 Consult - History and Phy. : 516

Cat:7 Cosmetic / Plastic Surgery : 27

Cat:8 Dentistry : 27

Cat:9 Dermatology : 29

Cat:10 Diets and Nutritions : 10

Cat:11 Discharge Summary : 108

Cat:12 ENT - Otolaryngology : 96

Cat:13 Emergency Room Reports : 75

Cat:14 Endocrinology : 19

Cat:15 Gastroenterology : 224

Cat:16 General Medicine : 259

Cat:17 Hematology - Oncology : 90

Cat:18 Hospice - Palliative Care : 6

Cat:19 IME-QME-Work Comp etc. : 16

Cat:20 Lab Medicine - Pathology : 8

Cat:21 Letters : 23

Cat:22 Nephrology : 81

Cat:23 Neurology : 223

Cat:24 Neurosurgery : 94

Cat:25 Obstetrics / Gynecology : 155

Cat:26 Office Notes : 50

Cat:27 Ophthalmology : 83

Cat:28 Orthopedic : 355

Cat:29 Pain Management : 61

Cat:30 Pediatrics - Neonatal : 70

Cat:31 Physical Medicine - Rehab : 21

Cat:32 Podiatry : 47

Cat:33 Psychiatry / Psychology : 53

Cat:34 Radiology : 273

Cat:35 Rheumatology : 10

Cat:36 SOAP / Chart / Progress Notes : 166

Cat:37 Sleep Medicine : 20

Cat:38 Speech - Language : 9

Cat:39 Surgery : 1088

Cat:40 Urology : 156

==================================

I try to perform classification on reduced dataset which has the distribution as follows:

============Reduced Categories ======================

Cat:1 Cardiovascular / Pulmonary : 371

Cat:2 Consult - History and Phy. : 516

Cat:3 Discharge Summary : 108

Cat:4 ENT - Otolaryngology : 96

Cat:5 Emergency Room Reports : 75

Cat:6 Gastroenterology : 224

Cat:7 General Medicine : 259

Cat:8 Hematology - Oncology : 90

Cat:9 Nephrology : 81

Cat:10 Neurology : 223

Cat:11 Neurosurgery : 94

Cat:12 Obstetrics / Gynecology : 155

Cat:13 Ophthalmology : 83

Cat:14 Orthopedic : 355

Cat:15 Pain Management : 61

Cat:16 Pediatrics - Neonatal : 70

Cat:17 Psychiatry / Psychology : 53

Cat:18 Radiology : 273

Cat:19 SOAP / Chart / Progress Notes : 166

Cat:20 Surgery : 1088

Cat:21 Urology : 156

============ Reduced Categories ======================

Applying domain knowledge i further reduce this dataset for meaningful results:

===========Reduced Categories======================

Cat:1 Cardiovascular / Pulmonary : 371

Cat:2 ENT - Otolaryngology : 96

Cat:3 Gastroenterology : 224

Cat:4 Hematology - Oncology : 90

Cat:5 Neurology : 317

Cat:6 Obstetrics / Gynecology : 155

Cat:7 Ophthalmology : 83

Cat:8 Orthopedic : 355

Cat:9 Pediatrics - Neonatal : 70

Cat:10 Psychiatry / Psychology : 53

Cat:11 Radiology : 273

Cat:12 Urology : 237

============Reduced Categories======================

On this dataset i get around 67% accuracy.

The conclusions are:

This dataset is very noisy.

Lot of text in transcriptions overlaps across categories

We can apply domain knowledge to reduce the categories

It is imbalanced dataset and using SMOTE can improve the results

Hand coded features may improve results on this dataset but may not apply to generic transcription datasets.

About

In this notebook i implement clinical text classfication on the medical transcription dataset from kaggle

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published