Janarish Saju C
AI/ML Engineer
20th January 2022
Participants/Problem (P), Intervention (I), Comparison (C) and Outcome (O)
Successful evidence-based medicine (EBM) applications rely on answering clinical questions by analyzing large medical literature databases. In order to formulate a well-defined, focused clinical question, a framework called PICO is widely used, which identifies the sentences in a given medical text that belong to the four components: Participants/Problem (P), Intervention (I), Comparison (C) and Outcome (O).
https://github.com/jind11/PubMed-PICO-Detection
https://pubmed.ncbi.nlm.nih.gov
Meta informations about the Data:
-
Structured_abstracts_PICO contains the original abstracts. The line that starts with ### indicates the PMID. After that line, each line contains the original section heading, the assigned gold label for train and test and the section content, separated by the symbol |. To create the gold label, key words in the section heading are checked and the mapping rule can be referred to the paper above-mentioned.
-
Structured_abstracts_sentences_PICO is almost the same as structured_abstracts_PICO except that each section content is sentence splitted using the Stanford CoreNLP toolkit so that each line has only one sentence and all numbers have been replaced by @.
-
The folder splitted contains the train, validation and test sets that are randomly splitted from the file structured_abstracts_sentences_PICO at the ratio of 8:1:1.
(*written here all the necessary steps carried out in the shared code)
- Data read/import
- Data conversion as per model requirements
- Data formatting
- Encode the labels to Numeric representation
- Tokenize and embed the datasets
- Initialize the BERT model
- Define the Task Name
- Define the Tokenizer method
- The following parameters were used
- evaluation_strategy = "epoch",
- learning_rate=1e-4,
- per_device_train_batch_size=16,
- per_device_eval_batch_size=16,
- num_train_epochs=6,
- weight_decay=1e-5,
- Train the model with the below metrics
- Train_dataset,
- Eval_dataset,
- Tokenizer,
- Compute_metrics
- Evaluation done based on the 20 percent of data extracted for validation purposes from the training data.
- Read the unseen data
- Data Conversion
- Feed the Unseen Data to the fine turned model and Get Predictions
- Get Label Predictions
- Store the results in a DataFrame
- Export test results
- BERT has the advantage over other Machine Learning and Deep Learning models.
- As it is a transformer technique pretrained with huge datasets.
- And it save us a lot of time for training
- Although it has a disadvantage, Heavier BERT model is computationally expensive
Number of Total Records: 24668
Accuracy (80:20 split): 90.8