Skip to content

A Benchmark for clinical notes entity linking in the SNOMED CT clinical terminology

License

Notifications You must be signed in to change notification settings

MikhailKulyabin/SNOBERT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SNOBERT: A Benchmark for clinical notes entity linking in the SNOMED CT clinical terminology

Method

Method scheme

The method consists of two stages: Candidate Selection (I) and Candidate Matching (II).

  • (I) NER classification problem;
  • (II) Linking the corresponding concept ID in SNOMED terminology.

For more information and details about the dataset see: https://arxiv.org/abs/2405.16115

First stage

Training an ensemble of BERT models, NER task in B-I-O format (7 classes, B-find, I-find, B-proc, I-proc, B-body, I-body, 0)

Second stage

Using a pretrained embedder, predict ID based on cosine similarity.

  1. For the whole SNOMED database extract concepts that are in [Body structure, Findings, Procedure] nodes (about 200k of unique IDs).

  2. Get a database of embeddings:

    # pseudocode
    ID2EMB = {}
    for id in all_ids:
      embeds = []
      for synonym in get_all_synonyms(id):
        embed = embedder(synonym)
        embeds.append(embed)
      true_embed = embeds.mean(0)
      ID2EMB[id] = true_embed
    
  3. Match extracted mentions (from stage 1) with database:

    # pseudocode
    for mention in predicted_mentions:
       qvec = embed(mention)
       similarities = qvec @ ID2EMB.values()
       top1_idx = argsort(similarities)
       qID = ID2EMB.keys()[top1_idx]
    
    • cambridgeltl/SapBERT-from-PubMedBERT-fulltext-mean-token
    • static dictionary postprocess.

Data

Folder structure

├── configs
├── data
│   ├── competition_data
│   │   ├── mimic-iv_notes_training_set.csv
│   │   ├── SnomedCT_InternationalRF2_PRODUCTION_20230531T120000Z_Challenge_Edition
│   │   └── train_annotations.csv
│   ├── first_stage
│   │   ├── S0_0_score_0.1129
│   │   └── Sall_9_score_0.2807
│   ├── preprocess_data
│   └── second_stage
│       └── sapbert
│           ├── embeds
│           └── model
├── docker
├── output
│   ├── 03-20
│   │   ├── 22_00_40
│   │   │   ├── models
│   │   │   │   └── Sall_56_score_0.9491
│   │   │   ├── src
│   │   │   ├── tb
│   │   │   └── tokenizer
│   │   └── 22_22_46
│   │       ├── models
│   │       │   └── S0_11_score_0.4197
│   │       ├── src
│   │       ├── tb
│   │       └── tokenizer
├── src
└── submission

Raw data

Prior to training, add the following raw data sources do data/competition_data:

data
└── competition_data
    ├── mimic-iv_notes_training_set.csv
    ├── SnomedCT_InternationalRF2_PRODUCTION_20230531T120000Z_Challenge_Edition
    └── train_annotations.csv

Fixing train annotations: To address certain annotation inaccuracies in train_annotations.csv, such as those caused by shifts due to <br> tags, we utilized the NER pipeline in labelstudio. Approximately 8-10 notes (out of 204) underwent corrections, involving adjustments to around 100-200 annotation IDs (out of over 50,000). These corrections specifically targeted errors resulting from shifted annotations. As a result of these adjustments, the overall score changed by approximately 0.002.

No-annotation-parts: Several parts of almost every note were excluded from annotations. We excluded them from the training process. Excluded parts-segments-headers: ['medications on admission:', '___ on admission:', 'discharge medications:']

Setup

To run a container using the image:

make docker

Or Anaconda:

conda create -n snomed-snobert python=3.10
conda activate snomed-snobert
pip install .  # installs requirements from pyproject.toml

Technical details and hardware

We used NVIDIA A100-SXM4-40GB (4 cards), but the solution should be replicatable on single A10 24gb

Optional (small score improvement): MLM pretrained model + 24hrs on 4 x A100

Training time:

  • preprocess: 4 minutes
  • train (4GPU): ~30 minutes
  • train (1GPU): ~60 minutes

Inference time: ~1 minute

Preprocessing

Preprocess the raw data into intermediate features:

python src/preprocess.py

To create data splits for validation run: python src/preprocess.py --val

This performs a variety of preprocessing steps including:

  • Exclude ['medications on admission:', '___ on admission:', 'discharge medications:'] from train data
  • Download pretrained weights from HuggingFace (HF)
  • Generate synonyms dictionaries for each concept category from Snomed CT → data/preprocess_data/{proc, find, body}_sctid_syn.json
  • Calculate embeddings of synonyms dictionary → data/second_stage/{embedder}/embeds/{name}_{concept_type}.pth
  • Get a static dictionary of span text and concept IDs found in the training data → data/preprocess_data/most_common_concept.pkl

Run training

We use hydra config to configure training of the first stage, an NER task. Set the value of OUTPUTS in configs/snom.yaml to the desired output directory (where model weights will be saved).

To run training in Distributed Data Parallel (DDP) mode:

# DDP mode (multiple GPUs)
torchrun --nproc-per-node=4 src/main.py PARALLEL.DDP=true

# or in single GPU mode
torchrun --nproc-per-node=1 src/main.py PARALLEL.DDP=false

As an example:

torchrun --nproc-per-node=1 src/main.py split=0 class_weights=[0.142,0.142,0.142,0.142,0.142,0.142,0.142] epochs=100 chunked_repeat=2

trains in 80 minutes on a single A10 24GB GPU and results in IoU of 0.4277 and first stage score of 0.7410.

The final leaderboard submission was an ensemble of six models:

torchrun --nproc-per-node=1 src/main.py split=0 class_weights=[0.142,0.142,0.142,0.142,0.142,0.142,0.142] epochs=100
torchrun --nproc-per-node=1 src/main.py split=3 class_weights=[0.142,1,1,1,1,1,1] epochs=100
torchrun --nproc-per-node=1 src/main.py split=all class_weights=[1,1,1,1,1,1,1] epochs=110
torchrun --nproc-per-node=1 src/main.py split=all class_weights=[0.142,0.142,0.142,0.142,0.142,0.142,0.142] epochs=110
torchrun --nproc-per-node=1 src/main.py split=all class_weights=[0.142,0.571,0.571,0.571,0.571,0.571,0.571] epochs=120
torchrun --nproc-per-node=1 src/main.py split=all class_weights=[0.142,0.571,0.571,0.571,0.571,0.571,0.571] epochs=120

Training can be started from:

This should be set in configs/snom.yaml, field model.

Model GPU First stage score IoU Epoch
MLM pretrained (420000 epochs) 1 0.7429 0.4231 75
BiomedNLP-BiomedBERT-large-uncased-abstract 1 0.7487 0.4199 76
MLM pretrained (420000 epochs) 4 0.7514 0.4302 74
BiomedNLP-BiomedBERT-large-uncased-abstract 4 0.7499 0.4257 72

Output model folder structure

output
├── 03-20           # date
│   ├── 22_00_40    # time
│   │   ├── models  # last 3 checkpoints in HF format
│   │   │   ├── Sall_54_score_0.9418  # S - split; all - alldata mode; 54- epoch; 0.9418 - F1 macro score
│   │   │   ├── Sall_55_score_0.9462
│   │   │   └── Sall_56_score_0.9491
│   │   ├── src        # source code for debugging
│   │   ├── tb         # tensorboard logs
│   │   └── tokenizer

Run inference

To run inference:

  • Select checkpoints from output/<date>/<time>/models, (e.g., output/03-20/22_00_40/Sall_54_score_0.9418), place them in data/first_stage folder.
  • Add test notes to data/competition_data/test_notes.csv

Then run the following:

python submission/main.py

To run in validation-scoring mode, run python submission/main.py --val.

main.py contains variables that point to assets required for inference:

  • model checkpoints:
    • FIRST_STAGE_CHECKPOINTS (default data/first_stage): list of paths to model weights for ensembling
    • SECOND_STAGE_CHECKPOINTS (default data/second_stage/sapbert): path to SapBERT weights
  • static dict path: STATIC_DICT_PATH (default data/preprocess_data/most_common_concept.pkl)
  • train data:
    • TRAIN_NOTES_PATH (data/competition_data/cutmed_fixed_train_annotations.csv)
    • TRAIN_ANNOTAIONS_PATH (data/competition_data/cutmed_fixed_train_annotations.csv)
  • inference data: TEST_NOTES_PATH (data/competition_data/test_notes.csv)
  • path to save the results of inference: SUBMISSION_PATH (submission.csv)

Citation

@misc{kulyabin2024snobert,
      title={SNOBERT: A Benchmark for clinical notes entity linking in the SNOMED CT clinical terminology}, 
      author={Mikhail Kulyabin and Gleb Sokolov and Aleksandr Galaida and Andreas Maier and Tomas Arias-Vergara},
      year={2024},
      eprint={2405.16115},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

About

A Benchmark for clinical notes entity linking in the SNOMED CT clinical terminology

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages