This directory contains code to download and pre-process the GAP-Replay corpus.
MediTron’s domain-adaptive pre-training corpus GAP-Replay combines 48.1B tokens from four datasets:
- Clinical Guidelines: a new dataset of 46K clinical practice guidelines from various healthcare-related sources,
- Paper Abstracts: abstracts from 16.1M closed-access PubMed and PubMed Central papers,
- Medical Papers: full-text articles extracted from 5M publicly available PubMed and PubMed Central papers.
- Replay dataset: general domain data distilled to compose 1% of the entire corpus.
To download all datasets and combine them into a single GAP-Replay corpus, run:
./download.sh
To download and pre-process PubMed papers and abstracts from the S2ORC API, run:
./pubmed/download.sh
To download and sub-sample replay data from the RedPajama-v1 dataset, run:
./replay/download.sh
Only 8 of 16 sources of clinical guidelines allow for redistribution (namely CCO, CDC, CMA, ICRC, NICE, SPOR, WHO & WikiDoc). For these 36K open-access articles, we release raw and clean versions of the data on the HuggingFace datasets hub.
from datasets import load_dataset
dataset = load_dataset("epfl-llm/guidelines")
To scrape all 16 sources, you can use our web scrapers and cleaning code in guidelines/
by first setting up the dependencies.
# Install dependencies
pip install -r guidelines/requirements.txt
# Use spacy to get the English language pipeline
python -m spacy download en_core_web_sm
# Install scipdf from GitHub to convert PDFs to text
pip install git+https://github.com/titipata/scipdf_parser
Then, to download and pre-process all 46K clinical practice guidelines, run:
./guidelines/download.sh
All sources of clinical practice guidelines supported by our scrapers are shown below.
Source | Full Name | Source tag | Total guidelines | Total words | Audience | Released |
---|---|---|---|---|---|---|
AAFP | American Academy of Family Physicians | aafp |
50 | 9.4K | Doctor | No |
CCO | Cancer Care Ontario | cco |
87 | 199K | Doctor | Yes |
CDC | Center for Disease Control and Prevention | cdc |
621 | 6.7M | Doctor | Yes |
CMA | Canadian Medical Association | cma |
431 | 1.7M | Doctor | Yes |
CPS | Canadian Paediatric Society | cps |
54 | 133K | Doctor | No |
drugs.com | Drugs.com | drugs |
6548 | 4.1M | Both | No |
GuidelineCentral | GuidelineCentral | gc |
1029 | 1M | Doctor | No |
ICRC | International Committee of the Red Cross | icrc |
49 | 1.2M | Doctor | Yes |
IDSA | Infectious Diseases Society of America | idsa |
47 | 646K | Doctor | No |
MAGIC | Making GRADE The Irresistible Choice | magic |
52 | 415K | Doctor | No |
MayoClinic | MayoClinic | mayo |
1100 | 2.2M | Patient | No |
NICE | National Institute for Health and Care Excellence | nice |
1656 | 8.1M | Doctor | Yes |
RCH | Royal Children's Hospital Melbourne | rch |
384 | 410K | Doctor | No |
SPOR | Strategy for Patient-Oriented Research | spor |
217 | 1.1M | Doctor | Yes |
WHO | World Health Organization | who |
223 | 3.1M | Both | Yes |
WikiDoc | WikiDoc | wikidoc |
33058 | 34M | Both | Yes |
NOTE: The endpoints or data shape of some of the sources may have changed since we scraped them, so the scrapers may be outdated.