Notebooks used in the project Przona

Contact person: Erik Tjong Kim Sang [email protected]

Notebooks for scraping websites with medical guidelines and performing text analysis

Website richtlijnendatabase.nl

Run scrape_website.ipynb to retrieve the html files. They will be stored in the directory ../data/richtlijnendatabase.nl
Run get_paragraphs.ipynb to extract the paragraphs with text from the downloaded files. They will be stored in the file csv/paragraphs_20210712.csv
Run steps 1 and 4 of text_ranking.ipynb to find the paragraphs with relevant medical terms regarding ehealth. This information will be stored in the files paragraphs.json and index.html
Run json_diff.ipynb to compare the json file of step 3 with a previous version and classify the html pages according to treatment steps. The results will be stored in the file index.html

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
.github/workflows		.github/workflows
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
get_paragraphs.ipynb		get_paragraphs.ipynb
json_diff.ipynb		json_diff.ipynb
keyword_search.ipynb		keyword_search.ipynb
nhg.ipynb		nhg.ipynb
przona.py		przona.py
scrape_website.ipynb		scrape_website.ipynb
scrape_website_javascript.ipynb		scrape_website_javascript.ipynb
text_diff.ipynb		text_diff.ipynb
text_ranking.ipynb		text_ranking.ipynb
wget.sh		wget.sh