This repository contains Jupyter notebooks showing how to use MACULA Greek New Testament data. These notebooks were prepared for a tutorial at the 2023 Global Missional AI Summit on "Greek and Hebrew Datasets for Natural Language Processing". The tutorial was led by Jonathan Robie, Sean Boisen, and Randall Tan of Clear Bible, Inc.
New to Colab/Jupyter Notebooks: If you have never used Google Colaboratory or Jupyter notebooks, check out the Getting Started tutorial, and click the 'Open in Colab' button at the top of the file.
Experienced with Colab/Jupyter Notebooks: If you have used notebooks like this before, but you are new to MACULA Greek and Hebrew data, head to MACULA Data Overview to get started.
The original Greek and Hebrew texts are at the heart of Bible translation, and they have been analyzed by many different researchers in every conceivable way. But most NLP practitioners do not know Hebrew or Greek. MACULA is a set of linguistic datasets that describe the original Hebrew and Greek texts that are at the heart of Bible translation.
Using English glosses, semantic domains, and various descriptions of the text, they can be used by NLP practitioners without knowledge of the original languages. These datasets were developed by Clear Bible, United Bible Societies, SIL International, unfoldingWord, Translatable Exegetical Tools, Faith Comes by Hearing, the Groves Center, OpenScriptures, Cherith Analytics, and others, and they have been integrated to work together.
In this workshop, we will use Google Colab notebooks to show how to use this data for specific tasks, then demonstrate some useful NLP tasks such as exploratory data analysis, topic modeling, identifying important vocabulary in a passage using TF-IDF, and text summarization.
Participants will be encouraged to work at their own pace and ask questions. They are also welcome to work on their own projects using this data, or build on the notebooks we present.
The notebooks were created by Ryder Wishart and Nathan Brock.
All code in this repository is released under MIT License. For data licensing, see the data README.