This repository contains the MACULA linguistic datasets for the Hebrew Bible, including data from:
- The text of the Westminster Leningrad Codex, released into the public domain by the Groves Center, and available at tanach.us.
- Morphology from the Open Scriptures Hebrew Bible, available on Github.
- Syntax trees developed by Clear Bible, Inc. together with the Groves Center. (Note: Clear was formerly known as Global Bible Initiative from 2014-2020 and Asia Bible Society before that.) Recently, the Groves Center graciously released Westminster Hebrew Syntax without Morphology under a Creative Commons CC BY 4.0 license.
- Word sense data from the United Bible Societies MARBLE project, based on the Semantic Dictionary of Biblical Hebrew.
- Cherith Glosses for the Hebrew Old Testament, by Andi Wu, Copyright (C) 2022 by Cherith Analytics, is licensed under a Creative Commons Attribution 4.0 International License ("CC BY 4.0").
- Semantic roles: Who does what to whom? (Agent, Verb, Patient …)
- Participant referents: Who is “he,” “she,” or “it” in this sentence?
We are adding further datasets, one at a time.
This data has been combined into a single set of trees. There are three variants of this data, found in the following directories:
WLC/nodes
contains this data in a set of nestedNode
elements suitable for many NLP systems and other systems that use recursive algorithms.WLC/lowfat
contains the same data in a form more suitable for some kinds of query systems and some kinds of display.WLC/tsv
contains the word-level data in a TSV table, without syntactic tree structure. This is simpler for many programs that do not need the complexity of graph structures.
Copyright statements for the individual sources can be found in the MACULA Hebrew license.