Using the solidus to separate morpheme segments is against OSIS philosophy #50

DavidHaslam · 2017-12-30T17:53:51Z

The general philosophy of OSIS is to use XML elements for all the semantic markup.

Using the solidus within the text to separate morpheme segments within Hebrew words goes against this OSIS philosophy. One friend has described this as "bad, bad, very bad".

cf. The XML files for the CrossWire WLC module are more conformant with this principle where they used the XML seg element for this purpose. The original data was obtained from the website tanach.us but further preprocessing was done before building the latest version of module, which differs from it's earliest version in this respect.

Edited: @DavidHaslam

e.g. Taken from the mod2imp output of the CrossWire WLC module, they are generally like this:

$$$Genesis 1:1
<w><seg type="x-morph">בְּ</seg><seg type="x-morph">רֵאשִׁ֖ית</seg> </w>
<w><seg type="x-morph">בָּרָ֣א</seg> </w>
<w><seg type="x-morph">אֱלֹהִ֑ים</seg> </w>
<w><seg type="x-morph">אֵ֥ת</seg> </w>
<w><seg type="x-morph">הַ</seg><seg type="x-morph">שָּׁמַ֖יִם</seg> </w>
<w><seg type="x-morph">וְ</seg><seg type="x-morph">אֵ֥ת</seg> </w>
<w><seg type="x-morph">הָ</seg><seg type="x-morph">אָֽרֶץ</seg> </w>
<w type="x-sofpasuq">׃ </w>

NB. In this extract, the output was also converted to Word Per Line format afterwards.

Aside: That is not to say that the WLC module is perfect.
Irrespective of any text critical issues, at least these mistakes were made when it was first built.

The Hebrew text should not have been normalized to NFC.
There should not be a space either before or after each MAQAF.
The space between Hebrew words should be outside the w elements.

These are not your responsibility. I mention them merely in passing.

Those defects were rectified in the WLC module after I created this issue in 2017.

Edited: @DavidHaslam

The text was updated successfully, but these errors were encountered:

DavidHaslam · 2023-07-21T21:08:55Z

@dowens76 @DavidTroidl

Does nobody involved in this project take any notice of issues?

This was posted in December 2017 so what's going on?

jag3773 · 2023-07-26T14:23:00Z

Hi @DavidHaslam, I suspect many people agree with you on that, myself included. Making such a change in the text as it is now would certainly cause all sorts of backwards incompatibility issues.

I'd be in favor of offering an alternate version of the files in the repo that has the fields separated according to OSIS philosophy. If you want to put in PR with the changes as you suggest I think we'd be willing to incorporate it.

DavidHaslam · 2023-07-26T15:47:14Z

@jag3773

Since I added this issue in 2017, the website tanach.us has had a change of title.

Instead of Westminster Leningrad Codex
it's now Unicode XML Leningrad Codex

There are other significant changes, but one relevant to this issue is that all the solidus / markers that used to separate morphological segments have all been removed!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using the solidus to separate morpheme segments is against OSIS philosophy #50

Using the solidus to separate morpheme segments is against OSIS philosophy #50

DavidHaslam commented Dec 30, 2017 •

edited

Loading

DavidHaslam commented Jul 21, 2023

jag3773 commented Jul 26, 2023

DavidHaslam commented Jul 26, 2023

Using the solidus to separate morpheme segments is against OSIS philosophy #50

Using the solidus to separate morpheme segments is against OSIS philosophy #50

Comments

DavidHaslam commented Dec 30, 2017 • edited Loading

DavidHaslam commented Jul 21, 2023

jag3773 commented Jul 26, 2023

DavidHaslam commented Jul 26, 2023

DavidHaslam commented Dec 30, 2017 •

edited

Loading