-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert to Scripture Burrito Proposed Format #6
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
3.10 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,82 @@ | ||
import os | ||
import json | ||
|
||
|
||
# These are not in the standard format. | ||
ALIGNMENT_EXCEPTIONS = [ | ||
"WLC-NET-manual.json", | ||
"WLC-CSBE-manual.json", | ||
"WLC-YLT-manual.json", | ||
"WLC-SGS-manual.json", | ||
"NA27-SGS-manual.json", | ||
"NA27-HSB-manual.json", | ||
"NA27-CUVMP-manual.json", | ||
] | ||
|
||
|
||
def find_alignment_file_paths_for_conversion(): | ||
alignment_file_paths = [] | ||
for root, dirs, files in os.walk("data/alignments"): | ||
for file in files: | ||
if file.endswith("-manual.json") and file not in ALIGNMENT_EXCEPTIONS: | ||
alignment_file_paths.append(os.path.join(root, file)) | ||
return alignment_file_paths | ||
|
||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @themikejr no problems with this, but at some point you might want to look at bible_alignments/config.py for dealing with the various pieces of alignment files. |
||
def convert(): | ||
alignment_file_paths = find_alignment_file_paths_for_conversion() | ||
|
||
for alignment_file_path in alignment_file_paths: | ||
print(alignment_file_path) | ||
sb_alignment = create_sb_json_structure() | ||
|
||
with open(alignment_file_path, "r") as file: | ||
alignment_data = json.load(file) | ||
for alignment_datum in alignment_data: | ||
try: | ||
sb_alignment_record = create_sb_alignment_record() | ||
sb_alignment_record["id"] = alignment_datum["id"] | ||
|
||
for source_id in alignment_datum["source_ids"]: | ||
sb_alignment_record["source"].append(source_id) | ||
for target_id in alignment_datum["target_ids"]: | ||
sb_alignment_record["target"].append(target_id) | ||
|
||
sb_alignment["records"].append(sb_alignment_record) | ||
except: | ||
print("Error in alignment_datum") | ||
print(f"\t{alignment_file_path}") | ||
print(f"\t{alignment_datum}") | ||
|
||
new_path = create_new_file_name(alignment_file_path) | ||
json.dump(sb_alignment, open(new_path, "w"), indent=2) | ||
# print("MIKE\n\n\n\n") | ||
# print(sb_alignment) | ||
|
||
|
||
def create_sb_json_structure(): | ||
sb_alignment = {} | ||
sb_alignment["type"] = "translation" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wonder if this is the appropriate type. @jtauber can confirm, but I think the 'translation' type was meant to be for cases where we knew the source was indeed the source, not simply for cases where we assume a source for the sake of alignment. Perhaps type should be 'alignment' as a default? A 'translation' example would be if we machine-translated a text, and we knew exactly what the source and target were. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good question. Generally, I don't pretend to have the answer, but at least we know what we need to know now. So I interpret the question to be: when source-target affinity is dubious (as I expect it would be with nearly every Bible translation we work with), what is the correct |
||
sb_alignment["meta"] = {} | ||
sb_alignment["meta"]["creator"] = "GrapeCity" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we add a 'note' about the GrapeCity ones for posterity, perhaps something Randall uses to describe the provenance and alignment process—in case he's not available to answer questions about it at some point? |
||
sb_alignment["records"] = [] | ||
return sb_alignment | ||
|
||
|
||
def create_sb_alignment_record(): | ||
sb_alignment_record = {} | ||
sb_alignment_record["id"] = "" | ||
sb_alignment_record["source"] = [] | ||
sb_alignment_record["target"] = [] | ||
return sb_alignment_record | ||
|
||
|
||
def create_new_file_name(existing_path): | ||
old_path_parts = existing_path.split("/") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @themikejr you should use pathlib for workings with paths (and eventually config.py for constructing filenames: see https://github.com/Clear-Bible/Alignments/blob/main/bible_alignments/config.py#L53). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I just got us started -- I'm not a python pro. I could maybe come back to this later, but I'm happy for anyone to push commits that make the code more idiomatic or make better use of existing utilities. |
||
old_name_parts = old_path_parts[4].split("-") | ||
new_filename = f"{old_name_parts[0]}-{old_name_parts[1]}-manual.sb.json" | ||
new_path = f"{old_path_parts[0]}/{old_path_parts[1]}/{old_path_parts[2]}/{old_path_parts[3]}/{new_filename}" | ||
return new_path | ||
|
||
|
||
convert() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry i let these languish @themikejr: should be updated now in Alignments.