You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
When using the partition function with overlap and chunking parameters, the overlapped content from the previous chunk is included in the text of the new chunk, but the coordinates metadata for this overlapped portion is missing. This creates inconsistency between the text content and the available coordinate information.
Examine the resulting chunks, particularly focusing on the overlapped portions
Expected behavior
When text content is included in a chunk due to overlap settings, its corresponding coordinate metadata should also be included in the chunk's metadata. This ensures consistency between the text content and the available coordinate information for each chunk.
The coordinates for the overlapped portion ("Built the entire app infrastructure...") are missing, though the text is present in the chunk.
Screenshots
N/A
Environment Info
# Please run `python scripts/collect_env.py` and paste the output here
Additional context
This issue is particularly important when the coordinate information is needed for downstream tasks such as:
Highlighting text in the original document
Maintaining spatial relationships between text elements
Performing layout-aware text processing
The missing coordinates for overlapped content can lead to inconsistencies in applications that rely on both the text content and its spatial information.
The text was updated successfully, but these errors were encountered:
scanny
changed the title
bug/Overlap does not include coordinates of the overlapped text
feat: Include coordinates of overlapped text in elements
Dec 15, 2024
@darrayes This is the expected behavior, so removing the "bug" label.
Some things to keep in mind:
Position metadata is captured for whole elements, not character-by-character or word-by-word.
Overlap is computed during chunking, in a completely separate process from partitioning. The only available information is the metadata present in the elements produced by partitioning. In particular, there is no access to or involvement of the original document in chunking. In practice these two steps can happen on two separate machines and on two separate days.
Coordinate information present in partitioned elements is retained in the .metadata.orig_elements field on chunks. This is roughly what you show in the code snippet although the "orig_elements": key is missing in the elided JSON you provided.
It's possible you may be able to recover some sort of location information from there. Overlap is always a suffix of the prior element used as a prefix for the current chunk. So if you get the "lower-right-hand-corner" of the last element in the prior chunk that will be the neighborhood of where the text came from.
I don't see this as likely to make the roadmap soon, but I'll leave it open as an enhancement request.
Describe the bug
When using the partition function with overlap and chunking parameters, the overlapped content from the previous chunk is included in the text of the new chunk, but the coordinates metadata for this overlapped portion is missing. This creates inconsistency between the text content and the available coordinate information.
To Reproduce
Expected behavior
When text content is included in a chunk due to overlap settings, its corresponding coordinate metadata should also be included in the chunk's metadata. This ensures consistency between the text content and the available coordinate information for each chunk.
For example, in the current output:
The coordinates for the overlapped portion ("Built the entire app infrastructure...") are missing, though the text is present in the chunk.
Screenshots
N/A
Environment Info
Additional context
This issue is particularly important when the coordinate information is needed for downstream tasks such as:
The missing coordinates for overlapped content can lead to inconsistencies in applications that rely on both the text content and its spatial information.
The text was updated successfully, but these errors were encountered: