Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Include coordinates of overlapped text in elements #3810

Open
darrayes opened this issue Dec 5, 2024 · 1 comment
Open

feat: Include coordinates of overlapped text in elements #3810

darrayes opened this issue Dec 5, 2024 · 1 comment
Labels
enhancement New feature or request

Comments

@darrayes
Copy link

darrayes commented Dec 5, 2024

Describe the bug
When using the partition function with overlap and chunking parameters, the overlapped content from the previous chunk is included in the text of the new chunk, but the coordinates metadata for this overlapped portion is missing. This creates inconsistency between the text content and the available coordinate information.

To Reproduce

  1. Create a PDF document with multiple sections
  2. Use the following code to partition the document:
elements = partition(
    filename=filename,
    strategy="hi_res",
    chunking_strategy="by_title",
    max_characters=1500,
    combine_text_under_n_chars=300,
    unique_element_ids=True,
    overlap=170,
    overlap_all=True,
    skip_infer_table_types=[]
)
  1. Examine the resulting chunks, particularly focusing on the overlapped portions

Expected behavior
When text content is included in a chunk due to overlap settings, its corresponding coordinate metadata should also be included in the chunk's metadata. This ensures consistency between the text content and the available coordinate information for each chunk.

For example, in the current output:

{
    "type": "CompositeElement",
    "text": "Built the entire app infrastructure with Flutter... [overlapped content] ... SKILLS\n\nProgramming Languages Go, Python...",
    "metadata": [
        {
            "type": "Title",
            "text": "SKILLS",
            "metadata": {
                "coordinates": {
                    "points": [[75.8, 1981.9], ...]
                }
            }
        },
        {
            "type": "NarrativeText",
            "text": "Programming Languages Go, Python...",
            "metadata": {
                "coordinates": {
                    "points": [[75.8, 2033.6], ...]
                }
            }
        }
    ]
}

The coordinates for the overlapped portion ("Built the entire app infrastructure...") are missing, though the text is present in the chunk.

Screenshots
N/A

Environment Info

# Please run `python scripts/collect_env.py` and paste the output here

Additional context
This issue is particularly important when the coordinate information is needed for downstream tasks such as:

  • Highlighting text in the original document
  • Maintaining spatial relationships between text elements
  • Performing layout-aware text processing

The missing coordinates for overlapped content can lead to inconsistencies in applications that rely on both the text content and its spatial information.

@darrayes darrayes added the bug Something isn't working label Dec 5, 2024
@scanny scanny added enhancement New feature or request and removed bug Something isn't working labels Dec 15, 2024
@scanny scanny changed the title bug/Overlap does not include coordinates of the overlapped text feat: Include coordinates of overlapped text in elements Dec 15, 2024
@scanny
Copy link
Collaborator

scanny commented Dec 15, 2024

@darrayes This is the expected behavior, so removing the "bug" label.

Some things to keep in mind:

  • Position metadata is captured for whole elements, not character-by-character or word-by-word.
  • Overlap is computed during chunking, in a completely separate process from partitioning. The only available information is the metadata present in the elements produced by partitioning. In particular, there is no access to or involvement of the original document in chunking. In practice these two steps can happen on two separate machines and on two separate days.
  • Coordinate information present in partitioned elements is retained in the .metadata.orig_elements field on chunks. This is roughly what you show in the code snippet although the "orig_elements": key is missing in the elided JSON you provided.

It's possible you may be able to recover some sort of location information from there. Overlap is always a suffix of the prior element used as a prefix for the current chunk. So if you get the "lower-right-hand-corner" of the last element in the prior chunk that will be the neighborhood of where the text came from.

I don't see this as likely to make the roadmap soon, but I'll leave it open as an enhancement request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants