Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: skip ocr for certain element types (Issue #3163) #3182

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

beez2022
Copy link

feat: skip ocr for certain element types

Only works for pdf or image format document types

Only works when ocr_mode="individual blocks"

introduced kwarg: pdf_skip_ocr_element_types : list
code changes in partition/pdf.py, partition/pdf_image/ocr.py

specify list of element types when calling partition_pdf or partition_image:
pdf_skip_ocr_element_types=['Image'] will skip ocr for elements of type Image

@beez2022 beez2022 changed the title feat: skip ocr for certain element types #3163 feat: skip ocr for certain element types (Issue #3163) Jun 11, 2024
@beez2022
Copy link
Author

beez2022 commented Jul 2, 2024

@christinestraub @scanny This is the PR for issue#3163 ( [Unstructured-IO/unstructured] feat/skip ocr for certain element types (Issue #3163))

@beez2022
Copy link
Author

beez2022 commented Jul 3, 2024

@christinestraub I have updated the changelog.md and run make test with no errors. Appreciate if you could merge the PR. Thank you

@christinestraub
Copy link
Collaborator

@beez2022

Thank you for your contribution and for proposing this new parameter to the partition() function. We appreciate the effort to enhance its functionality.

After careful consideration, we have some reservations about merging this PR at this time:

  • Parameter Complexity: Our partition() function already has numerous parameters. We're cautious about adding more unless it addresses a widespread user need. Can you provide more context on how many users this new parameter would benefit?

  • Interface Simplicity: Adding another parameter further complicates the function's interface. This might indicate that we're trying to accommodate too many processes within a single function.

We'd like to gather more user feedback before making a decision. Let's keep this PR open for now to allow for more community input and discussion.

Thank you again for your contribution and for helping us improve our codebase.

@beez2022
Copy link
Author

beez2022 commented Jul 4, 2024 via email

@ChiNoel-osu
Copy link

This will be very helpful because partition_pdf (hi_res) is super slow because of its single thread OCR. Some people don't even need OCR, they just want the layout detection.
+1 user will benefit from this 😃

@Masterchen09
Copy link

There is also another issue (with some votes of some users) which could be solved by this PR: #2467

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants