-
Notifications
You must be signed in to change notification settings - Fork 797
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: skip ocr for certain element types (Issue #3163) #3182
base: main
Are you sure you want to change the base?
Conversation
@christinestraub @scanny This is the PR for issue#3163 ( [Unstructured-IO/unstructured] feat/skip ocr for certain element types (Issue #3163)) |
@christinestraub I have updated the changelog.md and run make test with no errors. Appreciate if you could merge the PR. Thank you |
Thank you for your contribution and for proposing this new parameter to the After careful consideration, we have some reservations about merging this PR at this time:
We'd like to gather more user feedback before making a decision. Let's keep this PR open for now to allow for more community input and discussion. Thank you again for your contribution and for helping us improve our codebase. |
Hi Christine
Thank you for your reply. Really appreciate your comments.
Currently, my organization does have this need to not OCR images found in the documents in order to shorten the response time from partition_pdf(). However, I am not sure how many users would benefit from this.
Thank you for keeping this PR open.
I am thankful for the opportunity to contribute to your repo.
Best regardsBee
On Thursday, 4 July 2024 at 02:12:40 AM SGT, Christine Straub ***@***.***> wrote:
@beez2022
Thank you for your contribution and for proposing this new parameter to the partition() function. We appreciate the effort to enhance its functionality.
After careful consideration, we have some reservations about merging this PR at this time:
-
Parameter Complexity: Our partition() function already has numerous parameters. We're cautious about adding more unless it addresses a widespread user need. Can you provide more context on how many users this new parameter would benefit?
-
Interface Simplicity: Adding another parameter further complicates the function's interface. This might indicate that we're trying to accommodate too many processes within a single function.
We'd like to gather more user feedback before making a decision. Let's keep this PR open for now to allow for more community input and discussion.
Thank you again for your contribution and for helping us improve our codebase.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
This will be very helpful because partition_pdf (hi_res) is super slow because of its single thread OCR. Some people don't even need OCR, they just want the layout detection. |
There is also another issue (with some votes of some users) which could be solved by this PR: #2467 |
feat: skip ocr for certain element types
Only works for pdf or image format document types
Only works when ocr_mode="individual blocks"
introduced kwarg: pdf_skip_ocr_element_types : list
code changes in partition/pdf.py, partition/pdf_image/ocr.py
specify list of element types when calling partition_pdf or partition_image:
pdf_skip_ocr_element_types=['Image'] will skip ocr for elements of type Image