-
Notifications
You must be signed in to change notification settings - Fork 794
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
legacy office doc type conversion is not thread-safe in a container setup with Rocky Linux (potentially in general) #3763
Comments
@cwang A couple questions:
|
What I think happened is running threads in a threadpoolexecutor and when more than one threads (running concurrently) calling
Yes we built a system on top of Given a batch of docs: a.pdf, b.pdf, c.pdf and p.doc, q.ppt Hope it helps and happy to explain in more details. |
The |
Changing to enhancement because I don't believe thread-safety is a promised behavior. |
Right call to re-classify this as an improvement because it's usage scenario dependent. Also multiprocessing is a good suggestion however it's a different topic as in why multiprocessing with unstructured isn't easy in within a container, at least that's what I saw in our production environment, mostly because of the heaviness of stuff like Tesseract. Thanks @scanny! |
Describe the bug
The
convert_office_doc
function used to convert file types such as ppt and doc to their modern equivalents (pptx and docx respectively for example) is NOT thread safe as in the subprocess spun in a thread would randomly return exit code 1 without doing actual conversion viasoffice
in a container setup with Rocky Linux base images.See
unstructured/unstructured/partition/common/common.py
Line 256 in 340a07f
To Reproduce
Take a bundle of legacy office docs such as a few .doc and a few .ppt files, and call
partition
function in a thread pool setup, to see that randomly one of the doc would fail to get converted (therefore the whole partition function for that file fails). BUT it's definitely not always one file but can be any legacy file in that pack of documents, which suggests to me it's not a file issue but a threading with subprocess issue.Expected behavior
The legacy to modern office file conversion should always work despite threading or not.
Screenshots
N/A
Environment Info
I've tested with a wide range of Rocky base images + Python 3.10/3.11/3.12 for this issue.
Additional context
My workaround is to always do sequential processing among a pack of documents, by picking out all the legacy office docs and put them in a single thread to be processed sequentially. It's not ideal but maybe it should be mentioned in the OSS docs if no fix is coming any time soon?
The text was updated successfully, but these errors were encountered: