Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

legacy office doc type conversion is not thread-safe in a container setup with Rocky Linux (potentially in general) #3763

Open
cwang opened this issue Oct 30, 2024 · 5 comments
Labels
doc Related to Microsoft Word (.doc) legacy file format enhancement New feature or request ppt Related to Microsoft PowerPoint (.ppt) legacy file format

Comments

@cwang
Copy link

cwang commented Oct 30, 2024

Describe the bug

The convert_office_doc function used to convert file types such as ppt and doc to their modern equivalents (pptx and docx respectively for example) is NOT thread safe as in the subprocess spun in a thread would randomly return exit code 1 without doing actual conversion via soffice in a container setup with Rocky Linux base images.

See

To Reproduce
Take a bundle of legacy office docs such as a few .doc and a few .ppt files, and call partition function in a thread pool setup, to see that randomly one of the doc would fail to get converted (therefore the whole partition function for that file fails). BUT it's definitely not always one file but can be any legacy file in that pack of documents, which suggests to me it's not a file issue but a threading with subprocess issue.

Expected behavior
The legacy to modern office file conversion should always work despite threading or not.

Screenshots
N/A

Environment Info
I've tested with a wide range of Rocky base images + Python 3.10/3.11/3.12 for this issue.

Additional context
My workaround is to always do sequential processing among a pack of documents, by picking out all the legacy office docs and put them in a single thread to be processed sequentially. It's not ideal but maybe it should be mentioned in the OSS docs if no fix is coming any time soon?

@cwang cwang added the bug Something isn't working label Oct 30, 2024
@scanny scanny added doc Related to Microsoft Word (.doc) legacy file format ppt Related to Microsoft PowerPoint (.ppt) legacy file format labels Oct 30, 2024
@scanny
Copy link
Collaborator

scanny commented Oct 30, 2024

@cwang A couple questions:

  1. Is there any clue as to the cause of the failure? Like some message that occurs? Or does it just silently fail?
  2. How does the multi-threading arise? Is that something you add? How do you accomplish the single-threading that avoids the problem?

@cwang
Copy link
Author

cwang commented Oct 30, 2024

  1. Is there any clue as to the cause of the failure? Like some message that occurs? Or does it just silently fail?

What I think happened is running threads in a threadpoolexecutor and when more than one threads (running concurrently) calling subprocess(), some of them failed with blank message and exit code 1.

  1. How does the multi-threading arise? Is that something you add? How do you accomplish the single-threading that avoids the problem?

Yes we built a system on top of partition function call for office docs, and the only way we can avoid this happening is demonstrated below:

Given a batch of docs: a.pdf, b.pdf, c.pdf and p.doc, q.ppt
As we know the fact that: both .doc and .ppt calls the convert_office_doc function mentioned above that calls subprocess()
Then running processing using 5 threads concurrently in a thread pool would always result in one legacy office doc (sometimes p.doc other times q.ppt) not being partitioned because of the exit code 1 from subprocess() call;
However using 4 threads and sequentially processing p.dpc and q.ppt in a single thread while the other 3 threads processing 3 pdf files concurrently would make sure every file is processed successfully.
It also explains why tests don't usually caught this kind of errors because they tend to be ran sequentially or even in a parallel testing scenario, it's less unlikely you run these conversion tasks exactly at the same time.

Hope it helps and happy to explain in more details.

@scanny
Copy link
Collaborator

scanny commented Oct 30, 2024

The soffice command actually loads the LibreOffice application. On my Mac I can see the application icon pop up in the dock when running soffice and then disappear after the conversion is completed. I suspect that's where this particular thread-safety problem lies. You might try running it with multiprocessing rather than multithreading to see if that remedies the situation.

@scanny scanny changed the title bug/legacy office doc type conversion is not thread-safe in a container setup with Rocky Linux (potentially in general) legacy office doc type conversion is not thread-safe in a container setup with Rocky Linux (potentially in general) Oct 30, 2024
@scanny scanny added enhancement New feature or request and removed bug Something isn't working labels Oct 30, 2024
@scanny
Copy link
Collaborator

scanny commented Oct 30, 2024

Changing to enhancement because I don't believe thread-safety is a promised behavior.

@cwang
Copy link
Author

cwang commented Nov 1, 2024

Right call to re-classify this as an improvement because it's usage scenario dependent.

Also multiprocessing is a good suggestion however it's a different topic as in why multiprocessing with unstructured isn't easy in within a container, at least that's what I saw in our production environment, mostly because of the heaviness of stuff like Tesseract.

Thanks @scanny!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc Related to Microsoft Word (.doc) legacy file format enhancement New feature or request ppt Related to Microsoft PowerPoint (.ppt) legacy file format
Projects
None yet
Development

No branches or pull requests

2 participants