-
Notifications
You must be signed in to change notification settings - Fork 797
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug/dont-clean-bullets-in-partition-docx #3463
Comments
@jgen1 what is the problem this produces? Or is it just a matter of principle like segregation of responsibilities? In several formats, like HTML and DOCX off the top of my head, a list-item is indicated semantically, like by being a Removing a "manual" bullet character makes |
Thank you for the quick response @scanny. In my use case, I want to capture bullet points, or whatever the numbered list item actually is. I can't speak to HTML, but I know for Word, if you go create a docx file, add in a bunch of bullets/numbered lists and run that through python-docx, it will not include the bullet or numbered list in the text of that "List Item".
If a file with the content above is loaded into python-docx, it will show each item's text as "This is number 1", "This is number 2", "This is a" without the actual 1., 2., and a. in the text. The same is true for bullets. This is because the actual text of those numbered list isn't stored directly in text. So clean_bullets there is not helpful to remove bullets because those List Items don't contain them in the text anyway. Now for my use case where I want those bullets to appear as text, I am able to run a docx macro to convert all the numbered lists to text, but then when I use that with partition_docx, it is cleaning the bullets away. |
Removed the I think the enhancement idea is to capture bullet metadata, in particular for numbered list-items. |
Quick distinction: python-docx does not currently capture the bullet metadata for lists, so that would be a feature they would have to implement. |
Problem
Using partition_docx removes bullets from text.
This is unstructured/unstructured/docx/partition.py
Lines 474-484
Solution
Personally, I am in favor of item 2. In my opinion, cleaning of text should not occur in the partitioning function. My use case requires all text, including bullets, to be pulled from word documents. Unstructured has separate steps for cleaning, including removing bullets, so it seems that this code shouldn't be in the partitioning.
@scanny I see you are commented on this code chunk, do you have any thoughts?
The text was updated successfully, but these errors were encountered: