bug/dont-clean-bullets-in-partition-docx #3463

jgen1 · 2024-08-01T15:02:18Z

Problem

Using partition_docx removes bullets from text.
This is unstructured/unstructured/docx/partition.py
Lines 474-484

        # NOTE(scanny) - a list-item gets some special treatment, mutating the text to remove a
        # bullet-character if present.
        if self._is_list_item(paragraph):
            clean_text = clean_bullets(text).strip()
            if clean_text:
                yield ListItem(
                    text=clean_text,
                    metadata=metadata,
                    detection_origin=DETECTION_ORIGIN,
                )
            return

Solution

Make this a configurable parameter
Just remove this from the docx partitioning.

Personally, I am in favor of item 2. In my opinion, cleaning of text should not occur in the partitioning function. My use case requires all text, including bullets, to be pulled from word documents. Unstructured has separate steps for cleaning, including removing bullets, so it seems that this code shouldn't be in the partitioning.

@scanny I see you are commented on this code chunk, do you have any thoughts?

The text was updated successfully, but these errors were encountered:

scanny · 2024-08-01T17:14:45Z

@jgen1 what is the problem this produces? Or is it just a matter of principle like segregation of responsibilities?

In several formats, like HTML and DOCX off the top of my head, a list-item is indicated semantically, like by being a <li> HTML element or in DOCX by having a List Item paragraph style applied. So there is no bullet character present in the text in those cases.

Removing a "manual" bullet character makes ListItem elements consistent (text only, no leading bullet character) across the various document types, so that's a plus to the way it is at present.

jgen1 · 2024-08-01T17:28:44Z

Thank you for the quick response @scanny.

In my use case, I want to capture bullet points, or whatever the numbered list item actually is.

I can't speak to HTML, but I know for Word, if you go create a docx file, add in a bunch of bullets/numbered lists and run that through python-docx, it will not include the bullet or numbered list in the text of that "List Item".

This is number 1
This is number 2
a. This is a

If a file with the content above is loaded into python-docx, it will show each item's text as "This is number 1", "This is number 2", "This is a" without the actual 1., 2., and a. in the text. The same is true for bullets. This is because the actual text of those numbered list isn't stored directly in text. So clean_bullets there is not helpful to remove bullets because those List Items don't contain them in the text anyway. Now for my use case where I want those bullets to appear as text, I am able to run a docx macro to convert all the numbered lists to text, but then when I use that with partition_docx, it is cleaning the bullets away.

scanny · 2024-08-05T05:31:06Z

Removed the bug label since the current behavior is the expected behavior.

I think the enhancement idea is to capture bullet metadata, in particular for numbered list-items.

jgen1 · 2024-08-05T18:45:28Z

Quick distinction: python-docx does not currently capture the bullet metadata for lists, so that would be a feature they would have to implement.
What I would want here is, if a List Item does happen to contain a bullet string, don't remove that bullet. If the list item string contained a "1. " as a numbered list that would not get removed. Similarly with other bullet-type characters like "-" and "o". So to me it seems clean_bullets should not happen here.
So this enhancement would just be to take the clean_bullet out of the partition - see PR

jgen1 added the bug Something isn't working label Aug 1, 2024

This was referenced Aug 1, 2024

bug/dont-clean-bullets-in-partition-docx #3455

Closed

Remove clean_bullets from partition_docx #3464

Open

scanny added enhancement New feature or request and removed bug Something isn't working labels Aug 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug/dont-clean-bullets-in-partition-docx #3463

bug/dont-clean-bullets-in-partition-docx #3463

jgen1 commented Aug 1, 2024

scanny commented Aug 1, 2024

jgen1 commented Aug 1, 2024

scanny commented Aug 5, 2024

jgen1 commented Aug 5, 2024 •

edited

Loading

bug/dont-clean-bullets-in-partition-docx #3463

bug/dont-clean-bullets-in-partition-docx #3463

Comments

jgen1 commented Aug 1, 2024

Problem

Solution

scanny commented Aug 1, 2024

jgen1 commented Aug 1, 2024

scanny commented Aug 5, 2024

jgen1 commented Aug 5, 2024 • edited Loading

jgen1 commented Aug 5, 2024 •

edited

Loading