bug/dont-clean-bullets-in-partition-docx #3455

jgen1 · 2024-07-31T20:02:52Z

Problem

Using partition_docx removes bullets from text.
This is unstructured/unstructured/docx/partition.py
Lines 474-484

        # NOTE(scanny) - a list-item gets some special treatment, mutating the text to remove a
        # bullet-character if present.
        if self._is_list_item(paragraph):
            clean_text = clean_bullets(text).strip()
            if clean_text:
                yield ListItem(
                    text=clean_text,
                    metadata=metadata,
                    detection_origin=DETECTION_ORIGIN,
                )
            return

Solution

Make this a configurable parameter
Just remove this from the docx partitioning.

Personally, I am in favor of item 2. In my opinion, cleaning of text should not occur in the partitioning function. My use case requires all text, including bullets, to be pulled from word documents. Unstructured has separate steps for cleaning, including removing bullets, so it seems that this code shouldn't be in the partitioning.

@scanny I see you are commented on this code chunk, do you have any thoughts?

The text was updated successfully, but these errors were encountered:

jgen1 · 2024-08-01T15:02:54Z

I rewrote this issue as a bug issue. #3463

jgen1 added the enhancement New feature or request label Jul 31, 2024

jgen1 changed the title ~~feat/dont-clean-bullets-in-partition-docx~~ bug/dont-clean-bullets-in-partition-docx Aug 1, 2024

jgen1 closed this as not planned Won't fix, can't repro, duplicate, stale Aug 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug/dont-clean-bullets-in-partition-docx #3455

bug/dont-clean-bullets-in-partition-docx #3455

jgen1 commented Jul 31, 2024 •

edited

Loading

jgen1 commented Aug 1, 2024 •

edited

Loading

bug/dont-clean-bullets-in-partition-docx #3455

bug/dont-clean-bullets-in-partition-docx #3455

Comments

jgen1 commented Jul 31, 2024 • edited Loading

Problem

Solution

jgen1 commented Aug 1, 2024 • edited Loading

jgen1 commented Jul 31, 2024 •

edited

Loading

jgen1 commented Aug 1, 2024 •

edited

Loading