Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug/dont-clean-bullets-in-partition-docx #3455

Closed
jgen1 opened this issue Jul 31, 2024 · 1 comment
Closed

bug/dont-clean-bullets-in-partition-docx #3455

jgen1 opened this issue Jul 31, 2024 · 1 comment
Labels
enhancement New feature or request

Comments

@jgen1
Copy link

jgen1 commented Jul 31, 2024

Problem

Using partition_docx removes bullets from text.
This is unstructured/unstructured/docx/partition.py
Lines 474-484

        # NOTE(scanny) - a list-item gets some special treatment, mutating the text to remove a
        # bullet-character if present.
        if self._is_list_item(paragraph):
            clean_text = clean_bullets(text).strip()
            if clean_text:
                yield ListItem(
                    text=clean_text,
                    metadata=metadata,
                    detection_origin=DETECTION_ORIGIN,
                )
            return

Solution

  1. Make this a configurable parameter
  2. Just remove this from the docx partitioning.

Personally, I am in favor of item 2. In my opinion, cleaning of text should not occur in the partitioning function. My use case requires all text, including bullets, to be pulled from word documents. Unstructured has separate steps for cleaning, including removing bullets, so it seems that this code shouldn't be in the partitioning.

@scanny I see you are commented on this code chunk, do you have any thoughts?

@jgen1 jgen1 added the enhancement New feature or request label Jul 31, 2024
@jgen1 jgen1 changed the title feat/dont-clean-bullets-in-partition-docx bug/dont-clean-bullets-in-partition-docx Aug 1, 2024
@jgen1
Copy link
Author

jgen1 commented Aug 1, 2024

I rewrote this issue as a bug issue. #3463

@jgen1 jgen1 closed this as not planned Won't fix, can't repro, duplicate, stale Aug 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant