Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug/dont-clean-bullets-in-partition-docx #3463

Open
jgen1 opened this issue Aug 1, 2024 · 4 comments · May be fixed by #3464
Open

bug/dont-clean-bullets-in-partition-docx #3463

jgen1 opened this issue Aug 1, 2024 · 4 comments · May be fixed by #3464
Labels
enhancement New feature or request

Comments

@jgen1
Copy link

jgen1 commented Aug 1, 2024

Problem

Using partition_docx removes bullets from text.
This is unstructured/unstructured/docx/partition.py
Lines 474-484

        # NOTE(scanny) - a list-item gets some special treatment, mutating the text to remove a
        # bullet-character if present.
        if self._is_list_item(paragraph):
            clean_text = clean_bullets(text).strip()
            if clean_text:
                yield ListItem(
                    text=clean_text,
                    metadata=metadata,
                    detection_origin=DETECTION_ORIGIN,
                )
            return

Solution

  1. Make this a configurable parameter
  2. Just remove this from the docx partitioning.

Personally, I am in favor of item 2. In my opinion, cleaning of text should not occur in the partitioning function. My use case requires all text, including bullets, to be pulled from word documents. Unstructured has separate steps for cleaning, including removing bullets, so it seems that this code shouldn't be in the partitioning.

@scanny I see you are commented on this code chunk, do you have any thoughts?

@jgen1 jgen1 added the bug Something isn't working label Aug 1, 2024
@scanny
Copy link
Collaborator

scanny commented Aug 1, 2024

@jgen1 what is the problem this produces? Or is it just a matter of principle like segregation of responsibilities?

In several formats, like HTML and DOCX off the top of my head, a list-item is indicated semantically, like by being a <li> HTML element or in DOCX by having a List Item paragraph style applied. So there is no bullet character present in the text in those cases.

Removing a "manual" bullet character makes ListItem elements consistent (text only, no leading bullet character) across the various document types, so that's a plus to the way it is at present.

@jgen1
Copy link
Author

jgen1 commented Aug 1, 2024

Thank you for the quick response @scanny.

In my use case, I want to capture bullet points, or whatever the numbered list item actually is.

I can't speak to HTML, but I know for Word, if you go create a docx file, add in a bunch of bullets/numbered lists and run that through python-docx, it will not include the bullet or numbered list in the text of that "List Item".

  1. This is number 1
  2. This is number 2
    a. This is a

If a file with the content above is loaded into python-docx, it will show each item's text as "This is number 1", "This is number 2", "This is a" without the actual 1., 2., and a. in the text. The same is true for bullets. This is because the actual text of those numbered list isn't stored directly in text. So clean_bullets there is not helpful to remove bullets because those List Items don't contain them in the text anyway. Now for my use case where I want those bullets to appear as text, I am able to run a docx macro to convert all the numbered lists to text, but then when I use that with partition_docx, it is cleaning the bullets away.

@scanny scanny added enhancement New feature or request and removed bug Something isn't working labels Aug 5, 2024
@scanny
Copy link
Collaborator

scanny commented Aug 5, 2024

Removed the bug label since the current behavior is the expected behavior.

I think the enhancement idea is to capture bullet metadata, in particular for numbered list-items.

@jgen1
Copy link
Author

jgen1 commented Aug 5, 2024

Quick distinction: python-docx does not currently capture the bullet metadata for lists, so that would be a feature they would have to implement.
What I would want here is, if a List Item does happen to contain a bullet string, don't remove that bullet. If the list item string contained a "1. " as a numbered list that would not get removed. Similarly with other bullet-type characters like "-" and "o". So to me it seems clean_bullets should not happen here.
So this enhancement would just be to take the clean_bullet out of the partition - see PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants