You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Using partition_docx removes bullets from text.
This is unstructured/unstructured/docx/partition.py
Lines 474-484
# NOTE(scanny) - a list-item gets some special treatment, mutating the text to remove a
# bullet-character if present.
if self._is_list_item(paragraph):
clean_text = clean_bullets(text).strip()
if clean_text:
yield ListItem(
text=clean_text,
metadata=metadata,
detection_origin=DETECTION_ORIGIN,
)
return
Solution
Make this a configurable parameter
Just remove this from the docx partitioning.
Personally, I am in favor of item 2. In my opinion, cleaning of text should not occur in the partitioning function. My use case requires all text, including bullets, to be pulled from word documents. Unstructured has separate steps for cleaning, including removing bullets, so it seems that this code shouldn't be in the partitioning.
@scanny I see you are commented on this code chunk, do you have any thoughts?
The text was updated successfully, but these errors were encountered:
Problem
Using partition_docx removes bullets from text.
This is unstructured/unstructured/docx/partition.py
Lines 474-484
Solution
Personally, I am in favor of item 2. In my opinion, cleaning of text should not occur in the partitioning function. My use case requires all text, including bullets, to be pulled from word documents. Unstructured has separate steps for cleaning, including removing bullets, so it seems that this code shouldn't be in the partitioning.
@scanny I see you are commented on this code chunk, do you have any thoughts?
The text was updated successfully, but these errors were encountered: