Skip to content

Commit

Permalink
Fix: partition on empty or whitespace-only text files (#3675)
Browse files Browse the repository at this point in the history
This is a fix for this
[bug](#3674), auto partition fails on text files which are empty or contain only whitespaces

Inference of .txt file type fails if the file has only whitespaces.

To Reproduce:

```
from tempfile import NamedTemporaryFile

from unstructured.partition.auto import partition

with NamedTemporaryFile(mode="w", suffix=".txt") as f:
    f.write("   \n")
    f.seek(0)
    elements = partition(filename=f.name)
```
  • Loading branch information
tc360950 authored Sep 29, 2024
1 parent 50d75c4 commit 75c4998
Show file tree
Hide file tree
Showing 5 changed files with 28 additions and 16 deletions.
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
## 0.15.14-dev4
## 0.15.14-dev5

### Enhancements

Expand All @@ -11,6 +11,7 @@
* **Update Python SDK usage in `partition_via_api`.** Make a minor syntax change to ensure forward compatibility with the upcoming 0.26.0 Python SDK.
* **Remove "unused" `date_from_file_object` parameter.** As part of simplifying partitioning parameter set, remove `date_from_file_object` parameter. A file object does not have a last-modified date attribute so can never give a useful value. When a file-object is used as the document source (such as in Unstructured API) the last-modified date must come from the `metadata_last_modified` argument.
* **Fix occasional `KeyError` when mapping parent ids to hash ids.** Occasionally the input elements into `assign_and_map_hash_ids` can contain duplicated element instances, which lead to error when mapping parent id.
* **Allow empty text files.** Fixes an issue where text files with only white space would fail to be partitioned.

## 0.15.13

Expand Down
3 changes: 3 additions & 0 deletions example-docs/fake-text-all-whitespace.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@



34 changes: 21 additions & 13 deletions test_unstructured/partition/test_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -749,22 +749,30 @@ def test_auto_partition_tsv_from_filename():
# ================================================================================================
# TXT
# ================================================================================================


def test_auto_partition_text_from_filename():
file_path = example_doc_path("fake-text.txt")
@pytest.mark.parametrize(
("filename", "expected_elements"),
[
(
"fake-text.txt",
[
NarrativeText(text="This is a test document to use for unit tests."),
Address(text="Doylestown, PA 18901"),
Title(text="Important points:"),
ListItem(text="Hamburgers are delicious"),
ListItem(text="Dogs are the best"),
ListItem(text="I love fuzzy blankets"),
],
),
("fake-text-all-whitespace.txt", []),
],
)
def test_auto_partition_text_from_filename(filename: str, expected_elements: list[Element]):
file_path = example_doc_path(filename)

elements = partition(filename=file_path, strategy=PartitionStrategy.HI_RES)

assert elements == [
NarrativeText(text="This is a test document to use for unit tests."),
Address(text="Doylestown, PA 18901"),
Title(text="Important points:"),
ListItem(text="Hamburgers are delicious"),
ListItem(text="Dogs are the best"),
ListItem(text="I love fuzzy blankets"),
]
assert all(e.metadata.filename == "fake-text.txt" for e in elements)
assert elements == expected_elements
assert all(e.metadata.filename == filename for e in elements)
assert all(e.metadata.file_directory == example_doc_path("") for e in elements)


Expand Down
2 changes: 1 addition & 1 deletion unstructured/__version__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "0.15.14-dev4" # pragma: no cover
__version__ = "0.15.14-dev5" # pragma: no cover
2 changes: 1 addition & 1 deletion unstructured/file_utils/filetype.py
Original file line number Diff line number Diff line change
Expand Up @@ -601,7 +601,7 @@ def _is_json(self) -> bool:
text_head = self._ctx.text_head

# -- an empty file is not JSON --
if not text_head:
if not text_head.lstrip():
return False

# -- has to be a list or object, no string, number, or bool --
Expand Down

0 comments on commit 75c4998

Please sign in to comment.