Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added 'inline' to content disposition check #3489

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
* **Renames Astra to Astra DB** Conforms with DataStax internal naming conventions.
* **Accommodate single-column CSV files.** Resolves a limitation of `partition_csv()` where delimiter detection would fail on a single-column CSV file (which naturally has no delimeters).
* **Accommodate `image/jpg` in PPTX as alias for `image/jpeg`.** Resolves problem partitioning PPTX files having an invalid `image/jpg` (should be `image/jpeg`) MIME-type in the `[Content_Types].xml` member of the PPTX Zip archive.
* **EML File Partitioning** EML parts with Content Disposition type of `inline` are now included in content map when creating elements
* **Fixes an issue in Object Detection metrics** The issue was in preprocessing/validating the ground truth and predicted data for object detection metrics.

## 0.15.1
Expand Down
47 changes: 47 additions & 0 deletions example-docs/eml/text-part-marked-inline.eml
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
msip_labels:
MSIP_Label_5b083577-197b-450c-831d-519cf3f56cd2_ActionId=e50e55f0-3ca8-4485-cd8a-f13d1558cae5;
Received: Thu, 1 Feb 2024 00:00:00 +1000
From: Example User <[email protected]>
To: "[email protected]" <[email protected]>
Subject: Project Proposal
Date: Thu, 1 Feb 2024 00:00:00 +1000
MIME-Version: 1.0
Content-type: Multipart/mixed; charset=us-ascii;
boundary="B614747692556E4F8C3F55D8444354BC-2432FD0F_message_boundary"
Content-Description: Multipart message


--B614747692556E4F8C3F55D8444354BC-2432FD0F_message_boundary
Content-type: Multipart/related; charset=ISO-8859-1;
boundary="A32785A2178ABE448C898C485850D5DD-2432FD0F_message_boundary"
Content-Description: Multipart message


--A32785A2178ABE448C898C485850D5DD-2432FD0F_message_boundary
Content-type: Multipart/alternative; charset=ISO-8859-1;
boundary="EF8ECD0282019B449D4B1EBC186DDD07-2432FD0F_message_boundary"
Content-Description: Multipart message


--EF8ECD0282019B449D4B1EBC186DDD07-2432FD0F_message_boundary
Content-type: text/plain; charset=us-ascii
Content-Transfer-Encoding: Quoted-printable
Content-Disposition: inline
Content-Description: Message text

Hi

=20

Please find attached a project proposal.

Please let us know if you have any questions or would like to discuss=
anything at this stage.

=20

Kind regards

User

=20
10 changes: 10 additions & 0 deletions test_unstructured/partition/test_email.py
Original file line number Diff line number Diff line change
Expand Up @@ -679,3 +679,13 @@ def test_partition_eml_respects_detect_language_per_element():

assert "eng" in langs
assert "spa" in langs

def test_partition_reads_message_part_with_inline_content_disposition():
elements = partition_email(
example_doc_path("eml/text-part-marked-inline.eml"), process_attachments=False
)

assert len(elements) == 1
e = elements[0]
assert e.text.startswith("Hi Please find attached a project proposal.")
assert e.text.endswith("Kind regards User ")
4 changes: 2 additions & 2 deletions unstructured/partition/email.py
Original file line number Diff line number Diff line change
Expand Up @@ -387,9 +387,9 @@ def partition_email(
is_encrypted = False
content_map: dict[str, str] = {}
for part in msg.walk():
# NOTE(robinson) - content dispostiion is None for the content of the email itself.
# content dispostiion is None/inline for the content of the email itself.
# Other dispositions include "attachment" for attachments
if part.get_content_disposition() is not None:
if part.get_content_disposition() not in (None, "inline"):
continue
content_type = part.get_content_type()

Expand Down