Skip to content

Commit

Permalink
rfctr(email): eml partitioner rewrite (#3694)
Browse files Browse the repository at this point in the history
**Summary**
Initial attempts to incrementally refactor `partition_email()` into
shape to allow pluggable partitioning quickly became too complex for
ready code-review. Prepare separate rewritten module and tests and swap
them out whole.

**Additional Context**
- Uses the modern stdlib `email` module to reliably accomplish several
manual decoding steps in the legacy code.
- Remove obsolete email-specific element-types which were replaced 18
months or so ago with email-specific metadata fields for things like Cc:
addresses, subject, etc.
- Remove accepting an email as `text: str` because MIME-email is
inherently a binary format which can and often does contain multiple and
contradictory character-encodings.
- Remove `encoding` parameters as it is now unused. An email file is not
a text file and as such does not have a single overall encoding.
Character encoding is specified individually for each MIME-part within
the message and often varies from one part to another in the same
message.
- Remove the need for a caller to specify `attachment_partitioner`.
There is only one reasonable choice for this which is
`auto.partition()`, consistent with the same interface and operation in
`partition_msg()`.
- Fixes #3671 along the way by silently skipping attachments with a
file-type for which there is no partitioner.
- Substantially extend the test-suite to cover multiple
transport-encoding/charset combinations.

---------

Co-authored-by: ryannikolaidis <[email protected]>
Co-authored-by: scanny <[email protected]>
  • Loading branch information
3 people authored Oct 16, 2024
1 parent 9049e4e commit 1eceac2
Show file tree
Hide file tree
Showing 29 changed files with 2,061 additions and 1,195 deletions.
10 changes: 10 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,13 @@
## 0.16.1-dev0

### Enhancements

### Features

### Fixes

* **Rewrite of `partition.email` module and tests.** Use modern Python stdlib `email` module interface to parse email messages and attachments. This change shortens and simplifies the code, and makes it more robust and maintainable. Several historical problems were remedied in the process.

## 0.16.0

### Enhancements
Expand Down
3 changes: 3 additions & 0 deletions example-docs/eml/empty.eml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@



934 changes: 934 additions & 0 deletions example-docs/eml/mime-attach-mp3.eml

Large diffs are not rendered by default.

34 changes: 34 additions & 0 deletions example-docs/eml/mime-different-plain-html.eml
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
From: [email protected]
To: [email protected]
Date: Tue, 01 Oct 2024 12:34:56 -0500
Subject: Example MIME Email
MIME-Version: 1.0
Content-Type: multipart/alternative; boundary="boundary123"

--boundary123
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 7bit
This is the text/plain part.
Did you know that the first email was sent by Ray Tomlinson in 1971? He used the "@" symbol to separate the user's name from the computer name, a practice that is still in use today.
Another interesting fact is that the first known instance of email spam occurred in 1978. A marketing message was sent to 393 recipients on ARPANET, marking the beginning of what we now know as email spam.
--boundary123
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: 7bit

<!DOCTYPE html>
<html>
<head>
<title>Example MIME Email</title>
</head>
<body>
<p>This is the <code>text/html</code> part.</p>
<p>Did you know that the first <b>networked email</b> was sent by Ray Tomlinson in 1971? He used the "@" symbol to separate the user's name from the computer name, a practice that is still in use today.</p>
<p>Another interesting fact is that the first known instance of <i>email spam</i> occurred in 1978. A marketing message was sent to 393 recipients on ARPANET, marking the beginning of what we now know as email spam.</p>
</body>
</html>

--boundary123--
14 changes: 14 additions & 0 deletions example-docs/eml/mime-html-only.eml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
MIME-Version: 1.0
From: [email protected]
To: [email protected]
Date: Tue, 01 Oct 2024 12:34:56 -0500
Subject: Example HTML Only MIME Email
Content-Type: text/html; charset="ISO-8859-1"
Content-Transfer-Encoding: base64

PHA+VGhpcyBpcyBhIHRleHQvaHRtbCBwYXJ0LjwvcD4KPGRpdiBpZD0iY29udGVudCI+PHA+VGhl
IGZpcnN0IGVtb3RpY29uLCA6KSAsIHdhcyBwcm9wb3NlZCBieSBTY290dCBGYWhsbWFuIGluIDE5
ODIgdG8gaW5kaWNhdGUganVzdCBvciBzYXJjYXNtIGluIHRleHQgZW1haWxzLjwvcD4KPHA+R21h
aWwgd2FzIGxhdW5jaGVkIGJ5IEdvb2dsZSBpbiAyMDA0IHdpdGggMSBHQiBvZiBmcmVlIHN0b3Jh
Z2UsIHNpZ25pZmljYW50bHkgbW9yZSB0aGFuIHdoYXQgb3RoZXIgc2VydmljZXMgb2ZmZXJlZCBh
dCB0aGUgdGltZS48L3A+PC9kaXY+
10 changes: 10 additions & 0 deletions example-docs/eml/mime-multi-to-cc-bcc.eml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
From: [email protected]
To: Bob <[email protected]>, Sue <[email protected]>
Cc: Tom <[email protected]>, Alice <[email protected]>
Bcc: John <[email protected]>, Mary <[email protected]>
Subject: Example Plain-Text MIME Message
Message-ID: <[email protected]>
MIME-Version: 1.0
Content-Type: text/plain; charset="UTF-8"

This is a plain-text message.
37 changes: 37 additions & 0 deletions example-docs/eml/mime-multipart-digest.eml
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
From: [email protected]
To: [email protected]
Cc: [email protected]
Bcc: [email protected]
Subject: Example Multipart Digest Email
Message-ID: <[email protected]>
MIME-Version: 1.0
Content-Type: multipart/digest; boundary="boundary123"

--boundary123
Content-Type: message/rfc822
From: [email protected]
To: [email protected]
Subject: First Message
This is the first message in the digest.
--boundary123
Content-Type: message/rfc822
From: [email protected]
To: [email protected]
Subject: Second Message
This is the second message in the digest.
--boundary123
Content-Type: message/rfc822
From: [email protected]
To: [email protected]
Subject: Third Message
This is the third message in the digest.
--boundary123--
22 changes: 22 additions & 0 deletions example-docs/eml/mime-no-body.eml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
From: [email protected]
To: [email protected]
Date: Tue, 01 Oct 2024 12:34:56 -0500
Subject: Image Only Email
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="boundary123"

--boundary123
Content-Type: image/jpeg
Content-Disposition: attachment; filename="image.jpg"
Content-Transfer-Encoding: base64
/9j/4AAQSkZJRgABAQAAAQABAAD/2wCEAAkGBxISEBAQEhISEBAWFRUVFhUVFRUWFRUWFhUWFhUV
FRUYHSggGBolGxUVITEhJSkrLi4uFx8zODMtNygtLisBCgoKDg0OGhAQGi0fHx8rLS0rLS0rLS0t
LS0rLS0rLS0rLS0rLS0rLS0rLS0rLS0rLS0tLS0rLS0rLS0rLS0rLf/AABEIAMgAyAMBIgACEQED
EQH/xAAbAAEAAgMBAQAAAAAAAAAAAAAABAUCAwYBB//EAD0QAAIBAwMBBgQEBgIDCQAAAAECAwAE
ERIhBTFBBhMiUWFxgZEykaGxFCNCUrHB0fAUM2JygpLwFySTwsL/xAAYAQEBAQEBAAAAAAAAAAAA
AAAABQEDBP/EAB8RAQEBAQEBAQEBAQEAAAAAAAABEQIhEjEEQVFhcf/aAAwDAQACEQMRAD8A+6qK
CiiggqCiiCooIKgqCiiCooIKgqCiiCooIKgqCiiCooIKgqCiiCooIKgqCiiCooIKgqCiiCooIKgq
CiiCooIKgqCiiCooIKgqCiiCooIKgqCiiCooIKgqCiiCooIKgqCiiCooIKgqCiiCooIKgqCiiCo
[Base64 encoded image data continues]
--boundary123--
6 changes: 6 additions & 0 deletions example-docs/eml/mime-no-subject.eml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
From: [email protected]
To: [email protected]
MIME-Version: 1.0
Content-Type: text/plain; charset="UTF-8"

This is a simple email message without a subject.
8 changes: 8 additions & 0 deletions example-docs/eml/mime-no-to.eml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
From: [email protected]
Cc: Tom <[email protected]>, Alice <[email protected]>
Bcc: John <[email protected]>, Mary <[email protected]>
Subject: Example Plain-Text MIME Message
MIME-Version: 1.0
Content-Type: text/plain; charset="UTF-8"

This is a plain-text message.
22 changes: 22 additions & 0 deletions example-docs/eml/mime-simple.eml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
From: [email protected]
To: [email protected]
Subject: Example Multipart/Alternative Email
Message-ID: <[email protected]>
MIME-Version: 1.0
Content-Type: multipart/alternative; boundary="boundary123"

--boundary123
Content-Type: text/plain; charset="UTF-8"
This is a simple email message.
--boundary123
Content-Type: text/html; charset="UTF-8"

<html>
<body>
<p>This is a simple email message.</p>
</body>
</html>

--boundary123--
7 changes: 7 additions & 0 deletions example-docs/eml/mime-word-encoded-subject.eml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
From: [email protected]
To: [email protected]
Subject: =?UTF-8?B?U2ltcGxlIGVtYWlsIHdpdGgg4pi44pi/IFVuaWNvZGUgc3ViamVjdA==?=
MIME-Version: 1.0
Content-Type: text/plain; charset="UTF-8"

This is a simple email message with Unicode characters in the subject.
5 changes: 5 additions & 0 deletions example-docs/eml/rfc822-no-date.eml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
From: [email protected]
To: [email protected]
Subject: Example Email Without Date Header

This is an example email message without a Date header. Note that this is non-standard and may be flagged or corrected by email servers.
10 changes: 10 additions & 0 deletions example-docs/eml/simple-rfc-822.eml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
From: [email protected]
To: [email protected]
Date: Tue, 01 Oct 2024 12:34:56 -0500
Subject: Example RFC 822 Email

This is an RFC 822 email message.

An RFC 822 message is characterized by its simple, text-based format, which includes a header and a body. The header contains structured fields such as "From", "To", "Date", and "Subject", each followed by a colon and the corresponding information. The body follows the header, separated by a blank line, and contains the main content of the email.

The structure ensures compatibility and readability across different email systems and clients, adhering to the standards set by the Internet Engineering Task Force (IETF).
96 changes: 0 additions & 96 deletions test_unstructured/documents/test_email_elements.py

This file was deleted.

16 changes: 6 additions & 10 deletions test_unstructured/partition/test_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@
)
from unstructured.file_utils.model import FileType
from unstructured.partition.auto import _PartitionerLoader, partition
from unstructured.partition.common import UnsupportedFileFormatError
from unstructured.partition.utils.constants import PartitionStrategy
from unstructured.staging.base import elements_from_json, elements_to_dicts, elements_to_json

Expand Down Expand Up @@ -200,14 +201,6 @@ def test_auto_partition_email_from_file():
assert elements == EXPECTED_EMAIL_OUTPUT


def test_auto_partition_eml_add_signature_to_metadata():
elements = partition(example_doc_path("eml/signed-doc.p7s"))

assert len(elements) == 1
assert elements[0].text == "This is a test"
assert elements[0].metadata.signature == "<SIGNATURE>\n"


# ================================================================================================
# EPUB
# ================================================================================================
Expand Down Expand Up @@ -911,7 +904,10 @@ def test_auto_partition_raises_with_bad_type(request: FixtureRequest):
request, "unstructured.partition.auto.detect_filetype", return_value=FileType.UNK
)

with pytest.raises(ValueError, match="Invalid file made-up.fake. The FileType.UNK file type "):
with pytest.raises(
UnsupportedFileFormatError,
match="Invalid file made-up.fake. The FileType.UNK file type is not supported in partiti",
):
partition(filename="made-up.fake", strategy=PartitionStrategy.HI_RES)

detect_filetype_.assert_called_once_with(
Expand Down Expand Up @@ -1239,7 +1235,7 @@ def test_auto_partition_applies_the_correct_filetype_for_all_filetypes(
partition_fn = getattr(module, partition_fn_name)

# -- partition the example-doc for this filetype --
elements = partition_fn(file_path)
elements = partition_fn(file_path, process_attachments=False)

assert elements
assert all(
Expand Down
Loading

0 comments on commit 1eceac2

Please sign in to comment.