Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfctr(email): eml partitioner rewrite #3694

Merged
merged 9 commits into from
Oct 16, 2024
10 changes: 10 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,13 @@
## 0.16.1-dev0

### Enhancements

### Features

### Fixes

* **Rewrite of `partition.email` module and tests.** Use modern Python stdlib `email` module interface to parse email messages and attachments. This change shortens and simplifies the code, and makes it more robust and maintainable. Several historical problems were remedied in the process.

## 0.16.0

### Enhancements
Expand Down
3 changes: 3 additions & 0 deletions example-docs/eml/empty.eml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@



934 changes: 934 additions & 0 deletions example-docs/eml/mime-attach-mp3.eml

Large diffs are not rendered by default.

34 changes: 34 additions & 0 deletions example-docs/eml/mime-different-plain-html.eml
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
From: [email protected]
To: [email protected]
Date: Tue, 01 Oct 2024 12:34:56 -0500
Subject: Example MIME Email
MIME-Version: 1.0
Content-Type: multipart/alternative; boundary="boundary123"

--boundary123
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 7bit

This is the text/plain part.

Did you know that the first email was sent by Ray Tomlinson in 1971? He used the "@" symbol to separate the user's name from the computer name, a practice that is still in use today.

Another interesting fact is that the first known instance of email spam occurred in 1978. A marketing message was sent to 393 recipients on ARPANET, marking the beginning of what we now know as email spam.

--boundary123
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: 7bit

<!DOCTYPE html>
<html>
<head>
<title>Example MIME Email</title>
</head>
<body>
<p>This is the <code>text/html</code> part.</p>
<p>Did you know that the first <b>networked email</b> was sent by Ray Tomlinson in 1971? He used the "@" symbol to separate the user's name from the computer name, a practice that is still in use today.</p>
<p>Another interesting fact is that the first known instance of <i>email spam</i> occurred in 1978. A marketing message was sent to 393 recipients on ARPANET, marking the beginning of what we now know as email spam.</p>
</body>
</html>

--boundary123--
14 changes: 14 additions & 0 deletions example-docs/eml/mime-html-only.eml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
MIME-Version: 1.0
From: [email protected]
To: [email protected]
Date: Tue, 01 Oct 2024 12:34:56 -0500
Subject: Example HTML Only MIME Email
Content-Type: text/html; charset="ISO-8859-1"
Content-Transfer-Encoding: base64

PHA+VGhpcyBpcyBhIHRleHQvaHRtbCBwYXJ0LjwvcD4KPGRpdiBpZD0iY29udGVudCI+PHA+VGhl
IGZpcnN0IGVtb3RpY29uLCA6KSAsIHdhcyBwcm9wb3NlZCBieSBTY290dCBGYWhsbWFuIGluIDE5
ODIgdG8gaW5kaWNhdGUganVzdCBvciBzYXJjYXNtIGluIHRleHQgZW1haWxzLjwvcD4KPHA+R21h
aWwgd2FzIGxhdW5jaGVkIGJ5IEdvb2dsZSBpbiAyMDA0IHdpdGggMSBHQiBvZiBmcmVlIHN0b3Jh
Z2UsIHNpZ25pZmljYW50bHkgbW9yZSB0aGFuIHdoYXQgb3RoZXIgc2VydmljZXMgb2ZmZXJlZCBh
dCB0aGUgdGltZS48L3A+PC9kaXY+
10 changes: 10 additions & 0 deletions example-docs/eml/mime-multi-to-cc-bcc.eml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
From: [email protected]
To: Bob <[email protected]>, Sue <[email protected]>
Cc: Tom <[email protected]>, Alice <[email protected]>
Bcc: John <[email protected]>, Mary <[email protected]>
Subject: Example Plain-Text MIME Message
Message-ID: <[email protected]>
MIME-Version: 1.0
Content-Type: text/plain; charset="UTF-8"

This is a plain-text message.
37 changes: 37 additions & 0 deletions example-docs/eml/mime-multipart-digest.eml
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
From: [email protected]
To: [email protected]
Cc: [email protected]
Bcc: [email protected]
Subject: Example Multipart Digest Email
Message-ID: <[email protected]>
MIME-Version: 1.0
Content-Type: multipart/digest; boundary="boundary123"

--boundary123
Content-Type: message/rfc822

From: [email protected]
To: [email protected]
Subject: First Message

This is the first message in the digest.

--boundary123
Content-Type: message/rfc822

From: [email protected]
To: [email protected]
Subject: Second Message

This is the second message in the digest.

--boundary123
Content-Type: message/rfc822

From: [email protected]
To: [email protected]
Subject: Third Message

This is the third message in the digest.

--boundary123--
22 changes: 22 additions & 0 deletions example-docs/eml/mime-no-body.eml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
From: [email protected]
To: [email protected]
Date: Tue, 01 Oct 2024 12:34:56 -0500
Subject: Image Only Email
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="boundary123"

--boundary123
Content-Type: image/jpeg
Content-Disposition: attachment; filename="image.jpg"
Content-Transfer-Encoding: base64

/9j/4AAQSkZJRgABAQAAAQABAAD/2wCEAAkGBxISEBAQEhISEBAWFRUVFhUVFRUWFRUWFhUWFhUV
FRUYHSggGBolGxUVITEhJSkrLi4uFx8zODMtNygtLisBCgoKDg0OGhAQGi0fHx8rLS0rLS0rLS0t
LS0rLS0rLS0rLS0rLS0rLS0rLS0rLS0rLS0tLS0rLS0rLS0rLS0rLf/AABEIAMgAyAMBIgACEQED
EQH/xAAbAAEAAgMBAQAAAAAAAAAAAAAABAUCAwYBB//EAD0QAAIBAwMBBgQEBgIDCQAAAAECAwAE
ERIhBTFBBhMiUWFxgZEykaGxFCNCUrHB0fAUM2JygpLwFySTwsL/xAAYAQEBAQEBAAAAAAAAAAAA
AAAABQEDBP/EAB8RAQEBAQEBAQEBAQEAAAAAAAABEQIhEjEEQVFhcf/aAAwDAQACEQMRAD8A+6qK
CiiggqCiiCooIKgqCiiCooIKgqCiiCooIKgqCiiCooIKgqCiiCooIKgqCiiCooIKgqCiiCooIKgq
CiiCooIKgqCiiCooIKgqCiiCooIKgqCiiCooIKgqCiiCooIKgqCiiCooIKgqCiiCooIKgqCiiCo
[Base64 encoded image data continues]
--boundary123--
6 changes: 6 additions & 0 deletions example-docs/eml/mime-no-subject.eml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
From: [email protected]
To: [email protected]
MIME-Version: 1.0
Content-Type: text/plain; charset="UTF-8"

This is a simple email message without a subject.
8 changes: 8 additions & 0 deletions example-docs/eml/mime-no-to.eml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
From: [email protected]
Cc: Tom <[email protected]>, Alice <[email protected]>
Bcc: John <[email protected]>, Mary <[email protected]>
Subject: Example Plain-Text MIME Message
MIME-Version: 1.0
Content-Type: text/plain; charset="UTF-8"

This is a plain-text message.
22 changes: 22 additions & 0 deletions example-docs/eml/mime-simple.eml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
From: [email protected]
To: [email protected]
Subject: Example Multipart/Alternative Email
Message-ID: <[email protected]>
MIME-Version: 1.0
Content-Type: multipart/alternative; boundary="boundary123"

--boundary123
Content-Type: text/plain; charset="UTF-8"

This is a simple email message.

--boundary123
Content-Type: text/html; charset="UTF-8"

<html>
<body>
<p>This is a simple email message.</p>
</body>
</html>

--boundary123--
7 changes: 7 additions & 0 deletions example-docs/eml/mime-word-encoded-subject.eml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
From: [email protected]
To: [email protected]
Subject: =?UTF-8?B?U2ltcGxlIGVtYWlsIHdpdGgg4pi44pi/IFVuaWNvZGUgc3ViamVjdA==?=
MIME-Version: 1.0
Content-Type: text/plain; charset="UTF-8"

This is a simple email message with Unicode characters in the subject.
5 changes: 5 additions & 0 deletions example-docs/eml/rfc822-no-date.eml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
From: [email protected]
To: [email protected]
Subject: Example Email Without Date Header

This is an example email message without a Date header. Note that this is non-standard and may be flagged or corrected by email servers.
10 changes: 10 additions & 0 deletions example-docs/eml/simple-rfc-822.eml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
From: [email protected]
To: [email protected]
Date: Tue, 01 Oct 2024 12:34:56 -0500
Subject: Example RFC 822 Email

This is an RFC 822 email message.

An RFC 822 message is characterized by its simple, text-based format, which includes a header and a body. The header contains structured fields such as "From", "To", "Date", and "Subject", each followed by a colon and the corresponding information. The body follows the header, separated by a blank line, and contains the main content of the email.

The structure ensures compatibility and readability across different email systems and clients, adhering to the standards set by the Internet Engineering Task Force (IETF).
96 changes: 0 additions & 96 deletions test_unstructured/documents/test_email_elements.py

This file was deleted.

16 changes: 6 additions & 10 deletions test_unstructured/partition/test_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@
)
from unstructured.file_utils.model import FileType
from unstructured.partition.auto import _PartitionerLoader, partition
from unstructured.partition.common import UnsupportedFileFormatError
from unstructured.partition.utils.constants import PartitionStrategy
from unstructured.staging.base import elements_from_json, elements_to_dicts, elements_to_json

Expand Down Expand Up @@ -200,14 +201,6 @@ def test_auto_partition_email_from_file():
assert elements == EXPECTED_EMAIL_OUTPUT


def test_auto_partition_eml_add_signature_to_metadata():
elements = partition(example_doc_path("eml/signed-doc.p7s"))

assert len(elements) == 1
assert elements[0].text == "This is a test"
assert elements[0].metadata.signature == "<SIGNATURE>\n"
Coniferish marked this conversation as resolved.
Show resolved Hide resolved


# ================================================================================================
# EPUB
# ================================================================================================
Expand Down Expand Up @@ -911,7 +904,10 @@ def test_auto_partition_raises_with_bad_type(request: FixtureRequest):
request, "unstructured.partition.auto.detect_filetype", return_value=FileType.UNK
)

with pytest.raises(ValueError, match="Invalid file made-up.fake. The FileType.UNK file type "):
with pytest.raises(
UnsupportedFileFormatError,
match="Invalid file made-up.fake. The FileType.UNK file type is not supported in partiti",
):
partition(filename="made-up.fake", strategy=PartitionStrategy.HI_RES)

detect_filetype_.assert_called_once_with(
Expand Down Expand Up @@ -1239,7 +1235,7 @@ def test_auto_partition_applies_the_correct_filetype_for_all_filetypes(
partition_fn = getattr(module, partition_fn_name)

# -- partition the example-doc for this filetype --
elements = partition_fn(file_path)
elements = partition_fn(file_path, process_attachments=False)

assert elements
assert all(
Expand Down
Loading
Loading