-
Notifications
You must be signed in to change notification settings - Fork 797
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
rfctr(email): eml partitioner rewrite (#3694)
**Summary** Initial attempts to incrementally refactor `partition_email()` into shape to allow pluggable partitioning quickly became too complex for ready code-review. Prepare separate rewritten module and tests and swap them out whole. **Additional Context** - Uses the modern stdlib `email` module to reliably accomplish several manual decoding steps in the legacy code. - Remove obsolete email-specific element-types which were replaced 18 months or so ago with email-specific metadata fields for things like Cc: addresses, subject, etc. - Remove accepting an email as `text: str` because MIME-email is inherently a binary format which can and often does contain multiple and contradictory character-encodings. - Remove `encoding` parameters as it is now unused. An email file is not a text file and as such does not have a single overall encoding. Character encoding is specified individually for each MIME-part within the message and often varies from one part to another in the same message. - Remove the need for a caller to specify `attachment_partitioner`. There is only one reasonable choice for this which is `auto.partition()`, consistent with the same interface and operation in `partition_msg()`. - Fixes #3671 along the way by silently skipping attachments with a file-type for which there is no partitioner. - Substantially extend the test-suite to cover multiple transport-encoding/charset combinations. --------- Co-authored-by: ryannikolaidis <[email protected]> Co-authored-by: scanny <[email protected]>
- Loading branch information
1 parent
9049e4e
commit 1eceac2
Showing
29 changed files
with
2,061 additions
and
1,195 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
|
||
|
||
|
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,34 @@ | ||
From: [email protected] | ||
To: [email protected] | ||
Date: Tue, 01 Oct 2024 12:34:56 -0500 | ||
Subject: Example MIME Email | ||
MIME-Version: 1.0 | ||
Content-Type: multipart/alternative; boundary="boundary123" | ||
|
||
--boundary123 | ||
Content-Type: text/plain; charset="UTF-8" | ||
Content-Transfer-Encoding: 7bit | ||
This is the text/plain part. | ||
Did you know that the first email was sent by Ray Tomlinson in 1971? He used the "@" symbol to separate the user's name from the computer name, a practice that is still in use today. | ||
Another interesting fact is that the first known instance of email spam occurred in 1978. A marketing message was sent to 393 recipients on ARPANET, marking the beginning of what we now know as email spam. | ||
--boundary123 | ||
Content-Type: text/html; charset="UTF-8" | ||
Content-Transfer-Encoding: 7bit | ||
|
||
<!DOCTYPE html> | ||
<html> | ||
<head> | ||
<title>Example MIME Email</title> | ||
</head> | ||
<body> | ||
<p>This is the <code>text/html</code> part.</p> | ||
<p>Did you know that the first <b>networked email</b> was sent by Ray Tomlinson in 1971? He used the "@" symbol to separate the user's name from the computer name, a practice that is still in use today.</p> | ||
<p>Another interesting fact is that the first known instance of <i>email spam</i> occurred in 1978. A marketing message was sent to 393 recipients on ARPANET, marking the beginning of what we now know as email spam.</p> | ||
</body> | ||
</html> | ||
|
||
--boundary123-- |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
MIME-Version: 1.0 | ||
From: [email protected] | ||
To: [email protected] | ||
Date: Tue, 01 Oct 2024 12:34:56 -0500 | ||
Subject: Example HTML Only MIME Email | ||
Content-Type: text/html; charset="ISO-8859-1" | ||
Content-Transfer-Encoding: base64 | ||
|
||
PHA+VGhpcyBpcyBhIHRleHQvaHRtbCBwYXJ0LjwvcD4KPGRpdiBpZD0iY29udGVudCI+PHA+VGhl | ||
IGZpcnN0IGVtb3RpY29uLCA6KSAsIHdhcyBwcm9wb3NlZCBieSBTY290dCBGYWhsbWFuIGluIDE5 | ||
ODIgdG8gaW5kaWNhdGUganVzdCBvciBzYXJjYXNtIGluIHRleHQgZW1haWxzLjwvcD4KPHA+R21h | ||
aWwgd2FzIGxhdW5jaGVkIGJ5IEdvb2dsZSBpbiAyMDA0IHdpdGggMSBHQiBvZiBmcmVlIHN0b3Jh | ||
Z2UsIHNpZ25pZmljYW50bHkgbW9yZSB0aGFuIHdoYXQgb3RoZXIgc2VydmljZXMgb2ZmZXJlZCBh | ||
dCB0aGUgdGltZS48L3A+PC9kaXY+ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
From: [email protected] | ||
To: Bob <[email protected]>, Sue <[email protected]> | ||
Cc: Tom <[email protected]>, Alice <[email protected]> | ||
Bcc: John <[email protected]>, Mary <[email protected]> | ||
Subject: Example Plain-Text MIME Message | ||
Message-ID: <[email protected]> | ||
MIME-Version: 1.0 | ||
Content-Type: text/plain; charset="UTF-8" | ||
|
||
This is a plain-text message. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,37 @@ | ||
From: [email protected] | ||
To: [email protected] | ||
Cc: [email protected] | ||
Bcc: [email protected] | ||
Subject: Example Multipart Digest Email | ||
Message-ID: <[email protected]> | ||
MIME-Version: 1.0 | ||
Content-Type: multipart/digest; boundary="boundary123" | ||
|
||
--boundary123 | ||
Content-Type: message/rfc822 | ||
From: [email protected] | ||
To: [email protected] | ||
Subject: First Message | ||
This is the first message in the digest. | ||
--boundary123 | ||
Content-Type: message/rfc822 | ||
From: [email protected] | ||
To: [email protected] | ||
Subject: Second Message | ||
This is the second message in the digest. | ||
--boundary123 | ||
Content-Type: message/rfc822 | ||
From: [email protected] | ||
To: [email protected] | ||
Subject: Third Message | ||
This is the third message in the digest. | ||
--boundary123-- |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
From: [email protected] | ||
To: [email protected] | ||
Date: Tue, 01 Oct 2024 12:34:56 -0500 | ||
Subject: Image Only Email | ||
MIME-Version: 1.0 | ||
Content-Type: multipart/mixed; boundary="boundary123" | ||
|
||
--boundary123 | ||
Content-Type: image/jpeg | ||
Content-Disposition: attachment; filename="image.jpg" | ||
Content-Transfer-Encoding: base64 | ||
/9j/4AAQSkZJRgABAQAAAQABAAD/2wCEAAkGBxISEBAQEhISEBAWFRUVFhUVFRUWFRUWFhUWFhUV | ||
FRUYHSggGBolGxUVITEhJSkrLi4uFx8zODMtNygtLisBCgoKDg0OGhAQGi0fHx8rLS0rLS0rLS0t | ||
LS0rLS0rLS0rLS0rLS0rLS0rLS0rLS0rLS0tLS0rLS0rLS0rLS0rLf/AABEIAMgAyAMBIgACEQED | ||
EQH/xAAbAAEAAgMBAQAAAAAAAAAAAAAABAUCAwYBB//EAD0QAAIBAwMBBgQEBgIDCQAAAAECAwAE | ||
ERIhBTFBBhMiUWFxgZEykaGxFCNCUrHB0fAUM2JygpLwFySTwsL/xAAYAQEBAQEBAAAAAAAAAAAA | ||
AAAABQEDBP/EAB8RAQEBAQEBAQEBAQEAAAAAAAABEQIhEjEEQVFhcf/aAAwDAQACEQMRAD8A+6qK | ||
CiiggqCiiCooIKgqCiiCooIKgqCiiCooIKgqCiiCooIKgqCiiCooIKgqCiiCooIKgqCiiCooIKgq | ||
CiiCooIKgqCiiCooIKgqCiiCooIKgqCiiCooIKgqCiiCooIKgqCiiCooIKgqCiiCooIKgqCiiCo | ||
[Base64 encoded image data continues] | ||
--boundary123-- |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
From: [email protected] | ||
To: [email protected] | ||
MIME-Version: 1.0 | ||
Content-Type: text/plain; charset="UTF-8" | ||
|
||
This is a simple email message without a subject. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
From: [email protected] | ||
Cc: Tom <[email protected]>, Alice <[email protected]> | ||
Bcc: John <[email protected]>, Mary <[email protected]> | ||
Subject: Example Plain-Text MIME Message | ||
MIME-Version: 1.0 | ||
Content-Type: text/plain; charset="UTF-8" | ||
|
||
This is a plain-text message. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
From: [email protected] | ||
To: [email protected] | ||
Subject: Example Multipart/Alternative Email | ||
Message-ID: <[email protected]> | ||
MIME-Version: 1.0 | ||
Content-Type: multipart/alternative; boundary="boundary123" | ||
|
||
--boundary123 | ||
Content-Type: text/plain; charset="UTF-8" | ||
This is a simple email message. | ||
--boundary123 | ||
Content-Type: text/html; charset="UTF-8" | ||
|
||
<html> | ||
<body> | ||
<p>This is a simple email message.</p> | ||
</body> | ||
</html> | ||
|
||
--boundary123-- |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
From: [email protected] | ||
To: [email protected] | ||
Subject: =?UTF-8?B?U2ltcGxlIGVtYWlsIHdpdGgg4pi44pi/IFVuaWNvZGUgc3ViamVjdA==?= | ||
MIME-Version: 1.0 | ||
Content-Type: text/plain; charset="UTF-8" | ||
|
||
This is a simple email message with Unicode characters in the subject. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
From: [email protected] | ||
To: [email protected] | ||
Subject: Example Email Without Date Header | ||
|
||
This is an example email message without a Date header. Note that this is non-standard and may be flagged or corrected by email servers. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
From: [email protected] | ||
To: [email protected] | ||
Date: Tue, 01 Oct 2024 12:34:56 -0500 | ||
Subject: Example RFC 822 Email | ||
|
||
This is an RFC 822 email message. | ||
|
||
An RFC 822 message is characterized by its simple, text-based format, which includes a header and a body. The header contains structured fields such as "From", "To", "Date", and "Subject", each followed by a colon and the corresponding information. The body follows the header, separated by a blank line, and contains the main content of the email. | ||
|
||
The structure ensures compatibility and readability across different email systems and clients, adhering to the standards set by the Internet Engineering Task Force (IETF). |
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.