-
Notifications
You must be signed in to change notification settings - Fork 797
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug/partition_msg halts for attachmentes with UNK type #3671
Comments
#3605 PR (Draft) |
@S1M0N38 that was removed on purpose. @Paul-Cornell can you remove that parameter from the docs for us? |
When a .msg file contains an attachment of an unsupported type (UNK), the partition_msg function halts. I've implemented a custom attachment_partitioner to filter out unsupported types. Is there another way to process the supported types and ignore the unsupported ones? |
@scanny I'm not quite sure how to "remove" that parameter from the docs. Searching for |
@S1M0N38 Ahh, okay, so that's a bug then. You shouldn't have to provide a custom partitioner for that :) Shall we make this into a bug report for that or do you want to open a new one? The correct behavior would be for |
@Paul-Cornell it looks like the text is about the same for
from unstructured.partition.auto import partition
from unstructured.partition.email import partition_email
filename = "example-docs/eml/fake-email-attachment.eml"
elements = partition_email(
filename=filename, process_attachments=True, attachment_partitioner=partition
) You can change it like this in each case:
from unstructured.partition.email import partition_email
filename = "example-docs/eml/fake-email-attachment.eml"
elements = partition_email(filename=filename, process_attachments=True) So:
|
let's make this issue into a bug report and modify the title accordingly. |
**Summary** Initial attempts to incrementally refactor `partition_email()` into shape to allow pluggable partitioning quickly became too complex for ready code-review. Prepare separate rewritten module and tests and swap them out whole. **Additional Context** - Uses the modern stdlib `email` module to reliably accomplish several manual decoding steps in the legacy code. - Remove obsolete email-specific element-types which were replaced 18 months or so ago with email-specific metadata fields for things like Cc: addresses, subject, etc. - Remove accepting an email as `text: str` because MIME-email is inherently a binary format which can and often does contain multiple and contradictory character-encodings. - Remove `encoding` parameters as it is now unused. An email file is not a text file and as such does not have a single overall encoding. Character encoding is specified individually for each MIME-part within the message and often varies from one part to another in the same message. - Remove the need for a caller to specify `attachment_partitioner`. There is only one reasonable choice for this which is `auto.partition()`, consistent with the same interface and operation in `partition_msg()`. - Fixes #3671 along the way by silently skipping attachments with a file-type for which there is no partitioner. - Substantially extend the test-suite to cover multiple transport-encoding/charset combinations. --------- Co-authored-by: ryannikolaidis <[email protected]> Co-authored-by: scanny <[email protected]>
Currently, the
attachment_partitioner
is hardcoded topartition
in the following file:unstructured/unstructured/partition/msg.py
Lines 259 to 277 in 50d75c4
However, according to the official documentation, the
partition-msg
function acceptsattachment_partitioner
as an argument.The text was updated successfully, but these errors were encountered: