Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some fixes #55

Open
wants to merge 134 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
134 commits
Select commit Hold shift + click to select a range
b919419
language support
atc0m Feb 21, 2019
2517a41
json conf
atc0m Feb 21, 2019
54ef8f5
Merge pull request #1 from atc0m/language-support
atc0m Feb 21, 2019
da6ea80
languages support path
atc0m Mar 6, 2019
d5a8581
Merge pull request #2 from atc0m/language-support
atc0m Mar 6, 2019
38da775
move json into parser module
atc0m Mar 6, 2019
ba8f818
Merge pull request #3 from atc0m/language-support
atc0m Mar 6, 2019
929e78c
dir_path
atc0m Mar 6, 2019
c05378d
Merge pull request #4 from atc0m/language-support
atc0m Mar 6, 2019
aa5b02b
os
atc0m Mar 6, 2019
44250ed
Merge pull request #5 from atc0m/language-support
atc0m Mar 6, 2019
7fdecf3
build with language support
atc0m Mar 29, 2019
d8dd270
Merge pull request #6 from atc0m/language-support
atc0m Mar 29, 2019
ffca453
load json once
atc0m Mar 29, 2019
1eb7e86
mailto
atc0m Mar 29, 2019
6569397
Merge pull request #7 from atc0m/opt
atc0m Mar 29, 2019
0d50859
dir path
atc0m Mar 29, 2019
e87cf1e
Merge pull request #8 from atc0m/opt
atc0m Mar 29, 2019
ea11d23
mailto
atc0m Mar 29, 2019
cad4130
Merge pull request #9 from atc0m/opt
atc0m Mar 29, 2019
83b2361
mailto
atc0m Mar 29, 2019
6393dfe
refactor
atc0m Mar 29, 2019
4ca4624
Merge pull request #10 from atc0m/opt
atc0m Mar 29, 2019
ba88be3
french support
atc0m May 2, 2019
6227114
fix key
atc0m May 2, 2019
8d3bf59
cordialement regex
atc0m May 2, 2019
c9b1fc7
french signatures
atc0m May 13, 2019
d1915e6
ignore empty lines
atc0m May 13, 2019
7b28cc1
correct translation
atc0m Sep 4, 2019
6fc5df3
multiple signatures
atc0m Sep 5, 2019
6cc9442
multi quote header rgx
atc0m Sep 5, 2019
2d60020
html escaped brackets
atc0m Sep 5, 2019
5c2e1ce
multi quote fix
atc0m Sep 5, 2019
80d0bf8
syntax
atc0m Sep 5, 2019
44c3b52
selected character
atc0m Sep 6, 2019
1a9dc52
rm
atc0m Sep 6, 2019
5d372b5
Merge pull request #11 from atc0m/finnish
atc0m Sep 6, 2019
15b03cd
single quote header
atc0m Sep 9, 2019
ba3e0c7
Merge pull request #12 from atc0m/finnish-update
atc0m Sep 9, 2019
ee03139
whitespace
atc0m Sep 26, 2019
a75fb66
Merge pull request #13 from atc0m/header-fix
atc0m Sep 26, 2019
4360c6c
omit sigs, fix header regex
atc0m Nov 14, 2019
9b8bf60
signature appended to hidden fragment
atc0m Nov 14, 2019
aae3d44
Merge pull request #14 from atc0m/signature-header-fix
atc0m Nov 15, 2019
e52ba6d
finish fragment after signature
atc0m Nov 15, 2019
33da801
Merge pull request #15 from atc0m/signature-header-fix
atc0m Nov 15, 2019
915f2c0
test txts
atc0m Jan 23, 2020
2851339
ignore json
atc0m Jan 23, 2020
6347b2b
warnings regex
atc0m Jan 23, 2020
667192d
warnings
atc0m Jan 23, 2020
45c38a4
Merge pull request #16 from atc0m/warnings
atc0m Jan 23, 2020
b68075f
notice, do not reply
atc0m Jan 29, 2020
35f8927
Merge pull request #17 from atc0m/caution-extended
atc0m Jan 29, 2020
e25be22
more warnings
atc0m Jan 30, 2020
27a92f2
character syntax fix
atc0m Jan 30, 2020
a746d24
warnings extended
atc0m Feb 18, 2020
dd2d657
Merge pull request #18 from atc0m/warning2
atc0m Feb 18, 2020
7f742e6
extend
atc0m Mar 5, 2020
95c8a9a
Merge pull request #19 from atc0m/legal
atc0m Mar 5, 2020
ceb175d
confidential
atc0m Mar 5, 2020
ecc0f1f
Merge pull request #20 from atc0m/legal
atc0m Mar 5, 2020
36ccc98
communication
atc0m Mar 5, 2020
5fdae32
Merge branch 'master' into legal2
atc0m Mar 5, 2020
ad948c9
Merge pull request #21 from atc0m/legal2
atc0m Mar 5, 2020
b7da5c8
extra quote
atc0m Mar 5, 2020
40b2c25
Merge pull request #22 from atc0m/legal2
atc0m Mar 5, 2020
0017f9f
more quote signs
atc0m Mar 5, 2020
dd51d4e
rm stop
atc0m Mar 5, 2020
1adc119
Merge pull request #23 from atc0m/legal2
atc0m Mar 5, 2020
be17320
chinese update
atc0m Mar 31, 2020
696b6d1
french extension
atc0m Apr 2, 2020
a0cbc45
french extension
atc0m Apr 2, 2020
62a81ca
japanese quoted header support
atc0m Apr 3, 2020
f8c51af
chinese sent from
atc0m Apr 3, 2020
55ebfbc
add follow up to quote
atc0m Jun 11, 2020
7d97bed
Merge pull request #24 from atc0m/zendesk-followup
atc0m Jun 11, 2020
f24576b
forward/multi header test
atc0m Jun 19, 2020
2d4765c
Merge pull request #25 from atc0m/multi-quote
atc0m Jun 19, 2020
4ec9dd7
confidentiality notice variations
atc0m Oct 23, 2020
8a74f85
Merge pull request #26 from atc0m/confidentail_footer
atc0m Oct 23, 2020
95365d3
styling
atc0m Oct 23, 2020
8235343
only replace follow up if it's not the first line
atc0m Oct 23, 2020
00ae300
strip whitespace
atc0m Oct 23, 2020
8ee7a5f
Merge pull request #27 from atc0m/follow-up
atc0m Oct 23, 2020
2d4beb1
webform subject fix
atc0m Dec 30, 2020
35ecf1c
Merge pull request #28 from atc0m/webform
atc0m Dec 30, 2020
aef3fe7
added spanish stuff
mohamedalani Mar 8, 2021
9a154b5
Merge pull request #29 from atc0m/spanish-fix
mohamedalani Mar 8, 2021
5a9854a
fix multiple * before and after From
mohamedalani Apr 22, 2021
2375cfc
Merge pull request #30 from atc0m/from-fix
mohamedalani Apr 22, 2021
a7749ca
half fixing tests, still need some work but at least we see which one…
mohamedalani May 7, 2021
f9b740c
fixing a bug with from:\n messages and fixing sent from messages
mohamedalani May 7, 2021
144a772
fix *
mohamedalani May 7, 2021
216d1b9
fix *
mohamedalani May 7, 2021
0cea073
Merge pull request #31 from atc0m/some-fixes
mohamedalani May 7, 2021
0e49171
replace \r
atc0m May 19, 2021
a2217cc
Merge pull request #32 from atc0m/r-break
atc0m May 19, 2021
98bd8b7
outlook signature
atc0m May 27, 2021
a4bc4b0
Merge pull request #33 from atc0m/r-break
atc0m May 27, 2021
70b8b5b
Generalize and extend confidentiality notices
atc0m Sep 21, 2021
a5a270d
readme update
atc0m Sep 21, 2021
bf0aedb
Merge pull request #34 from atc0m/confidential-notices-extended
atc0m Sep 21, 2021
3669a24
single newline in warning message
atc0m Sep 22, 2021
054c73a
readme
atc0m Sep 22, 2021
9efbc79
Merge pull request #35 from atc0m/confidential-notices-extended
atc0m Sep 22, 2021
c6fbefd
include tabs
atc0m Sep 22, 2021
d02b11d
Merge pull request #36 from atc0m/confidential-notices-extended
atc0m Sep 22, 2021
023a0b0
variation of terms for confidentiality footers
atc0m Sep 22, 2021
569c9c4
Merge pull request #37 from atc0m/confidential-notices-extended
atc0m Sep 22, 2021
a83e853
filter spam
atc0m Sep 23, 2021
c8c4ac8
Merge pull request #38 from atc0m/confidential-notices-extended
atc0m Sep 23, 2021
b76c9a1
Emails are not secure signature
atc0m Oct 5, 2021
9971020
Merge pull request #39 from atc0m/email-are-not-secure
atc0m Oct 5, 2021
37b890e
German email header variation
atc0m Oct 13, 2021
20ab0a8
simpler
atc0m Oct 13, 2021
22d0c1b
Merge pull request #40 from atc0m/german-email-header-variation
atc0m Oct 13, 2021
6a2b3d5
fixing warnings
mohamedalani Oct 26, 2021
49e8d98
Merge pull request #41 from atc0m/some-fixes
mohamedalani Oct 26, 2021
23dd987
no newlines allowed after known confidentiality notice
atc0m Nov 16, 2021
731231d
Merge branch 'master' into confidential-notices-extended
atc0m Nov 16, 2021
cd6b27e
Merge pull request #42 from atc0m/confidential-notices-extended
atc0m Nov 17, 2021
8cc32d1
Find contacts from email threads
atc0m Nov 23, 2021
6c9d5d1
Merge pull request #43 from atc0m/email-contacts
atc0m Nov 23, 2021
54a3524
add copyright
mohamedalani Dec 6, 2021
4071827
Merge branch 'master' of https://github.com/atc0m/email-reply-parser …
mohamedalani Dec 6, 2021
ac13f25
Merge pull request #44 from atc0m/some-fixes
mohamedalani Dec 6, 2021
c4202dc
Remove (mailto:) header
atc0m Jan 10, 2022
22eb443
Merge pull request #45 from atc0m/remove-mailto
atc0m Jan 10, 2022
b780240
only match cautions at start of new sentence
atc0m Jan 19, 2022
cdd4734
Merge pull request #46 from atc0m/caution-new-sentence
atc0m Jan 19, 2022
210ef2e
add signatures in english
mohamedalani Jul 19, 2022
61c53b0
add signatures in english
mohamedalani Jul 19, 2022
6cf15f9
Merge pull request #47 from atc0m/en-signature
mohamedalani Jul 19, 2022
8fbf3eb
fixing signature for wickes
mohamedalani Oct 17, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,10 @@ tests/.DS_Store
.DS_Store
*.egg-info
.project
env/
venv/
dist/
dist/*

*.csv
__pycache__/
*.json
33 changes: 28 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,28 @@
# Email Reply Parser for Python
A port of GitHub's Email Reply Parser library, by the fine folks at [Zapier](https://zapier.com/).
A port of GitHub's Email Reply Parser library, by the fine folks at [Zapier](https://zapier.com/), with added language support.

Currently supported languages:

Arabic
German
English
Spanish
Finnish
French
Hebrew
Indonesian
Italian
Japanese
Korean
Dutch
Polish
Portuguese
Russian
Slovak
Thai
Turkish
Vietnamese
Chinese

## Summary

Expand Down Expand Up @@ -45,7 +68,8 @@ from email_reply_parser import EmailReplyParser
Step 2: Provide email message as type String

```python
EmailReplyParser.read(email_message)
parser = EmailReplyParser(language='en')
parser.read(email_message)
```

### How to only retrieve the reply message
Expand All @@ -56,10 +80,9 @@ Step 1: Import email reply parser package
from email_reply_parser import EmailReplyParser
```

Step 2: Provide email message as type string using parse_reply class method.
Step 2: Provide email message as type string using parse_reply.

```python
parser = EmailReplyParser(language='en')
EmailReplyParser.parse_reply(email_message)
```


232 changes: 184 additions & 48 deletions email_reply_parser/__init__.py
Original file line number Diff line number Diff line change
@@ -1,80 +1,202 @@
"""
email_reply_parser is a python library port of GitHub's Email Reply Parser.

For more information, visit https://github.com/zapier/email-reply-parser
email_reply_parser is a python library port of GitHub's Email Reply Parser.
For more information, visit https://github.com/zapier/email_reply_parser
"""

import os
import re
import json


class EmailReplyParser(object):
""" Represents a email message that is parsed.
"""

@staticmethod
def read(text):
""" Factory method that splits email into list of fragments
def __init__(self, language='en'):
dir_path = os.path.dirname(__file__)
with open(dir_path + "/languages_support.json", "r") as read_file:
self.words_map = json.load(read_file)
if language in self.words_map:
self.language = language
else:
self.language = 'en'

def read(self, text):
""" Factory method that splits email into list of fragments
text - A string email body

Returns an EmailMessage instance
"""
return EmailMessage(text).read()
return EmailMessage(text, self.language, self.words_map).read()

@staticmethod
def parse_reply(text):
def parse_reply(self, text):
""" Provides the reply portion of email.

text - A string email body

Returns reply body message
"""
return EmailReplyParser.read(text).reply
a = self.read(text).reply
return a

def find_contacts(self, text):
"""Provides a list of From To emails and the dates of these emails"""
contacts_dict = EmailContacts(text, self.language, self.words_map).contacts()
return contacts_dict


class EmailMessage(object):
""" An email message represents a parsed email body.
"""

SIG_REGEX = re.compile(r'(--|__|-\w)|(^Sent from my (\w+\s*){1,3})')
QUOTE_HDR_REGEX = re.compile('On.*wrote:$')
QUOTED_REGEX = re.compile(r'(>+)')
HEADER_REGEX = re.compile(r'^\*?(From|Sent|To|Subject):\*? .+')
_MULTI_QUOTE_HDR_REGEX = r'(?!On.*On\s.+?wrote:)(On\s(.+?)wrote:)'
MULTI_QUOTE_HDR_REGEX = re.compile(_MULTI_QUOTE_HDR_REGEX, re.DOTALL | re.MULTILINE)
MULTI_QUOTE_HDR_REGEX_MULTILINE = re.compile(_MULTI_QUOTE_HDR_REGEX, re.DOTALL)

def __init__(self, text):
def __init__(self, text, language, words_map):
self.fragments = []
self.fragment = None
self.text = text.replace('\r\n', '\n')
self.text = text.replace('\r\n', '\n').replace('\r', '\n')
self.found_visible = False
self.SIG_REGEX = None
self.QUOTE_HDR_REGEX = None
self.QUOTED_REGEX = None
self.HEADER_REGEX = None
self._MULTI_QUOTE_HDR_REGEX = None
self.MULTI_QUOTE_HDR_REGEX = None
self.MULTI_QUOTE_HDR_REGEX_MULTILINE = None
self.WARNING_REGEX = None
self.words_map = words_map
self.language = language
self.default_language = 'en'
self.set_regex()

def default_quoted_header(self):
self.QUOTED_REGEX = re.compile(r'(>+)')
self.HEADER_REGEX = re.compile(
r'^[* ]*(' + self.words_map[self.language]['From']
+ '|' + self.words_map[self.language]['Sent']
+ '|' + self.words_map[self.language]['To']
+ ')\s*:[\s\n\*]*.*'
)

def warnings(self):
dot = '\u200b'
single_space = f'[ {dot}\xA0\t]'
space = f'[,()]?{single_space}{{0,3}}[\n\r]?{single_space}{{0,3}}[,()]?'
sentence_start = f'(?:[\n\r.!?]|^){single_space}{{0,3}}'
confidential_variations = f'(privileged|confidential|private|sensitive|{space}(/|and|or|and{space}/{space}or|,){space}){{1,3}}'
message_variations = f'(electronic{space}|e[\-]?mail{space}|message{space}|communication{space}|transmission{space}){{1,3}}'
self.WARNING_REGEX = re.compile(
f'(CAUTION:|NOTICE:|Disclaimer:|Warning:|{confidential_variations}{space}Notice:|Please{space}do{space}not{space}reply'
f'|{confidential_variations}{space}information'
f'|{sentence_start}(The|This){space}information{space}(provided|transmitted|contained)?{space}(with)?in{space}this{space}{message_variations}'
f'|{sentence_start}(The|This){space}information{space}(may also be|is){space}legally'
f'|{sentence_start}(The|This){space}content[s]?{space}of{space}this{space}{message_variations}'
f'|{sentence_start}(The|This){space}{message_variations}{space}'
f'(may{space}contain|(and|or|and{space}/{space}or)?{space}(any|all)?{space}(files{space}transmitted|the{space}information{space}(contained|it{space}contains)|attach|associated)'
f'|[(]?including{space}(any|all)?{space}attachments[)]?|(is|are|contains){space}{confidential_variations}'
f'|is{space}for{space}the{space}recipients|is{space}intended{space}only|is{space}for{space}the{space}sole{space}user|has{space}been{space}scanned|with{space}its{space}contents'
f')|{sentence_start}(The|This){space}publication,{space}copying'
f'|{sentence_start}(The|This){space}sender{space}(cannot{space}guarantee|believes{space}that{space}this{space}{message_variations})'
f'|{sentence_start}If{space}you{space}have{space}received{space}this{space}{message_variations}{space}in{space}error'
f'|{sentence_start}The{space}contents{space}are{space}{confidential_variations}'
f'|{sentence_start}(Under|According to){space}(the)?{space}(General{space}Data{space}Protection{space}Regulation|GDPR)'
f'|{sentence_start}Click{space}here{space}to'
f'|{sentence_start}Copyright{space}'
f'|{sentence_start}Was{space}this{space}email{space}helpful\?'
f'|{sentence_start}For{space}Your{space}Information:'
f'|{sentence_start}Emails{space}are{space}not{space}secure'
f'|{sentence_start}To make{space}sure{space}you{space}continue{space}to{space}receive'
f'|{sentence_start}Please{space}choose{space}one{space}of{space}the{space}options{space}below'
f'|{sentence_start}Please{space}consider{space}the{space}environment{space}before{space}printing{space}this{space}{message_variations}'
f'|{sentence_start}This{space}e-mail{space}and{space}any{space}attachments{space}are{space}confidential'
f')[a-zA-Z0-9:;.,?!<>()@&/\'\"\“\” {dot}\xA0\t\-]*',
re.IGNORECASE
)

def nl_support(self):
self.SIG_REGEX = re.compile(
r'(--|__|-\w)|(^' + self.words_map[self.language]['Sent from'] + '(\w+\s*){1,3})'
)
self.QUOTE_HDR_REGEX = re.compile('Op.*schreef.*>:$')
self._MULTI_QUOTE_HDR_REGEX = r'(?!Op.*Op\s.+?schreef.*>:)(Op\s(.+?)schreef.*>:)'

def de_support(self):
self.SIG_REGEX = re.compile(r'(--|__|-\w)|(^' + self.words_map[self.language]['Sent from'] + '(\w+\s*){1,3})')
self.QUOTE_HDR_REGEX = re.compile('[a-zA-Z]{2,5}.*schrieb.*:$')
self._MULTI_QUOTE_HDR_REGEX = r'(?!Am.*Am\s.+?schrieb.*:)(Am\s(.+?)schrieb.*:)'

def fr_support(self):
self.SIG_REGEX = re.compile(
r'(--|__|-\w)|(^' + self.words_map[self.language]['Sent from'] \
+ '(\w+\s*){1,3})|(.*(cordialement|bonne r[ée]ception|salutations'
r'|cdlt|cdt|crdt|regards|best regard|bonne journ[ée]e))',
re.IGNORECASE
)
self.QUOTE_HDR_REGEX = re.compile('Le.*a écrit.*[> ]:$')
self._MULTI_QUOTE_HDR_REGEX = r'(?!Le.*Le\s.+?a écrit[a-zA-Z0-9.:;<>()&@ -]*:)(Le\s(.+?)a écrit[a-zA-Z0-9.:;<>()&@ -]*:)'

def en_support(self):
self.SIG_REGEX = re.compile(r'(--|__|-\w)|(^(sent from|get outlook)\s(\w+\s*){1,6})|(Best regards|Kind Regards|Thanks,|Thank you,|Best,|All the best|regards,)', flags=re.IGNORECASE)
self.QUOTE_HDR_REGEX = re.compile('\s*On.*wrote\s*:$')
self.QUOTED_REGEX = re.compile(r'(>+)|((&gt;)+)')
self._MULTI_QUOTE_HDR_REGEX = r'(?!On.*On\s.+?wrote\s*:)(On\s(.+?)wrote\s*:)'

def es_support(self):
self.SIG_REGEX = re.compile(r'(--|__|-\w)|(^Enviado desde (\w+\s*){1,6})')
self.QUOTE_HDR_REGEX = re.compile('\s*El.*escribió\s*:$')
self._MULTI_QUOTE_HDR_REGEX = r'(?!El.*El\s.+?escribió\s*:)(El\s(.+?)escribió\s*:)'

def ja_support(self):
self.SIG_REGEX = re.compile(r'--|__|-\w')
self.QUOTE_HDR_REGEX = re.compile(
r'[0-9]*年[0-9]*月[0-9]*日[\u3000-\u303F\u3040-\u309F\u30A0-\u30FF\uFF00-\uFFEF\u4E00-\u9FAF\u2605-\u2606\u2190-\u2195\u203Ba-zA-Z0-9.:;<>()&@ -]*:?$'
)
self.QUOTED_REGEX = re.compile(r'(>+)|((&gt;)+)')
self._MULTI_QUOTE_HDR_REGEX = r'(?!On.*On\s.+?wrote\s*:)(On\s(.+?)wrote\s*:)' # Dummy multiline: doesnt work for japanese due to BeautifulSoup insreting new lines before ":" character

def fi_support(self):
self.SIG_REGEX = re.compile(r'(--|__|-\w)|(^Lähetetty (\w+\s*){1,3})|(^Hanki Outlook for.*)')
self.QUOTE_HDR_REGEX = re.compile('(.+?kirjoitti(.+?kello.+?)?:)')
self.QUOTED_REGEX = re.compile(r'(>+)|((&gt;)+)')
self._MULTI_QUOTE_HDR_REGEX = r'(?!.+?kirjoitti.+?kirjoitti[a-zA-Z0-9.:;<>()&@ -]*:$)((.+?)kirjoitti[a-zA-Z0-9.:;<>()&@ -]*:$)'

def set_regex(self):
if hasattr(self, self.language + "_support"):
getattr(self, self.language + "_support")()
self.default_quoted_header()
else:
self.SIG_REGEX = re.compile(
r'(--|__|-\w)|(^(' + self.words_map[self.language]['Sent from']
+ '|' + self.words_map[self.default_language]['Sent from']
+ ')(\w+\s*){1,3})'
)
self.QUOTE_HDR_REGEX = re.compile('.*' + self.words_map[self.language]['wrote'] + '\s?:$')
self.default_quoted_header()
self._MULTI_QUOTE_HDR_REGEX = r'(?!.+?' + self.words_map[self.language]['wrote'] \
+ '\s*:\s*)(On\s(.+?)' + self.words_map[self.language]['wrote'] + ':)'
self.warnings()
self.FOLLOW_UP_HDR_REGEX = re.compile(r'(?<!^)This is a follow-up to your previous request.*', re.DOTALL)
self.MULTI_QUOTE_HDR_REGEX = re.compile(self._MULTI_QUOTE_HDR_REGEX, re.DOTALL | re.MULTILINE)
self.MULTI_QUOTE_HDR_REGEX_MULTILINE = re.compile(self._MULTI_QUOTE_HDR_REGEX, re.DOTALL)

def read(self):
""" Creates new fragment for each line
and labels as a signature, quote, or hidden.

Returns EmailMessage instance
"""

self.text = self.text.strip()
self.found_visible = False

is_multi_quote_header = self.MULTI_QUOTE_HDR_REGEX_MULTILINE.search(self.text)
if is_multi_quote_header:
self.text = self.MULTI_QUOTE_HDR_REGEX.sub(is_multi_quote_header.groups()[0].replace('\n', ''), self.text)

self.text = self.FOLLOW_UP_HDR_REGEX.sub('', self.text)
# Fix any outlook style replies, with the reply immediately above the signature boundary line
# See email_2_2.txt for an example
self.text = re.sub('([^\n])(?=\n ?[_-]{7,})', '\\1\n', self.text, re.MULTILINE)

self.text = re.sub(self.WARNING_REGEX, '\n', self.text)
self.lines = self.text.split('\n')
self.lines.reverse()

for line in self.lines:
self._scan_line(line)
if line.strip():
self._scan_line(line.strip())

self._finish_fragment()

self.fragments.reverse()

return self
Expand All @@ -85,42 +207,29 @@ def reply(self):
"""
reply = []
for f in self.fragments:
if not (f.hidden or f.quoted):
if not (f.hidden or f.quoted or f.signature):
reply.append(f.content)
return '\n'.join(reply)

def _scan_line(self, line):
""" Reviews each line in email message and determines fragment type

line - a row of text from an email message
"""
is_quote_header = self.QUOTE_HDR_REGEX.match(line) is not None
is_quoted = self.QUOTED_REGEX.match(line) is not None
is_header = is_quote_header or self.HEADER_REGEX.match(line) is not None

if self.fragment and len(line.strip()) == 0:
if self.SIG_REGEX.match(self.fragment.lines[-1].strip()):
self.fragment.signature = True
self._finish_fragment()

if self.fragment and self.SIG_REGEX.match(self.fragment.lines[-1].strip()):
self.fragment.signature = True
self._finish_fragment()
if self.fragment \
and ((self.fragment.headers == is_header and self.fragment.quoted == is_quoted) or
(self.fragment.quoted and (is_quote_header or len(line.strip()) == 0))):

(self.fragment.quoted and (is_quote_header or len(line.strip()) == 0))):
self.fragment.lines.append(line)
else:
self._finish_fragment()
self.fragment = Fragment(is_quoted, line, headers=is_header)

def quote_header(self, line):
""" Determines whether line is part of a quoted area

line - a row of the email message

Returns True or False
"""
return self.QUOTE_HDR_REGEX.match(line[::-1]) is not None

def _finish_fragment(self):
""" Creates fragment
"""
Expand All @@ -146,6 +255,33 @@ def _finish_fragment(self):
self.fragment = None


class EmailContacts(EmailMessage):

def contacts(self):
self.text = self.text.strip()
HEADER_BLOCK = re.compile(
r'('
+ '[>* ]*' + self.words_map[self.language]['From'] + '[ ]*:(.*)\n'
+ '[>* ]*(?:' + self.words_map[self.language]['Sent'] + '|Date)[ ]*:(.*)\n'
+ '[>* ]*' + self.words_map[self.language]['To'] + '[ ]*:(.*)\n'
+ ')'
)
EMAIL = re.compile(r'([a-zA-Z0-9_\-\.]+@[a-zA-Z0-9_\-\.]+\.[a-zA-Z]{2,5})')
headers = HEADER_BLOCK.findall(self.text)
json = []
for header in headers:
contact = {'from': '', 'to': '', 'date': ''}
from_email = EMAIL.search(header[1])
if from_email:
contact['from'] = from_email.groups()[0]
contact['date'] = header[2]
to_email = EMAIL.search(header[3])
if to_email:
contact['to'] = to_email.groups()[0]
json.append(contact)
return json


class Fragment(object):
""" A Fragment is a part of
an Email Message, labeling each part.
Expand Down
Loading