Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Develop #33

Merged
merged 63 commits into from
Aug 18, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
63 commits
Select commit Hold shift + click to select a range
35ac870
Intial commit of squeaky clean text
rhnfzl Jun 15, 2024
e285bc4
updated the sct.py script with modular code
rhnfzl Jun 15, 2024
b80aeae
updated the sct.py script with pipeline method, which would ideally w…
rhnfzl Jun 15, 2024
ee9f47e
removed unnecessary direction code
rhnfzl Jun 15, 2024
a8abf5c
adding to do list
rhnfzl Jun 15, 2024
ab921e2
adding to do list
rhnfzl Jun 15, 2024
64c9851
added requiremnt.txt file
rhnfzl Jun 15, 2024
cabb678
added setup.py file
rhnfzl Jun 15, 2024
5b6e759
added test cases
rhnfzl Jun 15, 2024
b46b2dc
updated config file
rhnfzl Jun 15, 2024
d97a01e
merging back
rhnfzl Jun 16, 2024
f7d8cfd
Merge branch 'main' into develop
rhnfzl Jun 16, 2024
90f743c
Develop (#2) (#3)
rhnfzl Jun 16, 2024
d746606
Merge branch 'main' of https://github.com/rhnfzl/SqueakyCleanText int…
rhnfzl Jun 16, 2024
d03567f
rebase
rhnfzl Jun 16, 2024
8182b94
update the license
rhnfzl Jun 16, 2024
f4c6add
added German and Spanish support
rhnfzl Jun 16, 2024
5f8ab49
Updated file for pypi
rhnfzl Jun 16, 2024
4e65a24
Updated readme file
rhnfzl Jun 16, 2024
e2a0973
Add GitHub Actions workflow for publishing to PyPI
rhnfzl Jun 16, 2024
f1e1cc9
Updated readme file
rhnfzl Jun 16, 2024
c3ed47b
Merge branch 'main' into develop
rhnfzl Jun 16, 2024
54f2714
Updated readme file
rhnfzl Jun 16, 2024
fb90a31
added the username to the publish.yml
rhnfzl Jun 16, 2024
30587e4
update the API vriable name
rhnfzl Jun 16, 2024
883d309
Merge branch 'main' into develop
rhnfzl Jun 16, 2024
1394a97
update the API user name
rhnfzl Jun 16, 2024
2b3d8fb
Bump version to 0.1.1
rhnfzl Jun 16, 2024
5bb8285
Merge branch 'main' into develop
rhnfzl Jun 16, 2024
b7f7ca5
updated the readme file
rhnfzl Jun 16, 2024
f3ef342
updated the version
rhnfzl Jun 16, 2024
2823846
Merge branch 'main' into develop
rhnfzl Jun 16, 2024
a2458d3
Update NER Process and added tag removal
rhnfzl Aug 9, 2024
0687a6f
Updated congig file
rhnfzl Aug 9, 2024
4d67b00
Merge branch 'main' into develop
rhnfzl Aug 9, 2024
d2aeb02
updated the code to have the option to not output language
rhnfzl Aug 16, 2024
8d171e9
fixed the bug for NER which was refrencing to the wrong model variabl…
rhnfzl Aug 17, 2024
ed3ce21
Merge branch 'main' into develop
rhnfzl Aug 17, 2024
6940597
fixed the Anonomyser Engine
rhnfzl Aug 17, 2024
03ef4e0
fixed the Anonomyser Engine
rhnfzl Aug 17, 2024
fb90dcd
added the test.yml file
rhnfzl Aug 17, 2024
9d82e47
Merge branch 'main' into develop
rhnfzl Aug 17, 2024
cc59d71
added the test.yml file
rhnfzl Aug 17, 2024
eeaec6b
Merge branch 'main' into develop
rhnfzl Aug 17, 2024
e905ef1
added the test.yml file
rhnfzl Aug 17, 2024
dede5a2
added the German and Spanish language support in lingua
rhnfzl Aug 17, 2024
17cc400
Merge branch 'main' into develop
rhnfzl Aug 17, 2024
8f849f2
added the ability in the config to change the model name
rhnfzl Aug 17, 2024
b0c3f8b
added the ability in the config to change the model name
rhnfzl Aug 17, 2024
a2a3335
Merge branch 'main' into develop
rhnfzl Aug 17, 2024
8385c92
added the ability in the config to change the model name and fixed sp…
rhnfzl Aug 17, 2024
96746bb
Merge branch 'develop' of https://github.com/rhnfzl/SqueakyCleanText …
rhnfzl Aug 17, 2024
b7af17b
squased some bugs
rhnfzl Aug 17, 2024
4e0403f
Merge branch 'main' into develop
rhnfzl Aug 17, 2024
cc9e4d3
added the language passing support
rhnfzl Aug 17, 2024
c646034
Merge branch 'main' into develop
rhnfzl Aug 17, 2024
44ab951
Refactored the code
rhnfzl Aug 17, 2024
8d58b19
Merge branch 'main' into develop
rhnfzl Aug 17, 2024
612753c
fixed typing issue
rhnfzl Aug 17, 2024
39b3d37
Merge branch 'main' into develop
rhnfzl Aug 17, 2024
1c186d1
reverted the refactor
rhnfzl Aug 17, 2024
5c43731
Merge branch 'main' into develop
rhnfzl Aug 17, 2024
d35cb18
Added the flow diagram of the pacckage in the readme
rhnfzl Aug 18, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,8 @@ SqueakyCleanText simplifies the process by automatically addressing common text
- Supports English, Dutch, German, and Spanish languages.
- Provides text formatted for both Language Model processing and Statistical Model processing.

![Default Flow of cleaning Text](resources/sct_flow.png)

##### Benefits for Statistical Models
When working with statistical models, further optimization is often required, such as removing stopwords, special symbols, and punctuation.
SqueakyCleanText streamlines this process, ensuring your text data is in optimal shape for classification and other downstream tasks.
Expand Down
Binary file added resources/sct_flow.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
91 changes: 24 additions & 67 deletions sct/config.py
Original file line number Diff line number Diff line change
@@ -1,100 +1,57 @@
"""
Module containing the configuration parameters for the SCT package.
detect_language : to detect the language automatically, but would consume more time if done on a batch
fix_bad_unicode : if True, fix "broken" unicode such as mojibake and garbled HTML entities
to_ascii_unicode : if True, convert non-to_ascii characters into their closest to_ascii equivalents
replace_with_url : special URL token, default "",
replace_with_email : special EMAIL token, default "",
replace_years : replace year, default "",
replace_with_phone_number : special PHONE token, default "",
replace_with_number : special NUMBER token, default "",
no_currency_symbols : if True, replace all currency symbols with the respective alphabetical ones,
ner_process : To execute NER Process to remove the positpositional tags, PER, LOC, ORG, MISC
remove_isolated_letters : remove any isolated letters which doesn't add any value to the text
remove_isolated_symbols : remove any isolated symbols which shouldn't be present in the text, usually which isn't
immediatly prefixed and suffixed by letter or number
normalize_whitespace : remove any unnecessary whitespace
statistical_model_processing : to get the statistical model text, like for fastText, SVM, LR etc
casefold : to lower the text
remove_stopwords : remove stopwords based on the language, usues NLTK stopwords
remove_punctuation : removes all the special symbols
"""

# Flag to detect the language automatically. If True, the language will be detected for each text.
CHECK_DETECT_LANGUAGE = True

# Flag to fix "broken" unicode such as mojibake and garbled HTML entities.
CHECK_FIX_BAD_UNICODE = True

# Flag to convert non-ASCII characters into their closest ASCII equivalents.
CHECK_TO_ASCII_UNICODE = True

# Flag to replace HTML tags with a special token.
CHECK_REPLACE_HTML = True

# Flag to replace URLs with a special token.
CHECK_REPLACE_URLS = True

# Flag to replace email addresses with a special token.
CHECK_REPLACE_EMAILS = True

# Flag to replace years with a special token.
CHECK_REPLACE_YEARS = True

# Flag to replace phone numbers with a special token.
CHECK_REPLACE_PHONE_NUMBERS = True

# Flag to replace numbers with a special token.
CHECK_REPLACE_NUMBERS = True

# Flag to replace currency symbols with their respective alphabetical equivalents.
CHECK_REPLACE_CURRENCY_SYMBOLS = True

# Flag to execute Named Entity Recognition (NER) to remove positional tags such as PER, LOC, ORG, MISC.
CHECK_NER_PROCESS = True

# Flag to remove any isolated letters which do not add any value to the text.
CHECK_REMOVE_ISOLATED_LETTERS = True

# Flag to remove any isolated symbols which should not be present in the text.
CHECK_REMOVE_ISOLATED_SPECIAL_SYMBOLS = True

# Flag to remove any unnecessary whitespace.
CHECK_NORMALIZE_WHITESPACE = True

# Flag to get the statistical model text, such as for fastText, SVM, LR.
CHECK_STATISTICAL_MODEL_PROCESSING = True

# Flag to convert all characters to lowercase.
CHECK_CASEFOLD = True

# Flag to remove stopwords based on the language. Uses NLTK stopwords.
CHECK_REMOVE_STOPWORDS = True

# Flag to remove all special symbols.
CHECK_REMOVE_PUNCTUATION = True

# Flag to remove custom stopwords specific to the SCT package.
CHECK_REMOVE_STEXT_CUSTOM_STOP_WORDS = True

# Special token to replace URLs.
REPLACE_WITH_URL = "<URL>"

# Special token to replace HTML tags.
REPLACE_WITH_HTML = "<HTML>"

# Special token to replace email addresses.
REPLACE_WITH_EMAIL = "<EMAIL>"

# Special token to replace years.
REPLACE_WITH_YEARS = "<YEAR>"

# Special token to replace phone numbers.
REPLACE_WITH_PHONE_NUMBERS = "<PHONE>"

# Special token to replace numbers.
REPLACE_WITH_NUMBERS = "<NUMBER>"

# Special token to replace currency symbols. If None, symbols will be replaced with their 3-letter abbreviations.
REPLACE_WITH_CURRENCY_SYMBOLS = None

# List of positional tags to be removed by NER.
POSITIONAL_TAGS = ['PER', 'LOC', 'ORG']

# Confidence threshold for NER.
NER_CONFIDENCE_THRESHOLD = 0.85

# Language to be used for NER. If None, the language will be detected automatically.
LANGUAGE = None

# List of pre-trained NER models in order of importance.
NER_MODELS_LIST = [
"FacebookAI/xlm-roberta-large-finetuned-conll03-english", # English Model
"FacebookAI/xlm-roberta-large-finetuned-conll02-dutch", # Dutch Model
"FacebookAI/xlm-roberta-large-finetuned-conll03-german", # German Model
"FacebookAI/xlm-roberta-large-finetuned-conll02-spanish", # Spanish Model
"Babelscape/wikineural-multilingual-ner" # Multilingual Model
]

# Order of the model is Important : English Model, Dutch Model, German Model, Spanish Model, MULTILINGUAL Model
NER_MODELS_LIST = ["FacebookAI/xlm-roberta-large-finetuned-conll03-english",
"FacebookAI/xlm-roberta-large-finetuned-conll02-dutch",
"FacebookAI/xlm-roberta-large-finetuned-conll03-german",
"FacebookAI/xlm-roberta-large-finetuned-conll02-spanish",
"Babelscape/wikineural-multilingual-ner"]
Loading
Loading