Skip to content

Commit

Permalink
Develop (#38)
Browse files Browse the repository at this point in the history
* Intial commit of squeaky clean text

* updated the sct.py script with modular code

* updated the sct.py script with pipeline method, which would ideally would help to make changes in the processing easier

* removed unnecessary direction code

* adding to do list

* adding to do list

* added requiremnt.txt file

* added setup.py file

* added test cases

* updated config file

* merging back

* Develop (#2) (#3)

* Intial commit of squeaky clean text

* updated the sct.py script with modular code

* updated the sct.py script with pipeline method, which would ideally would help to make changes in the processing easier

* removed unnecessary direction code

* adding to do list

* adding to do list

* added requiremnt.txt file

* added setup.py file

* added test cases

* updated config file

* merging back

* rebase

* update the license

* added German and Spanish support

* Updated file for pypi

* Updated readme file

* Add GitHub Actions workflow for publishing to PyPI

* Updated readme file

* Updated readme file

* added the username to the publish.yml

* update the API vriable name

* update the API user name

* Bump version to 0.1.1

* updated the readme file

* updated the version

* Update NER Process and added tag removal

* Updated congig file

* updated the code to have the option to not output language

* fixed the bug for NER which was refrencing to the wrong model variable names, add the gpu support

* fixed the Anonomyser Engine

* fixed the Anonomyser Engine

* added the test.yml file

* added the test.yml file

* added the test.yml file

* added the German and Spanish language support in lingua

* added the ability in the config to change the model name

* added the ability in the config to change the model name

* added the ability in the config to change the model name and fixed spanish model name

* squased some bugs

* added the language passing support

* Refactored the code

* fixed typing issue

* reverted the refactor

* Added the flow diagram of the pacckage in the readme

* Added the flow diagram of the pacckage in the readme

* fixed the diagram

* Resolved conflict by replacing sct_flow.png

* Added improved NER process and the batch processing method to enhance the speed
  • Loading branch information
rhnfzl authored Nov 13, 2024
1 parent 2cf6876 commit 289e203
Show file tree
Hide file tree
Showing 9 changed files with 841 additions and 311 deletions.
38 changes: 31 additions & 7 deletions .github/workflows/publish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,37 @@ on:
- main

jobs:
test:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.10", "3.11", "3.12"]

steps:
- uses: actions/checkout@v4

- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v3
with:
python-version: ${{ matrix.python-version }}

- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -e .
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
pip install hypothesis faker flake8 pytest
- name: Download NLTK stopwords
run: |
python -m nltk.downloader stopwords
- name: Test with pytest
run: |
pytest
publish:
needs: test
runs-on: ubuntu-latest

steps:
Expand All @@ -15,19 +45,13 @@ jobs:
- name: Set up Python
uses: actions/setup-python@v3
with:
python-version: '3.10' # Use a specific Python version for publishing
python-version: '3.10' # Use stable version for publishing

- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install setuptools wheel twine
- name: Install package dependencies
run: |
pip install -e .
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
pip install hypothesis faker
- name: Build the package
run: python setup.py sdist bdist_wheel

Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/python-package.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ jobs:
strategy:
fail-fast: false
matrix:
python-version: ["3.7", "3.8", "3.9", "3.10", "3.11", "3.12"]
python-version: ["3.10", "3.11", "3.12"]

steps:
- uses: actions/checkout@v4
Expand Down
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -166,3 +166,5 @@ sct/utils/__pycache__
sct/scripts/__pycache__
tests/.hypothesis
SqueakyCleanText.egg-info
test_performance.py
OldSqueakyCleanText/
226 changes: 73 additions & 153 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,182 +7,102 @@ In the world of machine learning and natural language processing, clean and well
SqueakyCleanText simplifies the process by automatically addressing common text issues, ensuring your data is clean and well-structured with minimal effort on your part.

### Key Features
- Encoding Issues: Corrects text encoding problems.
- HTML and URLs: Removes unnecessary long HTML tags and URLs, or replaces them with special tokens.
- Contact Information: Strips emails, phone numbers, and other contact details, or replaces them with special tokens.
- Isolated Characters: Eliminates isolated letters or symbols that add no value.
- NER Support: Uses a soft voting ensemble technique to handle named entities like location, person, and organization names, which can be replaced with special tokens if not needed in the text.
- Stopwords and Punctuation: For statistical models, it optimizes text by removing stopwords, special symbols, and punctuation.
- Currency Symbols: Replaces all currency symbols with their alphabetical equivalents.
- Whitespace Normalization: Removes unnecessary whitespace.
- Detects the language of the processed text, useful for downstream tasks.
- Supports English, Dutch, German, and Spanish languages.
- Provides text formatted for both Language Model processing and Statistical Model processing.
- **Encoding Issues**: Corrects text encoding problems and handles bad Unicode characters.
- **HTML and URLs**: Removes or replaces HTML tags and URLs with configurable tokens.
- **Contact Information**: Handles emails, phone numbers, and other contact details with customizable replacement tokens.
- **Named Entity Recognition (NER)**:
- Multi-language support (English, Dutch, German, Spanish)
- Ensemble voting technique for improved accuracy
- Configurable confidence thresholds
- Efficient batch processing
- Automatic text chunking for long documents
- GPU acceleration support
- **Text Normalization**:
- Removes isolated letters and symbols
- Normalizes whitespace
- Handles currency symbols
- Year detection and replacement
- Number standardization
- **Language Support**:
- Automatic language detection
- Language-specific NER models
- Language-aware stopword removal
- **Dual Output Formats**:
- Language Model format (preserves structure with tokens)
- Statistical Model format (optimized for classical ML)
- **Performance Optimization**:
- Batch processing support
- Configurable batch sizes
- Memory-efficient processing of large texts
- GPU memory management

![Default Flow of cleaning Text](resources/sct_flow.png)

##### Benefits for Statistical Models
When working with statistical models, further optimization is often required, such as removing stopwords, special symbols, and punctuation.
SqueakyCleanText streamlines this process, ensuring your text data is in optimal shape for classification and other downstream tasks.
### Benefits

##### Advantage for Ensemble NER Process
Relying on a single model for Named Entity Recognition (NER) may not be ideal, as there is a significant chance that it might miss some entities. Combining language-specific NER models increases specificity and reduces the risk of missing entities.
The NER model in this package includes a chunking mechanism, enabling effective NER processing even when the text exceeds the model's token size limit.
#### For Language Models
- Maintains text structure while anonymizing sensitive information
- Configurable token replacements
- Preserves context while removing noise
- Handles long documents through intelligent chunking

By automating these text cleaning steps, SqueakyCleanText ensures your data is prepared efficiently and effectively, saving time and improving model performance.
#### For Statistical Models
- Removes stopwords and punctuation
- Case normalization
- Special symbol removal
- Optimized for classification tasks

## Installation
#### Advanced NER Processing
- Ensemble approach reduces missed entities
- Language-specific models improve accuracy
- Confidence thresholds for precision control
- Efficient batch processing for large datasets
- Automatic handling of long documents

To install SqueakyCleanText, use the following pip command:
## Installation

```sh
pip install SqueakyCleanText
```

## Usage

Here are a few examples of how to use the SqueakyCleanText package:

Examples:
```python
english_text = "Hey John Doe, wanna grab some coffee at Starbucks on 5th Avenue? I'm feeling a bit tired after last night's party at Jane's place. BTW, I can't make it to the meeting at 10:00 AM. LOL! Call me at +1-555-123-4567 or email me at [email protected]. Check out this cool website: https://www.example.com."

dutch_text = "Hé Jan Jansen, wil je wat koffie halen bij Starbucks op de 5e Avenue? Ik voel me een beetje moe na het feest van gisteravond bij Annes huis. Btw, ik kan niet naar de vergadering om 10:00 uur. LOL! Bel me op +31-6-1234-5678 of mail me op [email protected]. Kijk eens naar deze coole website: https://www.voorbeeld.com."
```

- Using default configuration settings:

### Basic Usage
```python
# The first time you import the package, it may take some time because it will downloading the NER models. Please be patient.
from sct import sct

# Initialize the TextCleaner
sx = sct.TextCleaner()

# Process the text
# lmtext : Text for Language Models;
# cmtext : Text for Classical/Statistical ML;
# language : Processed text language

#### --- English Text
lmtext, cmtext, language = sx.process(english_text)
print(f"Language Model Text : {lmtext}")
print(f"Statistical Model Text : {cmtext}")
print(f"Language of the Text : {language}")

# Output the result
# Language Model Text : Hey <PERSON> wanna grab some coffee at Starbucks on <LOCATION> I'm feeling a bit tired after last night's party at <PERSON>'s place. BTW, can't make it to the meeting at <NUMBER><NUMBER> AM. LOL! Call me at <PHONE> or email me at <EMAIL> Check out this cool website: <URL>
# Statistical Model Text : hey person wanna grab coffee starbucks location im feeling bit tired last nights party persons place btw cant make meeting numbernumber am lol call phone email email check cool website url
# Language of the Text : ENGLISH

#### --- Dutch Text
lmtext, cmtext, language = sx.process(dutch_text)
print(f"Language Model Text : {lmtext}")
print(f"Statistical Model Text : {cmtext}")
print(f"Language of the Text : {language}")

# Output the result
# Language Model Text : He <PERSON> wil je wat koffie halen bij <ORGANISATION> op de <LOCATION> Ik voel me een beetje moe na het feest van gisteravond bij Annes huis. Btw, ik kan niet naar de vergadering om <NUMBER><NUMBER> uur. LOL! Bel me op <NUMBER><NUMBER><PHONE> of mail me op <EMAIL> Kijk eens naar deze coole website: <URL>
# Statistical Model Text : he person koffie halen organisation location voel beetje moe feest gisteravond annes huis btw vergadering numbernumber uur lol bel numbernumberphone mail email kijk coole website url
# Language of the Text : DUTCH
# Process single text
text = "Hey John Doe, email me at [email protected]"
lm_text, stat_text, language = sx.process(text)

# Process multiple texts efficiently
texts = ["Text 1", "Text 2", "Text 3"]
results = sx.process_batch(texts, batch_size=2)
```

- Using the package with custom configuration:
You can modify the package’s functionality by changing settings in the configuration file before initializing TextCleaner().

- Deactivating NER altogether:

```python

from sct import sct, config

config.CHECK_NER_PROCESS = False
sx = sct.TextCleaner()

lmtext, cmtext, language = sx.process(english_text)
print(f"Language Model Text : {lmtext}")
print(f"Statistical Model Text : {cmtext}")
print(f"Language of the Text : {language}")

# Output the result
# Language Model Text : Hey John Doe, wanna grab some coffee at Starbucks on 5th Avenue? I'm feeling a bit tired after last night's party at Jane's place. BTW, can't make it to the meeting at <NUMBER><NUMBER> AM. LOL! Call me at <PHONE> or email me at <EMAIL> Check out this cool website: <URL>
# Statistical Model Text : hey john doe wanna grab coffee starbucks 5th avenue im feeling bit tired last nights party janes place btw cant make meeting numbernumber am lol call phone email email check cool website url
# Language of the Text : ENGLISH
```

- Incase Statistical model text is not needed:

```python

from sct import sct, config

config.CHECK_STATISTICAL_MODEL_PROCESSING = False
sx = sct.TextCleaner()

lmtext, language = sx.process(english_text)
print(f"Language Model Text : {lmtext}")
print(f"Language of the Text : {language}")

# Output the result

# Output the result
# Language Model Text : Hey John Doe, wanna grab some coffee at Starbucks on 5th Avenue? I'm feeling a bit tired after last night's party at Jane's place. BTW, can't make it to the meeting at <NUMBER><NUMBER> AM. LOL! Call me at <PHONE> or email me at <EMAIL> Check out this cool website: <URL>
# Language of the Text : ENGLISH
```
### Full List of Configurable Settings:

Similarly, other aspects of the configuration can be changed. Simply modify the settings before initializing TextCleaner(). Below is the full list of configurable settings:

```python
from sct import sct, config
# In case Language detection is not required as well as No NER and No Statistical Model stopwords are needed,
# then only CHECK_DETECT_LANGUAGE will be considered False.
config.CHECK_DETECT_LANGUAGE = True
config.CHECK_FIX_BAD_UNICODE = True
config.CHECK_TO_ASCII_UNICODE = True
config.CHECK_REPLACE_HTML = True
config.CHECK_REPLACE_URLS = True
config.CHECK_REPLACE_EMAILS = True
config.CHECK_REPLACE_YEARS = True
config.CHECK_REPLACE_PHONE_NUMBERS = True
config.CHECK_REPLACE_NUMBERS = True
config.CHECK_REPLACE_CURRENCY_SYMBOLS = True
config.CHECK_NER_PROCESS = True
config.CHECK_REMOVE_ISOLATED_LETTERS = True
config.CHECK_REMOVE_ISOLATED_SPECIAL_SYMBOLS = True
config.CHECK_NORMALIZE_WHITESPACE = True
config.CHECK_STATISTICAL_MODEL_PROCESSING = True
config.CHECK_CASEFOLD = True
config.CHECK_REMOVE_STOPWORDS = True
config.CHECK_REMOVE_PUNCTUATION = True
config.CHECK_REMOVE_SCT_CUSTOM_STOP_WORDS = True
# Tags can be replaced if needed, like if no special tags are necessary "" can be passed
config.REPLACE_WITH_URL = "<URL>"
config.REPLACE_WITH_HTML = "<HTML>"
config.REPLACE_WITH_EMAIL = "<EMAIL>"
config.REPLACE_WITH_YEARS = "<YEAR>"
config.REPLACE_WITH_PHONE_NUMBERS = "<PHONE>"
config.REPLACE_WITH_NUMBERS = "<NUMBER>"
config.REPLACE_WITH_CURRENCY_SYMBOLS = None
# You can remove any of the tags
config.POSITIONAL_TAGS = ['PER', 'LOC', 'ORG']
config.NER_CONFIDENCE_THRESHOLD = 0.85
# Pass it as ENGLISH, DUTCH, GERMAN etc. if you know the language of text beforehand.
config.LANGUAGE = None

# Order of the model is Important: English Model, Dutch Model, German Model, Spanish Model, MULTILINGUAL Model
# All models passed need to support transformers AutoModel
config.NER_MODELS_LIST = [
"FacebookAI/xlm-roberta-large-finetuned-conll03-english",
"FacebookAI/xlm-roberta-large-finetuned-conll02-dutch",
"FacebookAI/xlm-roberta-large-finetuned-conll03-german",
"FacebookAI/xlm-roberta-large-finetuned-conll02-spanish",
"Babelscape/wikineural-multilingual-ner"
]

sx = sct.TextCleaner()
```
### Advanced Configuration
```python
from sct import sct, config

# Customize NER settings
config.CHECK_NER_PROCESS = True
config.NER_CONFIDENCE_THRESHOLD = 0.85
config.POSITIONAL_TAGS = ['PER', 'LOC', 'ORG']

# Customize replacement tokens
config.REPLACE_WITH_URL = "<URL>"
config.REPLACE_WITH_EMAIL = "<EMAIL>"
config.REPLACE_WITH_PHONE_NUMBERS = "<PHONE>"

# Set known language (skips detection)
config.LANGUAGE = "ENGLISH" # Options: ENGLISH, DUTCH, GERMAN, SPANISH

# Initialize with custom settings
sx = sct.TextCleaner()
```

## API

Expand Down
7 changes: 7 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,10 @@ transformers>=4.30.0
beautifulsoup4==4.12.2
presidio_anonymizer>=2.2.0
lingua-language-detector>=2.0.2
hypothesis==6.82.7
faker==20.1.0
flake8==6.1.0
pytest==8.3.3
coverage==7.3.1
pytest-cov==4.1.0
timeout-decorator==0.5.0
Loading

0 comments on commit 289e203

Please sign in to comment.