This repo provides clean implementation of Language Detection System in TensorFlow-2 using all best practices.
- Bulgarian
- Czech
- Danish
- Dutch
- English (Of course)
- Estonian
- Finnish
- French
- German
- Greek
- Hungarian
- Italian
- Latvian
- Lithuanian
- Polish
- Portuguese
- Romanian
- Slovak
- Slovenian
- Spanish
- Swedish
# Tensorflow CPU
conda activate (import tensorflow as tf)
pip install -r requirements.txt
NOTE: Models requires their respective tokenizers to work with; SO kindly download models along with their tokenizers
# Model
wget https://github.com/saahiluppal/langdet/blob/master/model.h5
# Tokenizer
wget https://github.com/saahiluppal/langdet/blob/master/tokenizer.json
Not sure which model to use, You can find information about models here
# wanna detect language (we recommend using more than 5 words for better accuracy)
# file dependencies soon to be added
detect.py
# Training custom model (we recommend setting code which better suits your needs)
manual_tokens.py
# jupyter notebook for same
manual_tokens.ipynb
# Wanna preprocess downloaded data for custom use
extraction.py
I used Dataset from European Parliament Parallel Corpus,which can be found here
While full dataset is large (1.5 GB Unextracted) you might want to use smaller preprocessed dataset can be found here