PHP library to identify human languages from text samples. Returns confidence scores for each.
$ pear install Text_LanguageDetect
$ composer require pear/text_languagedetect
Also see the examples in the docs/
directory and
the official documentation.
Simple language detection:
<?php require_once 'Text/LanguageDetect.php'; $text = 'Was wäre, wenn ich Ihnen das jetzt sagen würde?'; $ld = new Text_LanguageDetect(); $language = $ld->detectSimple($text); echo $language; //output: german
Show the three most probable languages with their confidence score:
<?php require_once 'Text/LanguageDetect.php'; $text = 'Was wäre, wenn ich Ihnen das jetzt sagen würde?'; $ld = new Text_LanguageDetect(); //3 most probable languages $results = $ld->detect($text, 3); foreach ($results as $language => $confidence) { echo $language . ': ' . number_format($confidence, 2) . "\n"; } //output: //german: 0.35 //dutch: 0.25 //swedish: 0.20 ?>
Instead of returning the full language name, ISO 639-2 two and three letter codes can be returned:
<?php require_once 'Text/LanguageDetect.php'; $ld = new Text_LanguageDetect(); //will output the ISO 639-1 two-letter language code // "de" $ld->setNameMode(2); echo $ld->detectSimple('Das ist ein kleiner Text') . "\n"; //will output the ISO 639-2 three-letter language code // "deu" $ld->setNameMode(3); echo $ld->detectSimple('Das ist ein kleiner Text') . "\n"; ?>
- albanian
- arabic
- azeri
- bengali
- bulgarian
- cebuano
- croatian
- czech
- danish
- dutch
- english
- estonian
- farsi
- finnish
- french
- german
- hausa
- hawaiian
- hindi
- hungarian
- icelandic
- indonesian
- italian
- kazakh
- kyrgyz
- latin
- latvian
- lithuanian
- macedonian
- mongolian
- nepali
- norwegian
- pashto
- pidgin
- polish
- portuguese
- romanian
- russian
- serbian
- slovak
- slovene
- somali
- spanish
- swahili
- swedish
- tagalog
- turkish
- ukrainian
- urdu
- uzbek
- vietnamese
- welsh
- Homepage
- http://pear.php.net/package/Text_LanguageDetect
- Bug tracker
- http://pear.php.net/bugs/search.php?cmd=display&package_name[]=Text_LanguageDetect
- Documentation
- http://pear.php.net/package/Text_LanguageDetect/docs
- Unit test status
https://travis-ci.org/pear/Text_LanguageDetect
Where are the data from?
I don't recall where I got the original data set. It's just the frequencies of 3-letter combinations in each supported language. It could be generated from a few random wikipedia pages from each language.