langdata

Source training data for Tesseract for lots of languages

Want to re-train tesseract for a specific language, by modifying/augmenting the original training data? Then you have come to the right place!

If you want to find a language data set to run Tesseract, then look at our tessdata repository instead.

To re-create the training of a single language, lang, you need the following:

All the data in the lang directory.
The corresponding unicharset/xheights files for the script(s) used by lang.
All the remaining non-lang-specific files in the top-level directory, such as font_properties.
You also need to obtain the fonts needed to train the language. Some languages were trained with commercially available fonts, so you will need to buy them in order to reproduce the training exactly, or use substitutes.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
afr		afr
amh		amh
ara		ara
asm		asm
aze		aze
aze_cyrl		aze_cyrl
bel		bel
ben		ben
bih		bih
bod		bod
bos		bos
bul		bul
cat		cat
ceb		ceb
ces		ces
chi_sim		chi_sim
chi_tra		chi_tra
chr		chr
cym		cym
dan		dan
deu		deu
dzo		dzo
ell		ell
eng		eng
enm		enm
epo		epo
est		est
eus		eus
fas		fas
fin		fin
fra		fra
frk		frk
frm		frm
gle		gle
gle_uncial		gle_uncial
glg		glg
guj		guj
hat		hat
heb		heb
hin		hin
hrv		hrv
hun		hun
iku		iku
ind		ind
isl		isl
ita		ita
ita_old		ita_old
jav		jav
jpn		jpn
kan		kan
kat		kat
kat_old		kat_old
kaz		kaz
khm		khm
kir		kir
kor		kor
kur		kur
lao		lao
lat		lat
lav		lav
lit		lit
mal		mal
mar		mar
mkd		mkd
mlt		mlt
msa		msa
mya		mya
nep		nep
nld		nld
nor		nor
ori		ori
pan		pan
per		per
pol		pol
por		por
pus		pus
ron		ron
rus		rus
san		san
sin		sin
slk		slk
slv		slv
spa		spa
spa_old		spa_old
sqi		sqi
srp		srp
srp_latn		srp_latn
swa		swa
swe		swe
syr		syr
tam		tam
tel		tel
tgk		tgk
tgl		tgl
tha		tha
tir		tir
tur		tur
uig		uig
ukr		ukr
urd		urd

Provide feedback