Korean Text Data Generator for OCR tasks. Heavily depends on TRDG by Belval.
You should install trdg
and Pillow
package with pip, if not installed yet. You might want to use venv or conda.
pip install trdg Pillow
All resources including dictionaries, fonts, and background images are located in respective folders at resources/
.
You may add more resources in the corresponding folders.
Run builds/ksx*.sh
to build preconfigured datasets.
You can run ./run.py
to run kotdg. ./run.py --help
will show (shockingly long) list of options.
Behaviors of the options are generally same as the original TRDG
-c
or--count
is the only required option. It specifies the number of the images to be generated.-l
or--language
specifies the languages to be used. You can refer to the ISO standard- FYI: Korean is
ko
, and English isen
.
- FYI: Korean is
-o
or--output_dir <dir>
specifies the location where the generated images and labels will be saved.- Default to
out
- Default to
- Specifying text source
--dict <name>
will load dictionary named<name>
fromresources/dicts/
- If none of the source-specifying options are set, this option (with
<name>
aswords.txt
) is used by default. - Each line of the file is separated, and at most 200 characters (from the beginning) are used.
- Lines are randomly picked.
- e.g.
./run.py --dict anthem.txt -c 5
- If none of the source-specifying options are set, this option (with
--input_file <path>
will load a file from the given path.- Note that
<path>
is not related toresources/
, unlike other cases. - Lines are sequentially used.
- e.g.
./run.py --input_file resources/dicts/ksx1001.txt -c 5
- Note that
--random
will load randomly picked letters in the corresponding pool- 한글은 유니코드 0xAC00부터 0xD7A3 영역에서 임의 추출합니다. 이는 가능한 모든 조합형 글자의 영역입니다.
--include_**
options will configure what letters can be included in the poolko
andcn
each have their own pool, and other latin languages will use ASCII pool
--wikipedia
will load random words from random Wikipedia,page, corresponding to the given language code.
- Text options
--font <name>
specifies which font to be used- Font will be loaded from
resources/fonts
- The default value (even if this options is not set) is
'NanumGothic.ttf'
. - Note: all fonts should be
.ttf
or.otf
format.- TRDG uses
PIL.Imagefont.truetype
function. Check corresponding documents for compatibility.
- TRDG uses
- e.g.
./run.py -c 5 --font "Maplestory Bold.ttf"
- Font will be loaded from
--font_dir <dir>
specifies the directory where fonts are located- All fonts in the directory will be tried to be used.
- Note that, of course, the total number of generated images is equal to the number passed to the
--count
. - e.g.
./run.py -c 5 --font_dir resources/fonts
- Image options
--format
specifies the 'height' of each image- If the text orientation is vertical, this specifies the width.
- Default to 32.
--width
specifies the width of each image.- If not set, the width will be 10 +
<width of text>
- Not yet tested with vertical texts
- If not set, the width will be 10 +
-b <mode>
or--background <mode>
specify which backgrounds will be used.- options for
<mode>
: 0 (gaussian noise), 1 (plain white), 2 (quasi-crystal), 3 (custom image) - Note that
<mode>=3
needs--image_dir
to locate the custom background images
- options for
--image_dir
specifies the location of background images.- Only used when
--background 3
. - Ideally, this should be
resources/images
. Planning to make that as a default. - e.g.
./run.py --background 3 --image_dir resources/images -c 5
- Only used when
--text_color "color_code"
specifies the color of the text.- Color code should be in hex type (e.g.
#00FFFF
), and it must be enclosed by double quotes. ("") - e.g.
./run.py -c 5 --text_color "#00FFFF"
- Color code should be in hex type (e.g.
- Output options
--name_format
specifies the format of the generated file names.<text>
: text written on the image,<idx>
: integer index of the image,<ext>
: extension of the image- 0:
<text>_<idx>.<ext>
- 1:
<idx>_<text>.<ext>
- 2:
<idx>.<ext>
, andlabels.txt
containing<idx>
-<text>
mapping
Import KoreanTextGenerator
with from kotdg.generator import KoreanTextGenerator
.
Refer to demo.py
for general usage.
resources/
- fonts, dictionaries, and background images.
- font file should be
.ttf
file
- font file should be
- fonts, dictionaries, and background images.
kotdg/
- core files.
KoreanTextGeneator
ingenerator.py
wraps TRDG generators- use with same parameters as TRDG generators, but with extra meta-parameter
source
source
should have a value amongsource_options = ("string", "random", "dict", "wiki", "file")
- use with same parameters as TRDG generators, but with extra meta-parameter
demo.py
- testing scripts for development
- refer to this when you want to call
kotdg
with code
out/
- default image saving directory
- ignored by git by default
Based on TextRecognitionDataGenerator
Some files are from:
딥러닝을 활용한 한글문서 OCR 연구,
Jeongeun So
Note that I do not have rights of most resources in this repository.