According to CMUSphinx's tutorial AM, the directory tree for new projects must follow the structure below:
my_db_dir/
│
.--------------:--------------.
│ │
etc/ wav/
├─ my_db.dic ├─ spkr_1/
├─ my_db.phone │ ├─ s1_file_1.wav
├─ my_db.lm │ ├─ s1_file_2.wav
├─ my_db.filler │ └─ s1_file_n.wav
├─ my_db_train.fileids ├─ spkr_2/
├─ my_db_train.transcription │ ├─ s2_file_1.wav
├─ my_db_test.fileids │ ├─ s2_file_2.wav
└─ my_db_test.transcription │ └─ s2_file_n.wav
└─ spkr_n/
├─ sn_file_1.wav
├─ sn_file_2.wav
└─ sn_file_n.wav
These scripts cover the "Data Preparation" section of CMU Sphinx's official AM training tutorial.
fb\_00\_create\_envtree.sh
: This script creates the directory structure shown above, except thespkr_X
inside thewav
folder. Notice that the data-dependent files (inside theetc
dir), although created, they DO NOT have any content yet. IOW, they're only initialized as empty files. A stupid choice of ours. But this scripts also checks for dependencies that must be installed before running fb_01 and fb_02, such assox
andwget
.fb\_01\_split\_train\_test.sh
: This script fulfills thefileids
andtranscriptions
files inetc/
dir. The data is divided as training set and test set, and the files within the dirs are data-dependent. The folderswav/spkr_X
contain symbolic links to the actual wav-transcription base dir.fb\_02\_define\_etclang.sh
: This script specially fulfills the files insidemy_db_dir/etc
dir: .dic, .filler, .phone, and .lm. A dependency is ourg2p
software, which must be previously downloaded/cloned from https://gitlab.com/fb-nlp/nlp-generator.git.
The next steps will then be (please refer to the section "Setting up the training scripts" for details):
- run
sphinxtrain -t my_db_dir setup
inside your project dir - edit the recently created
etc/sphinx_train.cfg
file - run
sphinxtrain run
to begin the AM train.
Grupo FalaBrasil (2020) - https://ufpafalabrasil.gitlab.io/
Universidade Federal do Pará (UFPA) - https://portal.ufpa.br/
Cassio Batista - https://cassota.gitlab.io/