This repository contains source code for learning audio embeddings via a Speech2Vec CPC model.
See the following slides for more info:
First, clone this repo, and run:
git submodule init
git submodule update
You can set up an environment and install prerequisites as follows:
sudo apt-get install libsndfile1 # required for librosa
conda create -n myenv python=3.7
conda activate myenv
conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch
pip install -r requirements.txt
The script uses the text corpus from LibriSpeech for generating the Morse code datasets (it does not use its audio files). You can download and extract the datasets as follows:
mkdir data; cd data
wget https://www.openslr.org/resources/12/dev-clean.tar.gz
tar zxvf dev-clean.tar.gz
wget https://www.openslr.org/resources/12/train-clean-100.tar.gz
tar zxvf train-clean-100.tar.gz
wget https://www.openslr.org/resources/12/train-clean-360.tar.gz
tar zxvf train-clean-360.tar.gz
The next step is to generate the Morse code datasets based on the LibriSpeech data. You can use the morse.py
script to do this:
python unsup_st/morse.py --src data/LibriSpeech --dest data/LibriMorse --dataset dev-clean
python unsup_st/morse.py --src data/LibriSpeech --dest data/LibriMorse --dataset train-clean-100
python unsup_st/morse.py --src data/LibriSpeech --dest data/LibriMorse --dataset train-clean-360
The training script assumes that the input audio is converted to feature representations (MFCC). You can use cache_dataset.py
to do this:
python unsup_st/cache_dataset.py --src data/LibriMorse --dest data/LibriMorse.cache --dataset dev-clean
python unsup_st/cache_dataset.py --src data/LibriMorse --dest data/LibriMorse.cache --dataset train-clean-100
python unsup_st/cache_dataset.py --src data/LibriMorse --dest data/LibriMorse.cache --dataset train-clean-360
Now you are ready to train the audio representation model.
To train an RNN-based CPC Speech2Vec model, run:
python unsup_st/speech2vec_cpc.py \
--model rnn \
--datadir data/LibriMorse.cache \
--train-dataset data/LibriMorse/train-clean-100/metadata.jsonl,data/LibriMorse/train-clean-360/metadata.jsonl \
--valid-dataset data/LibriMorse/dev-clean/metadata.jsonl \
--outdir data/embed \
--hidden-size 128 \
--lr 1e-3 \
--batch-size 450 \
--additive-margin 0.2
To train a Transformer-based CPC Speech2Vec model, run:
python unsup_st/speech2vec_cpc.py \
--model transformer \
--datadir data/LibriMorse.cache \
--train-dataset data/LibriMorse/train-clean-100/metadata.jsonl,data/LibriMorse/train-clean-360/metadata.jsonl \
--valid-dataset data/LibriMorse/dev-clean/metadata.jsonl \
--outdir data/embed \
--hidden-size 128 \
--layers 2 \
--lr 1e-3 \
--batch-size 450 \
--additive-margin 0.2
Here are the command line options:
--model
: eitherrnn
(RNN-based) ortransformer
(Transformer-based)--datadir
: directory containing audio features (output ofcache_dataset.py
above)--train-dataset
: comma delimited list ofmetadata.jsonl
files generated bymorse.py
--valid-dataset
: this is read but not used for training--outdir
: directory for writing the learned embedding files--hidden-size
: size of the hidden states (= size of the final embedding vectors)--layers
: number of Transformer layers (this does not work for the RNN-based model, which always has one layer)--lr
: learning rate--batch-size
: batch size--additive-margin
: additive margin for contrastive learning. See Yang et al. 2019 for more info
This training script will produce epochxxx.vec
(where xxx
is the epoch) under the directory specified by --outdir
.
Finally, you can use plot_embeddings.py
to visualize learned embeddings:
python unsup_st/plot_embeddings.py \
--input-file data/embed/epochxxx.vec \
--train-dataset data/LibriMorse/train-clean-100/metadata.jsonl,data/LibriMorse/train-clean-360/metadata.jsonl \
--output-file embed.html
You can open embed.html
with your browser and interact with the visualization!