Unsupervised Speech Translation

This repository contains source code for learning audio embeddings via a Speech2Vec CPC model.

See the following slides for more info:

Prerequisites

First, clone this repo, and run:

git submodule init
git submodule update

You can set up an environment and install prerequisites as follows:

sudo apt-get install libsndfile1    # required for librosa
conda create -n myenv python=3.7
conda activate myenv
conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch
pip install -r requirements.txt

Datasets

The script uses the text corpus from LibriSpeech for generating the Morse code datasets (it does not use its audio files). You can download and extract the datasets as follows:

mkdir data; cd data
wget https://www.openslr.org/resources/12/dev-clean.tar.gz
tar zxvf dev-clean.tar.gz
wget https://www.openslr.org/resources/12/train-clean-100.tar.gz
tar zxvf train-clean-100.tar.gz
wget https://www.openslr.org/resources/12/train-clean-360.tar.gz
tar zxvf train-clean-360.tar.gz

Generate Morse code datasets

The next step is to generate the Morse code datasets based on the LibriSpeech data. You can use the morse.py script to do this:

python unsup_st/morse.py --src data/LibriSpeech --dest data/LibriMorse --dataset dev-clean
python unsup_st/morse.py --src data/LibriSpeech --dest data/LibriMorse --dataset train-clean-100
python unsup_st/morse.py --src data/LibriSpeech --dest data/LibriMorse --dataset train-clean-360

Convert datasets into features

The training script assumes that the input audio is converted to feature representations (MFCC). You can use cache_dataset.py to do this:

python unsup_st/cache_dataset.py --src data/LibriMorse --dest data/LibriMorse.cache --dataset dev-clean
python unsup_st/cache_dataset.py --src data/LibriMorse --dest data/LibriMorse.cache --dataset train-clean-100
python unsup_st/cache_dataset.py --src data/LibriMorse --dest data/LibriMorse.cache --dataset train-clean-360

Train the models

Now you are ready to train the audio representation model.

To train an RNN-based CPC Speech2Vec model, run:

python unsup_st/speech2vec_cpc.py \
    --model rnn \
    --datadir data/LibriMorse.cache \
    --train-dataset data/LibriMorse/train-clean-100/metadata.jsonl,data/LibriMorse/train-clean-360/metadata.jsonl \
    --valid-dataset data/LibriMorse/dev-clean/metadata.jsonl \
    --outdir data/embed \
    --hidden-size 128 \
    --lr 1e-3 \
    --batch-size 450 \
    --additive-margin 0.2

To train a Transformer-based CPC Speech2Vec model, run:

python unsup_st/speech2vec_cpc.py \
    --model transformer \
    --datadir data/LibriMorse.cache \
    --train-dataset data/LibriMorse/train-clean-100/metadata.jsonl,data/LibriMorse/train-clean-360/metadata.jsonl \
    --valid-dataset data/LibriMorse/dev-clean/metadata.jsonl \
    --outdir data/embed \
    --hidden-size 128 \
    --layers 2 \
    --lr 1e-3 \
    --batch-size 450 \
    --additive-margin 0.2

Here are the command line options:

--model: either rnn (RNN-based) or transformer (Transformer-based)
--datadir: directory containing audio features (output of cache_dataset.py above)
--train-dataset: comma delimited list of metadata.jsonl files generated by morse.py
--valid-dataset: this is read but not used for training
--outdir: directory for writing the learned embedding files
--hidden-size: size of the hidden states (= size of the final embedding vectors)
--layers: number of Transformer layers (this does not work for the RNN-based model, which always has one layer)
--lr: learning rate
--batch-size: batch size
--additive-margin: additive margin for contrastive learning. See Yang et al. 2019 for more info

This training script will produce epochxxx.vec (where xxx is the epoch) under the directory specified by --outdir.

Visualize embeddings

Finally, you can use plot_embeddings.py to visualize learned embeddings:

python unsup_st/plot_embeddings.py \
    --input-file data/embed/epochxxx.vec \
    --train-dataset data/LibriMorse/train-clean-100/metadata.jsonl,data/LibriMorse/train-clean-360/metadata.jsonl \
    --output-file embed.html

You can open embed.html with your browser and interact with the visualization!

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
unsup_st		unsup_st
word-benchmarks @ 9b61d0e		word-benchmarks @ 9b61d0e
.gitmodules		.gitmodules
README.md		README.md
cpc.png		cpc.png
morse.ipynb		morse.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Unsupervised Speech Translation

Prerequisites

Datasets

Generate Morse code datasets

Convert datasets into features

Train the models

Visualize embeddings

About

Releases

Packages

Languages

earthspecies/unsupervised-speech-translation

Folders and files

Latest commit

History

Repository files navigation

Unsupervised Speech Translation

Prerequisites

Datasets

Generate Morse code datasets

Convert datasets into features

Train the models

Visualize embeddings

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages