A comparison of E2E and Cascading Speech-to-Speech translation systems (ES-EN)

In this project, we evaluate end-to-end and cascading speech-to-speech translation (S2ST) systems on the CVSS-C Spanish-English dataset. We measure any differences in performance using a suite of evaluation metrics like COMET, METEOR, and BLASER 2.0 that go beyond n-gram overlap metrics like BLEU. To establish a baseline, we used the untuned pre-trained OWSM 3.1 model from ESPNet S2T model with the fastspeech2 conformer from ESPNet and the fastspeech2 conformer hifigan vocoder for the cascading system. For the end-to-end system, we used an untuned discrete-unit S2ST ESPNet model pre-trained on a Spanish-to-English subset of the CVSS-C dataset with a Parallel WaveGAN huBERT vocoder to synthesize the output speech. We then chose to fine-tune our cascading system on CVSS-C data (for better comparison between the e2e and cascading systems) using LoRA (rank 4) for 10 epochs.

This project was completed under the guidance of Prof. Shinji Watanabe, at LTI, CMU. For more details refer to the project report.

Results

Comparison of the performance of E2E and Cascading S2ST Models on the CVSS-C Spanish-to-English test set. The baselines:- e2e-oob is the out-of- the-box end-to-end model and casc-oob is out-of-the-box cascading model. casc-ft models are the fine-tuned (via LoRA) cascading models: casc-ft-best (learning rate 1e-5, epoch 3); casc-ft-1-epoch (learning rate 1e-5, epoch 1); casc-ft-5-epoch (learning rate 1e-5, epoch 5); and casc-ft-low-lr (learning rate 1e-7, epoch 1). \

BP is Brevity Penalty (scale 0-1), HRR is Hypothesis to Reference ratio (scale 0-1); ASR-BLEU (scale 1-100), COMET (scale 0-1), METEOR (scale 0-1), and BLASER2.0 (scale 1-5)

Model	ASR-BLEU	BP	HRR	COMET	METEOR	BLASER2.0
e2e-oob	14.901	0.82	0.928	0.538	0.283	3.188
casc-oob	17.692	0.88	0.975	0.619	0.338	3.604
casc-ft-best-val-acc	15.062	0.709	0.785	0.599	0.323	3.435
casc-ft-1-epoch	14.930	0.705	0.784	0.593	0.318	3.386
casc-ft-5-epoch	14.383	0.702	0.784	0.601	0.314	3.428
casc-ft-low-lr	13.031	0.636	0.722	0.570	0.281	3.298

Our results show that the cascading systems still outperform their e2e counterparts, and the e2e systems are unable to overcome the cascading systems’ advantages, including the large quantities of training data available for text-based models and high quality pretrained text-based models. Additionally, our qualitative analysis revealed that ASR-BLEU scores do not always perfectly correlate with human judgements, meaning ASR-BLEU alone is insufficient for the holistic evaluation of S2ST systems.

Please refer to the project report for more detail.

Directory Structure

cvss-c_en_wavegan_hubert_vocoder - Contains the config file for the Vocoder.
dev_dataset - Contains TSV filenames for the Source and Target audiofiles and transcripts (CVSS-C dataset).
espnet_recipe_scripts - espnet recipes for s2st_inference and st_inference.
results - The csv outputs of the e2e, and cascaded models (oob, finetuned) with all 4 Translation metrics.
tts_multi_speaker_model - The exp folder contains the config file for the TTS Model.
utils:
macro_average_results.py - The script to macro average the 4 translation metrics across all dev dataset samples.
sampling_rate_converter.py - The script to convert all clips in the input dataset to a 16KHz sampling rate.

expanded_translation_metrics.py - The script to evaluate the prediction texts and return the 4 translation metrics (ASR BLEU, COMET, METEOR, BLASER 2.0)
finetune_s2t.py - The script to finetune the S2T model on the CoVoST dataset.
forward_feed_cascaded_finetuned_oob.py - The script to forward-feed the Cascaded S2ST model (oob/finetuned) on the CVSS-C dev dataset (with metrics).
forward_feed_e2e.py - The script to forward-feed the End-to-end S2ST model on the CVSS-C dev dataset (with metrics).
live_s2st_demonstration.py - The demonstration that compares the cascaded and end-to-end models on a single audio file.

tts_config.yaml - The config file for the Text2Speech model.
lora_config.yaml - The config file for the LoRA adapter.

environment.txt - The virtual environment packages (with versions) listed explicitly.
report.pdf - The report containing details about experimental design and results.

Note: The model files (for the TTS model, Vocoder, S2T model etc.) are not included due to their size, however all config files are in the respective directories.

Run Commands

Environment Setup:
- Since this project extensively uses espnet recipes, refer to the following espnet installation instructions.
Fine-tune the Cascaded model on CoVoST 2 data:
- Download the CoVoST 2 es_en dataset (edit the corresponding path variable in the script).
- python finetune_s2t.py
Inference the Models on the dev dataset and calculate all metrics
- Download the appropriate audio files in the CVSS-C es-en dev dataset from the CommonVoice release version 4.
- Run python .utils/sampling_rate_converter.py with the appropriate paths.
- End-to-end model: python forward_feed_e2e.py
- Cascaded oob model: python forward_feed_cascaded_finetuned_oob.py --inference_mode=oob
- Cascaded finetuned model: python forward_feed_cascaded_finetuned_oob.py --inference_mode=finetuned
Run the Demonstration:
- The demo requires the finetuning step to have been completed prior.
- Select a single file from the CVSS-C es-en dataset to inference and change the 'demo_sample_filename' variable.
- Run the demo: python live_s2t_demonstration.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A comparison of E2E and Cascading Speech-to-Speech translation systems (ES-EN)

Results

Directory Structure

Run Commands

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
cvss-c_en_wavegan_hubert_vocoder		cvss-c_en_wavegan_hubert_vocoder
dev_dataset		dev_dataset
espnet_recipe_scripts		espnet_recipe_scripts
results		results
tts_multi-speaker_model		tts_multi-speaker_model
utils		utils
.DS_Store		.DS_Store
.gitattributes		.gitattributes
README.md		README.md
environment.txt		environment.txt
expanded_translation_metrics.py		expanded_translation_metrics.py
finetune_s2t.py		finetune_s2t.py
forward_feed_cascaded_finetuned_oob.py		forward_feed_cascaded_finetuned_oob.py
forward_feed_e2e.py		forward_feed_e2e.py
live_s2st_demonstration.py		live_s2st_demonstration.py
lora_config.yaml		lora_config.yaml
report.pdf		report.pdf
tts_config.yaml		tts_config.yaml

Aadit3003/s2st-cascading-e2e

Folders and files

Latest commit

History

Repository files navigation

A comparison of E2E and Cascading Speech-to-Speech translation systems (ES-EN)

Results

Directory Structure

Run Commands

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages