This masters thesis aims to address the challenging tasks of text-to-speech (TTS) voice cloning. It will involve developing a novel approaches that leverages recent advances in deep neural networks. While current methods have made progress, there is still room for improvement in synthesizing naturally sounding voices that can be personalized to match target speakers. This work will explore designing a new model architecture and training methodology to better capture both speaker identity and enhance prosody in the generated speech. The thesis will also investigate how to integrate the use of specialized markup languages (e.g., SSML) into current architectures to enable fine-grained control over the generated speech (e.g., emphasis, pauses, etc.)
A comprehensive literature review will identify open challenges and help select a promising research direction. The model will be rigorously evaluated on standard datasets using objective and, if possible, subjective metrics. Its capabilities, such as how closely it can match new voices, will be systematically compared to state-of-the-art baselines.
Objectives:
The main objectives of this thesis are:
- O1: Perform a literature review on TTS and voice cloning.
- O2: Design a new model architecture and training methodology. It will leverages pre-trained or self-supervised speech representations and will be able to generate speech with a specific speaker identity.
- O3: Evaluate the model on standard datasets using objective and subjective metrics. Compare its capabilities to state-of-the-art baselines.
- O4: Write a thesis report and present the results.
- O5: (Optional) Implement a working demo.
📚 Suggested Readings:
-
Tools:
-
Technical reports:
- TorToiSe TTS technical report by James Betker [paper access]
- SSML: A Speech Synthesis Markup Language by Taylor and Isard [paper access]
-
Research papers
- Zero-shot text-to-speech synthesis conditioned using self-supervised speech representation model by Fujita et al. [paper access]
- Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale by Le et al. [paper access]
- Robust Speech Recognition via Large-Scale Weak Supervision by Radford et al. [paper access]
This research has the potential to advance both TTS and voice cloning techniques. It also aims to help foster more natural and personalized speech technologies for Italian, which could benefit applications in education, accessibility, and entertainment.
📝 Notes:
- 💻 The thesis involves developing new deep learning models and requires a strong background in audio-related topics and Python programming.
- 🗺️ The research will primarily center on languages rooted in Latin and Germanic origins (such as English and Italian).
- 💼 This thesis is in collaboration with AudioBoost, a startup focusing on speech technologies.
- 🤖 The lab has recently acquired two Ameca robots. Depending on the progress of the thesis, it may be possible to create a working demo to give the robot a more natural voice.
📞 Contacts: