Skip to content

The YouTube Text-To-Speech dataset is comprised of waveform audio extracted from YouTube videos alongside their English transcriptions

License

Notifications You must be signed in to change notification settings

ryanrudes/YTTTS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

YTTTS

The YouTube Text-To-Speech dataset is comprised of waveform audio extracted from YouTube videos alongside their English transcriptions

videos.txt is a text file that consists of concatenated YouTube video IDs. YouTube video URLs are in the format https://www.youtube.com/watch?v=<video-id>, for example:

https://www.youtube.com/watch?v=BRRolKTlF6Q

A YouTube video ID is always 11 characters in length, so to read in video IDs from the example file provided, you simply have to read the contents in 11 byte chunks:

with open('videos.txt', 'r') as f:
  while True:
    ID = f.read(11)
    print (ID)

scrape.py scrapes YouTube video IDs and continuously appends them to the file videos.txt.
Once you are satisfied with the quanitity that has been scraped (or you may simply use the preprovided list of video IDs), running main.py will iterate through the scraped videos and download both the audio and captions from each video. It will then extract the videos subtitles and their corresponding audio clips, which are parsed from a .srt file, and organize a tree of subdirectories within each video's data folder. Each subdirectory contains both a text file containing the phrase uttered in the short audio clip (subtitles.txt), and the corresponding audio in waveform (audio.wav).

You can also try it out with the included file LastWeekTonight.txt, which contains the contatenated video IDs of every video posted on John Oliver's Last Week Tonight's YouTube Channel as of March 22, 2021.

Some Demos via Google Drive

Uses

  • Voice Cloning
  • TTS Engines
  • Speaker Embedding
  • Speaker Recognition

Download

About

The YouTube Text-To-Speech dataset is comprised of waveform audio extracted from YouTube videos alongside their English transcriptions

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages