youtube-audio-and-transcript-extract

split audio_segmentation with corresponding transcript for youtube datasets

how can i handle youtube dataset with the indian accent. then segmented with a correct transcript?

first downloading .mp3 playlist for youbute indian speakers with .vtt subtitle file.

.vtt file format like starting-ending timing with the audio transcript. i was segmenting that youtube audiofile with Start-End time.

and i applied some preprocessing like data cleaning, wav file format 16bit 16khz mono, and then use it deepspeech training.

youtube_news.txt

python3 youtube_download.py

python3 text1.py

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md
text1.py		text1.py
youtube_download.py		youtube_download.py
youtube_news.txt		youtube_news.txt