Skip to content

SinSpeech-Development/Refined-OpenSLR-52-Corpus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Refined OpenSLR-52 Corpus

This repository contains a refined set of transcriptions of a speech corpus for Sinhala.

Our goal is to provide a clean speech corpus for Sinhala, which is readily-usable with ASR toolkits.

We have used an open-source Sinhala speech corpus titled "Large Sinhala ASR training data set" published by Google. The complete dataset is available here.

By going through the transcriptions, we have identified two kinds of issues:

  1. Issues related to textual characters
  2. Issues related to linguistics

We have addressed the above issues by following a systematic approach of filtration and correction. The tasks done for each kind of issue are listed below.

Issues related to textual characters

  • Removing punctuation marks (e.g. . , / " " : -)
  • Excluding utterances which contain English characters
  • Replacing numerical characters with their textual forms
  • Removing unnecessarily applied non-printable characters (e.g. ZWJ, NZWJ, ZWSP)

Issues related to linguistics

  • Applying spelling corrections on misspelled words
  • Applying a consistent way of spacing between words, complying with Sinhala grammar

When applying the refinements, certain transcriptions were excluded because, in each such utterance, the speaker has spoken words incorrectly as those words have been misspelled in the original transcription (i.e. prompt). The statistics of the corpus after applying refinements and corresponding transcription files are listed in the below table.

Version identifier Description Total utterances Unique utterances Unique words
V0 Original, unmodified 185,293 102,576 69,581
V1 Corrected textual characters 178,409 98,435 63,376
V2 Applied lingustic corrections on V1 178,096 98,127 57,029

In addition to the above refinements, we have also tagged each speaker with his/her gender, as that information is required by ASR toolkits like Kaldi. The speaker to gender mapping is contained in this file.

We have developed Python scripts to automate most of the steps of the refinement process. Those scripts can be accessed in this repository.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published