-
Notifications
You must be signed in to change notification settings - Fork 20
MLS and MAILABS: considerations and issues ( Have you seen my apostrophe?) #124
Comments
Even if the texts of the examples are the same, the speakers may be different. In the example you say (Mattia Pascal of Pirandello) the Speaker are the same (both clips are derived from the same LibriVox clips), but they are different segments: the MLS one is longer, so they are not duplicates I think it's hard to find real duplicates, we could keep them ...? |
To solve the apostrophe bug in m-ailabs and mls, we would need to parse both strings (original and normalized). |
In m-ailabs my fix work fine (we have original text!). |
MAILABS list apostrophe error |
So starting from the mailabs_fixed_token, I tried to detect the problematic MLS books. Right now I have checked: Verga, Novelle, "Vita dei campi", book id: 656 In each zip files you'll find different set of around 50 wrong words. Some of them already got a correction, most of them don't. Also there is a file with strange behaviour of some sentences (strange chars, bigger errors like some words without spaces and so on). I will check those tokens in a future step. If you can please choose one set or subset and put the correct word, would be awesome! The format is: , dellanima,dell'anima if you think that one token could be ambiguous (eg: loro,l'oro), please flag it with SKIP loro,l'oro,SKIP |
@nefastosaturo I take the first one Verga Novelle, id=656. |
Et voilà, credo di aver fatto tutto (spero sia corretto) 656_Verga_Novelle.zip |
To check all the texts in MLS, csv generated by importer may help. |
On M-AILABS there are other examples to exclude:
(folder mix\novelle_per_un_anno_06) novelle06_17_pirandello_f000387 I was able to find them because 3 of them were filtered by importer (see check audio too_short), |
I was checking these two dataset.
The first thing that came in my mind was the duplication of same audio samples and yes, we can v-check this:
from mailabs:
from MLS:
The second thing is: we got a big annoying error and is the apostrophe char missed. Why?
MAILABS metadata.csv contains:
audio_id
|transcription
| NORMALIZEDtranscription
Given that LOT OF transcription use ’ instead of ' , the normalized version ( which is taken in account by Deepspeech import_mailabs script) will result without apostrophes
BUT
from the example above WE CAN SEE that MLS IS MISSING apostrophe too!
So a recap:
' char sometimes is missed from MLS sometimes from MAILABS
So, whats the best strategy?
EDIT:
add books list from MAILABS and MLS
MAILABS_book_list.txt
MLS_book_list.txt
The text was updated successfully, but these errors were encountered: