MLS and MAILABS: considerations and issues ( Have you seen my apostrophe?) #124

nefastosaturo · 2021-01-21T17:29:22Z

I was checking these two dataset.

The first thing that came in my mind was the duplication of same audio samples and yes, we can v-check this:

from mailabs:

$ it_IT/by_book/male/riccardo_fasol/il_fu_mattia_pascal/metadata.csv
mattiapascal_08_pirandello_f000160|E che avventure! Una più ardita dell'altra...|E che avventure! Una più ardita dell'altra...

from MLS:

$ mls_italian/train/transcripts.txt
1595_4194_001172	e che avventure una più ardita dellaltra ecco qua per dare un altro saggio un brano di dialogo tra lui e una donna maritata

The second thing is: we got a big annoying error and is the apostrophe char missed. Why?

MAILABS metadata.csv contains: audio_id | transcription | NORMALIZED transcription

Given that LOT OF transcription use ’ instead of ' , the normalized version ( which is taken in account by Deepspeech import_mailabs script) will result without apostrophes

BUT

from the example above WE CAN SEE that MLS IS MISSING apostrophe too!

So a recap:
' char sometimes is missed from MLS sometimes from MAILABS

So, whats the best strategy?

leave MLS and MAILABS as it is, with some/lot overlapping samples, just use a NEW import_mailabs script parser and replace the ’ character
try to fix MLS transcriptions using the raw MAILABS one and, after checking if MAILABS is a subset of MLS, discard MAILABS
others..

EDIT:

add books list from MAILABS and MLS
MAILABS_book_list.txt
MLS_book_list.txt

The text was updated successfully, but these errors were encountered:

eziolotta · 2021-01-29T23:01:27Z

Even if the texts of the examples are the same, the speakers may be different.
Different speakers I think can be useful even if they say the same thing.

In the example you say (Mattia Pascal of Pirandello) the Speaker are the same (both clips are derived from the same LibriVox clips), but they are different segments: the MLS one is longer, so they are not duplicates

I think it's hard to find real duplicates, we could keep them ...?

eziolotta · 2021-01-30T13:00:14Z

To solve the apostrophe bug in m-ailabs and mls, we would need to parse both strings (original and normalized).
I made this fix, and other changes, I'll do a PR soon.

eziolotta · 2021-01-30T13:09:00Z

In m-ailabs my fix work fine (we have original text!).
In MLS maybe need to reuse the raw data of mailabs as you say. i Try...

eziolotta · 2021-01-30T17:35:49Z

MAILABS list apostrophe error
mailabs_fixed_token.txt

nefastosaturo · 2021-02-12T16:18:46Z

So starting from the mailabs_fixed_token, I tried to detect the problematic MLS books.

Right now I have checked:

Verga, Novelle, "Vita dei campi", book id: 656
656_Verga_Novelle.zip
Pascoli, Myricae, book id: 1590
1590_pascoli.zip
Machiavelli, Il Principe, book id: 10624 <--- I was thinking to discard this one, there are too many latinism

In each zip files you'll find different set of around 50 wrong words. Some of them already got a correction, most of them don't.

Also there is a file with strange behaviour of some sentences (strange chars, bigger errors like some words without spaces and so on). I will check those tokens in a future step.

If you can please choose one set or subset and put the correct word, would be awesome!

The format is:

,
eg:

dellanima,dell'anima
damore,d'amore
unaltro,un altro

if you think that one token could be ambiguous (eg: loro,l'oro), please flag it with SKIP

loro,l'oro,SKIP

Sav22999 · 2021-02-12T16:35:47Z

@nefastosaturo I take the first one Verga Novelle, id=656.

Sav22999 · 2021-02-12T17:29:56Z

Et voilà, credo di aver fatto tutto (spero sia corretto) 656_Verga_Novelle.zip

eziolotta · 2021-02-13T10:02:33Z

To check all the texts in MLS, csv generated by importer may help.
train_full.zip

eziolotta · 2021-03-13T16:44:47Z

On M-AILABS there are other examples to exclude:

transcription does not match with spoken words :-(
audio is truncated before the end of transcription

(folder mix\novelle_per_un_anno_06)
novelle06_16_pirandello_f000028
novelle06_16_pirandello_f000029
novelle06_16_pirandello_f000030
novelle06_16_pirandello_f000031
novelle06_16_pirandello_f000032
novelle06_16_pirandello_f000033
novelle06_16_pirandello_f000034
novelle06_16_pirandello_f000035
novelle06_16_pirandello_f000036
novelle06_16_pirandello_f000037
novelle06_16_pirandello_f000038
novelle06_16_pirandello_f000039
novelle06_16_pirandello_f000040
novelle06_16_pirandello_f000041

novelle06_17_pirandello_f000387

I was able to find them because 3 of them were filtered by importer (see check audio too_short),
then I checked (by hand) whole blocks novelle06_16 and novelle06_17

nefastosaturo added bug Something isn't working help wanted Extra attention is needed dataset labels Jan 21, 2021

nefastosaturo changed the title ~~MLS and MAILABS: considerations and issues ( Have you see my apostrophe?)~~ MLS and MAILABS: considerations and issues ( Have you seen my apostrophe?) Jan 25, 2021

eziolotta mentioned this issue Feb 10, 2021

Importers: fix and changes #127

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MLS and MAILABS: considerations and issues ( Have you seen my apostrophe?) #124

MLS and MAILABS: considerations and issues ( Have you seen my apostrophe?) #124

nefastosaturo commented Jan 21, 2021 •

edited

Loading

eziolotta commented Jan 29, 2021

eziolotta commented Jan 30, 2021 •

edited

Loading

eziolotta commented Jan 30, 2021

eziolotta commented Jan 30, 2021

nefastosaturo commented Feb 12, 2021 •

edited

Loading

Sav22999 commented Feb 12, 2021

Sav22999 commented Feb 12, 2021

eziolotta commented Feb 13, 2021

eziolotta commented Mar 13, 2021 •

edited

Loading

MLS and MAILABS: considerations and issues ( Have you seen my apostrophe?) #124

MLS and MAILABS: considerations and issues ( Have you seen my apostrophe?) #124

Comments

nefastosaturo commented Jan 21, 2021 • edited Loading

eziolotta commented Jan 29, 2021

eziolotta commented Jan 30, 2021 • edited Loading

eziolotta commented Jan 30, 2021

eziolotta commented Jan 30, 2021

nefastosaturo commented Feb 12, 2021 • edited Loading

Sav22999 commented Feb 12, 2021

Sav22999 commented Feb 12, 2021

eziolotta commented Feb 13, 2021

eziolotta commented Mar 13, 2021 • edited Loading

nefastosaturo commented Jan 21, 2021 •

edited

Loading

eziolotta commented Jan 30, 2021 •

edited

Loading

nefastosaturo commented Feb 12, 2021 •

edited

Loading

eziolotta commented Mar 13, 2021 •

edited

Loading