Skip to content
This repository has been archived by the owner on Mar 8, 2023. It is now read-only.

Evaluate Voci Parlate of Wikipedia #34

Open
Mte90 opened this issue Nov 14, 2019 · 1 comment
Open

Evaluate Voci Parlate of Wikipedia #34

Mte90 opened this issue Nov 14, 2019 · 1 comment
Labels
dataset question Further information is requested

Comments

@Mte90
Copy link
Member

Mte90 commented Nov 14, 2019

There is an Italian project of Wikipedia pages read in Italian.
The link for all the various audio and pages: https://it.wikipedia.org/wiki/Categoria:Voci_parlate

Some of this recordings are public domain like https://it.wikipedia.org/wiki/File:Itwiki-Barile_(unit%C3%A0_di_misura).ogg
Another problem is that those recordings are of old version of the pages so we need to recover the version read to associate with the recording.

@Mte90 Mte90 added the dataset label Nov 22, 2019
@Mte90 Mte90 added the question Further information is requested label Nov 8, 2020
@eziolotta
Copy link
Contributor

eziolotta commented Nov 29, 2020

Yes, audio refer to past revisions text.
However Link of past revision is on the page of the latest version together the audio which it refers.

Another problem: in the audio there is a header and a footer not present in text.

  • Page title
  • sub-titles
    'http://it.wikipedia.org' <<< Not present
  • Text Audio
  • Long pause of silence
  • 'Questa registrazione e il testo dell'articolo sono rilasciati secondo i termini della GNU Free Documentation Licence disponibile all'indirizzo internet www.gnu.org/copyleft/fdl.html' <<< Not present

Preprocessing text would be necessary, inserting the missing parts.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
dataset question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants