This repository houses the crawler code for building the Anuvaad parallel corpus. The ultimate goal is to build quality parallel datasets across various domains (General, Judicial, Educational, Financial, Press, etc) & various Indian languages.
The current set of crawlers are built to scrape, tokenizer and align multilingual reports/documents available at various sources.
- Press Information Bureau (http://pib.gov.in)
- Press Information Bureau Archives (http://pibarchive.nic.in)
- Wikipedia (https://www.wikipedia.org)
- Prothomalo (https://www.prothomalo.com)
- Newsonair (http://newsonair.com)
- Indianexpress (https://indianexpress.com)
- DW (https://dw.com)
- Goodreturns (https://www.goodreturns.in/)
- Jagran-Josh (https://www.jagran.com/)
- Tribune (https://tribuneindia.com)
- Times of India (https://timesofindia.indiatimes.com/)
- Zee News (https://zeenews.india.com/)
- Pranabmukherjii (http://pranabmukherjee.nic.in/)
- Eparliament(http://eparlib.nic.in/)
- Ebalbook(https://cart.ebalbharati.in/BalBooks/ebook.aspx)
- National Institute of Open Schooling (https://nios.ac.in/)
- tntextbooks(https://www.tntextbooks.in/p/school-books.html)
- keralatextbooks(https://samagra.kite.kerala.gov.in/#/textbook/page)
The broader steps involved in all the tools can be generalized to the following :
Hit the required web page & download the contents in respective languages.
The process of spliting the scraped document into individual sentences using the Tokenizer.
The process of pairing the sentences across different languages which has the same meaning.
This involves both model based validation & generating an ideal sample for manual review.
The parallel corpus of the above datasets are available under : anuvaad-parallel-corpus