Corpus for ASR in the 5 More Spoken Languages in the World

According to the "Anuario 2013" of the "Instituto Cervantes"1 and the "Atlas de la lengua española en el mundo"2, the five more spoken languages in the world are: mandarin-chinese, english, spanish, hindi and arabic.

So in this section, we show different comparison tables between several corpus in these five languages, extracted from the Linguistic Data Consortium (LDC) and the European Language Resources Association (ELRA).


LDC Tables ELRA Tables
Mandarin-Chinese : LDC Corpus Mandarin-Chinese : ELRA Corpus
English : LDC Corpus English : ELRA Corpus
Spanish : LDC Corpus Spanish : ELRA Corpus
Hindi : LDC Corpus Hindi : ELRA Corpus
Arabic : LDC Corpus Arabic : ELRA Corpus



1. The Instituto Cervantes (http://www.cervantes.es/) is a public organization founded in Spain on March 21st, 1991 by the government of this country, sponsored by the king of Spain. It depends on the "Ministerio de Asuntos Exteriores" and its main goal is to promote the teaching of the Spanish language and the culture of Spain and Hispanoamerica all over the world.

2. http://cvc.cervantes.es/lengua/anuario/anuario_13/