Únete a la comunidad:
  • tweetter
  • mail

enlace | AfroLID Tool Demo

Language identification (LID) is a crucial precursor for NLP, especially for mining web data. Problematically, most of the world’s 7000+ languages today are not covered by LID technologies. We address this pressing issue for Africa by introducing AfroLID, a neural LID toolkit for 517 African languages and varieties. AfroLID exploits a multidomain web dataset manually curated from across 14 language families utilizing five orthographic systems. When evaluated on our blind Test set, AfroLID achieves 95.89 F1-score. We also compare AfroLID to five existing LID tools that each cover a small number of African languages, finding it to outperform them on most languages. We further show the utility of AfroLID in the wild by testing it on the acutely under-served Twitter domain. Finally, we offer a number of controlled case studies and perform a linguistically-motivated error analysis that allow us to both showcase AfroLID’s powerful capabilities and limitations.

Share:
This collaborative platform was established to enable the community of the International Decade of Indigenous Languages (IDIL 2022–2032) to share events, activities, and resources. The content published on the platform is the responsibility of registered users and does not commit the Secretariat of the Decade (UNESCO) and/or the Members of the Global Task Force for Making a Decade of Action for Indigenous Languages. Please note that the platform has been inactive since February 2025 and no longer accepts new uploads. While work is underway to provide an updated solution, the Secretariat of IDIL 2022–2032 remains available for any inquiries at: indigenous.languages@unesco.org.