Africa is not the dataset. It’s the architect.

The Great Mosque of Djenne, Mali depicting Africa as the architect of AI.

David Ifeoluwa Adelani, a researcher at McGill University, Mila – Quebec AI Institute, and Canada CIFAR AI Chair, is confronting one of the most foundational, and overlooked, challenges in building inclusive AI: language.

You can’t build effective Natural Language Processing (NLP) systems without data.

And you certainly can’t build inclusive ones without the right kind of data.

That’s why Adelani’s work focuses on a people-first creation of high-quality, human-annotated datasets across over 20 African languages.

Datasets like MasakhaNER and MasakhaPOS support languages such as Hausa, Igbo, Swahili, Yoruba, Amharic, Twi, Wolof, and Zulu. His team’s efforts also include AfroBench, an open benchmark suite co-created by African researchers that spans 64 languages and rigorously tests how large language models (LLMs) handle tasks like translation, classification, and question answering.

These aren’t scraped web pages or auto-transcribed audio clips. They’re carefully curated, verified by native speakers, and annotated for meaning, tone, and nuance.

Human-annotated data ensures that it’s people, not just machines, defining what’s linguistically correct, culturally accurate, and socially respectful. It’s slower work, but it preserves the dignity and complexity of language.

AfroBench doesn’t just reveal what’s broken - it gives us the tools to fix it.

In a recent study, Adelani and collaborators tested four major LLMs (mT0, Aya, LLaMA 2, and GPT-4) across 60 African languages and six NLP tasks (topic classification, sentiment classification, machine translation, summarisation, question answering, and named entity recognition). The results were clear: even the most advanced models perform significantly worse on African languages, especially on generative tasks like summarisation and translation, as compared to English. This benchmark data transforms the vague sense that “AI isn’t working for us” into measurable, actionable evidence.

Adelani’s work provides more than infrastructure, it sets a new global standard for language inclusion.

It’s a blueprint for how AI can serve people in their own words, not just the world’s dominant tongues.

Not just included. Embedded.

This body of work hasn’t stayed in research papers. It has shaped the real-world behaviour of major tech platforms. Meta’s “No Language Left Behind” and Google Translate have integrated Masakhane’s datasets and benchmarks to improve their models for African languages.

Meanwhile, local startups are using these open tools to build health chatbots, civic engagement apps, and educational platforms that understand the languages spoken by their users.

Why Inclusion won't scale without policy

Despite these breakthroughs, Adelani is clear-eyed about the obstacles: underfunding, limited compute power, and low political prioritisation. These aren’t technical gaps—they’re structural ones. Africa currently holds less than 1% of global AI computing power and would require hundreds more data centres to train and deploy competitive models at scale.

Some players are starting to respond. Cassava Technologies, for instance, isn’t waiting for global infrastructure to trickle down. It’s building its own.

Cassava Technologies is building Africa’s first AI factory.

By launching Africa’s first “AI factory,” Cassava is turning compute into opportunity:

GPU-as-a-Service makes high-performance AI tools available to startups, researchers, and public sector innovators.
Partnerships with Zindi and SAAIA plug local talent directly into the new compute layer.
Pan-African fibre networks ensure that infrastructure is both fast and reachable.

For the first time, thousands of African engineers can train large language models without relying on Silicon Valley’s cloud - preserving data, accelerating time to market, and building for African realities. As Adelani reminds us, inclusion isn’t just about language or datasets. It’s about who holds the power to compute. Cassava is making sure that power isn’t just imported, it’s built right here.

Linguistic Infrastructure Is the Missing Layer in Global AI

Language equity in AI won’t be solved by researchers alone. It demands that governments treat linguistic data as critical infrastructure, funders rethink what “inclusive AI” truly means, and industry expand its imagination beyond dominant-language markets.

But researchers like Adelani are showing what’s possible when that foundation is laid. As detailed in peer-reviewed work (Nekoto et al., 2020; Adelani et al., 2022), this isn’t just African innovation. It’s a blueprint for building multilingual AI systems globally, from the ground up.

Adelani enables AI to perform complex tasks - like translation, named entity recognition, and question answering - in languages that were previously invisible to machines. This makes tools like search, learning, and public services more accessible to millions.

This is data equity in action.

#4. Africa as the AI architect

Africa is not the dataset. It’s the architect.

#3. Language belongs to the people, not the platforms

#5. Inclusion isn’t charity - it’s strategy