Skip to content

Knowledge Base

Documentation

Everything you need to navigate the ATLAS ecosystem — from contributing high-quality SEA language data to understanding our governance framework.

Methodology

How coverage is measured

Every region on the map gets a single 0–100% coverage score. That score combines five distinct signals so a region can't game the picture by being strong on just one — a country with many datasets but only one contributor and one language still reads as fragile.

The five signals

Each is computed independently per bucket (country, region, continent, global). A signal is treated as missing — not zero — when nothing rolls up to it, so a region isn't penalised for absence of data.

  1. 01

    Datasets

    How many distinct datasets fall under this region.

    Number of unique datasets

  2. 02

    Examples

    Total records / rows across every config of every dataset in the bucket.

    Sum of example counts across all datasets

  3. 03

    Languages

    Distinct ISO 639-3 codes seen across the region's catalog — measures breadth.

    Number of unique languages represented

  4. 04

    Contributors

    Distinct, non-anonymous people / orgs who submitted at least one dataset.

    Number of unique named contributors

  5. 05

    Recent

    Datasets added in the last 365 days. Captures momentum, not just historical inventory.

    Datasets added within the past 12 months

From signals to one score

  1. Rank each signal among peers

    Within each granularity (country vs country, region vs region, continent vs continent) each non-zero signal is given a percentile rank — 0% means "smallest", 100% means "largest". Zero is treated as missing data, not as the bottom of the scale.

  2. Take the average of what's there

    The coverage score is the mean of the available percentile ranks. A region scoring 80th percentile on datasets, 60th on examples and 70th on contributors ends up at ~70%.

  3. Require at least three signals

    If fewer than three of the five signals are populated, the score is withheld entirely and the region renders in grey — "insufficient data" rather than "poor coverage". That keeps a region with a single dataset from looking either world-class or catastrophic.

Why this combination, not a single number

Each individual signal has a known blind spot: dataset_count is gameable by counting tiny releases, example_count is skewed by audio datasets, contributor_count rewards single prolific authors. Averaging percentile ranks across five independent signals smooths those out — a region needs to do reasonably well on several axes to score high.

Why the map and the language list disagree

On the Coverage page you see two language rankings. They use different yardsticks:

Map / top regions

Composite coverage score — the 0–100% percentile from this page's formula, combining all five signals into one health number per region.

"By language, raw count" section

Just one signal — dataset count. No weighting, no smoothing. Surfaces which languages have the most attached datasets, full stop.

When the two disagree, the disagreement IS the signal. A language can show up high on the raw count (many small datasets) while its region scores poorly on the composite (few contributors, no recent additions). That tells you something different than either chart alone — and it's why we show both.

Back to Insights

Cookies & analytics

We use cookies to make ATLAS work and to understand how it's used. Choose which categories to allow.

Necessary

Required for core site functionality. Always active.

Always on

Analytics

Helps us understand how ATLAS is used so we can improve it.

Marketing

Used for personalised content. Off by default.