How coverage is measured

Methodology

How coverage is measured

Every region on the map gets a single 0–100% coverage score. That score combines five distinct signals so a region can't game the picture by being strong on just one — a country with many datasets but only one contributor and one language still reads as fragile.

The five signals

Each is computed independently per bucket (country, region, continent, global). A signal is treated as missing — not zero — when nothing rolls up to it, so a region isn't penalised for absence of data.

01
Datasets
How many distinct datasets fall under this region.
Number of unique datasets
02
Examples
Total records / rows across every config of every dataset in the bucket.
Sum of example counts across all datasets
03
Languages
Distinct ISO 639-3 codes seen across the region's catalog — measures breadth.
Number of unique languages represented
04
Contributors
Distinct, non-anonymous people / orgs who submitted at least one dataset.
Number of unique named contributors
05
Recent
Datasets added in the last 365 days. Captures momentum, not just historical inventory.
Datasets added within the past 12 months

From signals to one score

Rank each signal among peers
Within each granularity (country vs country, region vs region, continent vs continent) each non-zero signal is given a percentile rank — 0% means "smallest", 100% means "largest". Zero is treated as missing data, not as the bottom of the scale.
Take the average of what's there
The coverage score is the mean of the available percentile ranks. A region scoring 80th percentile on datasets, 60th on examples and 70th on contributors ends up at ~70%.
Require at least three signals
If fewer than three of the five signals are populated, the score is withheld entirely and the region renders in grey — "insufficient data" rather than "poor coverage". That keeps a region with a single dataset from looking either world-class or catastrophic.

Why this combination, not a single number

Each individual signal has a known blind spot: dataset_count is gameable by counting tiny releases, example_count is skewed by audio datasets, contributor_count rewards single prolific authors. Averaging percentile ranks across five independent signals smooths those out — a region needs to do reasonably well on several axes to score high.

Why the map and the language list disagree

On the Coverage page you see two language rankings. They use different yardsticks:

Map / top regions

Composite coverage score — the 0–100% percentile from this page's formula, combining all five signals into one health number per region.

"By language, raw count" section

Just one signal — dataset count. No weighting, no smoothing. Surfaces which languages have the most attached datasets, full stop.

When the two disagree, the disagreement IS the signal. A language can show up high on the raw count (many small datasets) while its region scores poorly on the composite (few contributors, no recent additions). That tells you something different than either chart alone — and it's why we show both.

Back to Insights

Documentation