How composition is measured

Methodology

How composition is measured

Composition asks ‘what kinds of data are in ATLAS’ from three angles at once. This page explains each angle, how the percentages are computed, and why some breakdowns can sum past 100%.

The four dimensions

Each dimension is a separate breakdown of the catalog — not a slice of a single pie.

01
Modality
Shape of the data
Text, audio, image, video, or multimodal. Derived from the modalities array on dataset_metadata — a dataset with both audio and text counts for both.
02
Tasks
What the data is for
Task families derived from the dataset's task tags rolled up to a coarse grouping (classification, generation, translation, etc.). One dataset can serve multiple.
03
Domains
Subject areas covered
Broad subject areas (healthcare, legal, finance, government, etc.) assigned to each dataset by the contribution pipeline. A dataset can belong to multiple domains.
04
Multilinguality
How many languages per dataset
A single bucket per dataset — monolingual, bilingual, or multilingual — based on the count of distinct language codes attached to it.

How the percentages work

Count per segment
For each dimension, count the distinct datasets whose tags include the segment value. A dataset can be counted in multiple segments within the same dimension.
Divide by total visible datasets
The percentage is segment_count ÷ total_datasets. Same denominator across all segments in the dimension.

Why some bars go past 100%

Datasets can have more than one modality or task — audio AND text, classification AND generation. Each segment is its own count, not a slice of a pie.

Multilinguality is the exception: every dataset is exactly one of monolingual / bilingual / multilingual, so it always sums to 100%.

Back to Insights

Documentation

How composition is measured

Shape of the data

What the data is for

Subject areas covered

How many languages per dataset

Count per segment

Divide by total visible datasets