Methodology
How composition is measured
Composition asks ‘what kinds of data are in ATLAS’ from three angles at once. This page explains each angle, how the percentages are computed, and why some breakdowns can sum past 100%.
The four dimensions
Each dimension is a separate breakdown of the catalog — not a slice of a single pie.
- 01
Modality
Shape of the data
Text, audio, image, video, or multimodal. Derived from the modalities array on dataset_metadata — a dataset with both audio and text counts for both.
- 02
Tasks
What the data is for
Task families derived from the dataset's task tags rolled up to a coarse grouping (classification, generation, translation, etc.). One dataset can serve multiple.
- 03
Domains
Subject areas covered
Broad subject areas (healthcare, legal, finance, government, etc.) assigned to each dataset by the contribution pipeline. A dataset can belong to multiple domains.
- 04
Multilinguality
How many languages per dataset
A single bucket per dataset — monolingual, bilingual, or multilingual — based on the count of distinct language codes attached to it.
How the percentages work
Count per segment
For each dimension, count the distinct datasets whose tags include the segment value. A dataset can be counted in multiple segments within the same dimension.
Divide by total visible datasets
The percentage is segment_count ÷ total_datasets. Same denominator across all segments in the dimension.
Why some bars go past 100%
Datasets can have more than one modality or task — audio AND text, classification AND generation. Each segment is its own count, not a slice of a pie.
Multilinguality is the exception: every dataset is exactly one of monolingual / bilingual / multilingual, so it always sums to 100%.