Skip to content

Knowledge Base

Documentation

Everything you need to navigate the ATLAS ecosystem — from contributing high-quality SEA language data to understanding our governance framework.

Methodology

How composition is measured

Composition asks ‘what kinds of data are in ATLAS’ from three angles at once. This page explains each angle, how the percentages are computed, and why some breakdowns can sum past 100%.

The four dimensions

Each dimension is a separate breakdown of the catalog — not a slice of a single pie.

  1. 01

    Modality

    Shape of the data

    Text, audio, image, video, or multimodal. Derived from the modalities array on dataset_metadata — a dataset with both audio and text counts for both.

  2. 02

    Tasks

    What the data is for

    Task families derived from the dataset's task tags rolled up to a coarse grouping (classification, generation, translation, etc.). One dataset can serve multiple.

  3. 03

    Domains

    Subject areas covered

    Broad subject areas (healthcare, legal, finance, government, etc.) assigned to each dataset by the contribution pipeline. A dataset can belong to multiple domains.

  4. 04

    Multilinguality

    How many languages per dataset

    A single bucket per dataset — monolingual, bilingual, or multilingual — based on the count of distinct language codes attached to it.

How the percentages work

  1. Count per segment

    For each dimension, count the distinct datasets whose tags include the segment value. A dataset can be counted in multiple segments within the same dimension.

  2. Divide by total visible datasets

    The percentage is segment_count ÷ total_datasets. Same denominator across all segments in the dimension.

Why some bars go past 100%

Datasets can have more than one modality or task — audio AND text, classification AND generation. Each segment is its own count, not a slice of a pie.

Multilinguality is the exception: every dataset is exactly one of monolingual / bilingual / multilingual, so it always sums to 100%.

Back to Insights

Cookies & analytics

We use cookies to make ATLAS work and to understand how it's used. Choose which categories to allow.

Necessary

Required for core site functionality. Always active.

Always on

Analytics

Helps us understand how ATLAS is used so we can improve it.

Marketing

Used for personalised content. Off by default.