Skip to content

Knowledge Base

Documentation

Everything you need to navigate the ATLAS ecosystem — from contributing high-quality SEA language data to understanding our governance framework.

Methodology

How the connections network is built

The Connections chapter treats the catalog as a network. Each dataset is a node; an edge means two datasets share enough semantic ground to count as related.

What the hero stats mean

Each number describes the network as a whole, not any specific dataset.

  1. 01

    Largest community

    Number of datasets in the most populous sector in this view. A read of where the catalog's mass currently sits.

  2. 02

    Islands

    Datasets with no qualifying link to anything else. These sit on their own at the edge of the canvas — the network's blind spot.

  3. 03

    Avg neighbours

    Mean number of edges per dataset. Higher means more interconnected; lower means most datasets stand more alone.

Description similarity

Each dataset's description is embedded into a vector. Pairs whose vectors point in similar directions (cosine ≥ 0.75) move on. Catches semantic kin even when their tags don't match.

Tag overlap

Survivors are scored on how much their structured tags actually overlap — Jaccard across six facets, blended into one number:

Tasks (tag-level)
Exact task tags in common (e.g. classification, NER, translation).
Tasks (group-level)
Overlap rolled up to task families. Two datasets in different specific tasks but the same family still earn partial credit.
Domains (tag-level)
Exact domain tags — biomedical, legal, news, code, etc.
Domains (group-level)
Domain family roll-up, same logic as task groups.
Modality
Text, speech, image, video, multimodal. Coarser than tasks/domains so weighted lighter.
Language
ISO 639-3 codes in common. Coarsest of the six, so it nudges the score rather than driving it.

Tasks and domains weigh most (what the data's for); modality and language nudge (surface form). Score must clear 0.70 — well above the catalog default of 0.45 — so only the strongest bonds make the cut.

Hub and island callouts

Two ringed nodes on the canvas put a real dataset name behind the abstract counts.

Hub

The single most-connected dataset in this view. Often a benchmark or anchor that everything else gravitates toward.

Island

The largest disconnected dataset. A useful seed for asking why it doesn't overlap with anything else.

Back to Insights

Cookies & analytics

We use cookies to make ATLAS work and to understand how it's used. Choose which categories to allow.

Necessary

Required for core site functionality. Always active.

Always on

Analytics

Helps us understand how ATLAS is used so we can improve it.

Marketing

Used for personalised content. Off by default.