How the connections network is built

Methodology

How the connections network is built

The Connections chapter treats the catalog as a network. Each dataset is a node; an edge means two datasets share enough semantic ground to count as related.

What the hero stats mean

Each number describes the network as a whole, not any specific dataset.

01
Largest community
Number of datasets in the most populous sector in this view. A read of where the catalog's mass currently sits.
02
Islands
Datasets with no qualifying link to anything else. These sit on their own at the edge of the canvas — the network's blind spot.
03
Avg neighbours
Mean number of edges per dataset. Higher means more interconnected; lower means most datasets stand more alone.

What counts as a link

Description similarity

Each dataset's description is embedded into a vector. Pairs whose vectors point in similar directions (cosine ≥ 0.75) move on. Catches semantic kin even when their tags don't match.

Tag overlap

Survivors are scored on how much their structured tags actually overlap — Jaccard across six facets, blended into one number:

Tasks (tag-level): Exact task tags in common (e.g. classification, NER, translation).
Tasks (group-level): Overlap rolled up to task families. Two datasets in different specific tasks but the same family still earn partial credit.
Domains (tag-level): Exact domain tags — biomedical, legal, news, code, etc.
Domains (group-level): Domain family roll-up, same logic as task groups.
Modality: Text, speech, image, video, multimodal. Coarser than tasks/domains so weighted lighter.
Language: ISO 639-3 codes in common. Coarsest of the six, so it nudges the score rather than driving it.

Tasks and domains weigh most (what the data's for); modality and language nudge (surface form). Score must clear 0.70 — well above the catalog default of 0.45 — so only the strongest bonds make the cut.

Hub and island callouts

Two ringed nodes on the canvas put a real dataset name behind the abstract counts.

Hub

The single most-connected dataset in this view. Often a benchmark or anchor that everything else gravitates toward.

Island

The largest disconnected dataset. A useful seed for asking why it doesn't overlap with anything else.

Back to Insights

Documentation