Methodology
How the connections network is built
The Connections chapter treats the catalog as a network. Each dataset is a node; an edge means two datasets share enough semantic ground to count as related.
What the hero stats mean
Each number describes the network as a whole, not any specific dataset.
- 01
Largest community
Number of datasets in the most populous sector in this view. A read of where the catalog's mass currently sits.
- 02
Islands
Datasets with no qualifying link to anything else. These sit on their own at the edge of the canvas — the network's blind spot.
- 03
Avg neighbours
Mean number of edges per dataset. Higher means more interconnected; lower means most datasets stand more alone.
What counts as a link
Description similarity
Each dataset's description is embedded into a vector. Pairs whose vectors point in similar directions (cosine ≥ 0.75) move on. Catches semantic kin even when their tags don't match.
Tag overlap
Survivors are scored on how much their structured tags actually overlap — Jaccard across six facets, blended into one number:
- Tasks (tag-level)
- Exact task tags in common (e.g. classification, NER, translation).
- Tasks (group-level)
- Overlap rolled up to task families. Two datasets in different specific tasks but the same family still earn partial credit.
- Domains (tag-level)
- Exact domain tags — biomedical, legal, news, code, etc.
- Domains (group-level)
- Domain family roll-up, same logic as task groups.
- Modality
- Text, speech, image, video, multimodal. Coarser than tasks/domains so weighted lighter.
- Language
- ISO 639-3 codes in common. Coarsest of the six, so it nudges the score rather than driving it.
Tasks and domains weigh most (what the data's for); modality and language nudge (surface form). Score must clear 0.70 — well above the catalog default of 0.45 — so only the strongest bonds make the cut.
Hub and island callouts
Two ringed nodes on the canvas put a real dataset name behind the abstract counts.
Hub
The single most-connected dataset in this view. Often a benchmark or anchor that everything else gravitates toward.
Island
The largest disconnected dataset. A useful seed for asking why it doesn't overlap with anything else.