How to contribute
ATLAS lets you contribute datasets three ways. This guide walks you through every step — from choosing a path to seeing your dataset published in the catalogue.
Before you start
You need a signed-in ATLAS account to submit. Sign in with Google — it takes under a minute. Once you're in, head to the Contribute page and the wizard will guide you through the rest.
Step 1 — Choose your path
Paste a HuggingFace URL
Best for: datasets already public on HuggingFace
Paste your HuggingFace dataset URL (e.g. huggingface.co/datasets/org/name) and ATLAS will auto-extract the dataset name, description, existing tags, version, and citation from the public dataset card. You then verify and complete the remaining fields — languages, license, modality, task group, domain group, and the Responsible AI assessment.
Upload a file
Best for: fresh data you're publishing for the first time
Upload a single file up to 5 GB. Accepted formats: zip, tar, gz, csv, json, txt. If your dataset is spread across multiple files, zip them first. Files over 100 MB are uploaded in chunks automatically — you can safely close and resume the upload later. After the file is uploaded, you fill in all metadata yourself.
Need help or have a large dataset?
Best for: datasets over 5 GB, unsupported formats, or complex cases
Use the support form to contact the ATLAS team directly. Describe your dataset, its format, and its approximate size. The team will work with you to find the best onboarding approach — direct transfer, custom pipeline, or staged ingestion.
Steps 2–4 — The wizard
After choosing a path, a four-step wizard guides you through the contribution. Your progress is auto-saved as a draft every time you change a field, so you can close the browser and return anytime within 7 days.
Step 2 — Core metadata
These fields describe your dataset so contributors and researchers can find and evaluate it. HuggingFace submissions pre-fill some fields from the public dataset card — check them carefully before continuing.
- Dataset name
- A short, unique identifier for your dataset. Must be 3–60 characters. The wizard checks for close matches in the catalogue in real time and warns you if a similar name already exists.
- Description
- A plain-language summary of what the dataset contains and why it was created. Up to 2,000 characters. Be specific about the language variety, domain, and any collection context.
- Languages
- One or more languages present in the dataset. Select from the canonical SEA language list. If a language is missing, contact the ATLAS team.
- License
- The usage license that governs how others may use this data (e.g. CC-BY-4.0, MIT, Apache-2.0). Choose the most restrictive license that applies. If unsure, consult your organisation's legal team before submitting.
- Modality
- The data type(s) in this dataset. Select up to two: Text, Audio, Image, Video, or Multimodal. Choose Multimodal only if the dataset genuinely combines multiple modalities in each sample.
- Task group
- The primary machine-learning task category: NLP, Speech, Vision, or Multimodal. This determines which task tags are available in the next field.
- Task tags
- Specific tasks within the task group — for example, Translation, NER, or Summarisation under NLP; ASR or TTS under Speech. Add as many as apply.
- Domain group
- The primary subject domain(s): General, Medical, Legal, Education, or Other. Select up to three.
- Domain tags
- Finer-grained domain labels within the selected groups (e.g. Clinical Notes, Court Proceedings). Add as many as are relevant.
- Free tags
- Any additional labels that help with discovery — dataset series names, project codes, topic keywords. These do not affect taxonomy-based filtering.
- Version
- The version string for this dataset release. Defaults to 1.0.0. Use semantic versioning if you plan to release updates.
- Publication date
- The date the dataset was originally created or first published. If the data spans multiple collection periods, use the date of the earliest records.
- Citation
- Optional. BibTeX or plain-text citation for any paper or report that should be credited when the dataset is used. If your dataset has no associated publication, leave this blank.
Step 3 — Responsible AI assessment
The RAI assessment documents how the data was gathered and any risks users should be aware of. It is required for all submissions. Be as accurate and specific as possible — incomplete or vague answers may delay review.
- Collection method
- How the raw data was obtained. Options: Community contribution, Web scrape, Human annotation, Existing corpus, Synthetic, Other. Select the method that best describes the majority of the data.
- Annotation protocol
- Describe how the data was labelled or structured. If it was not annotated (e.g. raw text corpus), write "No annotation — raw text". Mention any annotation tools, guidelines, or inter-annotator agreement measures used.
- Preprocessing protocol
- Describe any cleaning, filtering, normalisation, or deduplication steps applied before submission. If the data is submitted as-is, write "No preprocessing applied".
- Intended use
- The primary purpose for which the dataset was created. Options: Pre-training, Fine-tuning, Evaluation, RAG, Research, Other.
- Known limitations
- Describe specific gaps, coverage issues, or scenarios where the dataset should not be used. Minimum 10 characters. For example: limited to formal written text, low speaker diversity for dialect X, collected from a single time period.
- Potential biases
- Select all known bias types that may be present. Options: Geographic, Demographic, Language-variety, Domain, Temporal, Other. Selecting Other reveals a free-text field to describe the bias.
- Social impact
- The positive societal applications this dataset is intended to support. Select all that apply: Language preservation, Education, Research, Healthcare, Governance, Cultural heritage, Commercial, Other.
- Contains PII
- Indicate whether the dataset contains any Personally Identifiable Information (names, IDs, contact details, locations). If yes, a text field appears — describe what PII is present and what steps were taken to anonymise or handle it.
- Maintenance plan
- How this dataset will be kept up to date after publication. Options: Actively maintained (regular updates planned), Community maintained (open to community contributions), No plan (one-time release), Other.
Step 4 — Review and submit
The final step shows a read-only summary of everything you've entered. Check each section carefully. If anything looks wrong, use the Back button to return to that step and correct it. When you're ready, accept the contributor terms and click Submit. Your draft is cleared on successful submission.
Helpful features
Auto-saved draft
Every field change is saved automatically to your browser. If you close the tab or lose your connection, return to the Contribute page and choose "Resume draft" to pick up where you left off. Drafts expire after 7 days of inactivity.
Duplicate name detection
As you type the dataset name, ATLAS searches the catalogue for close matches and shows a warning card if similar names exist. Review the matches — if your dataset is genuinely different, you can dismiss the warning and continue.
What happens after you submit
Your submission enters the ATLAS contribution pipeline. Here is what happens at each stage:
Automated processing — modality detection, domain augmentation, and dataset card generation run automatically.
Human review — the ATLAS team inspects the metadata, RAI assessment, and data quality.
Published — once all checks pass, your dataset appears in the catalogue with a unique dataset ID.
If changes are needed — the status moves to "pending" and the team will contact you with feedback.
You can track the status of all your contributions — reviewing, pending, published, or rejected — from the Contributions section of your profile page.