Cookies & analytics
We use cookies to make ATLAS work and to understand how it's used. Choose which categories to allow.
Necessary
Required for core site functionality. Always active.
Analytics
Helps us understand how ATLAS is used so we can improve it.
Marketing
Used for personalised content. Off by default.
Search and filter the global catalog of language and culture data.
Showing 1–10 of 458
A small fully hand-checked wordnet for Abui, containing over 1,400 concepts and 3,600 senses, is created. A bootstrapping technique is introduced to utilise the information in the gloss fields (English, national, and regional) to generate sense candidates using a naive algorithm based on multilingual sense intersection.
This is an automatically-produced question answering dataset generated from Indonesian Wikipedia articles. Each entry in the dataset consists of a context paragraph, the question and answer, and the question's equivalent SPARQL query. Questions are separated into two subsets: simple (question consists of a single SPARQL triple pattern) and complex (question consists of two triples plus an optional typing triple).
Dataset Card for "ai2_arc" translated into Hindi This is Hindi translated version of "ai2_arc" using the IndicTrans2 model (Gala et al., 2023). We recommend you to visit the "ai2_arc" huggingface dataset card (link) for the details.
Dataset Card for Aksharantar Dataset Summary Aksharantar is the largest publicly available transliteration dataset for 20 Indic languages. The corpus has 26M Indic language-English transliteration pairs. Supported Tasks and Leaderboards [More Information Needed] Languages Assamese (asm) Hindi (hin) Maithili (mai) Marathi (mar) Punjabi (pan) Tamil (tam) Bengali (ben) Kannada (kan) Malayalam (mal) Nepali (nep) Sanskrit (san) Telugu… See the full description on the dataset page: https://huggingface.co/datasets/ai4bharat/Aksharantar.
ALICE-THI is a Thai handwritten script dataset that contains 24045 character images, which is split into Thai handwritten character dataset (THI-C68) for 14490 images and Thai handwritten digit dataset (THI-D10) for 9555 images. The data was collected from 150 native writers aged from 20 to 23 years old. The participants were allowed to write only the isolated Thai script on the form and at least 100 samples per character. The character images obtained from this dataset generally have no background noise.
A 20,000-sentence Burmese (Myanmar) treebank on news articles containing complete phrase structure annotation.As the final result of the Burmese component in the Asian Language Treebank Project, this is the first large-scale,open-access treebank for the Burmese language.
The dataset contribution of this study is a compilation of short fictional stories written in Bikol for readability assessment. The data was combined other collected Philippine language corpora, such as Tagalog and Cebuano. The data from these languages are all distributed across the Philippine elementary system's first three grade levels (L1, L2, L3). We sourced this dataset from Let's Read Asia (LRA), Bloom Library, Department of Education, and Adarna House. \
This open-source dataset consists of 4.54 hours of transcribed Indonesian conversational speech on certain topics, where seven conversations between two pairs of speakers were contained. Please create an account and be logged in on https://magichub.com to download the data.
This open-source dataset consists of 5 hours of transcribed Malay conversational speech on certain topics, where ten conversations between five pairs of speakers were contained.
This open-source dataset consists of 3.5 hours of transcribed Indonesian scripted speech focusing on daily use sentences, where 3,296 utterances contributed by ten speakers were contained.