Project ATLAS

Showing 1–10 of 458

Abui Wordnet

A small fully hand-checked wordnet for Abui, containing over 1,400 concepts and 3,600 senses, is created. A bootstrapping technique is introduced to utilise the information in the gloss fields (English, national, and regional) to generate sense candidates using a naive algorithm based on multilingual sense intersection.

TextAbui (abz)cc-by-4.0

Ac Iquad

This is an automatically-produced question answering dataset generated from Indonesian Wikipedia articles. Each entry in the dataset consists of a context paragraph, the question and answer, and the question's equivalent SPARQL query. Questions are separated into two subsets: simple (question consists of a single SPARQL triple pattern) and complex (question consists of two triples plus an optional typing triple).

TextIndonesian (ind)cc-by-4.0

Ai2 Arc Hi

Dataset Card for "ai2_arc" translated into Hindi This is Hindi translated version of "ai2_arc" using the IndicTrans2 model (Gala et al., 2023). We recommend you to visit the "ai2_arc" huggingface dataset card (link) for the details.

TextHindi (hi)cc-by-sa-4.0

Aksharantar

Dataset Card for Aksharantar Dataset Summary Aksharantar is the largest publicly available transliteration dataset for 20 Indic languages. The corpus has 26M Indic language-English transliteration pairs. Supported Tasks and Leaderboards [More Information Needed] Languages Assamese (asm) Hindi (hin) Maithili (mai) Marathi (mar) Punjabi (pan) Tamil (tam) Bengali (ben) Kannada (kan) Malayalam (mal) Nepali (nep) Sanskrit (san) Telugu… See the full description on the dataset page: https://huggingface.co/datasets/ai4bharat/Aksharantar.

TextAssamese (asm)Bangla (ben)Bodo (brx)+18 morecc

Alice Thi

ALICE-THI is a Thai handwritten script dataset that contains 24045 character images, which is split into Thai handwritten character dataset (THI-C68) for 14490 images and Thai handwritten digit dataset (THI-D10) for 9555 images. The data was collected from 150 native writers aged from 20 to 23 years old. The participants were allowed to write only the isolated Thai script on the form and at least 100 samples per character. The character images obtained from this dataset generally have no background noise.

ImageThai (tha)unknown

Alt Burmese Treebank

A 20,000-sentence Burmese (Myanmar) treebank on news articles containing complete phrase structure annotation.As the final result of the Burmese component in the Asian Language Treebank Project, this is the first large-scale,open-access treebank for the Burmese language.

TextBurmese (mya)cc-by-nc-sa-4.0

Ara Close

The dataset contribution of this study is a compilation of short fictional stories written in Bikol for readability assessment. The data was combined other collected Philippine language corpora, such as Tagalog and Cebuano. The data from these languages are all distributed across the Philippine elementary system's first three grade levels (L1, L2, L3). We sourced this dataset from Let's Read Asia (LRA), Bloom Library, Department of Education, and Adarna House. \

TextCentral Bikol (bcl)Cebuano (ceb)cc-by-4.0

ASR Indocsc

This open-source dataset consists of 4.54 hours of transcribed Indonesian conversational speech on certain topics, where seven conversations between two pairs of speakers were contained. Please create an account and be logged in on https://magichub.com to download the data.

AudioTextIndonesian (ind)cc-by-nc-nd-4.0

ASR Malcsc

This open-source dataset consists of 5 hours of transcribed Malay conversational speech on certain topics, where ten conversations between five pairs of speakers were contained.

AudioTextMalay (individual language) (zlm)cc-by-nc-nd-4.0

ASR Sindodusc

This open-source dataset consists of 3.5 hours of transcribed Indonesian scripted speech focusing on daily use sentences, where 3,296 utterances contributed by ten speakers were contained.

AudioTextIndonesian (ind)cc-by-nc-nd-4.0