Description
CulturaX
Cleaned, Enormous, and Public: The Multilingual Fuel to Democratize Large Language Models for 167 Languages
Dataset Summary
We present CulturaX, a substantial multilingual dataset with 6.3 trillion tokens in 167 languages, tailored for large language model (LLM) development. Our dataset undergoes meticulous cleaning and deduplication through a rigorous pipeline of multiple stages to accomplish the best quality for model training, including language… See the full description on the dataset page: https://huggingface.co/datasets/uonlp/CulturaX.
What can I do with this?
Tags
task_categories:text-generationtask_categories:fill-masktask_ids:language-modelingtask_ids:masked-language-modelingannotations_creators:no-annotationlanguage_creators:foundmultilinguality:multilingualsource_datasets:originallanguage:aflanguage:alslanguage:amlanguage:anlanguage:arlanguage:arzlanguage:aslanguage:astlanguage:avlanguage:azlanguage:azblanguage:ba