Text Generation & ChatUnknown

CulturaX

by uonlp

Silver56
15.3Kdownloads
604likes
n<1K

Description

CulturaX Cleaned, Enormous, and Public: The Multilingual Fuel to Democratize Large Language Models for 167 Languages Dataset Summary We present CulturaX, a substantial multilingual dataset with 6.3 trillion tokens in 167 languages, tailored for large language model (LLM) development. Our dataset undergoes meticulous cleaning and deduplication through a rigorous pipeline of multiple stages to accomplish the best quality for model training, including language… See the full description on the dataset page: https://huggingface.co/datasets/uonlp/CulturaX.

What can I do with this?

Tags

task_categories:text-generationtask_categories:fill-masktask_ids:language-modelingtask_ids:masked-language-modelingannotations_creators:no-annotationlanguage_creators:foundmultilinguality:multilingualsource_datasets:originallanguage:aflanguage:alslanguage:amlanguage:anlanguage:arlanguage:arzlanguage:aslanguage:astlanguage:avlanguage:azlanguage:azblanguage:ba