Text Generation & ChatCommercial OK
MADLAD-400
by allenai
627.1Kdownloads
159likes
n>1TDescription
MADLAD-400
Dataset and Introduction
MADLAD-400 (Multilingual Audited Dataset: Low-resource And Document-level) is
a document-level multilingual dataset based on Common Crawl, covering 419
languages in total. This uses all snapshots of CommonCrawl available as of August
1, 2022. The primary advantage of this dataset over similar datasets is that it
is more multilingual (419 languages), it is audited and more highly filtered,
and it is document-level. The main disadvantage… See the full description on the dataset page: https://huggingface.co/datasets/allenai/MADLAD-400.
What can I do with this?
Tags
task_categories:text-generationlicense:odc-bysize_categories:n>1Tarxiv:2309.04662arxiv:2010.14571arxiv:2103.12028region:us