Math & ReasoningCommercial OK
HPLT2.0_cleaned
by HPLT
28.4Kdownloads
38likes
n>1TDescription
NB: HPLT2.0 is now superseded by a newer release:
HPLT3.0
We recommed switching to v3.0, unless you have a compelling reason to stay on 2.0.
This is a large-scale collection of web-crawled documents in 191 world languages, produced by the HPLT project.
The source of the data is mostly Internet Archive with some additions from Common Crawl.
For a detailed description of the dataset, please refer to our website and our pre-print.
The Cleaned variant of HPLT Datasets v2.0
This is the… See the full description on the dataset page: https://huggingface.co/datasets/HPLT/HPLT2.0_cleaned.
What can I do with this?
Tags
task_categories:fill-masktask_categories:text-generationtask_ids:language-modelingmultilinguality:multilinguallanguage:acelanguage:aflanguage:alslanguage:amlanguage:arlanguage:aslanguage:astlanguage:awalanguage:ayrlanguage:azblanguage:azjlanguage:balanguage:bmlanguage:banlanguage:belanguage:bem