Text Generation & ChatCommercial OK
C4
by allenai
621.1Kdownloads
539likes
n<1KDescription
C4
Dataset Summary
A colossal, cleaned version of Common Crawl's web crawl corpus. Based on Common Crawl dataset: "https://commoncrawl.org".
This is the processed version of Google's C4 dataset
We prepared five variants of the data: en, en.noclean, en.noblocklist, realnewslike, and multilingual (mC4).
For reference, these are the sizes of the variants:
en: 305GB
en.noclean: 2.3TB
en.noblocklist: 380GB
realnewslike: 15GB
multilingual (mC4): 9.7TB (108 subsets, one per… See the full description on the dataset page: https://huggingface.co/datasets/allenai/c4.
What can I do with this?
Tags
task_categories:text-generationtask_categories:fill-masktask_ids:language-modelingtask_ids:masked-language-modelingannotations_creators:no-annotationlanguage_creators:foundmultilinguality:multilingualsource_datasets:originallanguage:aflanguage:amlanguage:arlanguage:azlanguage:belanguage:bglanguage:bnlanguage:calanguage:ceblanguage:colanguage:cslanguage:cy