Description
Common Corpus
Full paper - ICLR 2026 oral
Common Corpus is the largest open and permissible licensed text dataset, comprising 2.27 trillion tokens (2,267,302,720,836 tokens). It is a diverse dataset, consisting of books, newspapers, scientific articles, government and legal documents, code, and more. Common Corpus has been created by Pleias in association with several partners.
Common Corpus differs from existing open datasets in that it is:
Truly Open: contains only data that… See the full description on the dataset page: https://huggingface.co/datasets/PleIAs/common_corpus.
What can I do with this?
Tags
language:enlanguage:frlanguage:delanguage:zhlanguage:itlanguage:eslanguage:jalanguage:pllanguage:lalanguage:nllanguage:rulanguage:arlanguage:kosize_categories:10K<n<100Kformat:parquetmodality:tabularmodality:textlibrary:datasetslibrary:pandaslibrary:polars