CodeUnknown

common_corpus

by PleIAs

Silver60
143.0Kdownloads
388likes

Description

Common Corpus Full paper - ICLR 2026 oral Common Corpus is the largest open and permissible licensed text dataset, comprising 2.27 trillion tokens (2,267,302,720,836 tokens). It is a diverse dataset, consisting of books, newspapers, scientific articles, government and legal documents, code, and more. Common Corpus has been created by Pleias in association with several partners. Common Corpus differs from existing open datasets in that it is: Truly Open: contains only data that… See the full description on the dataset page: https://huggingface.co/datasets/PleIAs/common_corpus.

What can I do with this?

Tags

language:enlanguage:frlanguage:delanguage:zhlanguage:itlanguage:eslanguage:jalanguage:pllanguage:lalanguage:nllanguage:rulanguage:arlanguage:kosize_categories:10K<n<100Kformat:parquetmodality:tabularmodality:textlibrary:datasetslibrary:pandaslibrary:polars