Math & ReasoningKTOCommercial OK

📄 FinePDFs

by HuggingFaceFW

Silver59
35.4Kdownloads
833likes
n>1T

Description

Liberating 3T of the finest tokens from PDFs What is this? As we run out of web pages to process, the natural question has always been: what to do next? Only a few knew about a data source that everyone avoided for ages, due to its incredible extraction cost and complexity: PDFs. 📄 FinePDFs is exactly that. It is the largest publicly available corpus sourced exclusively from PDFs, containing about 3 trillion tokens across 475 million documents in 1733 languages. Compared to HTML… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/finepdfs.

What can I do with this?

Tags

task_categories:text-generationlanguage:aailanguage:aaklanguage:aaulanguage:aazlanguage:abalanguage:abilanguage:abklanguage:abnlanguage:abqlanguage:abslanguage:abtlanguage:abxlanguage:abylanguage:abzlanguage:acalanguage:acdlanguage:acelanguage:acflanguage:ach