Text Generation & ChatCommercial OK

FineWeb

by HuggingFaceFW

Silver66
206.0Kdownloads
2.7Klikes
n>1T

Description

🍷 FineWeb 15 trillion tokens of the finest data the 🌐 web has to offer What is it? The 🍷 FineWeb dataset consists of more than 18.5T tokens (originally 15T tokens) of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the 🏭 datatrove library, our large scale data processing library. 🍷 FineWeb was originally meant to be a fully open replication of πŸ¦… RefinedWeb, with a release… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb.

What can I do with this?

Tags

task_categories:text-generationlanguage:enlicense:odc-bysize_categories:10B<n<100Bmodality:tabularmodality:textarxiv:2306.01116arxiv:2109.07445arxiv:2406.17557doi:10.57967/hf/2493region:us