Text Generation & ChatCommercial OK
FineWeb
by HuggingFaceFW
206.0Kdownloads
2.7Klikes
n>1TDescription
π· FineWeb
15 trillion tokens of the finest data the π web has to offer
What is it?
The π· FineWeb dataset consists of more than 18.5T tokens (originally 15T tokens) of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the π datatrove library, our large scale data processing library.
π· FineWeb was originally meant to be a fully open replication of π¦
RefinedWeb, with a release⦠See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb.
What can I do with this?
Tags
task_categories:text-generationlanguage:enlicense:odc-bysize_categories:10B<n<100Bmodality:tabularmodality:textarxiv:2306.01116arxiv:2109.07445arxiv:2406.17557doi:10.57967/hf/2493region:us