Math & ReasoningKTO, PretrainingCommercial OK

🥂 FineWeb 2

by HuggingFaceFW

Silver58
37.8Kdownloads
775likes
n>1T

Description

🥂 FineWeb2 A sparkling update with 1000s of languages What is it? This is the second iteration of the popular 🍷 FineWeb dataset, bringing high quality pretraining data to over 1000 🗣️ languages. The 🥂 FineWeb2 dataset is fully reproducible, available under the permissive ODC-By 1.0 license and extensively validated through hundreds of ablation experiments. In particular, on the set of 9 diverse languages we used to guide our processing decisions, 🥂 FineWeb2… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb-2.

What can I do with this?

Tags

task_categories:text-generationlanguage:aailanguage:aaklanguage:aaulanguage:aazlanguage:abalanguage:abilanguage:abklanguage:abnlanguage:abqlanguage:abslanguage:abtlanguage:abxlanguage:abylanguage:abzlanguage:acalanguage:acdlanguage:acelanguage:acflanguage:ach