Instruction FollowingSynthetic DataCommercial OK

FineWeb-Edu

by HuggingFaceFW

Silver64
310.3Kdownloads
1.0Klikes
n>1T

Description

πŸ“š FineWeb-Edu 1.3 trillion tokens of the finest educational data the 🌐 web has to offer Paper: https://arxiv.org/abs/2406.17557 What is it? πŸ“š FineWeb-Edu dataset consists of 1.3T tokens and 5.4T tokens (FineWeb-Edu-score-2) of educational web pages filtered from 🍷 FineWeb dataset. This is the 1.3 trillion version. To enhance FineWeb's quality, we developed an educational quality classifier using annotations generated by LLama3-70B-Instruct. We then… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu.

What can I do with this?

Tags

task_categories:text-generationlanguage:enlicense:odc-bysize_categories:1B<n<10Bformat:parquetmodality:tabularmodality:textlibrary:datasetslibrary:dasklibrary:polarslibrary:mlcroissantarxiv:2406.17557arxiv:2404.14219arxiv:2401.10020arxiv:2109.07445doi:10.57967/hf/2497region:us