Instruction FollowingSynthetic DataCommercial OK
FineWeb-Edu
by HuggingFaceFW
310.3Kdownloads
1.0Klikes
n>1TDescription
π FineWeb-Edu
1.3 trillion tokens of the finest educational data the π web has to offer
Paper: https://arxiv.org/abs/2406.17557
What is it?
π FineWeb-Edu dataset consists of 1.3T tokens and 5.4T tokens (FineWeb-Edu-score-2) of educational web pages filtered from π· FineWeb dataset. This is the 1.3 trillion version.
To enhance FineWeb's quality, we developed an educational quality classifier using annotations generated by LLama3-70B-Instruct. We then⦠See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu.
What can I do with this?
Tags
task_categories:text-generationlanguage:enlicense:odc-bysize_categories:1B<n<10Bformat:parquetmodality:tabularmodality:textlibrary:datasetslibrary:dasklibrary:polarslibrary:mlcroissantarxiv:2406.17557arxiv:2404.14219arxiv:2401.10020arxiv:2109.07445doi:10.57967/hf/2497region:us