Text Generation & ChatPretrainingCommercial OK

FineWeb-HQ

by epfml

Silver51
300.4Kdownloads
7likes
n>1T

Description

FineWeb-HQ Dataset Summary FineWeb-HQ is a high-quality, model-filtered pretraining dataset derived as a subset of FineWeb. FineWeb-HQ was created by selecting the top 10% of FineWeb documents based on a deep learning classifier trained to identify structured and knowledge-rich samples. This classifier uses XLM-RoBERTa embeddings to score documents. To validate our approach, we pretrained 1B-parameter LLM models with a Llama-like architecture across multiple languages and… See the full description on the dataset page: https://huggingface.co/datasets/epfml/FineWeb-HQ.

What can I do with this?

Tags

task_categories:text-generationlanguage:enlicense:odc-bysize_categories:1B<n<10Bformat:parquetmodality:tabularmodality:textlibrary:datasetslibrary:dasklibrary:mlcroissantlibrary:polarsarxiv:2502.10361region:us