Text Generation & ChatPretrainingCommercial OK
FineWeb-HQ
by epfml
300.4Kdownloads
7likes
n>1TDescription
FineWeb-HQ
Dataset Summary
FineWeb-HQ is a high-quality, model-filtered pretraining dataset derived as a subset of FineWeb. FineWeb-HQ was created by selecting the top 10% of FineWeb documents based on a deep learning classifier trained to identify structured and knowledge-rich samples. This classifier uses XLM-RoBERTa embeddings to score documents.
To validate our approach, we pretrained 1B-parameter LLM models with a Llama-like architecture across multiple languages and… See the full description on the dataset page: https://huggingface.co/datasets/epfml/FineWeb-HQ.
What can I do with this?
Tags
task_categories:text-generationlanguage:enlicense:odc-bysize_categories:1B<n<10Bformat:parquetmodality:tabularmodality:textlibrary:datasetslibrary:dasklibrary:mlcroissantlibrary:polarsarxiv:2502.10361region:us