CodePretrainingCommercial OK

Ultra-FineWeb

by openbmb

Silver55
25.6Kdownloads
333likes
n>1T

Description

Ultra-FineWeb 📜 Ultra-FineWeb Technical Report | 📄 MiniCPM4 Paper | 💻 GitHub Repository | 🌐 MiniCPM4 Project Page 📚 Introduction Ultra-FineWeb is a large-scale, high-quality, and efficiently-filtered dataset. We use the proposed efficient verification-based high-quality filtering pipeline to the FineWeb and Chinese FineWeb datasets (source data from Chinese FineWeb-edu-v2, which includes IndustryCorpus2, MiChao, WuDao, SkyPile, WanJuan, ChineseWebText… See the full description on the dataset page: https://huggingface.co/datasets/openbmb/Ultra-FineWeb.

What can I do with this?

Tags

task_categories:text-generationlanguage:enlanguage:zhlicense:apache-2.0size_categories:1B<n<10Bmodality:textarxiv:2505.05427arxiv:2506.07900arxiv:2412.04315region:usllmpretrainingweb-corpusdata-filteringhigh-quality