CodePretrainingCommercial OK
Ultra-FineWeb
by openbmb
25.6Kdownloads
333likes
n>1TDescription
Ultra-FineWeb
📜 Ultra-FineWeb Technical Report | 📄 MiniCPM4 Paper | 💻 GitHub Repository | 🌐 MiniCPM4 Project Page
📚 Introduction
Ultra-FineWeb is a large-scale, high-quality, and efficiently-filtered dataset. We use the proposed efficient verification-based high-quality filtering pipeline to the FineWeb and Chinese FineWeb datasets (source data from Chinese FineWeb-edu-v2, which includes IndustryCorpus2, MiChao, WuDao, SkyPile, WanJuan, ChineseWebText… See the full description on the dataset page: https://huggingface.co/datasets/openbmb/Ultra-FineWeb.
What can I do with this?
Tags
task_categories:text-generationlanguage:enlanguage:zhlicense:apache-2.0size_categories:1B<n<10Bmodality:textarxiv:2505.05427arxiv:2506.07900arxiv:2412.04315region:usllmpretrainingweb-corpusdata-filteringhigh-quality