CodePPO, PretrainingNon-Commercial
Nemotron-CC-v2
by nvidia
109.4Kdownloads
111likes
Description
Nemotron-Pre-Training-Dataset-v1 Release
Data Overview
This pretraining dataset, for generative AI model training, preserves high-value math and code while enriching it with diverse multilingual Q&A, fueling the next generation of intelligent, globally-capable models.
This dataset supports NVIDIA Nemotron Nano 2, a family of large language models (LLMs) that consists of the NVIDIA-Nemotron-Nano-9B-v2, NVIDIA-Nemotron-Nano-9B-v2-Base, and NVIDIA-Nemotron-Nano-12B-v2-Base… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/Nemotron-CC-v2.
What can I do with this?
Tags
task_categories:text-generationlicense:othersize_categories:1B<n<10Bformat:parquetmodality:textlibrary:datasetslibrary:dasklibrary:polarslibrary:mlcroissantarxiv:2508.14444region:us