CodePPO, PretrainingNon-Commercial

Nemotron-CC-v2

by nvidia

Silver56
109.4Kdownloads
111likes

Description

Nemotron-Pre-Training-Dataset-v1 Release Data Overview This pretraining dataset, for generative AI model training, preserves high-value math and code while enriching it with diverse multilingual Q&A, fueling the next generation of intelligent, globally-capable models. This dataset supports NVIDIA Nemotron Nano 2, a family of large language models (LLMs) that consists of the NVIDIA-Nemotron-Nano-9B-v2, NVIDIA-Nemotron-Nano-9B-v2-Base, and NVIDIA-Nemotron-Nano-12B-v2-Base… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/Nemotron-CC-v2.

What can I do with this?

Tags

task_categories:text-generationlicense:othersize_categories:1B<n<10Bformat:parquetmodality:textlibrary:datasetslibrary:dasklibrary:polarslibrary:mlcroissantarxiv:2508.14444region:us