CodeCommercial OK

Zyda-2

by Zyphra

Silver58
246.4Kdownloads
90likes
n>1T

Description

Zyda-2 Zyda-2 is a 5 trillion token language modeling dataset created by collecting open and high quality datasets and combining them and cross-deduplication and model-based quality filtering. Zyda-2 comprises diverse sources of web data, highly educational content, math, code, and scientific papers. To construct Zyda-2, we took the best open-source datasets available: Zyda, FineWeb, DCLM, and Dolma. Models trained on Zyda-2 significantly outperform identical models trained on the… See the full description on the dataset page: https://huggingface.co/datasets/Zyphra/Zyda-2.

What can I do with this?

Tags

task_categories:text-generationlanguage:enlicense:odc-bysize_categories:n>1Tregion:us