Math & ReasoningSynthetic DataCommercial OK

SYNTH - generalist open data and environment

by PleIAs

Silver57
63.7Kdownloads
260likes
10M<n<100M

Description

SYNTH Blog announcement SYNTH is the first open generalist synthetic dataset for training small reasoning model end-to-end, jointly released by Pleias and the AI Alliance. SYNTH includes 79,648,272 individual text samples, comprising over 41 billion words (about 75 billion tokens with Pleias tokenizer). It is based on the amplification of 58,698 articles from Wikipedia and made possible thanks to the Structured Wikipedia dataset from Wikimedia Enterprise. SYNTH differs… See the full description on the dataset page: https://huggingface.co/datasets/PleIAs/SYNTH.

What can I do with this?

Tags

task_categories:text-generationtask_categories:zero-shot-classificationtask_categories:summarizationlanguage:enlanguage:frlanguage:itlanguage:eslanguage:delanguage:pllanguage:nllanguage:lalicense:cdla-permissive-2.0size_categories:10M<n<100Mformat:parquetmodality:textlibrary:datasetslibrary:dasklibrary:polarslibrary:mlcroissantregion:us