Math & ReasoningSynthetic DataCommercial OK
SYNTH - generalist open data and environment
by PleIAs
63.7Kdownloads
260likes
10M<n<100MDescription
SYNTH
Blog announcement
SYNTH is the first open generalist synthetic dataset for training small reasoning model end-to-end, jointly released by Pleias and the AI Alliance.
SYNTH includes 79,648,272 individual text samples, comprising over 41 billion words (about 75 billion tokens with Pleias tokenizer). It is based on the amplification of 58,698 articles from Wikipedia and made possible thanks to the Structured Wikipedia dataset from Wikimedia Enterprise.
SYNTH differs… See the full description on the dataset page: https://huggingface.co/datasets/PleIAs/SYNTH.
What can I do with this?
Tags
task_categories:text-generationtask_categories:zero-shot-classificationtask_categories:summarizationlanguage:enlanguage:frlanguage:itlanguage:eslanguage:delanguage:pllanguage:nllanguage:lalicense:cdla-permissive-2.0size_categories:10M<n<100Mformat:parquetmodality:textlibrary:datasetslibrary:dasklibrary:polarslibrary:mlcroissantregion:us