Instruction FollowingSFT, Synthetic DataUnknown
SmolTalk
by HuggingFaceTB
13.5Kdownloads
395likes
1M<n<10MDescription
SmolTalk
Dataset description
This is a synthetic dataset designed for supervised finetuning (SFT) of LLMs. It was used to build SmolLM2-Instruct family of models and contains 1M samples. More details in our paper https://arxiv.org/abs/2502.02737
During the development of SmolLM2, we observed that models finetuned on public SFT datasets underperformed compared to other models with proprietary instruction datasets. To address this gap, we created new synthetic datasets… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/smoltalk.
What can I do with this?
Tags
language:ensize_categories:1M<n<10Mformat:parquetmodality:tabularmodality:textlibrary:datasetslibrary:dasklibrary:mlcroissantlibrary:polarsarxiv:2502.02737region:ussynthetic