Instruction FollowingSFT, Synthetic DataUnknown

SmolTalk

by HuggingFaceTB

Silver54
13.5Kdownloads
395likes
1M<n<10M

Description

SmolTalk Dataset description This is a synthetic dataset designed for supervised finetuning (SFT) of LLMs. It was used to build SmolLM2-Instruct family of models and contains 1M samples. More details in our paper https://arxiv.org/abs/2502.02737 During the development of SmolLM2, we observed that models finetuned on public SFT datasets underperformed compared to other models with proprietary instruction datasets. To address this gap, we created new synthetic datasets… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/smoltalk.

What can I do with this?

Tags

language:ensize_categories:1M<n<10Mformat:parquetmodality:tabularmodality:textlibrary:datasetslibrary:dasklibrary:mlcroissantlibrary:polarsarxiv:2502.02737region:ussynthetic