Instruction FollowingSFT, Synthetic DataUnknown

SmolTalk

by HuggingFaceTB

Silver54

13.5Kdownloads

395likes

1M<n<10M

Description

SmolTalk Dataset description This is a synthetic dataset designed for supervised finetuning (SFT) of LLMs. It was used to build SmolLM2-Instruct family of models and contains 1M samples. More details in our paper https://arxiv.org/abs/2502.02737 During the development of SmolLM2, we observed that models finetuned on public SFT datasets underperformed compared to other models with proprietary instruction datasets. To address this gap, we created new synthetic datasets… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/smoltalk.

SmolTalk

Description

What can I do with this?

Tags