Structured DataPretraining, Synthetic DataCommercial OK
smollm-corpus
by HuggingFaceTB
39.0Kdownloads
447likes
Description
SmolLM-Corpus
This dataset is a curated collection of high-quality educational and synthetic data designed for training small language models.
You can find more details about the models trained on this dataset in our SmolLM blog post.
Dataset subsets
Cosmopedia v2
Cosmopedia v2 is an enhanced version of Cosmopedia, the largest synthetic dataset for pre-training, consisting of over 39 million textbooks, blog posts, and stories generated by… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus.
What can I do with this?
Tags
language:enlicense:odc-bysize_categories:100M<n<1Bformat:parquetmodality:tabularmodality:textlibrary:datasetslibrary:dasklibrary:mlcroissantlibrary:polarsregion:us