Structured DataPretraining, Synthetic DataCommercial OK

smollm-corpus

by HuggingFaceTB

Silver57
39.0Kdownloads
447likes

Description

SmolLM-Corpus This dataset is a curated collection of high-quality educational and synthetic data designed for training small language models. You can find more details about the models trained on this dataset in our SmolLM blog post. Dataset subsets Cosmopedia v2 Cosmopedia v2 is an enhanced version of Cosmopedia, the largest synthetic dataset for pre-training, consisting of over 39 million textbooks, blog posts, and stories generated by… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus.

What can I do with this?

Tags

language:enlicense:odc-bysize_categories:100M<n<1Bformat:parquetmodality:tabularmodality:textlibrary:datasetslibrary:dasklibrary:mlcroissantlibrary:polarsregion:us