Text Generation & ChatSynthetic DataOther

TinyStories

by roneneldan

Silver61
95.0Kdownloads
932likes

Description

Dataset containing synthetically generated (by GPT-3.5 and GPT-4) short stories that only use a small vocabulary. Described in the following paper: https://arxiv.org/abs/2305.07759. The models referred to in the paper were trained on TinyStories-train.txt (the file tinystories-valid.txt can be used for validation loss). These models can be found on Huggingface, at roneneldan/TinyStories-1M/3M/8M/28M/33M/1Layer-21M. Additional resources: tinystories_all_data.tar.gz - contains a superset of… See the full description on the dataset page: https://huggingface.co/datasets/roneneldan/TinyStories.

What can I do with this?

Tags

task_categories:text-generationlanguage:enlicense:cdla-sharing-1.0size_categories:1M<n<10Mformat:parquetmodality:textlibrary:datasetslibrary:dasklibrary:polarslibrary:mlcroissantarxiv:2305.07759region:us