Text Generation & ChatSynthetic DataOther
TinyStories
by roneneldan
95.0Kdownloads
932likes
Description
Dataset containing synthetically generated (by GPT-3.5 and GPT-4) short stories that only use a small vocabulary.
Described in the following paper: https://arxiv.org/abs/2305.07759.
The models referred to in the paper were trained on TinyStories-train.txt (the file tinystories-valid.txt can be used for validation loss). These models can be found on Huggingface, at roneneldan/TinyStories-1M/3M/8M/28M/33M/1Layer-21M.
Additional resources:
tinystories_all_data.tar.gz - contains a superset of… See the full description on the dataset page: https://huggingface.co/datasets/roneneldan/TinyStories.
What can I do with this?
Tags
task_categories:text-generationlanguage:enlicense:cdla-sharing-1.0size_categories:1M<n<10Mformat:parquetmodality:textlibrary:datasetslibrary:dasklibrary:polarslibrary:mlcroissantarxiv:2305.07759region:us