Instruction FollowingSynthetic DataCommercial OK
UltraChat 200k
by HuggingFaceH4
42.4Kdownloads
678likes
100K<n<1MDescription
Dataset Card for UltraChat 200k
Dataset Description
This is a heavily filtered version of the UltraChat dataset and was used to train Zephyr-7B-β, a state of the art 7b chat model.
The original datasets consists of 1.4M dialogues generated by ChatGPT and spanning a wide range of topics. To create UltraChat 200k, we applied the following logic:
Selection of a subset of data for faster supervised fine tuning.
Truecasing of the dataset, as we observed around 5% of the data… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k.
What can I do with this?
Tags
task_categories:text-generationlanguage:enlicense:mitsize_categories:100K<n<1Mformat:parquetmodality:textlibrary:datasetslibrary:dasklibrary:mlcroissantlibrary:polarsarxiv:2305.14233region:us