Preference & Alignment (DPO/RLHF)Preference Learning, Reward Modeling, Evol-InstructCommercial OK

UltraFeedback

by openbmb

Silver51
4.1Kdownloads
410likes
100K<n<1M

Description

Introduction GitHub Repo UltraRM-13b UltraCM-13b UltraFeedback is a large-scale, fine-grained, diverse preference dataset, used for training powerful reward models and critic models. We collect about 64k prompts from diverse resources (including UltraChat, ShareGPT, Evol-Instruct, TruthfulQA, FalseQA, and FLAN). We then use these prompts to query multiple LLMs (see Table for model lists) and generate 4 different responses for each prompt, resulting in a total of 256k samples. To… See the full description on the dataset page: https://huggingface.co/datasets/openbmb/UltraFeedback.

What can I do with this?

Tags

task_categories:text-generationlanguage:enlicense:mitsize_categories:10K<n<100Kformat:jsonmodality:textlibrary:datasetslibrary:dasklibrary:mlcroissantlibrary:polarsarxiv:2310.01377region:us