Preference & Alignment (DPO/RLHF)Preference Learning, Reward Modeling, Evol-InstructCommercial OK

UltraFeedback

by openbmb

Silver51

4.1Kdownloads

410likes

100K<n<1M

Description

Introduction GitHub Repo UltraRM-13b UltraCM-13b UltraFeedback is a large-scale, fine-grained, diverse preference dataset, used for training powerful reward models and critic models. We collect about 64k prompts from diverse resources (including UltraChat, ShareGPT, Evol-Instruct, TruthfulQA, FalseQA, and FLAN). We then use these prompts to query multiple LLMs (see Table for model lists) and generate 4 different responses for each prompt, resulting in a total of 256k samples. To… See the full description on the dataset page: https://huggingface.co/datasets/openbmb/UltraFeedback.

UltraFeedback

Description

What can I do with this?

Tags