Instruction FollowingSFT, Synthetic DataNon-Commercial

PersonaHub

by proj-persona

Silver55
8.7Kdownloads
722likes
100M<n<1B

Description

Scaling Synthetic Data Creation with 1,000,000,000 Personas This repo releases data introduced in our paper Scaling Synthetic Data Creation with 1,000,000,000 Personas: We propose a novel persona-driven data synthesis methodology that leverages various perspectives within a large language model (LLM) to create diverse synthetic data. To fully exploit this methodology at scale, we introduce PERSONA HUB – a collection of 1 billion diverse personas automatically curated from web data.… See the full description on the dataset page: https://huggingface.co/datasets/proj-persona/PersonaHub.

What can I do with this?

Tags

task_categories:text-generationtask_categories:text-classificationtask_categories:token-classificationtask_categories:fill-masktask_categories:table-question-answeringlanguage:enlanguage:zhlicense:cc-by-nc-sa-4.0size_categories:100K<n<1Mformat:jsonmodality:textlibrary:datasetslibrary:dasklibrary:mlcroissantlibrary:polarsarxiv:2406.20094region:ussynthetictextmath