CodeSynthetic DataCommercial OK

CoSyn-400K

by allenai

Bronze44
2.0Kdownloads
47likes

Description

CoSyn-400k CoSyn-400k is a collection of synthetic question-answer pairs about very diverse range of computer-generated images. The data was created by using the Claude large language model to generate code that can be executed to render an image, and using GPT-4o mini to generate Q/A pairs based on the code (without using the rendered image). The code used to generate this data is open source. Synthetic pointing data is available in a seperate repo. Quick links: 📃 CoSyn… See the full description on the dataset page: https://huggingface.co/datasets/allenai/CoSyn-400K.

What can I do with this?

Tags

task_categories:visual-question-answeringlicense:odc-bysize_categories:100K<n<1Mformat:parquetmodality:imagemodality:textlibrary:datasetslibrary:dasklibrary:mlcroissantlibrary:polarsarxiv:2502.14846arxiv:2409.17146region:us