Benchmarks & EvaluationSynthetic DataUnknown

colpali_train_set

by vidore

Bronze49
6.7Kdownloads
91likes

Description

Dataset Description This dataset is the training set of ColPali it includes 127,460 query-image pairs from both openly available academic datasets (63%) and a synthetic dataset made up of pages from web-crawled PDF documents and augmented with VLM-generated (Claude-3 Sonnet) pseudo-questions (37%). Our training set is fully English by design, enabling us to study zero-shot generalization to non-English languages. Dataset #examples (query-page pairs) Language DocVQA 39… See the full description on the dataset page: https://huggingface.co/datasets/vidore/colpali_train_set.

What can I do with this?

Tags

task_categories:document-question-answeringtask_categories:visual-document-retrievalsize_categories:100K<n<1Mformat:parquetmodality:imagemodality:textlibrary:datasetslibrary:dasklibrary:mlcroissantlibrary:polarsarxiv:2407.01449region:us