Benchmarks & EvaluationSynthetic DataUnknown

colpali_train_set

by vidore

Bronze49

6.7Kdownloads

91likes

Description

Dataset Description This dataset is the training set of ColPali it includes 127,460 query-image pairs from both openly available academic datasets (63%) and a synthetic dataset made up of pages from web-crawled PDF documents and augmented with VLM-generated (Claude-3 Sonnet) pseudo-questions (37%). Our training set is fully English by design, enabling us to study zero-shot generalization to non-English languages. Dataset #examples (query-page pairs) Language DocVQA 39… See the full description on the dataset page: https://huggingface.co/datasets/vidore/colpali_train_set.

colpali_train_set

Description

What can I do with this?

Tags