Benchmarks & EvaluationSynthetic DataUnknown
colpali_train_set
by vidore
6.7Kdownloads
91likes
Description
Dataset Description
This dataset is the training set of ColPali it includes 127,460 query-image pairs from both openly available academic datasets (63%) and a synthetic dataset made up
of pages from web-crawled PDF documents and augmented with VLM-generated (Claude-3 Sonnet) pseudo-questions (37%).
Our training set is fully English by design, enabling us to study zero-shot generalization to non-English languages.
Dataset
#examples (query-page pairs)
Language
DocVQA
39… See the full description on the dataset page: https://huggingface.co/datasets/vidore/colpali_train_set.
What can I do with this?
Tags
task_categories:document-question-answeringtask_categories:visual-document-retrievalsize_categories:100K<n<1Mformat:parquetmodality:imagemodality:textlibrary:datasetslibrary:dasklibrary:mlcroissantlibrary:polarsarxiv:2407.01449region:us