Most Used Datasets

The training datasets referenced by the most fine-tuned models in our catalog. The backbone of the fine-tuning ecosystem.

Last updated April 3, 2026 · Updated daily

wty-release by daxida holds the #1 position with 0 models, ahead of arxiv-papers-by-subject at 0.

The top 10 is dominated by daxida, permutans, Trelis. This is the first snapshot — future updates will track position changes and emerging trends.

The gap between #1 and #193 is 0 vs 0 models, showing a relatively competitive field.

🥇new

wty-release

Bronze41

daxida · Code

⚠️ This dataset is automatically uploaded. For source code and issue tracking, visit the GitHub repo at wty version: 2026-04-03 commit: bee556d logs: link

0

models

🥈new

arxiv-papers-by-subject

Bronze43

permutans · Code

arXiv Papers by Subject A reorganised version of the nick007x/arxiv-papers dataset, partitioned by subject code, year, and month for efficient selective access. Dataset Description This dataset contains metadata for over 2.5 million arXiv papers, organised into a hierarchical directory structure that allows users to download only the specific subjects and time periods they need, rather than the entire dataset. Motivation The original nick007x/arxiv-papers… See the full description on the dataset page: https://huggingface.co/datasets/permutans/arxiv-papers-by-subject.

0

models

🥈new

tiny-shakespeare

Bronze42

Trelis · Text Generation & Chat

Data source Downloaded via Andrej Karpathy's nanogpt repo from this link Data Format The entire dataset is split into train (90%) and test (10%). All rows are at most 1024 tokens, using the Llama 2 tokenizer. All rows are split cleanly so that sentences are whole and unbroken.

0

models

🥉new

Knesset (Israeli Parliament) Proceedings Corpus

Bronze41

GiliGold · Classification & Sentiment

For The Knesset Corpus: [The Knesset Corpus]

0

models

🥉new

trading-chart-patterns

New

diamond-in · Finance

0

models

4new

H-WBC

New

2Nitrogen · Uncategorized

0

models

4new

movie-v16

Bronze27

ducanhh55 · Uncategorized

0

models

5new

FineFineWeb

Silver64

m-a-p · Classification & Sentiment

FineFineWeb: A Comprehensive Study on Fine-Grained Domain Web Corpus arXiv: Coming Soon Project Page: Coming Soon Blog: Coming Soon Data Statistics Domain (#tokens/#samples) Iteration 1 Tokens Iteration 2 Tokens Iteration 3 Tokens Total Tokens Iteration 1 Count Iteration 2 Count Iteration 3 Count Total Count aerospace 5.77B 261.63M 309.33M 6.34B 9100000 688505 611034 10399539 agronomy 13.08B 947.41M 229.04M 14.26B 15752828 2711790 649404 19114022 artistic… See the full description on the dataset page: https://huggingface.co/datasets/m-a-p/FineFineWeb.

0

models

5new

scirepeval

Bronze27

allenai · Benchmarks & Evaluation

0

models

6new

nguyenvanthanh2004

Bronze25

nguyenvanthanh2004 · Uncategorized

0

models

6new

aguvis-stage2

Bronze44

xlangai · Code

AGUVIS Collection This is the AGUVIS collection stage 2 for computer/mobile/desktop trajectory training. Dataset Details Project Page: https://aguvis-project.github.io Repository: https://github.com/xlang-ai/aguvis Paper : https://huggingface.co/papers/2412.04454 Citation BibTeX: @article{xu2024aguvis, title={Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction}, author={Yiheng Xu and Zekun Wang and Junli Wang and Dunjie Lu and Tianbao Xie… See the full description on the dataset page: https://huggingface.co/datasets/xlangai/aguvis-stage2.

0

models

7new

b1fd4aab

New

dude-os · Uncategorized

0

models

7new

images

Bronze27

deepinv · Image Recognition

0

models

8new

gsm8k_sycophancy

New

praneethd7 · Math & Reasoning

0

models

8new

DataCompDR-1B

Silver50

apple · Code

Dataset Card for DataCompDR-1B This dataset contains synthetic captions, embeddings, and metadata for DataCompDR-1B. The metadata has been generated using pretrained image-text models on DataComp-1B. For details on how to use the metadata, please visit our github repository. Dataset Details Dataset Description DataCompDR is an image-text dataset and an enhancement to the DataComp dataset. We reinforce the DataComp dataset using our multi-modal dataset… See the full description on the dataset page: https://huggingface.co/datasets/apple/DataCompDR-1B.

0

models

9new

MELD

Bronze27

ymw-hnu · Uncategorized

0

models

9new

svg2_0.85

New

krishagarwal · Video

0

models

10new

ngothithao1984

Bronze25

ngothithao1984 · Uncategorized

0

models

10new

lerobot_rlbench_rvt_vcl_all_variations

New

vrlfdvla · Structured Data

0

models

11new

groot-robocasa-300

Bronze39

RoMALab · Code

This dataset was created using LeRobot. Dataset Structure meta/info.json: { "codebase_version": "v3.0", "robot_type": "robocasa", "total_episodes": 7200, "total_frames": 2066059, "total_tasks": 24, "chunks_size": 1000, "data_files_size_in_mb": 100, "video_files_size_in_mb": 500, "fps": 20, "splits": { "train": "0:7200"}, "data_path": "data/chunk-{chunk_index:03d}/file-{file_index:03d}.parquet", "video_path":… See the full description on the dataset page: https://huggingface.co/datasets/RoMALab/groot-robocasa-300.

0

models

11new

TexVerse

Bronze49

YiboZhang2001 · Role-Play & Characters

TexVerse: A Universe of 3D Objects with High-Resolution Textures             Yibo Zhang1,2, Li Zhang1,3, Rui Ma2 *, Nan Cao1,4 1Shanghai Innovation Institute 2Jilin University 3Fudan University 4Tongji University * Corresponding Author TexVerse is a large-scale 3D dataset featuring high-resolution textures. Its key characteristics include: Scale & Source: TexVerse dataset has 858,669 unique 3D models curated from Sketchfab, including 158,518… See the full description on the dataset page: https://huggingface.co/datasets/YiboZhang2001/TexVerse.

0

models

12new

endless-terminals

Bronze41

obiwan96 · Uncategorized

This dataset is released with the Endless Terminals paper. There are about 2500 synthetic terminal-use environments in this dataset. license: mit

0

models

12new

colpali_train_set

Bronze49

vidore · Benchmarks & Evaluation

Dataset Description This dataset is the training set of ColPali it includes 127,460 query-image pairs from both openly available academic datasets (63%) and a synthetic dataset made up of pages from web-crawled PDF documents and augmented with VLM-generated (Claude-3 Sonnet) pseudo-questions (37%). Our training set is fully English by design, enabling us to study zero-shot generalization to non-English languages. Dataset #examples (query-page pairs) Language DocVQA 39… See the full description on the dataset page: https://huggingface.co/datasets/vidore/colpali_train_set.

0

models

13new

ShareGPT_Vicuna_unfiltered

Silver62

anon8231489123 · Uncensored

Further cleaning done. Please look through the dataset and ensure that I didn't miss anything. Update: Confirmed working method for training the model: https://huggingface.co/AlekseyKorshuk/vicuna-7b/discussions/4#64346c08ef6d5abefe42c12c Two choices: Removes instances of "I'm sorry, but": https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/blob/main/ShareGPT_V3_unfiltered_cleaned_split_no_imsorry.json Has instances of "I'm sorry, but":… See the full description on the dataset page: https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered.

0

models

13new

test_dataset

New

mosaicml · Uncategorized

0

models

14new

ForeHOI

Bronze40

YuantaoChen · Code

Paper: https://arxiv.org/abs/2602.06226 Project: https://tao-11-chen.github.io/project_pages/ForeHOI/ Github: https://github.com/Tao-11-chen/ForeHOI/ Please refer to example_loader.py for data usage

0

models

14new

PND_Adam-U_pick-simple_speed2x

Bronze37

BeingBeyond · Robotics

This data is part of the training data for Being-H0.5, produced by BeingBeyond. License This dataset is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0). Citation Being-H0.5 @article{beingbeyond2026beingh05, title={Being-H0.5: Scaling Human-Centric Robot Learning for Cross-Embodiment Generalization}, author={Luo, Hao and Wang, Ye and Zhang, Wanpeng and Zheng, Sipeng and Xi, Ziheng and Xu, Chaoyi and Xu, Haiweng and Yuan, Haoqi and… See the full description on the dataset page: https://huggingface.co/datasets/BeingBeyond/PND_Adam-U_pick-simple_speed2x.

0

models

15new

SolArchive.org Solana Datasets

Bronze38

solarchive · Finance

solarchive.org: Solana Blockchain Datasets A clean, long-term, public archive of Solana blockchain data. This dataset contains a complete historical archive of Solana blockchain transactions, accounts, and tokens, sourced from Google BigQuery's public Solana dataset and optimized for analysis. 🎯 What is this? Solarchive is a free, public archive of the entire Solana blockchain, designed for: 🔬 Researchers analyzing blockchain behavior and patterns 📊 Data scientists… See the full description on the dataset page: https://huggingface.co/datasets/solarchive/solarchive.

0

models

16new

TextPecker-1.5M

Bronze41

CIawevy · Code

TextPecker-1.5M: A Dataset for Training and evaluating TextPecker This repository contains the TextPecker-1.5M dataset, a new benchmark proposed in the paper "TextPecker: Rewarding Structural Anomaly Quantification for Enhancing Visual Text Rendering". Code and Project Page The official implementation and project details for the TextPecker and TextPecker-1.5M dataset can be found on the GitHub repository: https://github.com/CIawevy/TextPecker Sample Usage You… See the full description on the dataset page: https://huggingface.co/datasets/CIawevy/TextPecker-1.5M.

0

models

16new

ATM-Bench

Bronze39

Jingbiao · Creative Writing

ATM-Bench: Long-Term Personalized Referential Memory QA ATM-Bench is the first benchmark for multimodal, multi-source personalized referential memory QA over long time horizons (~4 years) with evidence-grounded retrieval and answering. Paper: According to Me: Long-Term Personalized Referential Memory QA Overview Existing long-term memory benchmarks focus primarily on dialogue history, failing to capture realistic personalized references grounded in lived experience.… See the full description on the dataset page: https://huggingface.co/datasets/Jingbiao/ATM-Bench.

0

models

17new

TongSIM-Asset

Bronze49

bigai · Code

TongSIM GitHub Visit Github : https://github.com/bigai-ai/tongsim What is TongSIM-Asset? As artificial intelligence (AI) rapidly advances, especially in multimodal large language models, research focus is shifting from single-modality text processing to the more complex domains of multimodal and embodied AI. Embodied intelligence focuses on training agents within realistic simulated environments, leveraging physical interaction and action… See the full description on the dataset page: https://huggingface.co/datasets/bigai/TongSIM-Asset.

0

models

17new

InternData-A1

Silver53

InternRobotics · Robotics

InternData-A1 InternData-A1 is a hybrid synthetic-real manipulation dataset containing over 630k trajectories and 7,433 hours across 4 embodiments, 18 skills, 70 tasks, and 227 scenes, covering rigid, articulated, deformable, and fluid-object manipulation. Your browser does not support the video tag. Your browser does not support the video tag.… See the full description on the dataset page: https://huggingface.co/datasets/InternRobotics/InternData-A1.

0

models

18new

rbm-1m-ood-full

Bronze33

aliangdw · Benchmarks & Evaluation

RBM-1M-OOD evaluation dataset used in Robometer. It contains over 1k trajectories used for evaluation of general-purpose reward models. Dataset Description Official evaluation in the paper uses only these 6 data sources: usc_trossen, mit_franka, utd_so101, usc_xarm, usc_franka, usc_koch. Reported benchmarks and metrics in the paper are computed on this subset. The repository may also include trajectories from additional data sources (e.g. utd_so101_wrist, usc_koch_paired… See the full description on the dataset page: https://huggingface.co/datasets/aliangdw/rbm-1m-ood-full.

0

models

19new

course-assets

Bronze34

hf-vision · Image Recognition

0

models

19new

G_CACHE_1

New

sumith2425 · Uncategorized

0

models

20new

OceanTACO

Bronze44

nilsleh · Structured Data

Dataset Card: OceanTACO Dataset Summary This dataset is a multi-source collection of global ocean sea surface measurements, integrating numerical model reanalysis, L4 gap-filled products, L3 satellite observations, and in-situ data. The collection includes sea surface height (SSH), temperature (SST), salinity (SSS), wind speed, and other variables. The L3 SWOT data has been processed onto a consistent regular grid through irreversible interpolation and coordinate… See the full description on the dataset page: https://huggingface.co/datasets/nilsleh/OceanTACO.

0

models

20new

audio-files

New

nader33 · Uncategorized

0

models

21new

sea-commoncrawl

New

sailor2 · Text - General

0

models

21new

Nemotron-CC-v2

Silver56

nvidia · Code

Nemotron-Pre-Training-Dataset-v1 Release Data Overview This pretraining dataset, for generative AI model training, preserves high-value math and code while enriching it with diverse multilingual Q&A, fueling the next generation of intelligent, globally-capable models. This dataset supports NVIDIA Nemotron Nano 2, a family of large language models (LLMs) that consists of the NVIDIA-Nemotron-Nano-9B-v2, NVIDIA-Nemotron-Nano-9B-v2-Base, and NVIDIA-Nemotron-Nano-12B-v2-Base… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/Nemotron-CC-v2.

0

models

22new

openvalidators

Bronze47

opentensor · Uncategorized

Dataset Card for Openvalidators dataset Dataset Summary The OpenValidators dataset, created by the OpenTensor Foundation, is a continuously growing collection of data generated by the OpenValidators project in W&B. It contains millions of records and serves researchers, data scientists, and miners in the Bittensor network. The dataset provides information on network performance, node behaviors, and wandb run details. Researchers can gain insights and detect patterns… See the full description on the dataset page: https://huggingface.co/datasets/opentensor/openvalidators.

0

models

22new

SpreadsheetBench

Bronze40

KAKA22 · Code

SpreadsheetBench: Towards Challenging Real World Spreadsheet Manipulation | Paper | Github | Homepage | We introduce SpreadsheetBench, a challenging spreadsheet manipulation benchmark exclusively derived from real-world scenarios, designed to immerse current large language models (LLMs) in the actual workflow of spreadsheet users. Unlike existing benchmarks that rely on synthesized queries and simplified spreadsheet files, SpreadsheetBench is built from 912 real questions gathered… See the full description on the dataset page: https://huggingface.co/datasets/KAKA22/SpreadsheetBench.

0

models

23new

toxic_conversations_50k

Bronze43

mteb · Code

ToxicConversationsClassification An MTEB dataset Massive Text Embedding Benchmark Collection of comments from the Civil Comments platform together with annotations if the comment is toxic or not. Task category t2c Domains Social, Written Reference https://www.kaggle.com/competitions/jigsaw-unintended-bias-in-toxicity-classification/overview How to evaluate on this task You can evaluate an embedding model on this dataset using the following code: import… See the full description on the dataset page: https://huggingface.co/datasets/mteb/toxic_conversations_50k.

0

models

23new

sdxl-models

Bronze43

Aisha-AI-Official · Role-Play & Characters

Aisha-AI.com 💜 A NSFW Social Network powered by AI Characters The models saved in this dataset are currently being used, or have been used at some point, to generate images and videos. The dataset is public and can be used as a backup or alternative to more unstable servers (like the unfortunate Civitai).

0

models

24new

graphrl-spatial-gym-3iter

New

yw12356 · Image Recognition

0

models

24new

PhysicalAI-Autonomous-Vehicle-Cosmos-Drive-Dreams

Silver50

nvidia · Code

PhysicalAI-Autonomous-Vehicle-Cosmos-Drive-Dreams Paper | Paper Website | GitHub Download We provide a download script to download our dataset. If you have enough space, you can use git to download a dataset from huggingface. usage: download.py [-h] --odir ODIR [--file_types {hdmap,lidar,synthetic}[,…]] [--workers N] [--clean_cache] required arguments: --odir ODIR Output… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/PhysicalAI-Autonomous-Vehicle-Cosmos-Drive-Dreams.

0

models

25new

apple_music_v5_10w_2

Bronze27

TUBGX · Uncategorized

0

models

25new

agieval-sat-en

Bronze35

hails · Code

Dataset Card for "agieval-sat-en" Dataset taken from https://github.com/microsoft/AGIEval and processed as in that repo, following dmayhem93/agieval-* datasets on the HF hub. This dataset contains the contents of the SAT-en subtask of AGIEval, as accessed in https://github.com/ruixiangcui/AGIEval/commit/5c77d073fda993f1652eaae3cf5d04cc5fd21d40 . Citation: @misc {zhong2023agieval, title={AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models}, author={Wanjun Zhong and… See the full description on the dataset page: https://huggingface.co/datasets/hails/agieval-sat-en.

0

models

26new

PolyCAT: Polygon-Aperture Eye-Tracking Dataset

Bronze37

laBran · Image Recognition

PolyCAT: Polygon-Aperture Eye-Tracking Dataset PolyCAT is a public eye-tracking dataset for studying visual attention on natural images viewed through irregular polygon apertures. The dataset is designed to support saliency prediction, gaze modeling, and research on how geometric viewing constraints affect visual exploration strategies. Recording Details Eye tracker: EyeLink 1000+ (SR Research), head-mounted, binocular, 500 Hz per eye Display: 27" 4K monitor (3840 x 2160… See the full description on the dataset page: https://huggingface.co/datasets/laBran/PolyCAT.

0

models

26new

Prophet's Mosque Library

Bronze47

ieasybooks-org · Vision-Language

Prophet's Mosque Library 📖 Overview Prophet’s Mosque Library is one of the primary resources for Islamic books. It hosts more than 48,000 PDF books across over 70 categories. In this dataset, we processed the original PDF files using Google Document AI APIs and extracted their contents into two additional formats: TXT and DOCX. 📊 Dataset Contents The dataset includes 70,884 PDF files (spanning 23,494,042 pages) representing 48,717 Islamic books. Each book is… See the full description on the dataset page: https://huggingface.co/datasets/ieasybooks-org/prophet-mosque-library.

0

models

27new

Kai0

Bronze42

ts-learn · Robotics

KAI0 TODO The advantage label will be coming soon. Contents About the Dataset Load the Dataset Download the Dataset Dataset Structure Folder hierarchy Details License and Citation About the Dataset ~134 hours real world scenarios Main Tasks FlattenFold Single task Initial state: T-shirts are randomly tossed onto the table, presenting random crumpled configurations Manipulation task: Operate the robotic arm to… See the full description on the dataset page: https://huggingface.co/datasets/ts-learn/Kai0.

0

models

27new

proof-pile-2

Silver52

EleutherAI · Math & Reasoning

A dataset of high quality mathematical text.

0

models

28new

Cantone

Bronze45

AlienKevin · Speech & Audio

Cantone A dataset of 34,489 recordings of Cantonese syllables by 10 speakers. Those syllables are generated through the Cantonese speech synthesis engines of Amazon, Apple, Google, and Microsoft. All recordings are stored as WAV files with the following format Channel: mono Sample rate: 16 kHz Bits per sample: 16 Here's a breakdown of the number of recordings under each speaker: Company Speaker # Syllables Amazon Hiujin 3,885 Apple Aasing 2,977 Apple Sinji 2,977… See the full description on the dataset page: https://huggingface.co/datasets/AlienKevin/cantone.

0

models

28new

RAVine-logs

Bronze42

sapphirex · Creative Writing

RAVine-logs This repository contains the running logs of the experiments conducted in the paper RAVine: Reality-Aligned Evaluation for Agentic Search. These logs can be used for result reproduction or detailed case analysis of agentic LLMs with search performance. RAVine is a comprehensive evaluation system for agentic search, encompassing the web environment, benchmark datasets, and a novel evaluation method, serving as a full-process, reproducible, and goal-aligned evaluation… See the full description on the dataset page: https://huggingface.co/datasets/sapphirex/RAVine-logs.

0

models

29new

Launcher

Bronze25

SVCFusion · Uncategorized

0

models

29new

CloudSEN12-scribble

Bronze34

csaybar · Instruction Following

🚨 New Dataset Version Released! We are excited to announce the release of Version [1.1] of our dataset! This update includes: [L2A & L1C support]. [Temporal support]. [Check the data without downloading (Cloud-optimized properties)]. 📥 Go to: https://huggingface.co/datasets/tacofoundation/cloudsen12 and follow the instructions in colab CloudSEN12 NOLABEL A Benchmark Dataset for Cloud Semantic Understanding CloudSEN12 SCRIBBLE A Benchmark Dataset for… See the full description on the dataset page: https://huggingface.co/datasets/csaybar/CloudSEN12-scribble.

0

models

30new

fineweb-edu-translated

Silver50

Helsinki-NLP · Translation & Multilingual

Helsinki-NLP/fineweb-edu-translated fineweb-edu-tanslated is a collection of automatically translated documents from fineweb-edu. Translations are based on OPUS-MT and HPLT-MT models. The data covers 36,704,000 documents with over 28 billion space-searated tokens of English data translated into 36 languages. The total data set is incudes of over 960 billion tokens and the translated documents are aligned across all languages. More information about how the data has been produced can… See the full description on the dataset page: https://huggingface.co/datasets/Helsinki-NLP/fineweb-edu-translated.

0

models

30new

aochekq

New

june94430 · Image Recognition

0

models

31new

VIDGEN-1M

Bronze33

AnXin69 · Benchmarks & Evaluation

Datasets Card we present VidGen-1M, a superior training dataset for text-to-video models. Produced through a coarse-to-fine curation strategy, this dataset guarantees high-quality videos and detailed captions with excellent temporal consistency.We open source the VidGen-1M dataset so that scholars can train their own models and conduct fair model evaluation。 Details Due to network and size limitations, we split the dataset into 2048 parts and upload them one by… See the full description on the dataset page: https://huggingface.co/datasets/AnXin69/VIDGEN-1M.

0

models

31new

MovieChat-1K_train

Bronze29

Lovelittlerain · Text Generation & Chat

0

models

32new

amazon_food_reviews

New

duongdono · Uncategorized

0

models

32new

DocVQA

Silver52

lmms-lab · Math & Reasoning

Large-scale Multi-modality Models Evaluation Suite Accelerating the development of large-scale multi-modality models (LMMs) with lmms-eval 🏠 Homepage | 📚 Documentation | 🤗 Huggingface Datasets This Dataset This is a formatted version of DocVQA. It is used in our lmms-eval pipeline to allow for one-click evaluations of large multi-modality models. @article{mathew2020docvqa, title={DocVQA: A Dataset for VQA on Document Images. CoRR abs/2007.00398 (2020)}… See the full description on the dataset page: https://huggingface.co/datasets/lmms-lab/DocVQA.

0

models

33new

au30_tra

Bronze38

Sam04 · Uncategorized

Dataset Card for Dataset Name This dataset card aims to be a base template for new datasets. It has been generated using this raw template. Dataset Details Dataset Description Curated by: [More Information Needed] Funded by [optional]: [More Information Needed] Shared by [optional]: [More Information Needed] Language(s) (NLP): [More Information Needed] License: [More Information Needed] Dataset Sources [optional] Repository: [More… See the full description on the dataset page: https://huggingface.co/datasets/Sam04/au30_tra.

0

models

33new

Chair

Bronze28

yuhuo03 · Uncategorized

0

models

34new

magnetograms

Bronze26

JD1361015 · Image Recognition

0

models

34new

NewsWire

Bronze48

dell-research-harvard · Science & Research

Dataset Card for NewsWire Dataset Summary NewsWire contains 2.7 million unique public domain U.S. news wire articles, written between 1878 and 1977. Locations in these articles are georeferenced, topics are tagged using customized neural topic classification, named entities are recognized, and individuals are disambiguated to Wikipedia using a novel entity disambiguation model. Languages English (en) Dataset Structure Each year in the dataset is… See the full description on the dataset page: https://huggingface.co/datasets/dell-research-harvard/newswire.

0

models

35new

TxT360

Silver56

LLM360 · Text Generation & Chat

TxT360: A Top-Quality LLM Pre-training Dataset Requires the Perfect Blend Changelog Version Details v1.1 Added new data sources: TxT360_BestOfWeb, TxT360_QA, europarl-aligned, and wikipedia_extended. Details of v1.1 Additions TxT360_BestOfWeb: This is a filtered version of the TxT360 dataset, created using the ProX document filtering model. The model is similar to the FineWeb-Edu classifier, but also assigns an additional format score that… See the full description on the dataset page: https://huggingface.co/datasets/LLM360/TxT360.

0

models

35new

webui-training-data

New

post-train · Image Recognition

0

models

36new

nguyenminhphuong1997

Bronze25

nguyenminhphuong1997 · Uncategorized

0

models

36new

fc-amf-ocr

Bronze47

lightonai · Role-Play & Characters

Dataset Card for Finance Commons AMF OCR dataset (FC-AMF-OCR) Dataset Summary The FC-AMF-OCR dataset is a comprehensive document collection derived from the AMF-PDF dataset, which is part of the Finance Commons collection. This extensive dataset comprises 9.3 million images, each processed through Optical Character Recognition (OCR) using the docTR library. While native text annotations are available in the AMF-Text dataset, these annotations suffer from imperfections and… See the full description on the dataset page: https://huggingface.co/datasets/lightonai/fc-amf-ocr.

0

models

37new

AmazonReviewsClassification

Bronze33

mteb · Code

AmazonReviewsClassification An MTEB dataset Massive Text Embedding Benchmark A collection of Amazon reviews specifically designed to aid research in multilingual text classification. Task category t2c Domains Reviews, Written Reference https://arxiv.org/abs/2010.02573 How to evaluate on this task You can evaluate an embedding model on this dataset using the following code: import mteb task = mteb.get_tasks(["AmazonReviewsClassification"]) evaluator =… See the full description on the dataset page: https://huggingface.co/datasets/mteb/AmazonReviewsClassification.

0

models

37new

movie-v3

Bronze27

LinhHuong11 · Uncategorized

0

models

38new

alpaca_en

Bronze47

llamafactory · Code

Borrowed from: https://github.com/tatsu-lab/stanford_alpaca Removed some erroneous examples. You can use it in LLaMA Factory by specifying dataset: alpaca_en.

0

models

38new

eval_venv

Bronze27

Matt300209 · Benchmarks & Evaluation

0

models

39new

Hyperheight Data Cube Denoising and Super-Resolution

Bronze46

anfera236 · Code

Hyperheight Data Cube Denoising and Super-Resolution Dataset Summary Generation code and pipeline: https://github.com/Anfera/HHDC-Creator (HHDC-Creator repo). 3-D photon-count waveforms (Hyperheight data cubes) built from NEON discrete-return LiDAR using the HHDC pipeline (hhdc/cube_generator.py). Each cube stores a high-resolution canopy volume (default: 0.5 m vertical bins over 64 m height, footprints every 2 m) across a 96 m × 96 m tile. In the HHDC-Creator pipeline… See the full description on the dataset page: https://huggingface.co/datasets/anfera236/HHDC.

0

models

39new

rare_share

Bronze37

RARE111 · Math & Reasoning

The supplementary materials for RARE: Retrieval-Augmented Reasoning Modeling (https://arxiv.org/abs/2503.23513) license: apache-2.0

0

models

40new

Amazon-Reviews-2023

Silver57

McAuley-Lab · Uncategorized

Amazon Review 2023 is an updated version of the Amazon Review 2018 dataset. This dataset mainly includes reviews (ratings, text) and item metadata (desc- riptions, category information, price, brand, and images). Compared to the pre- vious versions, the 2023 version features larger size, newer reviews (up to Sep 2023), richer and cleaner meta data, and finer-grained timestamps (from day to milli-second).

0

models

40new

9552195U

Bronze35

ITI121-25S2 · Uncategorized

Idli & Dosai Object Detection Dataset This dataset contains annotated images of Idli and Dosai for custom object detection. Format YOLOv8 (Ultralytics) Bounding box annotations Classes idli dosai Source Images collected from real-world photographs and public online sources. Annotations created using Roboflow. Usage This dataset was created for ITI121 Assignment 2.

0

models

41new

dangquanghuy1985

New

dangquanghuy1985 · Uncategorized

0

models

41new

eurosat

Bronze40

tanganke · Code

Dataset Card for EuroSAT Dataset Source Paper with code Usage from datasets import load_dataset dataset = load_dataset('tranganke/eurosat') Data Fields The dataset contains the following fields: image: An image in RGB format. label: The label for the image, which is one of 10 classes: 0: annual crop land 1: forest 2: brushland or shrubland 3: highway or road 4: industrial buildings or commercial buildings 5: pasture land 6: permanent crop land… See the full description on the dataset page: https://huggingface.co/datasets/tanganke/eurosat.

0

models

42new

extrinsic_contact_estimation_real_datasets

Bronze29

serialexperimentsleon · Uncategorized

0

models

42new

wholebody-pose-estimation-fingerspelling

Bronze38

fhswf · Uncategorized

Whole-Body Pose Estimation Dataset for German Sign Language (DGS) Finger Alphabet Dataset Description This dataset contains 5,000 annotated images for fine-tuning whole-body pose estimation models. The images depict individuals performing signs from the German Sign Language (Deutsche Gebärdensprache) finger alphabet. The frames were extracted and annotated from the video dataset available at:[https://huggingface.co/datasets/fhswf/dgs-pose] Key Features… See the full description on the dataset page: https://huggingface.co/datasets/fhswf/wholebody-pose-estimation-fingerspelling.

0

models

43new

HPLT2.0_cleaned

Silver50

HPLT · Math & Reasoning

NB: HPLT2.0 is now superseded by a newer release: HPLT3.0 We recommed switching to v3.0, unless you have a compelling reason to stay on 2.0. This is a large-scale collection of web-crawled documents in 191 world languages, produced by the HPLT project. The source of the data is mostly Internet Archive with some additions from Common Crawl. For a detailed description of the dataset, please refer to our website and our pre-print. The Cleaned variant of HPLT Datasets v2.0 This is the… See the full description on the dataset page: https://huggingface.co/datasets/HPLT/HPLT2.0_cleaned.

0

models

43new

OPUS EUconst

Bronze42

Helsinki-NLP · Benchmarks & Evaluation

Dataset Card for OPUS EUconst Dataset Summary A parallel corpus collected from the European Constitution. EUconst's Numbers: Languages: 21 Bitexts: 210 Number of files: 986 Number of tokens: 3.01M Sentence fragments: 0.22M Supported Tasks and Leaderboards The underlying task is machine translation. Languages The languages in the dataset are: Czech (cs) Danish (da) German (de) Greek (el) English (en) Spanish (es) Estonian (et) Finnish (fi) French… See the full description on the dataset page: https://huggingface.co/datasets/Helsinki-NLP/euconst.

0

models

44new

pan_piper_mix_1

Bronze27

HitmanReborn · Uncategorized

0

models

44new

DRAGON

Bronze41

lesc-unifi · Image Recognition

Dataset Card for DRAGON 🧾 ArXiv Preprint DRAGON is a large-scale Dataset of Realistic imAges Generated by diffusiON models. The dataset includes a total of 2.5 million training images and 100,000 test images generated using 25 diffusion models, spanning both recent advancements and older, well-established architectures. Dataset Details Dataset Description The remarkable ease of use of diffusion models for image generation has led to a proliferation of… See the full description on the dataset page: https://huggingface.co/datasets/lesc-unifi/dragon.

0

models

45new

nguyenvanchienvn

New

nguyenvanchienvn · Uncategorized

0

models

46new

Tadabur: A Large-Scale Quran Audio Dataset

Bronze39

FaisaI · Speech & Audio

Tadabur: A Large-Scale Quran Audio Dataset The most comprehensive and richly annotated Qur'anic recitation corpus to date Faisal Alherran       ✦ Overview Tadabur is a large-scale, high-diversity Qur'anic speech dataset designed to advance research in Qur'anic Automatic Speech Recognition (ASR), reciter modeling, tajwīd-aware speech processing, and prosodic analysis. It is the most comprehensive publicly available collection of Qur'anic recitation… See the full description on the dataset page: https://huggingface.co/datasets/FaisaI/tadabur.

0

models

46new

MMEB_Test_Instruct

Bronze26

ziyjiang · Instruction Following

0

models

47new

reward-bench-results

Bronze43

allenai · Benchmarks & Evaluation

Results for Holisitic Evaluation of Reward Models (HERM) Benchmark Here, you'll find the raw scores for the HERM project. The repository is structured as follows. ├── best-of-n/ <- Nested directory for different completions on Best of N challenge | ├── alpaca_eval/ └── results for each reward model | | ├── tulu-13b/{org}/{model}.json | | └── zephyr-7b/{org}/{model}.json | └── mt_bench/ |… See the full description on the dataset page: https://huggingface.co/datasets/allenai/reward-bench-results.

0

models

47new

GLIMPSE-processed-libero_10

Bronze28

zrgong · Image Recognition

0

models

48new

sts17-crosslingual-sts

Bronze40

mteb · Code

STS17 An MTEB dataset Massive Text Embedding Benchmark Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation Task category t2t Domains News, Web, Written Reference https://alt.qcri.org/semeval2017/task1/ How to evaluate on this task You can evaluate an embedding model on this dataset using the following code: import mteb task = mteb.get_tasks(["STS17"]) evaluator = mteb.MTEB(task) model =… See the full description on the dataset page: https://huggingface.co/datasets/mteb/sts17-crosslingual-sts.

0

models

48new

sun397

Bronze43

tanganke · Image Recognition

SUN397 dataset The database contains 397 categories subset from the SUN dataset for Scene Recognition used in the following paper. The number of images varies across categories, but there are at least 100 images per category, and 108,754 images in total. All images are in jpg format. The images provided here are for research purposes only. The file ClassName.txt contains the name list for the 397 categories. Please cite the following paper if you use this dataset in your research.… See the full description on the dataset page: https://huggingface.co/datasets/tanganke/sun397.

0

models

49new

results

Bronze33

hallucinations-leaderboard · Benchmarks & Evaluation

0

models

49new

latent_v1_alpha_03

New

atokforps · Uncategorized

0

models

50new

latent_worker_early-a2_02

Bronze28

atokforps · Uncategorized

0

models

50new

EU Law Dataset - Category 15.10

Bronze37

G4KMU · Legal

EU Law Dataset – Category 15.10: Environment This dataset contains official legal documents from the European Union, collected from the EUR-Lex website, specifically under category 15.10: "Environment". The documents span from the year 1961 to 2025 and are provided in multiple European "languages. The original documents are in PDF format and have been converted into various text-based formats using OLMCR. The dataset splits represent the different "languages available for each… See the full description on the dataset page: https://huggingface.co/datasets/G4KMU/LEMUR.

0

models

51new

buily2003

New

buily2003 · Uncategorized

0

models

51new

casestudy_openevolve_results

Bronze25

willychan21 · Uncategorized

0

models

52new

plinder_apo2mol_subset

New

linbc20 · Uncategorized

0

models

52new

movie-v10

Bronze26

LinhHuong11 · Uncategorized

0

models

53new

OpenImage_top1_final

Bronze26

Tungtom2004 · Image Recognition

0

models

53new

turkey-all-universities

Bronze39

h8st6ptv · Image Recognition

Certainly! Here’s the dataset description in Markdown format: All Universities in Turkey Dataset Description This dataset contains detailed information about various universities. Each record represents a single university and includes attributes such as the university's name, type, city, website, address, logo URL, and a button for accessing additional details. This data is typically extracted from a web page listing universities. Fields 1. id… See the full description on the dataset page: https://huggingface.co/datasets/h8st6ptv/turkey-all-universities.

0

models

54new

FineTranslations-Edu

Bronze42

HuggingFaceFW · Text Generation & Chat

💬 FineTranslations The world's knowledge in 1+1T tokens of parallel text NOTE: this is the Edu version of the dataset, containing only the top 10% scoring data based on an educational classifier applied to the English translations. It has no splits. For the base dataset, see here. What is it? This dataset contains over 1 trillion tokens of parallel text in English and 500+ languages. It was obtained by translating data from 🥂 FineWeb2 into English using… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/finetranslations-edu.

0

models

54new

RELLISUR

Bronze26

butters111 · Image Recognition

0

models

55new

librispeech_asr_dummy

Bronze35

hf-internal-testing · Speech & Audio

0

models

55new

AnyEdit

Bronze43

Bin1117 · Instruction Following

Celebrate! AnyEdit resolved the data alignment with the re-uploading process (but the view filter is not working:(, though it has 25 edit types). You can view the validation split for a quick look. You can also refer to anyedit-split dataset to view and download specific data for each editing type. Dataset Card for AnyEdit-Dataset Instruction-based image editing aims to modify specific image elements with natural language instructions. However, current models in this domain often… See the full description on the dataset page: https://huggingface.co/datasets/Bin1117/AnyEdit.

0

models

56new

movie-v7

Bronze28

ducanhh55 · Uncategorized

0

models

56new

GAMMA (Glaucoma grading from Multi-Modality imAges) Challenge Dataset

Bronze41

Vincent08426 · Medical & Healthcare

GAMMA — Glaucoma grading from Multi-Modality imAges (Challenge dataset) Image: Dataset Samples. Short description GAMMA is the first public multi-modality glaucoma grading dataset that pairs 2D color fundus photographs with 3D OCT volumes for each sample. It was released as part of the GAMMA challenge (OMIA8 / MICCAI 2021) to encourage algorithms that combine fundus and OCT information for automatic… See the full description on the dataset page: https://huggingface.co/datasets/Vincent08426/GAMMA.

0

models

57new

pile-val-backup

Bronze49

mit-han-lab · Text - General

This is a backup for the pile val dataset downloaded from here: https://the-eye.eu/public/AI/pile/val.jsonl.zst Please respect the original license of the dataset.

0

models

57new

U

Bronze36

shiyiyoyo · Structured Data

UAV Trajectory Dataset Summary This dataset comprises over 5000 random UAV (Unmanned Aerial Vehicle) trajectories collected over 20 hours of flight time. It is intended for training AI models such as trajectory prediction applications. The dataset is generated through an automated pipeline for the creation and preprocessing of UAV synthetic trajectories, making it ready for direct AI model training. Data Description The dataset features parameterized… See the full description on the dataset page: https://huggingface.co/datasets/shiyiyoyo/Synthetic-UAV-Flight-Trajectories.

0

models

58new

Dance2Hesitate

Bronze25

brsrikrishna · Uncategorized

0

models

58new

otonariniginga

Bronze33

BangumiBase · Role-Play & Characters

Bangumi Image Base of Otonari Ni Ginga This is the image base of bangumi Otonari ni Ginga, we detected 32 characters, 5029 images in total. The full dataset is here. Please note that these image bases are not guaranteed to be 100% cleaned, they may be noisy actual. If you intend to manually train models using this dataset, we recommend performing necessary preprocessing on the downloaded dataset to eliminate potential noisy samples (approximately 1% probability). Here is the… See the full description on the dataset page: https://huggingface.co/datasets/BangumiBase/otonariniginga.

0

models

59new

TroveLedger Financial Time Series Dataset

Bronze44

Traders-Lab · Finance

🗃️ TroveLedger — Financial Time Series Dataset A growing ledger of accumulated market history. ⚠️ Temporary Notice: Intraday Data Adjustments (January 2026) What happened:A discrepancy has been identified in the minute- and hourly-resolution data: these series are currently not fully adjusted for stock splits and dividends. Daily-resolution data remains correctly adjusted (as provided by the source). Why this matters:For accurate backtesting and model training –… See the full description on the dataset page: https://huggingface.co/datasets/Traders-Lab/TroveLedger.

0

models

59new

phamthihuong2003

Bronze26

phamthihuong2003 · Uncategorized

0

models

60new

humanoid-everyday

Silver50

USC-PSI-Lab · Robotics

Humanoid Everyday A Comprehensive Robotic Dataset for Open-World Humanoid Manipulation Overview Humanoid Everyday is a large-scale, diverse humanoid manipulation dataset designed for open-world robotic learning and embodied intelligence. It contains over 260 tasks across 7 major categories, covering dexterous manipulation, human–humanoid interaction, and locomotion-integrated activities.All data were collected through a human-supervised teleoperation pipeline, recording… See the full description on the dataset page: https://huggingface.co/datasets/USC-PSI-Lab/humanoid-everyday.

0

models

60new

egoschema

Bronze29

lmms-lab · Text - General

0

models

61new

nguyenvana1990

New

nguyenvana1990 · Uncategorized

0

models

61new

SloMoBlur

Bronze43

Thomas880423 · Uncategorized

To cite this dataset in a publication, please use: @misc{mahmud2025deblurringwildrealworldimage, title={Deblurring in the Wild: A Real-World Image Deblurring Dataset from Smartphone High-Speed Videos}, author={Syed Mumtahin Mahmud and Mahdi Mohd Hossain Noki and Prothito Shovon Majumder and Abdul Mohaimen Al Radi and Sudipto Das Sukanto and Afia Lubaina and Md. Mosaddek Khan}, year={2025}, eprint={2506.19445}, archivePrefix={arXiv}, primaryClass={cs.CV}… See the full description on the dataset page: https://huggingface.co/datasets/Thomas880423/SloMoBlur.

0

models

62new

OPUS_Tatoeba

New

wecover · Text - General

0

models

63new

omni-refiner-kontext

Bronze43

lsmpp · Uncategorized

omni-refiner-kontext Uploaded via huggingface_hub API.

0

models

63new

vuducmanh1991

New

vuducmanh1991 · Uncategorized

0

models

64new

Parameter Golf FineWeb Export

Bronze46

willdepueoai · Uncategorized

Parameter Golf FineWeb Export This repository hosts tokenizer-matched export artifacts derived from HuggingFaceFW/fineweb, specifically a 30B subset pulled from the 100B FineWeb cut used for parameter-golf experiments. The repository contains: pretokenized training and validation shards under datasets/datasets/ tokenizer artifacts under datasets/tokenizers/ the export manifest at datasets/manifest.json selected-document metadata at datasets/docs_selected.jsonl License… See the full description on the dataset page: https://huggingface.co/datasets/willdepueoai/parameter-golf.

0

models

64new

GQA-35k

Bronze36

Voxel51 · Math & Reasoning

Dataset Card for GQA-35k The GQA (Visual Reasoning in the Real World) dataset is a large-scale visual question answering dataset that includes scene graph annotations for each image. This is a FiftyOne dataset with 35000 samples. Note: This is a 35,000 sample subset which does not contain questions, only the scene graph annotations as detection-level attributes. You can find the recipe notebook for creating the dataset here Installation If you haven't already, install… See the full description on the dataset page: https://huggingface.co/datasets/Voxel51/GQA-Scene-Graph.

0

models

65new

tatenoyuushanonariagariseason2

Bronze33

BangumiBase · Role-Play & Characters

Bangumi Image Base of Tate No Yuusha No Nariagari Season 2 This is the image base of bangumi Tate no Yuusha no Nariagari Season 2, we detected 81 characters, 5635 images in total. The full dataset is here. Please note that these image bases are not guaranteed to be 100% cleaned, they may be noisy actual. If you intend to manually train models using this dataset, we recommend performing necessary preprocessing on the downloaded dataset to eliminate potential noisy samples… See the full description on the dataset page: https://huggingface.co/datasets/BangumiBase/tatenoyuushanonariagariseason2.

0

models

65new

Turmatle_pretrain_datasets

Bronze25

zi-hui · Image Recognition

0

models

66new

CocoChorales-E

Bronze42

ben2002chou · Benchmarks & Evaluation

Viewer note: default uses viewer_preview/ for responsive audio playback. Full training/evaluation files remain available in the original folder structure. CocoChorales-E CocoChorales-E subset used by the LadderSym training pipeline. Paired Inputs for Error Detection The model takes paired inputs: mistake: performance audio/MIDI containing musical errors score: paired reference score audio/MIDI (target/correct context) Error supervision is provided with labels:… See the full description on the dataset page: https://huggingface.co/datasets/ben2002chou/CocoChorales-E.

0

models

66new

hy

New

cryptodawn · Uncategorized

0

models

67new

Military Aircraft Detection Dataset

Bronze41

a2015003713 · Image Recognition

Military Aircraft Detection Dataset Military aircraft detection dataset in COCO and YOLO format. This dataset is synchronized from the original Kaggle dataset:https://www.kaggle.com/datasets/a2015003713/militaryaircraftdetectiondataset

0

models

67new

Openpdf-Analysis-Recognition

Bronze37

prithivMLmods · Role-Play & Characters

Openpdf-Analysis-Recognition The Openpdf-Analysis-Recognition dataset is curated for tasks related to image-to-text recognition, particularly for scanned document images and OCR (Optical Character Recognition) use cases. It contains over 6,900 images in a structured imagefolder format suitable for training models on document parsing, PDF image understanding, and layout/text extraction tasks. Attribute Value Task Image-to-Text Modality Image Format ImageFolder… See the full description on the dataset page: https://huggingface.co/datasets/prithivMLmods/Openpdf-Analysis-Recognition.

0

models

68new

hle

Silver59

cais · Code

[!NOTE] IMPORTANT: Please help us protect the integrity of this benchmark by not publicly sharing, re-uploading, or distributing the dataset. Humanity's Last Exam 🌐 Website | 📄 Paper | GitHub Center for AI Safety & Scale AI Humanity's Last Exam (HLE) is a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. Humanity's Last Exam consists of 2,500 questions across dozens of… See the full description on the dataset page: https://huggingface.co/datasets/cais/hle.

0

models

68new

DiTFake

Bronze33

lioooox · Code

Here is the released dataset (DiTFake) for Synthetic Image Detection (SID) proposed in our paper. Improving Synthetic Image Detection Towards Generalization: An Image Transformation Perspective This dataset contains 30,000 images in total, including synthetic images generated by three recent DiT-based models (Flux, PixArt, and SD3) and equal numbers of real images from COCO. More implementation details can be found in our GitHub repository.

0

models

69new

llm_pt_leaderboard_raw_results

Bronze26

eduagarcia-temp · Benchmarks & Evaluation

0

models

69new

20d6952c

New

dude-os · Uncategorized

0

models

70new

latent_worker_early-a2_04

Bronze28

atokforps · Uncategorized

0

models

70new

ver

New

zephyrglow · Uncategorized

0

models

71new

Typed Digital Signatures Dataset

Bronze44

Benjy · Role-Play & Characters

Typed Digital Signatures Dataset This comprehensive dataset contains synthetic digital signatures rendered across 30 different Google Fonts, specifically selected for their handwriting and signature-style characteristics. Each font contributes unique stylistic elements, making this dataset ideal for robust signature analysis and font recognition tasks. Dataset Overview Total Fonts: 30 different Google Fonts Images per Font: 3,000 signatures Total Dataset Size: ~90,000… See the full description on the dataset page: https://huggingface.co/datasets/Benjy/typed_digital_signatures.

0

models

71new

bio-mcp-data

Bronze34

longevity-genie · Math & Reasoning

Bio-MCP-Data A repository containing biological datasets that will be used by BIO-MCP MCP (Model Context Protocol) standard. About This repository hosts biological data assets formatted to be compatible with the Model Context Protocol, enabling AI models to efficiently access and process biological information. The data is managed using Git Large File Storage (LFS) to handle large biological datasets. Purpose Provide standardized biological datasets for AI… See the full description on the dataset page: https://huggingface.co/datasets/longevity-genie/bio-mcp-data.

0

models

72new

ahmedml

Bronze41

neashton · Uncategorized

AhmedML: High-Fidelity Computational Fluid Dynamics dataset for incompressible, low-speed bluff body aerodynamics Contact: Neil Ashton (NVIDIA) - contact@caemldatasets.org website: https://caemldatasets.org Summary: This dataset contains 500 different geometric variations of the Ahmed Car Body - a simplified car-like shape that exhibits many of the flow topologies that are present on bluff bodies such as road vehicles. The dataset contains a wide… See the full description on the dataset page: https://huggingface.co/datasets/neashton/ahmedml.

0

models

73new

DarijaMMLU

Bronze40

MBZUAI-Paris · Benchmarks & Evaluation

Dataset Card for DarijaMMLU Dataset Summary DarijaMMLU is an evaluation benchmark designed to assess large language models' (LLM) performance in Moroccan Darija, a variety of Arabic. It consists of 22,027 multiple-choice questions, translated from selected subsets of the Massive Multitask Language Understanding (MMLU) and ArabicMMLU benchmarks to measure model performance on 44 subjects in Darija. Supported Tasks Task Category: Multiple-choice question… See the full description on the dataset page: https://huggingface.co/datasets/MBZUAI-Paris/DarijaMMLU.

0

models

74new

reflect-r1-0309

New

guanys · Uncategorized

0

models

74new

aihub-wild-animal

Bronze26

im-wali · Uncategorized

0

models

75new

cc12m-wds

Silver50

pixparse · Vision-Language

Dataset Card for Conceptual Captions 12M (CC12M) Dataset Summary Conceptual 12M (CC12M) is a dataset with 12 million image-text pairs specifically meant to be used for visionand-language pre-training. Its data collection pipeline is a relaxed version of the one used in Conceptual Captions 3M (CC3M). Usage This instance of Conceptual Captions is in webdataset .tar format. It can be used with webdataset library or upcoming releases of Hugging Face datasets.… See the full description on the dataset page: https://huggingface.co/datasets/pixparse/cc12m-wds.

0

models

75new

LNDb

New

Angelou0516 · Text - General

0

models

76new

dclm-baseline-filtered

Bronze28

KORMo-Team · Uncategorized

0

models

76new

phamngochieu1994

Bronze26

phamngochieu1994 · Uncategorized

0

models

77new

audiofolder_two_configs_in_metadata

Bronze25

hf-internal-testing · Speech & Audio

0

models

77new

GuanacoDataset

Silver53

JosephusCheung · Text Generation & Chat

Sorry, it's no longer available on Hugging Face. Please reach out to those who have already downloaded it. If you have a copy, please refrain from re-uploading it to Hugging Face. The people here don't deserve it. See also: https://twitter.com/RealJosephus/status/1779913520529707387 GuanacoDataset News: We're heading towards multimodal VQA, with blip2-flan-t5-xxl Alignment to Guannaco 7B LLM. Still under construction: GuanacoVQA weight & GuanacoVQA Dataset Notice: Effective… See the full description on the dataset page: https://huggingface.co/datasets/JosephusCheung/GuanacoDataset.

0

models

78new

totz50k

Bronze26

ynhe · Video

0

models

78new

0xdesigner

New

gaianet · Text - General

0

models

79new

bias_in_bios

Bronze44

LabHC · Code

Bias in Bios Bias in Bios was created by (De-Artega et al., 2019) and published under the MIT license (https://github.com/microsoft/biosbias). The dataset is used to investigate bias in NLP models. It consists of textual biographies used to predict professional occupations, the sensitive attribute is the gender (binary). The version shared here is the version proposed by (Ravgofel et al., 2020) which slightly smaller due to the unavailability of 5,557 biographies. The dataset is… See the full description on the dataset page: https://huggingface.co/datasets/LabHC/bias_in_bios.

0

models

79new

img_upload

Bronze32

Maynor996 · Image Recognition

0

models

80new

English Characters Image Dataset

Bronze34

Mayank022 · Role-Play & Characters

English Characters Image Dataset (A-Z, a-z, 0-9) This dataset contains high-resolution (128x128 pixels) grayscale images of English characters, including uppercase letters (A-Z), lowercase letters (a-z), and digits (0-9). Each character is available in 80,000 to 100,000 unique font styles, making it one of the most comprehensive resources for character-level image modeling. Dataset Description The images in this dataset have been generated by rendering over 85,000… See the full description on the dataset page: https://huggingface.co/datasets/Mayank022/English_Characters_Images.

0

models

80new

alpaca_gpt4_zh

Silver50

llamafactory · Instruction Following

Borrowed from: https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM Removed 6,103 mistruncated examples. You can use it in LLaMA Factory by specifying dataset: alpaca_gpt4_zh.

0

models

81new

e2d65cf4

New

dude-os · Uncategorized

0

models

81new

dinhthanhbinh1986

Bronze25

dinhthanhbinh1986 · Uncategorized

0

models

82new

CC-MAIN-2018-13

Bronze25

cc-clean · Text - General

0

models

82new

tldr

Bronze43

trl-lib · Text - General

TL;DR Dataset Summary The TL;DR dataset is a processed version of Reddit posts, specifically curated to train models using the TRL library for summarization tasks. It leverages the common practice on Reddit where users append "TL;DR" (Too Long; Didn't Read) summaries to lengthy posts, providing a rich source of paired text data for training summarization models. Data Structure Format: Standard Type: Prompt-completion Columns: "pompt": The unabridged Reddit… See the full description on the dataset page: https://huggingface.co/datasets/trl-lib/tldr.

0

models

83new

chords-billboard

New

lamooon · Text - General

0

models

83new

Multimodal-Dataset-Image_Text_Table_TimeSeries-for-Financial-Time-Series-Forecasting

Bronze44

Y123-wed · Finance

The sp500stock_data_description.csv file provides detailed information on the existence of four modalities (text, image, time series, and table) for 4,213 S&P 500 stocks. The hs300stock_data_description.csv file provides detailed information on the existence of four modalities (text, image, time series, and table) for 858 HS 300 stocks. If you find our research helpful, please cite our paper: @article{xu2025finmultitime, title={FinMultiTime: A Four-Modal Bilingual Dataset for… See the full description on the dataset page: https://huggingface.co/datasets/Y123-wed/Multimodal-Dataset-Image_Text_Table_TimeSeries-for-Financial-Time-Series-Forecasting.

0

models

84new

BoxFusion

New

Kevin1804 · Image Recognition

0

models

84new

assets

Bronze28

Genesis-Intelligence · Uncategorized

0

models

85new

PAWS: Paraphrase Adversaries from Word Scrambling

Silver51

google-research-datasets · Classification & Sentiment

Dataset Card for PAWS: Paraphrase Adversaries from Word Scrambling Dataset Summary PAWS: Paraphrase Adversaries from Word Scrambling This dataset contains 108,463 human-labeled and 656k noisily labeled pairs that feature the importance of modeling structure, context, and word order information for the problem of paraphrase identification. The dataset has two subsets, one based on Wikipedia and the other one based on the Quora Question Pairs (QQP) dataset. For further… See the full description on the dataset page: https://huggingface.co/datasets/google-research-datasets/paws.

0

models

85new

CoSyn-400K

Bronze44

allenai · Code

CoSyn-400k CoSyn-400k is a collection of synthetic question-answer pairs about very diverse range of computer-generated images. The data was created by using the Claude large language model to generate code that can be executed to render an image, and using GPT-4o mini to generate Q/A pairs based on the code (without using the rendered image). The code used to generate this data is open source. Synthetic pointing data is available in a seperate repo. Quick links: 📃 CoSyn… See the full description on the dataset page: https://huggingface.co/datasets/allenai/CoSyn-400K.

0

models

86new

hf

New

codexdream · Code

0

models

86new

public

Bronze25

humosleo · Uncategorized

0

models

87new

nga2005

Bronze25

nga2005 · Uncategorized

0

models

87new

RottenTomatoes - MR Movie Review Data

Silver55

cornell-movie-review-data · Benchmarks & Evaluation

Dataset Card for "rotten_tomatoes" Dataset Summary Movie Review Dataset. This is a dataset of containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. This data was first used in Bo Pang and Lillian Lee, ``Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales.'', Proceedings of the ACL, 2005. Supported Tasks and Leaderboards More Information Needed Languages… See the full description on the dataset page: https://huggingface.co/datasets/cornell-movie-review-data/rotten_tomatoes.

0

models

88new

MY_MARS

New

ilovehyperspectral01 · Uncategorized

0

models

88new

mast

Bronze26

hyf015 · Uncategorized

0

models

89new

movie-v3

Bronze29

Yen0606 · Uncategorized

0

models

89new

PLUS_Lab_GPUs_Data

Bronze26

pluslab · Uncategorized

0

models

90new

phamtrungkien1994

Bronze25

phamtrungkien1994 · Uncategorized

0

models

90new

MME

Bronze49

lmms-lab · Benchmarks & Evaluation

Evaluation Dataset for MME

0

models

91new

truongquocanh2003

New

truongquocanh2003 · Uncategorized

0

models

91new

XLEL-WD is a multilingual event linking dataset. This dataset contains mention references in multilingual Wikipedia/Wikinews articles to event items from Wikidata. The descriptions for Wikidata event items are taken from the corresponding Wikipedia articles.

Bronze42

adithya7 · Text - General

XLEL-WD is a multilingual event linking dataset. This dataset contains mention references from multilingual Wikipedia/Wikinews articles to event items in Wikidata. The text descriptions for Wikidata events are compiled from Wikipedia articles.

0

models

92new

jat-dataset-tokenized

Bronze43

jat-project · Uncategorized

Dataset Card for "jat-dataset-tokenized" More Information needed

0

models

92new

NIH-CXR14

Bronze44

alkzar90 · Medical & Healthcare

The NIH Chest X-ray dataset consists of 100,000 de-identified images of chest x-rays. The images are in PNG format. The data is provided by the NIH Clinical Center and is available through the NIH download site: https://nihcc.app.box.com/v/ChestXray-NIHCC

0

models

93new

Turkmen Speech Dataset

Bronze44

rozumov · Benchmarks & Evaluation

Turkmen Speech Dataset (ASR) This dataset contains 251 hours of Turkmen speech audio with transcriptions, intended for training and evaluating Automatic Speech Recognition (ASR) models. It is one of the largest publicly available Turkmen speech datasets. Dataset Overview Property Value Total clips 119,847 Total duration 251.86 hours Sampling rate 16,000 Hz Language Turkmen (tk) Split train Each item includes: audio: waveform + sampling rate text:… See the full description on the dataset page: https://huggingface.co/datasets/rozumov/TurkmenSpeech.

0

models

93new

MMLU-ProX

Bronze46

li-lab · Code

MMLU-ProX MMLU-ProX is a multilingual benchmark that builds upon MMLU-Pro, extending to 29 typologically diverse languages, designed to evaluate large language models' reasoning capabilities across linguistic and cultural boundaries. Github | Paper News [2025/08] 🎉 MMLU-ProX was accepted by EMNLP 2025 Main Conference! [2025/05] MMLU-ProX now contains 29 languages, all available on Huggingface. [2025/03] MMLU-ProX is now available on Huggingface. [2025/03] We are still… See the full description on the dataset page: https://huggingface.co/datasets/li-lab/MMLU-ProX.

0

models

94new

Graph-PanNuke

Bronze39

dszohib · Code

Graph-PanNuke: A Cell-Graph Dataset for Nucleus Classification from PanNuke Graph-PanNuke is a node-level classification dataset derived from the PanNuke pan-cancer histology dataset. We use all slides at 40× magnification. Each tissue patch is converted into a cell-graph where nodes represent detected cell nuclei and edges encode spatial proximity. The task is predicting the cell type of each nucleus across 5 classes. Note that node features describe cell morphology, texture… See the full description on the dataset page: https://huggingface.co/datasets/dszohib/graph-pannuke.

0

models

94new

fsc-180k

Bronze43

Hollow12334 · Uncategorized

FSC-180k We introduce our hybrid semantic change detection dataset, named FSC-180k. It consists of approximately 60,000 real aerial images sourced from the FLAIR dataset, along with 180,000 artificially modified images. These images were generated using our HySCDG pipeline applied (three times) to each real image. In total, the dataset provides 180,000 image pairs. Each pair is accompanied by a binary change map and semantic segmentation maps for both images (land use… See the full description on the dataset page: https://huggingface.co/datasets/Hollow12334/fsc-180k.

0

models

95new

ProObjaverse-300K

Bronze30

Stable-X · Image Generation

0

models

95new

libero_track_object_ee_relative

Bronze34

CRRaphael · Code

This dataset was created using LeRobot. Dataset Structure meta/info.json: { "codebase_version": "v2.1", "robot_type": "panda", "total_episodes": 1433, "total_frames": 43826, "total_tasks": 30, "total_videos": 0, "total_chunks": 2, "chunks_size": 1000, "fps": 10, "splits": { "train": "0:1433" }, "data_path": "data/chunk-{episode_chunk:03d}/episode_{episode_index:06d}.parquet", "video_path":… See the full description on the dataset page: https://huggingface.co/datasets/CRRaphael/libero_track_object_ee_relative.

0

models

96new

GUI-360

Bronze49

vyokky · Code

GUI-360°: A Comprehensive Dataset And Benchmark For Computer-Using Agents Paper | Code GUI-360° is a large-scale, comprehensive dataset and benchmark suite designed to advance Computer-Using Agents (CUAs). 🎯 Key Features 🔢 1.2M+ executed action steps across thousands of trajectories 💼 Popular Windows office applications (Word, Excel, PowerPoint) 📸 Full-resolution screenshots with accessibility metadata 🎨 Multi-modal trajectories with reasoning traces ✅ Both… See the full description on the dataset page: https://huggingface.co/datasets/vyokky/GUI-360.

0

models

96new

Słownik Języka Polskiego

Bronze37

Apokryf · Legal

SJP Słownik Języka Polskiego transferowany z oficjalnych zasobników zestaw słownikowy do pracy z językiem polskim. https://sjp.pl/

0

models

97new

hh-rlhf

Silver60

Anthropic · Preference & Alignment (DPO/RLHF)

Dataset Card for HH-RLHF Dataset Summary This repository provides access to two different kinds of data: Human preference data about helpfulness and harmlessness from Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. These data are meant to train preference (or reward) models for subsequent RLHF training. These data are not meant for supervised training of dialogue agents. Training dialogue agents on these data is likely to lead… See the full description on the dataset page: https://huggingface.co/datasets/Anthropic/hh-rlhf.

0

models

97new

diffusers-images-docs

New

diffusers · Image Recognition

0

models

98new

SA-Med3D-140K

Bronze44

blsmash044 · Code

SA-Med3D-140K [github] Dataset Summary SA-Med3D-140K is a large-scale, multi-modal, multi-anatomical volumetric medical image segmentation dataset. It was created to facilitate the development of general-purpose foundation models for 3D medical image segmentation. The dataset comprises 21,729 3D medical images and 143,518 corresponding masks. It was gathered from a combination of 70 public datasets and 8,128 privately licensed annotated cases from 24 hospitals.… See the full description on the dataset page: https://huggingface.co/datasets/blsmash044/SA-Med3D-140K.

0

models

98new

open-images

Bronze29

dalle-mini · Image Recognition

0

models

99new

Dl3DV-Dataset

Silver50

DL3DV · Creative Writing

DL3DV-Dataset This repo has all the 960P frames with camera poses of DL3DV-10K Dataset. We are working hard to review all the dataset to avoid sensitive information. Thank you for your patience. Download If you have enough space, you can use git to download a dataset from huggingface. See this link. 480P/960P versions should satisfies most needs. If you do not have enough space, we further provide a download script here to download a subset. The usage: usage:… See the full description on the dataset page: https://huggingface.co/datasets/DL3DV/DL3DV-ALL-960P.

0

models

99new

image

New

pxnjack · Image Recognition

0

models

100new

roboverse_data

Bronze49

RoboVerseOrg · Benchmarks & Evaluation

This dataset is part of the RoboVerse project, as described in the paper RoboVerse: Towards a Unified Platform, Dataset and Benchmark for Scalable and Generalizable Robot Learning.

0

models

100new

vision-arena-bench-v0.1

Bronze36

lmarena-ai · Preference & Alignment (DPO/RLHF)

VisionArena-Bench: An automatic eval pipeline to estimate model preference rankings An automatic benchmark of 500 diverse user prompts that can be used to cheaply approximate Chatbot Arena model rankings via automatic benchmarking with VLM as a judge. Dataset Sources Repository: https://github.com/lm-sys/FastChat Paper: https://arxiv.org/abs/2412.08687 Automatic Evaluation Code: Coming Soon! Dataset Structure question_id: The unique hash representing the… See the full description on the dataset page: https://huggingface.co/datasets/lmarena-ai/vision-arena-bench-v0.1.

0

models

About the Most Used Datasets Leaderboard

The training datasets referenced by the most fine-tuned models in our catalog. The backbone of the fine-tuning ecosystem. This leaderboard tracks the top 193 training datasets ranked by models, with daily snapshots to monitor how the rankings evolve over time.

Every dataset on this leaderboard is sourced from HuggingFace and verified for relevance to fine-tuning workflows.

Methodology

Rankings are based on how many fine-tuned models in our catalog reference each dataset in their training data. This measures real-world influence — datasets that are actually being used to create new models.

Rankings are snapshotted daily at 6:00 AM UTC. Position changes shown on the leaderboard compare the current snapshot to the previous day's snapshot. All data is sourced directly from the HuggingFace Hub API and processed through our classification pipeline, which uses tag analysis, model card parsing, and naming pattern detection to identify genuine fine-tunes.

Data Sources

  • HuggingFace Hub API — download counts, likes, trending scores, model metadata, and README/model cards
  • Model card parsing — training datasets, training method (LoRA, DPO, SFT, etc.), framework, hardware, and hyperparameters extracted from README files
  • Tag classification — fine-tune detection via `base_model:finetune:*` and `base_model:quantized:*` HuggingFace tags, plus naming pattern analysis

Who Is This For?

This leaderboard is designed for anyone fine-tuning their own model who needs high-quality training data, or researchers studying what data produces the best results in the fine-tuning ecosystem.

Whether you're a beginner exploring what's possible with fine-tuned AI models or an experienced ML engineer looking for the best starting point for your next project, these rankings give you a data-driven way to find the highest quality datasets without having to wade through thousands of quantizations, format conversions, and abandoned repositories on HuggingFace.

Update Schedule

This leaderboard was last updated on April 3, 2026. Rankings are refreshed daily with the latest download counts, likes, and trending data from HuggingFace. Historical snapshots are preserved to track trends over time — you can see which datasets are growing in popularity and which are being superseded by newer alternatives.