Most Used Datasets
The training datasets referenced by the most fine-tuned models in our catalog. The backbone of the fine-tuning ecosystem.
Last updated April 3, 2026 · Updated daily
wty-release by daxida holds the #1 position with 0 models, ahead of arxiv-papers-by-subject at 0.
The top 10 is dominated by daxida, permutans, Trelis. This is the first snapshot — future updates will track position changes and emerging trends.
The gap between #1 and #193 is 0 vs 0 models, showing a relatively competitive field.
wty-release
Bronze41daxida · Code
⚠️ This dataset is automatically uploaded. For source code and issue tracking, visit the GitHub repo at wty version: 2026-04-03 commit: bee556d logs: link
models
arxiv-papers-by-subject
Bronze43permutans · Code
arXiv Papers by Subject A reorganised version of the nick007x/arxiv-papers dataset, partitioned by subject code, year, and month for efficient selective access. Dataset Description This dataset contains metadata for over 2.5 million arXiv papers, organised into a hierarchical directory structure that allows users to download only the specific subjects and time periods they need, rather than the entire dataset. Motivation The original nick007x/arxiv-papers… See the full description on the dataset page: https://huggingface.co/datasets/permutans/arxiv-papers-by-subject.
models
tiny-shakespeare
Bronze42Trelis · Text Generation & Chat
Data source Downloaded via Andrej Karpathy's nanogpt repo from this link Data Format The entire dataset is split into train (90%) and test (10%). All rows are at most 1024 tokens, using the Llama 2 tokenizer. All rows are split cleanly so that sentences are whole and unbroken.
models
Knesset (Israeli Parliament) Proceedings Corpus
Bronze41GiliGold · Classification & Sentiment
For The Knesset Corpus: [The Knesset Corpus]
models
trading-chart-patterns
Newdiamond-in · Finance
models
H-WBC
New2Nitrogen · Uncategorized
models
movie-v16
Bronze27ducanhh55 · Uncategorized
models
FineFineWeb
Silver64m-a-p · Classification & Sentiment
FineFineWeb: A Comprehensive Study on Fine-Grained Domain Web Corpus arXiv: Coming Soon Project Page: Coming Soon Blog: Coming Soon Data Statistics Domain (#tokens/#samples) Iteration 1 Tokens Iteration 2 Tokens Iteration 3 Tokens Total Tokens Iteration 1 Count Iteration 2 Count Iteration 3 Count Total Count aerospace 5.77B 261.63M 309.33M 6.34B 9100000 688505 611034 10399539 agronomy 13.08B 947.41M 229.04M 14.26B 15752828 2711790 649404 19114022 artistic… See the full description on the dataset page: https://huggingface.co/datasets/m-a-p/FineFineWeb.
models
scirepeval
Bronze27allenai · Benchmarks & Evaluation
models
nguyenvanthanh2004
Bronze25nguyenvanthanh2004 · Uncategorized
models
aguvis-stage2
Bronze44xlangai · Code
AGUVIS Collection This is the AGUVIS collection stage 2 for computer/mobile/desktop trajectory training. Dataset Details Project Page: https://aguvis-project.github.io Repository: https://github.com/xlang-ai/aguvis Paper : https://huggingface.co/papers/2412.04454 Citation BibTeX: @article{xu2024aguvis, title={Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction}, author={Yiheng Xu and Zekun Wang and Junli Wang and Dunjie Lu and Tianbao Xie… See the full description on the dataset page: https://huggingface.co/datasets/xlangai/aguvis-stage2.
models
b1fd4aab
Newdude-os · Uncategorized
models
images
Bronze27deepinv · Image Recognition
models
gsm8k_sycophancy
Newpraneethd7 · Math & Reasoning
models
DataCompDR-1B
Silver50apple · Code
Dataset Card for DataCompDR-1B This dataset contains synthetic captions, embeddings, and metadata for DataCompDR-1B. The metadata has been generated using pretrained image-text models on DataComp-1B. For details on how to use the metadata, please visit our github repository. Dataset Details Dataset Description DataCompDR is an image-text dataset and an enhancement to the DataComp dataset. We reinforce the DataComp dataset using our multi-modal dataset… See the full description on the dataset page: https://huggingface.co/datasets/apple/DataCompDR-1B.
models
MELD
Bronze27ymw-hnu · Uncategorized
models
svg2_0.85
Newkrishagarwal · Video
models
ngothithao1984
Bronze25ngothithao1984 · Uncategorized
models
lerobot_rlbench_rvt_vcl_all_variations
Newvrlfdvla · Structured Data
models
groot-robocasa-300
Bronze39RoMALab · Code
This dataset was created using LeRobot. Dataset Structure meta/info.json: { "codebase_version": "v3.0", "robot_type": "robocasa", "total_episodes": 7200, "total_frames": 2066059, "total_tasks": 24, "chunks_size": 1000, "data_files_size_in_mb": 100, "video_files_size_in_mb": 500, "fps": 20, "splits": { "train": "0:7200"}, "data_path": "data/chunk-{chunk_index:03d}/file-{file_index:03d}.parquet", "video_path":… See the full description on the dataset page: https://huggingface.co/datasets/RoMALab/groot-robocasa-300.
models
TexVerse
Bronze49YiboZhang2001 · Role-Play & Characters
TexVerse: A Universe of 3D Objects with High-Resolution Textures Yibo Zhang1,2, Li Zhang1,3, Rui Ma2 *, Nan Cao1,4 1Shanghai Innovation Institute 2Jilin University 3Fudan University 4Tongji University * Corresponding Author TexVerse is a large-scale 3D dataset featuring high-resolution textures. Its key characteristics include: Scale & Source: TexVerse dataset has 858,669 unique 3D models curated from Sketchfab, including 158,518… See the full description on the dataset page: https://huggingface.co/datasets/YiboZhang2001/TexVerse.
models
endless-terminals
Bronze41obiwan96 · Uncategorized
This dataset is released with the Endless Terminals paper. There are about 2500 synthetic terminal-use environments in this dataset. license: mit
models
colpali_train_set
Bronze49vidore · Benchmarks & Evaluation
Dataset Description This dataset is the training set of ColPali it includes 127,460 query-image pairs from both openly available academic datasets (63%) and a synthetic dataset made up of pages from web-crawled PDF documents and augmented with VLM-generated (Claude-3 Sonnet) pseudo-questions (37%). Our training set is fully English by design, enabling us to study zero-shot generalization to non-English languages. Dataset #examples (query-page pairs) Language DocVQA 39… See the full description on the dataset page: https://huggingface.co/datasets/vidore/colpali_train_set.
models
ShareGPT_Vicuna_unfiltered
Silver62anon8231489123 · Uncensored
Further cleaning done. Please look through the dataset and ensure that I didn't miss anything. Update: Confirmed working method for training the model: https://huggingface.co/AlekseyKorshuk/vicuna-7b/discussions/4#64346c08ef6d5abefe42c12c Two choices: Removes instances of "I'm sorry, but": https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/blob/main/ShareGPT_V3_unfiltered_cleaned_split_no_imsorry.json Has instances of "I'm sorry, but":… See the full description on the dataset page: https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered.
models
test_dataset
Newmosaicml · Uncategorized
models
ForeHOI
Bronze40YuantaoChen · Code
Paper: https://arxiv.org/abs/2602.06226 Project: https://tao-11-chen.github.io/project_pages/ForeHOI/ Github: https://github.com/Tao-11-chen/ForeHOI/ Please refer to example_loader.py for data usage
models
PND_Adam-U_pick-simple_speed2x
Bronze37BeingBeyond · Robotics
This data is part of the training data for Being-H0.5, produced by BeingBeyond. License This dataset is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0). Citation Being-H0.5 @article{beingbeyond2026beingh05, title={Being-H0.5: Scaling Human-Centric Robot Learning for Cross-Embodiment Generalization}, author={Luo, Hao and Wang, Ye and Zhang, Wanpeng and Zheng, Sipeng and Xi, Ziheng and Xu, Chaoyi and Xu, Haiweng and Yuan, Haoqi and… See the full description on the dataset page: https://huggingface.co/datasets/BeingBeyond/PND_Adam-U_pick-simple_speed2x.
models
SolArchive.org Solana Datasets
Bronze38solarchive · Finance
solarchive.org: Solana Blockchain Datasets A clean, long-term, public archive of Solana blockchain data. This dataset contains a complete historical archive of Solana blockchain transactions, accounts, and tokens, sourced from Google BigQuery's public Solana dataset and optimized for analysis. 🎯 What is this? Solarchive is a free, public archive of the entire Solana blockchain, designed for: 🔬 Researchers analyzing blockchain behavior and patterns 📊 Data scientists… See the full description on the dataset page: https://huggingface.co/datasets/solarchive/solarchive.
models
TextPecker-1.5M
Bronze41CIawevy · Code
TextPecker-1.5M: A Dataset for Training and evaluating TextPecker This repository contains the TextPecker-1.5M dataset, a new benchmark proposed in the paper "TextPecker: Rewarding Structural Anomaly Quantification for Enhancing Visual Text Rendering". Code and Project Page The official implementation and project details for the TextPecker and TextPecker-1.5M dataset can be found on the GitHub repository: https://github.com/CIawevy/TextPecker Sample Usage You… See the full description on the dataset page: https://huggingface.co/datasets/CIawevy/TextPecker-1.5M.
models
ATM-Bench
Bronze39Jingbiao · Creative Writing
ATM-Bench: Long-Term Personalized Referential Memory QA ATM-Bench is the first benchmark for multimodal, multi-source personalized referential memory QA over long time horizons (~4 years) with evidence-grounded retrieval and answering. Paper: According to Me: Long-Term Personalized Referential Memory QA Overview Existing long-term memory benchmarks focus primarily on dialogue history, failing to capture realistic personalized references grounded in lived experience.… See the full description on the dataset page: https://huggingface.co/datasets/Jingbiao/ATM-Bench.
models
TongSIM-Asset
Bronze49bigai · Code
TongSIM GitHub Visit Github : https://github.com/bigai-ai/tongsim What is TongSIM-Asset? As artificial intelligence (AI) rapidly advances, especially in multimodal large language models, research focus is shifting from single-modality text processing to the more complex domains of multimodal and embodied AI. Embodied intelligence focuses on training agents within realistic simulated environments, leveraging physical interaction and action… See the full description on the dataset page: https://huggingface.co/datasets/bigai/TongSIM-Asset.
models
InternData-A1
Silver53InternRobotics · Robotics
InternData-A1 InternData-A1 is a hybrid synthetic-real manipulation dataset containing over 630k trajectories and 7,433 hours across 4 embodiments, 18 skills, 70 tasks, and 227 scenes, covering rigid, articulated, deformable, and fluid-object manipulation. Your browser does not support the video tag. Your browser does not support the video tag.… See the full description on the dataset page: https://huggingface.co/datasets/InternRobotics/InternData-A1.
models
rbm-1m-ood-full
Bronze33aliangdw · Benchmarks & Evaluation
RBM-1M-OOD evaluation dataset used in Robometer. It contains over 1k trajectories used for evaluation of general-purpose reward models. Dataset Description Official evaluation in the paper uses only these 6 data sources: usc_trossen, mit_franka, utd_so101, usc_xarm, usc_franka, usc_koch. Reported benchmarks and metrics in the paper are computed on this subset. The repository may also include trajectories from additional data sources (e.g. utd_so101_wrist, usc_koch_paired… See the full description on the dataset page: https://huggingface.co/datasets/aliangdw/rbm-1m-ood-full.
models
course-assets
Bronze34hf-vision · Image Recognition
models
G_CACHE_1
Newsumith2425 · Uncategorized
models
OceanTACO
Bronze44nilsleh · Structured Data
Dataset Card: OceanTACO Dataset Summary This dataset is a multi-source collection of global ocean sea surface measurements, integrating numerical model reanalysis, L4 gap-filled products, L3 satellite observations, and in-situ data. The collection includes sea surface height (SSH), temperature (SST), salinity (SSS), wind speed, and other variables. The L3 SWOT data has been processed onto a consistent regular grid through irreversible interpolation and coordinate… See the full description on the dataset page: https://huggingface.co/datasets/nilsleh/OceanTACO.
models
audio-files
Newnader33 · Uncategorized
models
sea-commoncrawl
Newsailor2 · Text - General
models
Nemotron-CC-v2
Silver56nvidia · Code
Nemotron-Pre-Training-Dataset-v1 Release Data Overview This pretraining dataset, for generative AI model training, preserves high-value math and code while enriching it with diverse multilingual Q&A, fueling the next generation of intelligent, globally-capable models. This dataset supports NVIDIA Nemotron Nano 2, a family of large language models (LLMs) that consists of the NVIDIA-Nemotron-Nano-9B-v2, NVIDIA-Nemotron-Nano-9B-v2-Base, and NVIDIA-Nemotron-Nano-12B-v2-Base… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/Nemotron-CC-v2.
models
openvalidators
Bronze47opentensor · Uncategorized
Dataset Card for Openvalidators dataset Dataset Summary The OpenValidators dataset, created by the OpenTensor Foundation, is a continuously growing collection of data generated by the OpenValidators project in W&B. It contains millions of records and serves researchers, data scientists, and miners in the Bittensor network. The dataset provides information on network performance, node behaviors, and wandb run details. Researchers can gain insights and detect patterns… See the full description on the dataset page: https://huggingface.co/datasets/opentensor/openvalidators.
models
SpreadsheetBench
Bronze40KAKA22 · Code
SpreadsheetBench: Towards Challenging Real World Spreadsheet Manipulation | Paper | Github | Homepage | We introduce SpreadsheetBench, a challenging spreadsheet manipulation benchmark exclusively derived from real-world scenarios, designed to immerse current large language models (LLMs) in the actual workflow of spreadsheet users. Unlike existing benchmarks that rely on synthesized queries and simplified spreadsheet files, SpreadsheetBench is built from 912 real questions gathered… See the full description on the dataset page: https://huggingface.co/datasets/KAKA22/SpreadsheetBench.
models
toxic_conversations_50k
Bronze43mteb · Code
ToxicConversationsClassification An MTEB dataset Massive Text Embedding Benchmark Collection of comments from the Civil Comments platform together with annotations if the comment is toxic or not. Task category t2c Domains Social, Written Reference https://www.kaggle.com/competitions/jigsaw-unintended-bias-in-toxicity-classification/overview How to evaluate on this task You can evaluate an embedding model on this dataset using the following code: import… See the full description on the dataset page: https://huggingface.co/datasets/mteb/toxic_conversations_50k.
models
sdxl-models
Bronze43Aisha-AI-Official · Role-Play & Characters
Aisha-AI.com 💜 A NSFW Social Network powered by AI Characters The models saved in this dataset are currently being used, or have been used at some point, to generate images and videos. The dataset is public and can be used as a backup or alternative to more unstable servers (like the unfortunate Civitai).
models
graphrl-spatial-gym-3iter
Newyw12356 · Image Recognition
models
PhysicalAI-Autonomous-Vehicle-Cosmos-Drive-Dreams
Silver50nvidia · Code
PhysicalAI-Autonomous-Vehicle-Cosmos-Drive-Dreams Paper | Paper Website | GitHub Download We provide a download script to download our dataset. If you have enough space, you can use git to download a dataset from huggingface. usage: download.py [-h] --odir ODIR [--file_types {hdmap,lidar,synthetic}[,…]] [--workers N] [--clean_cache] required arguments: --odir ODIR Output… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/PhysicalAI-Autonomous-Vehicle-Cosmos-Drive-Dreams.
models
apple_music_v5_10w_2
Bronze27TUBGX · Uncategorized
models
agieval-sat-en
Bronze35hails · Code
Dataset Card for "agieval-sat-en" Dataset taken from https://github.com/microsoft/AGIEval and processed as in that repo, following dmayhem93/agieval-* datasets on the HF hub. This dataset contains the contents of the SAT-en subtask of AGIEval, as accessed in https://github.com/ruixiangcui/AGIEval/commit/5c77d073fda993f1652eaae3cf5d04cc5fd21d40 . Citation: @misc {zhong2023agieval, title={AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models}, author={Wanjun Zhong and… See the full description on the dataset page: https://huggingface.co/datasets/hails/agieval-sat-en.
models
PolyCAT: Polygon-Aperture Eye-Tracking Dataset
Bronze37laBran · Image Recognition
PolyCAT: Polygon-Aperture Eye-Tracking Dataset PolyCAT is a public eye-tracking dataset for studying visual attention on natural images viewed through irregular polygon apertures. The dataset is designed to support saliency prediction, gaze modeling, and research on how geometric viewing constraints affect visual exploration strategies. Recording Details Eye tracker: EyeLink 1000+ (SR Research), head-mounted, binocular, 500 Hz per eye Display: 27" 4K monitor (3840 x 2160… See the full description on the dataset page: https://huggingface.co/datasets/laBran/PolyCAT.
models
Prophet's Mosque Library
Bronze47ieasybooks-org · Vision-Language
Prophet's Mosque Library 📖 Overview Prophet’s Mosque Library is one of the primary resources for Islamic books. It hosts more than 48,000 PDF books across over 70 categories. In this dataset, we processed the original PDF files using Google Document AI APIs and extracted their contents into two additional formats: TXT and DOCX. 📊 Dataset Contents The dataset includes 70,884 PDF files (spanning 23,494,042 pages) representing 48,717 Islamic books. Each book is… See the full description on the dataset page: https://huggingface.co/datasets/ieasybooks-org/prophet-mosque-library.
models
Kai0
Bronze42ts-learn · Robotics
KAI0 TODO The advantage label will be coming soon. Contents About the Dataset Load the Dataset Download the Dataset Dataset Structure Folder hierarchy Details License and Citation About the Dataset ~134 hours real world scenarios Main Tasks FlattenFold Single task Initial state: T-shirts are randomly tossed onto the table, presenting random crumpled configurations Manipulation task: Operate the robotic arm to… See the full description on the dataset page: https://huggingface.co/datasets/ts-learn/Kai0.
models
proof-pile-2
Silver52EleutherAI · Math & Reasoning
A dataset of high quality mathematical text.
models
Cantone
Bronze45AlienKevin · Speech & Audio
Cantone A dataset of 34,489 recordings of Cantonese syllables by 10 speakers. Those syllables are generated through the Cantonese speech synthesis engines of Amazon, Apple, Google, and Microsoft. All recordings are stored as WAV files with the following format Channel: mono Sample rate: 16 kHz Bits per sample: 16 Here's a breakdown of the number of recordings under each speaker: Company Speaker # Syllables Amazon Hiujin 3,885 Apple Aasing 2,977 Apple Sinji 2,977… See the full description on the dataset page: https://huggingface.co/datasets/AlienKevin/cantone.
models
RAVine-logs
Bronze42sapphirex · Creative Writing
RAVine-logs This repository contains the running logs of the experiments conducted in the paper RAVine: Reality-Aligned Evaluation for Agentic Search. These logs can be used for result reproduction or detailed case analysis of agentic LLMs with search performance. RAVine is a comprehensive evaluation system for agentic search, encompassing the web environment, benchmark datasets, and a novel evaluation method, serving as a full-process, reproducible, and goal-aligned evaluation… See the full description on the dataset page: https://huggingface.co/datasets/sapphirex/RAVine-logs.
models
Launcher
Bronze25SVCFusion · Uncategorized
models
CloudSEN12-scribble
Bronze34csaybar · Instruction Following
🚨 New Dataset Version Released! We are excited to announce the release of Version [1.1] of our dataset! This update includes: [L2A & L1C support]. [Temporal support]. [Check the data without downloading (Cloud-optimized properties)]. 📥 Go to: https://huggingface.co/datasets/tacofoundation/cloudsen12 and follow the instructions in colab CloudSEN12 NOLABEL A Benchmark Dataset for Cloud Semantic Understanding CloudSEN12 SCRIBBLE A Benchmark Dataset for… See the full description on the dataset page: https://huggingface.co/datasets/csaybar/CloudSEN12-scribble.
models
fineweb-edu-translated
Silver50Helsinki-NLP · Translation & Multilingual
Helsinki-NLP/fineweb-edu-translated fineweb-edu-tanslated is a collection of automatically translated documents from fineweb-edu. Translations are based on OPUS-MT and HPLT-MT models. The data covers 36,704,000 documents with over 28 billion space-searated tokens of English data translated into 36 languages. The total data set is incudes of over 960 billion tokens and the translated documents are aligned across all languages. More information about how the data has been produced can… See the full description on the dataset page: https://huggingface.co/datasets/Helsinki-NLP/fineweb-edu-translated.
models
aochekq
Newjune94430 · Image Recognition
models
VIDGEN-1M
Bronze33AnXin69 · Benchmarks & Evaluation
Datasets Card we present VidGen-1M, a superior training dataset for text-to-video models. Produced through a coarse-to-fine curation strategy, this dataset guarantees high-quality videos and detailed captions with excellent temporal consistency.We open source the VidGen-1M dataset so that scholars can train their own models and conduct fair model evaluation。 Details Due to network and size limitations, we split the dataset into 2048 parts and upload them one by… See the full description on the dataset page: https://huggingface.co/datasets/AnXin69/VIDGEN-1M.
models
MovieChat-1K_train
Bronze29Lovelittlerain · Text Generation & Chat
models
amazon_food_reviews
Newduongdono · Uncategorized
models
DocVQA
Silver52lmms-lab · Math & Reasoning
Large-scale Multi-modality Models Evaluation Suite Accelerating the development of large-scale multi-modality models (LMMs) with lmms-eval 🏠 Homepage | 📚 Documentation | 🤗 Huggingface Datasets This Dataset This is a formatted version of DocVQA. It is used in our lmms-eval pipeline to allow for one-click evaluations of large multi-modality models. @article{mathew2020docvqa, title={DocVQA: A Dataset for VQA on Document Images. CoRR abs/2007.00398 (2020)}… See the full description on the dataset page: https://huggingface.co/datasets/lmms-lab/DocVQA.
models
au30_tra
Bronze38Sam04 · Uncategorized
Dataset Card for Dataset Name This dataset card aims to be a base template for new datasets. It has been generated using this raw template. Dataset Details Dataset Description Curated by: [More Information Needed] Funded by [optional]: [More Information Needed] Shared by [optional]: [More Information Needed] Language(s) (NLP): [More Information Needed] License: [More Information Needed] Dataset Sources [optional] Repository: [More… See the full description on the dataset page: https://huggingface.co/datasets/Sam04/au30_tra.
models
Chair
Bronze28yuhuo03 · Uncategorized
models
magnetograms
Bronze26JD1361015 · Image Recognition
models
NewsWire
Bronze48dell-research-harvard · Science & Research
Dataset Card for NewsWire Dataset Summary NewsWire contains 2.7 million unique public domain U.S. news wire articles, written between 1878 and 1977. Locations in these articles are georeferenced, topics are tagged using customized neural topic classification, named entities are recognized, and individuals are disambiguated to Wikipedia using a novel entity disambiguation model. Languages English (en) Dataset Structure Each year in the dataset is… See the full description on the dataset page: https://huggingface.co/datasets/dell-research-harvard/newswire.
models
TxT360
Silver56LLM360 · Text Generation & Chat
TxT360: A Top-Quality LLM Pre-training Dataset Requires the Perfect Blend Changelog Version Details v1.1 Added new data sources: TxT360_BestOfWeb, TxT360_QA, europarl-aligned, and wikipedia_extended. Details of v1.1 Additions TxT360_BestOfWeb: This is a filtered version of the TxT360 dataset, created using the ProX document filtering model. The model is similar to the FineWeb-Edu classifier, but also assigns an additional format score that… See the full description on the dataset page: https://huggingface.co/datasets/LLM360/TxT360.
models
webui-training-data
Newpost-train · Image Recognition
models
nguyenminhphuong1997
Bronze25nguyenminhphuong1997 · Uncategorized
models
fc-amf-ocr
Bronze47lightonai · Role-Play & Characters
Dataset Card for Finance Commons AMF OCR dataset (FC-AMF-OCR) Dataset Summary The FC-AMF-OCR dataset is a comprehensive document collection derived from the AMF-PDF dataset, which is part of the Finance Commons collection. This extensive dataset comprises 9.3 million images, each processed through Optical Character Recognition (OCR) using the docTR library. While native text annotations are available in the AMF-Text dataset, these annotations suffer from imperfections and… See the full description on the dataset page: https://huggingface.co/datasets/lightonai/fc-amf-ocr.
models
AmazonReviewsClassification
Bronze33mteb · Code
AmazonReviewsClassification An MTEB dataset Massive Text Embedding Benchmark A collection of Amazon reviews specifically designed to aid research in multilingual text classification. Task category t2c Domains Reviews, Written Reference https://arxiv.org/abs/2010.02573 How to evaluate on this task You can evaluate an embedding model on this dataset using the following code: import mteb task = mteb.get_tasks(["AmazonReviewsClassification"]) evaluator =… See the full description on the dataset page: https://huggingface.co/datasets/mteb/AmazonReviewsClassification.
models
movie-v3
Bronze27LinhHuong11 · Uncategorized
models
alpaca_en
Bronze47llamafactory · Code
Borrowed from: https://github.com/tatsu-lab/stanford_alpaca Removed some erroneous examples. You can use it in LLaMA Factory by specifying dataset: alpaca_en.
models
eval_venv
Bronze27Matt300209 · Benchmarks & Evaluation
models
Hyperheight Data Cube Denoising and Super-Resolution
Bronze46anfera236 · Code
Hyperheight Data Cube Denoising and Super-Resolution Dataset Summary Generation code and pipeline: https://github.com/Anfera/HHDC-Creator (HHDC-Creator repo). 3-D photon-count waveforms (Hyperheight data cubes) built from NEON discrete-return LiDAR using the HHDC pipeline (hhdc/cube_generator.py). Each cube stores a high-resolution canopy volume (default: 0.5 m vertical bins over 64 m height, footprints every 2 m) across a 96 m × 96 m tile. In the HHDC-Creator pipeline… See the full description on the dataset page: https://huggingface.co/datasets/anfera236/HHDC.
models
rare_share
Bronze37RARE111 · Math & Reasoning
The supplementary materials for RARE: Retrieval-Augmented Reasoning Modeling (https://arxiv.org/abs/2503.23513) license: apache-2.0
models
Amazon-Reviews-2023
Silver57McAuley-Lab · Uncategorized
Amazon Review 2023 is an updated version of the Amazon Review 2018 dataset. This dataset mainly includes reviews (ratings, text) and item metadata (desc- riptions, category information, price, brand, and images). Compared to the pre- vious versions, the 2023 version features larger size, newer reviews (up to Sep 2023), richer and cleaner meta data, and finer-grained timestamps (from day to milli-second).
models
9552195U
Bronze35ITI121-25S2 · Uncategorized
Idli & Dosai Object Detection Dataset This dataset contains annotated images of Idli and Dosai for custom object detection. Format YOLOv8 (Ultralytics) Bounding box annotations Classes idli dosai Source Images collected from real-world photographs and public online sources. Annotations created using Roboflow. Usage This dataset was created for ITI121 Assignment 2.
models
dangquanghuy1985
Newdangquanghuy1985 · Uncategorized
models
eurosat
Bronze40tanganke · Code
Dataset Card for EuroSAT Dataset Source Paper with code Usage from datasets import load_dataset dataset = load_dataset('tranganke/eurosat') Data Fields The dataset contains the following fields: image: An image in RGB format. label: The label for the image, which is one of 10 classes: 0: annual crop land 1: forest 2: brushland or shrubland 3: highway or road 4: industrial buildings or commercial buildings 5: pasture land 6: permanent crop land… See the full description on the dataset page: https://huggingface.co/datasets/tanganke/eurosat.
models
extrinsic_contact_estimation_real_datasets
Bronze29serialexperimentsleon · Uncategorized
models
wholebody-pose-estimation-fingerspelling
Bronze38fhswf · Uncategorized
Whole-Body Pose Estimation Dataset for German Sign Language (DGS) Finger Alphabet Dataset Description This dataset contains 5,000 annotated images for fine-tuning whole-body pose estimation models. The images depict individuals performing signs from the German Sign Language (Deutsche Gebärdensprache) finger alphabet. The frames were extracted and annotated from the video dataset available at:[https://huggingface.co/datasets/fhswf/dgs-pose] Key Features… See the full description on the dataset page: https://huggingface.co/datasets/fhswf/wholebody-pose-estimation-fingerspelling.
models
HPLT2.0_cleaned
Silver50HPLT · Math & Reasoning
NB: HPLT2.0 is now superseded by a newer release: HPLT3.0 We recommed switching to v3.0, unless you have a compelling reason to stay on 2.0. This is a large-scale collection of web-crawled documents in 191 world languages, produced by the HPLT project. The source of the data is mostly Internet Archive with some additions from Common Crawl. For a detailed description of the dataset, please refer to our website and our pre-print. The Cleaned variant of HPLT Datasets v2.0 This is the… See the full description on the dataset page: https://huggingface.co/datasets/HPLT/HPLT2.0_cleaned.
models
OPUS EUconst
Bronze42Helsinki-NLP · Benchmarks & Evaluation
Dataset Card for OPUS EUconst Dataset Summary A parallel corpus collected from the European Constitution. EUconst's Numbers: Languages: 21 Bitexts: 210 Number of files: 986 Number of tokens: 3.01M Sentence fragments: 0.22M Supported Tasks and Leaderboards The underlying task is machine translation. Languages The languages in the dataset are: Czech (cs) Danish (da) German (de) Greek (el) English (en) Spanish (es) Estonian (et) Finnish (fi) French… See the full description on the dataset page: https://huggingface.co/datasets/Helsinki-NLP/euconst.
models
pan_piper_mix_1
Bronze27HitmanReborn · Uncategorized
models
DRAGON
Bronze41lesc-unifi · Image Recognition
Dataset Card for DRAGON 🧾 ArXiv Preprint DRAGON is a large-scale Dataset of Realistic imAges Generated by diffusiON models. The dataset includes a total of 2.5 million training images and 100,000 test images generated using 25 diffusion models, spanning both recent advancements and older, well-established architectures. Dataset Details Dataset Description The remarkable ease of use of diffusion models for image generation has led to a proliferation of… See the full description on the dataset page: https://huggingface.co/datasets/lesc-unifi/dragon.
models
nguyenvanchienvn
Newnguyenvanchienvn · Uncategorized
models
Tadabur: A Large-Scale Quran Audio Dataset
Bronze39FaisaI · Speech & Audio
Tadabur: A Large-Scale Quran Audio Dataset The most comprehensive and richly annotated Qur'anic recitation corpus to date Faisal Alherran ✦ Overview Tadabur is a large-scale, high-diversity Qur'anic speech dataset designed to advance research in Qur'anic Automatic Speech Recognition (ASR), reciter modeling, tajwīd-aware speech processing, and prosodic analysis. It is the most comprehensive publicly available collection of Qur'anic recitation… See the full description on the dataset page: https://huggingface.co/datasets/FaisaI/tadabur.
models
MMEB_Test_Instruct
Bronze26ziyjiang · Instruction Following
models
reward-bench-results
Bronze43allenai · Benchmarks & Evaluation
Results for Holisitic Evaluation of Reward Models (HERM) Benchmark Here, you'll find the raw scores for the HERM project. The repository is structured as follows. ├── best-of-n/ <- Nested directory for different completions on Best of N challenge | ├── alpaca_eval/ └── results for each reward model | | ├── tulu-13b/{org}/{model}.json | | └── zephyr-7b/{org}/{model}.json | └── mt_bench/ |… See the full description on the dataset page: https://huggingface.co/datasets/allenai/reward-bench-results.
models
GLIMPSE-processed-libero_10
Bronze28zrgong · Image Recognition
models
sts17-crosslingual-sts
Bronze40mteb · Code
STS17 An MTEB dataset Massive Text Embedding Benchmark Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation Task category t2t Domains News, Web, Written Reference https://alt.qcri.org/semeval2017/task1/ How to evaluate on this task You can evaluate an embedding model on this dataset using the following code: import mteb task = mteb.get_tasks(["STS17"]) evaluator = mteb.MTEB(task) model =… See the full description on the dataset page: https://huggingface.co/datasets/mteb/sts17-crosslingual-sts.
models
sun397
Bronze43tanganke · Image Recognition
SUN397 dataset The database contains 397 categories subset from the SUN dataset for Scene Recognition used in the following paper. The number of images varies across categories, but there are at least 100 images per category, and 108,754 images in total. All images are in jpg format. The images provided here are for research purposes only. The file ClassName.txt contains the name list for the 397 categories. Please cite the following paper if you use this dataset in your research.… See the full description on the dataset page: https://huggingface.co/datasets/tanganke/sun397.
models
results
Bronze33hallucinations-leaderboard · Benchmarks & Evaluation
models
latent_v1_alpha_03
Newatokforps · Uncategorized
models
latent_worker_early-a2_02
Bronze28atokforps · Uncategorized
models
EU Law Dataset - Category 15.10
Bronze37G4KMU · Legal
EU Law Dataset – Category 15.10: Environment This dataset contains official legal documents from the European Union, collected from the EUR-Lex website, specifically under category 15.10: "Environment". The documents span from the year 1961 to 2025 and are provided in multiple European "languages. The original documents are in PDF format and have been converted into various text-based formats using OLMCR. The dataset splits represent the different "languages available for each… See the full description on the dataset page: https://huggingface.co/datasets/G4KMU/LEMUR.
models
buily2003
Newbuily2003 · Uncategorized
models
casestudy_openevolve_results
Bronze25willychan21 · Uncategorized
models
plinder_apo2mol_subset
Newlinbc20 · Uncategorized
models
movie-v10
Bronze26LinhHuong11 · Uncategorized
models
OpenImage_top1_final
Bronze26Tungtom2004 · Image Recognition
models
turkey-all-universities
Bronze39h8st6ptv · Image Recognition
Certainly! Here’s the dataset description in Markdown format: All Universities in Turkey Dataset Description This dataset contains detailed information about various universities. Each record represents a single university and includes attributes such as the university's name, type, city, website, address, logo URL, and a button for accessing additional details. This data is typically extracted from a web page listing universities. Fields 1. id… See the full description on the dataset page: https://huggingface.co/datasets/h8st6ptv/turkey-all-universities.
models
FineTranslations-Edu
Bronze42HuggingFaceFW · Text Generation & Chat
💬 FineTranslations The world's knowledge in 1+1T tokens of parallel text NOTE: this is the Edu version of the dataset, containing only the top 10% scoring data based on an educational classifier applied to the English translations. It has no splits. For the base dataset, see here. What is it? This dataset contains over 1 trillion tokens of parallel text in English and 500+ languages. It was obtained by translating data from 🥂 FineWeb2 into English using… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/finetranslations-edu.
models
RELLISUR
Bronze26butters111 · Image Recognition
models
librispeech_asr_dummy
Bronze35hf-internal-testing · Speech & Audio
models
AnyEdit
Bronze43Bin1117 · Instruction Following
Celebrate! AnyEdit resolved the data alignment with the re-uploading process (but the view filter is not working:(, though it has 25 edit types). You can view the validation split for a quick look. You can also refer to anyedit-split dataset to view and download specific data for each editing type. Dataset Card for AnyEdit-Dataset Instruction-based image editing aims to modify specific image elements with natural language instructions. However, current models in this domain often… See the full description on the dataset page: https://huggingface.co/datasets/Bin1117/AnyEdit.
models
movie-v7
Bronze28ducanhh55 · Uncategorized
models
GAMMA (Glaucoma grading from Multi-Modality imAges) Challenge Dataset
Bronze41Vincent08426 · Medical & Healthcare
GAMMA — Glaucoma grading from Multi-Modality imAges (Challenge dataset) Image: Dataset Samples. Short description GAMMA is the first public multi-modality glaucoma grading dataset that pairs 2D color fundus photographs with 3D OCT volumes for each sample. It was released as part of the GAMMA challenge (OMIA8 / MICCAI 2021) to encourage algorithms that combine fundus and OCT information for automatic… See the full description on the dataset page: https://huggingface.co/datasets/Vincent08426/GAMMA.
models
pile-val-backup
Bronze49mit-han-lab · Text - General
This is a backup for the pile val dataset downloaded from here: https://the-eye.eu/public/AI/pile/val.jsonl.zst Please respect the original license of the dataset.
models
U
Bronze36shiyiyoyo · Structured Data
UAV Trajectory Dataset Summary This dataset comprises over 5000 random UAV (Unmanned Aerial Vehicle) trajectories collected over 20 hours of flight time. It is intended for training AI models such as trajectory prediction applications. The dataset is generated through an automated pipeline for the creation and preprocessing of UAV synthetic trajectories, making it ready for direct AI model training. Data Description The dataset features parameterized… See the full description on the dataset page: https://huggingface.co/datasets/shiyiyoyo/Synthetic-UAV-Flight-Trajectories.
models
Dance2Hesitate
Bronze25brsrikrishna · Uncategorized
models
otonariniginga
Bronze33BangumiBase · Role-Play & Characters
Bangumi Image Base of Otonari Ni Ginga This is the image base of bangumi Otonari ni Ginga, we detected 32 characters, 5029 images in total. The full dataset is here. Please note that these image bases are not guaranteed to be 100% cleaned, they may be noisy actual. If you intend to manually train models using this dataset, we recommend performing necessary preprocessing on the downloaded dataset to eliminate potential noisy samples (approximately 1% probability). Here is the… See the full description on the dataset page: https://huggingface.co/datasets/BangumiBase/otonariniginga.
models
TroveLedger Financial Time Series Dataset
Bronze44Traders-Lab · Finance
🗃️ TroveLedger — Financial Time Series Dataset A growing ledger of accumulated market history. ⚠️ Temporary Notice: Intraday Data Adjustments (January 2026) What happened:A discrepancy has been identified in the minute- and hourly-resolution data: these series are currently not fully adjusted for stock splits and dividends. Daily-resolution data remains correctly adjusted (as provided by the source). Why this matters:For accurate backtesting and model training –… See the full description on the dataset page: https://huggingface.co/datasets/Traders-Lab/TroveLedger.
models
phamthihuong2003
Bronze26phamthihuong2003 · Uncategorized
models
humanoid-everyday
Silver50USC-PSI-Lab · Robotics
Humanoid Everyday A Comprehensive Robotic Dataset for Open-World Humanoid Manipulation Overview Humanoid Everyday is a large-scale, diverse humanoid manipulation dataset designed for open-world robotic learning and embodied intelligence. It contains over 260 tasks across 7 major categories, covering dexterous manipulation, human–humanoid interaction, and locomotion-integrated activities.All data were collected through a human-supervised teleoperation pipeline, recording… See the full description on the dataset page: https://huggingface.co/datasets/USC-PSI-Lab/humanoid-everyday.
models
egoschema
Bronze29lmms-lab · Text - General
models
nguyenvana1990
Newnguyenvana1990 · Uncategorized
models
SloMoBlur
Bronze43Thomas880423 · Uncategorized
To cite this dataset in a publication, please use: @misc{mahmud2025deblurringwildrealworldimage, title={Deblurring in the Wild: A Real-World Image Deblurring Dataset from Smartphone High-Speed Videos}, author={Syed Mumtahin Mahmud and Mahdi Mohd Hossain Noki and Prothito Shovon Majumder and Abdul Mohaimen Al Radi and Sudipto Das Sukanto and Afia Lubaina and Md. Mosaddek Khan}, year={2025}, eprint={2506.19445}, archivePrefix={arXiv}, primaryClass={cs.CV}… See the full description on the dataset page: https://huggingface.co/datasets/Thomas880423/SloMoBlur.
models
OPUS_Tatoeba
Newwecover · Text - General
models
omni-refiner-kontext
Bronze43lsmpp · Uncategorized
omni-refiner-kontext Uploaded via huggingface_hub API.
models
vuducmanh1991
Newvuducmanh1991 · Uncategorized
models
Parameter Golf FineWeb Export
Bronze46willdepueoai · Uncategorized
Parameter Golf FineWeb Export This repository hosts tokenizer-matched export artifacts derived from HuggingFaceFW/fineweb, specifically a 30B subset pulled from the 100B FineWeb cut used for parameter-golf experiments. The repository contains: pretokenized training and validation shards under datasets/datasets/ tokenizer artifacts under datasets/tokenizers/ the export manifest at datasets/manifest.json selected-document metadata at datasets/docs_selected.jsonl License… See the full description on the dataset page: https://huggingface.co/datasets/willdepueoai/parameter-golf.
models
GQA-35k
Bronze36Voxel51 · Math & Reasoning
Dataset Card for GQA-35k The GQA (Visual Reasoning in the Real World) dataset is a large-scale visual question answering dataset that includes scene graph annotations for each image. This is a FiftyOne dataset with 35000 samples. Note: This is a 35,000 sample subset which does not contain questions, only the scene graph annotations as detection-level attributes. You can find the recipe notebook for creating the dataset here Installation If you haven't already, install… See the full description on the dataset page: https://huggingface.co/datasets/Voxel51/GQA-Scene-Graph.
models
tatenoyuushanonariagariseason2
Bronze33BangumiBase · Role-Play & Characters
Bangumi Image Base of Tate No Yuusha No Nariagari Season 2 This is the image base of bangumi Tate no Yuusha no Nariagari Season 2, we detected 81 characters, 5635 images in total. The full dataset is here. Please note that these image bases are not guaranteed to be 100% cleaned, they may be noisy actual. If you intend to manually train models using this dataset, we recommend performing necessary preprocessing on the downloaded dataset to eliminate potential noisy samples… See the full description on the dataset page: https://huggingface.co/datasets/BangumiBase/tatenoyuushanonariagariseason2.
models
Turmatle_pretrain_datasets
Bronze25zi-hui · Image Recognition
models
CocoChorales-E
Bronze42ben2002chou · Benchmarks & Evaluation
Viewer note: default uses viewer_preview/ for responsive audio playback. Full training/evaluation files remain available in the original folder structure. CocoChorales-E CocoChorales-E subset used by the LadderSym training pipeline. Paired Inputs for Error Detection The model takes paired inputs: mistake: performance audio/MIDI containing musical errors score: paired reference score audio/MIDI (target/correct context) Error supervision is provided with labels:… See the full description on the dataset page: https://huggingface.co/datasets/ben2002chou/CocoChorales-E.
models
hy
Newcryptodawn · Uncategorized
models
Military Aircraft Detection Dataset
Bronze41a2015003713 · Image Recognition
Military Aircraft Detection Dataset Military aircraft detection dataset in COCO and YOLO format. This dataset is synchronized from the original Kaggle dataset:https://www.kaggle.com/datasets/a2015003713/militaryaircraftdetectiondataset
models
Openpdf-Analysis-Recognition
Bronze37prithivMLmods · Role-Play & Characters
Openpdf-Analysis-Recognition The Openpdf-Analysis-Recognition dataset is curated for tasks related to image-to-text recognition, particularly for scanned document images and OCR (Optical Character Recognition) use cases. It contains over 6,900 images in a structured imagefolder format suitable for training models on document parsing, PDF image understanding, and layout/text extraction tasks. Attribute Value Task Image-to-Text Modality Image Format ImageFolder… See the full description on the dataset page: https://huggingface.co/datasets/prithivMLmods/Openpdf-Analysis-Recognition.
models
hle
Silver59cais · Code
[!NOTE] IMPORTANT: Please help us protect the integrity of this benchmark by not publicly sharing, re-uploading, or distributing the dataset. Humanity's Last Exam 🌐 Website | 📄 Paper | GitHub Center for AI Safety & Scale AI Humanity's Last Exam (HLE) is a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. Humanity's Last Exam consists of 2,500 questions across dozens of… See the full description on the dataset page: https://huggingface.co/datasets/cais/hle.
models
DiTFake
Bronze33lioooox · Code
Here is the released dataset (DiTFake) for Synthetic Image Detection (SID) proposed in our paper. Improving Synthetic Image Detection Towards Generalization: An Image Transformation Perspective This dataset contains 30,000 images in total, including synthetic images generated by three recent DiT-based models (Flux, PixArt, and SD3) and equal numbers of real images from COCO. More implementation details can be found in our GitHub repository.
models
llm_pt_leaderboard_raw_results
Bronze26eduagarcia-temp · Benchmarks & Evaluation
models
20d6952c
Newdude-os · Uncategorized
models
latent_worker_early-a2_04
Bronze28atokforps · Uncategorized
models
ver
Newzephyrglow · Uncategorized
models
Typed Digital Signatures Dataset
Bronze44Benjy · Role-Play & Characters
Typed Digital Signatures Dataset This comprehensive dataset contains synthetic digital signatures rendered across 30 different Google Fonts, specifically selected for their handwriting and signature-style characteristics. Each font contributes unique stylistic elements, making this dataset ideal for robust signature analysis and font recognition tasks. Dataset Overview Total Fonts: 30 different Google Fonts Images per Font: 3,000 signatures Total Dataset Size: ~90,000… See the full description on the dataset page: https://huggingface.co/datasets/Benjy/typed_digital_signatures.
models
bio-mcp-data
Bronze34longevity-genie · Math & Reasoning
Bio-MCP-Data A repository containing biological datasets that will be used by BIO-MCP MCP (Model Context Protocol) standard. About This repository hosts biological data assets formatted to be compatible with the Model Context Protocol, enabling AI models to efficiently access and process biological information. The data is managed using Git Large File Storage (LFS) to handle large biological datasets. Purpose Provide standardized biological datasets for AI… See the full description on the dataset page: https://huggingface.co/datasets/longevity-genie/bio-mcp-data.
models
ahmedml
Bronze41neashton · Uncategorized
AhmedML: High-Fidelity Computational Fluid Dynamics dataset for incompressible, low-speed bluff body aerodynamics Contact: Neil Ashton (NVIDIA) - contact@caemldatasets.org website: https://caemldatasets.org Summary: This dataset contains 500 different geometric variations of the Ahmed Car Body - a simplified car-like shape that exhibits many of the flow topologies that are present on bluff bodies such as road vehicles. The dataset contains a wide… See the full description on the dataset page: https://huggingface.co/datasets/neashton/ahmedml.
models
DarijaMMLU
Bronze40MBZUAI-Paris · Benchmarks & Evaluation
Dataset Card for DarijaMMLU Dataset Summary DarijaMMLU is an evaluation benchmark designed to assess large language models' (LLM) performance in Moroccan Darija, a variety of Arabic. It consists of 22,027 multiple-choice questions, translated from selected subsets of the Massive Multitask Language Understanding (MMLU) and ArabicMMLU benchmarks to measure model performance on 44 subjects in Darija. Supported Tasks Task Category: Multiple-choice question… See the full description on the dataset page: https://huggingface.co/datasets/MBZUAI-Paris/DarijaMMLU.
models
reflect-r1-0309
Newguanys · Uncategorized
models
aihub-wild-animal
Bronze26im-wali · Uncategorized
models
cc12m-wds
Silver50pixparse · Vision-Language
Dataset Card for Conceptual Captions 12M (CC12M) Dataset Summary Conceptual 12M (CC12M) is a dataset with 12 million image-text pairs specifically meant to be used for visionand-language pre-training. Its data collection pipeline is a relaxed version of the one used in Conceptual Captions 3M (CC3M). Usage This instance of Conceptual Captions is in webdataset .tar format. It can be used with webdataset library or upcoming releases of Hugging Face datasets.… See the full description on the dataset page: https://huggingface.co/datasets/pixparse/cc12m-wds.
models
LNDb
NewAngelou0516 · Text - General
models
dclm-baseline-filtered
Bronze28KORMo-Team · Uncategorized
models
phamngochieu1994
Bronze26phamngochieu1994 · Uncategorized
models
audiofolder_two_configs_in_metadata
Bronze25hf-internal-testing · Speech & Audio
models
GuanacoDataset
Silver53JosephusCheung · Text Generation & Chat
Sorry, it's no longer available on Hugging Face. Please reach out to those who have already downloaded it. If you have a copy, please refrain from re-uploading it to Hugging Face. The people here don't deserve it. See also: https://twitter.com/RealJosephus/status/1779913520529707387 GuanacoDataset News: We're heading towards multimodal VQA, with blip2-flan-t5-xxl Alignment to Guannaco 7B LLM. Still under construction: GuanacoVQA weight & GuanacoVQA Dataset Notice: Effective… See the full description on the dataset page: https://huggingface.co/datasets/JosephusCheung/GuanacoDataset.
models
totz50k
Bronze26ynhe · Video
models
0xdesigner
Newgaianet · Text - General
models
bias_in_bios
Bronze44LabHC · Code
Bias in Bios Bias in Bios was created by (De-Artega et al., 2019) and published under the MIT license (https://github.com/microsoft/biosbias). The dataset is used to investigate bias in NLP models. It consists of textual biographies used to predict professional occupations, the sensitive attribute is the gender (binary). The version shared here is the version proposed by (Ravgofel et al., 2020) which slightly smaller due to the unavailability of 5,557 biographies. The dataset is… See the full description on the dataset page: https://huggingface.co/datasets/LabHC/bias_in_bios.
models
img_upload
Bronze32Maynor996 · Image Recognition
models
English Characters Image Dataset
Bronze34Mayank022 · Role-Play & Characters
English Characters Image Dataset (A-Z, a-z, 0-9) This dataset contains high-resolution (128x128 pixels) grayscale images of English characters, including uppercase letters (A-Z), lowercase letters (a-z), and digits (0-9). Each character is available in 80,000 to 100,000 unique font styles, making it one of the most comprehensive resources for character-level image modeling. Dataset Description The images in this dataset have been generated by rendering over 85,000… See the full description on the dataset page: https://huggingface.co/datasets/Mayank022/English_Characters_Images.
models
alpaca_gpt4_zh
Silver50llamafactory · Instruction Following
Borrowed from: https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM Removed 6,103 mistruncated examples. You can use it in LLaMA Factory by specifying dataset: alpaca_gpt4_zh.
models
e2d65cf4
Newdude-os · Uncategorized
models
dinhthanhbinh1986
Bronze25dinhthanhbinh1986 · Uncategorized
models
CC-MAIN-2018-13
Bronze25cc-clean · Text - General
models
tldr
Bronze43trl-lib · Text - General
TL;DR Dataset Summary The TL;DR dataset is a processed version of Reddit posts, specifically curated to train models using the TRL library for summarization tasks. It leverages the common practice on Reddit where users append "TL;DR" (Too Long; Didn't Read) summaries to lengthy posts, providing a rich source of paired text data for training summarization models. Data Structure Format: Standard Type: Prompt-completion Columns: "pompt": The unabridged Reddit… See the full description on the dataset page: https://huggingface.co/datasets/trl-lib/tldr.
models
chords-billboard
Newlamooon · Text - General
models
Multimodal-Dataset-Image_Text_Table_TimeSeries-for-Financial-Time-Series-Forecasting
Bronze44Y123-wed · Finance
The sp500stock_data_description.csv file provides detailed information on the existence of four modalities (text, image, time series, and table) for 4,213 S&P 500 stocks. The hs300stock_data_description.csv file provides detailed information on the existence of four modalities (text, image, time series, and table) for 858 HS 300 stocks. If you find our research helpful, please cite our paper: @article{xu2025finmultitime, title={FinMultiTime: A Four-Modal Bilingual Dataset for… See the full description on the dataset page: https://huggingface.co/datasets/Y123-wed/Multimodal-Dataset-Image_Text_Table_TimeSeries-for-Financial-Time-Series-Forecasting.
models
BoxFusion
NewKevin1804 · Image Recognition
models
assets
Bronze28Genesis-Intelligence · Uncategorized
models
PAWS: Paraphrase Adversaries from Word Scrambling
Silver51google-research-datasets · Classification & Sentiment
Dataset Card for PAWS: Paraphrase Adversaries from Word Scrambling Dataset Summary PAWS: Paraphrase Adversaries from Word Scrambling This dataset contains 108,463 human-labeled and 656k noisily labeled pairs that feature the importance of modeling structure, context, and word order information for the problem of paraphrase identification. The dataset has two subsets, one based on Wikipedia and the other one based on the Quora Question Pairs (QQP) dataset. For further… See the full description on the dataset page: https://huggingface.co/datasets/google-research-datasets/paws.
models
CoSyn-400K
Bronze44allenai · Code
CoSyn-400k CoSyn-400k is a collection of synthetic question-answer pairs about very diverse range of computer-generated images. The data was created by using the Claude large language model to generate code that can be executed to render an image, and using GPT-4o mini to generate Q/A pairs based on the code (without using the rendered image). The code used to generate this data is open source. Synthetic pointing data is available in a seperate repo. Quick links: 📃 CoSyn… See the full description on the dataset page: https://huggingface.co/datasets/allenai/CoSyn-400K.
models
hf
Newcodexdream · Code
models
public
Bronze25humosleo · Uncategorized
models
nga2005
Bronze25nga2005 · Uncategorized
models
RottenTomatoes - MR Movie Review Data
Silver55cornell-movie-review-data · Benchmarks & Evaluation
Dataset Card for "rotten_tomatoes" Dataset Summary Movie Review Dataset. This is a dataset of containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. This data was first used in Bo Pang and Lillian Lee, ``Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales.'', Proceedings of the ACL, 2005. Supported Tasks and Leaderboards More Information Needed Languages… See the full description on the dataset page: https://huggingface.co/datasets/cornell-movie-review-data/rotten_tomatoes.
models
MY_MARS
Newilovehyperspectral01 · Uncategorized
models
mast
Bronze26hyf015 · Uncategorized
models
movie-v3
Bronze29Yen0606 · Uncategorized
models
PLUS_Lab_GPUs_Data
Bronze26pluslab · Uncategorized
models
phamtrungkien1994
Bronze25phamtrungkien1994 · Uncategorized
models
MME
Bronze49lmms-lab · Benchmarks & Evaluation
Evaluation Dataset for MME
models
truongquocanh2003
Newtruongquocanh2003 · Uncategorized
models
XLEL-WD is a multilingual event linking dataset. This dataset contains mention references in multilingual Wikipedia/Wikinews articles to event items from Wikidata. The descriptions for Wikidata event items are taken from the corresponding Wikipedia articles.
Bronze42adithya7 · Text - General
XLEL-WD is a multilingual event linking dataset. This dataset contains mention references from multilingual Wikipedia/Wikinews articles to event items in Wikidata. The text descriptions for Wikidata events are compiled from Wikipedia articles.
models
jat-dataset-tokenized
Bronze43jat-project · Uncategorized
Dataset Card for "jat-dataset-tokenized" More Information needed
models
NIH-CXR14
Bronze44alkzar90 · Medical & Healthcare
The NIH Chest X-ray dataset consists of 100,000 de-identified images of chest x-rays. The images are in PNG format. The data is provided by the NIH Clinical Center and is available through the NIH download site: https://nihcc.app.box.com/v/ChestXray-NIHCC
models
Turkmen Speech Dataset
Bronze44rozumov · Benchmarks & Evaluation
Turkmen Speech Dataset (ASR) This dataset contains 251 hours of Turkmen speech audio with transcriptions, intended for training and evaluating Automatic Speech Recognition (ASR) models. It is one of the largest publicly available Turkmen speech datasets. Dataset Overview Property Value Total clips 119,847 Total duration 251.86 hours Sampling rate 16,000 Hz Language Turkmen (tk) Split train Each item includes: audio: waveform + sampling rate text:… See the full description on the dataset page: https://huggingface.co/datasets/rozumov/TurkmenSpeech.
models
MMLU-ProX
Bronze46li-lab · Code
MMLU-ProX MMLU-ProX is a multilingual benchmark that builds upon MMLU-Pro, extending to 29 typologically diverse languages, designed to evaluate large language models' reasoning capabilities across linguistic and cultural boundaries. Github | Paper News [2025/08] 🎉 MMLU-ProX was accepted by EMNLP 2025 Main Conference! [2025/05] MMLU-ProX now contains 29 languages, all available on Huggingface. [2025/03] MMLU-ProX is now available on Huggingface. [2025/03] We are still… See the full description on the dataset page: https://huggingface.co/datasets/li-lab/MMLU-ProX.
models
Graph-PanNuke
Bronze39dszohib · Code
Graph-PanNuke: A Cell-Graph Dataset for Nucleus Classification from PanNuke Graph-PanNuke is a node-level classification dataset derived from the PanNuke pan-cancer histology dataset. We use all slides at 40× magnification. Each tissue patch is converted into a cell-graph where nodes represent detected cell nuclei and edges encode spatial proximity. The task is predicting the cell type of each nucleus across 5 classes. Note that node features describe cell morphology, texture… See the full description on the dataset page: https://huggingface.co/datasets/dszohib/graph-pannuke.
models
fsc-180k
Bronze43Hollow12334 · Uncategorized
FSC-180k We introduce our hybrid semantic change detection dataset, named FSC-180k. It consists of approximately 60,000 real aerial images sourced from the FLAIR dataset, along with 180,000 artificially modified images. These images were generated using our HySCDG pipeline applied (three times) to each real image. In total, the dataset provides 180,000 image pairs. Each pair is accompanied by a binary change map and semantic segmentation maps for both images (land use… See the full description on the dataset page: https://huggingface.co/datasets/Hollow12334/fsc-180k.
models
ProObjaverse-300K
Bronze30Stable-X · Image Generation
models
libero_track_object_ee_relative
Bronze34CRRaphael · Code
This dataset was created using LeRobot. Dataset Structure meta/info.json: { "codebase_version": "v2.1", "robot_type": "panda", "total_episodes": 1433, "total_frames": 43826, "total_tasks": 30, "total_videos": 0, "total_chunks": 2, "chunks_size": 1000, "fps": 10, "splits": { "train": "0:1433" }, "data_path": "data/chunk-{episode_chunk:03d}/episode_{episode_index:06d}.parquet", "video_path":… See the full description on the dataset page: https://huggingface.co/datasets/CRRaphael/libero_track_object_ee_relative.
models
GUI-360
Bronze49vyokky · Code
GUI-360°: A Comprehensive Dataset And Benchmark For Computer-Using Agents Paper | Code GUI-360° is a large-scale, comprehensive dataset and benchmark suite designed to advance Computer-Using Agents (CUAs). 🎯 Key Features 🔢 1.2M+ executed action steps across thousands of trajectories 💼 Popular Windows office applications (Word, Excel, PowerPoint) 📸 Full-resolution screenshots with accessibility metadata 🎨 Multi-modal trajectories with reasoning traces ✅ Both… See the full description on the dataset page: https://huggingface.co/datasets/vyokky/GUI-360.
models
Słownik Języka Polskiego
Bronze37Apokryf · Legal
SJP Słownik Języka Polskiego transferowany z oficjalnych zasobników zestaw słownikowy do pracy z językiem polskim. https://sjp.pl/
models
hh-rlhf
Silver60Anthropic · Preference & Alignment (DPO/RLHF)
Dataset Card for HH-RLHF Dataset Summary This repository provides access to two different kinds of data: Human preference data about helpfulness and harmlessness from Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. These data are meant to train preference (or reward) models for subsequent RLHF training. These data are not meant for supervised training of dialogue agents. Training dialogue agents on these data is likely to lead… See the full description on the dataset page: https://huggingface.co/datasets/Anthropic/hh-rlhf.
models
diffusers-images-docs
Newdiffusers · Image Recognition
models
SA-Med3D-140K
Bronze44blsmash044 · Code
SA-Med3D-140K [github] Dataset Summary SA-Med3D-140K is a large-scale, multi-modal, multi-anatomical volumetric medical image segmentation dataset. It was created to facilitate the development of general-purpose foundation models for 3D medical image segmentation. The dataset comprises 21,729 3D medical images and 143,518 corresponding masks. It was gathered from a combination of 70 public datasets and 8,128 privately licensed annotated cases from 24 hospitals.… See the full description on the dataset page: https://huggingface.co/datasets/blsmash044/SA-Med3D-140K.
models
open-images
Bronze29dalle-mini · Image Recognition
models
Dl3DV-Dataset
Silver50DL3DV · Creative Writing
DL3DV-Dataset This repo has all the 960P frames with camera poses of DL3DV-10K Dataset. We are working hard to review all the dataset to avoid sensitive information. Thank you for your patience. Download If you have enough space, you can use git to download a dataset from huggingface. See this link. 480P/960P versions should satisfies most needs. If you do not have enough space, we further provide a download script here to download a subset. The usage: usage:… See the full description on the dataset page: https://huggingface.co/datasets/DL3DV/DL3DV-ALL-960P.
models
image
Newpxnjack · Image Recognition
models
roboverse_data
Bronze49RoboVerseOrg · Benchmarks & Evaluation
This dataset is part of the RoboVerse project, as described in the paper RoboVerse: Towards a Unified Platform, Dataset and Benchmark for Scalable and Generalizable Robot Learning.
models
vision-arena-bench-v0.1
Bronze36lmarena-ai · Preference & Alignment (DPO/RLHF)
VisionArena-Bench: An automatic eval pipeline to estimate model preference rankings An automatic benchmark of 500 diverse user prompts that can be used to cheaply approximate Chatbot Arena model rankings via automatic benchmarking with VLM as a judge. Dataset Sources Repository: https://github.com/lm-sys/FastChat Paper: https://arxiv.org/abs/2412.08687 Automatic Evaluation Code: Coming Soon! Dataset Structure question_id: The unique hash representing the… See the full description on the dataset page: https://huggingface.co/datasets/lmarena-ai/vision-arena-bench-v0.1.
models