Most Downloaded Datasets

The most downloaded training datasets of all time. The data that powers the best fine-tuned models.

Last updated April 3, 2026 · Updated daily

FineFineWeb by m-a-p holds the #1 position with 2.6M downloads, ahead of ニコニコ実況 過去ログアーカイブ at 2.4M.

The top 10 is dominated by m-a-p, KakologArchives, ropedia-ai. This is the first snapshot — future updates will track position changes and emerging trends.

The gap between #1 and #100 is 2.6M vs 143.0K downloads, showing significant concentration at the top.

🥇new

FineFineWeb

Silver64

m-a-p · Classification & Sentiment

FineFineWeb: A Comprehensive Study on Fine-Grained Domain Web Corpus arXiv: Coming Soon Project Page: Coming Soon Blog: Coming Soon Data Statistics Domain (#tokens/#samples) Iteration 1 Tokens Iteration 2 Tokens Iteration 3 Tokens Total Tokens Iteration 1 Count Iteration 2 Count Iteration 3 Count Total Count aerospace 5.77B 261.63M 309.33M 6.34B 9100000 688505 611034 10399539 agronomy 13.08B 947.41M 229.04M 14.26B 15752828 2711790 649404 19114022 artistic… See the full description on the dataset page: https://huggingface.co/datasets/m-a-p/FineFineWeb.

2.6M

downloads

🥈new

ニコニコ実況 過去ログアーカイブ

Silver60

KakologArchives · Classification & Sentiment

ニコニコ実況 過去ログアーカイブ ニコニコ実況 過去ログアーカイブは、ニコニコ実況 のサービス開始から現在までのすべての過去ログコメントを収集したデータセットです。 去る2020年12月、ニコニコ実況は ニコニコ生放送内の一公式チャンネルとしてリニューアル されました。これに伴い、2009年11月から運用されてきた旧システムは提供終了となり(事実上のサービス終了)、torne や BRAVIA などの家電への対応が軒並み終了する中、当時の生の声が詰まった約11年分の過去ログも同時に失われることとなってしまいました。 そこで 5ch の DTV 板の住民が中心となり、旧ニコニコ実況が終了するまでに11年分の全チャンネルの過去ログをアーカイブする計画が立ち上がりました。紆余曲折あり Nekopanda 氏が約11年分のラジオや BS も含めた全チャンネルの過去ログを完璧に取得してくださったおかげで、11年分の過去ログが電子の海に消えていく事態は回避できました。しかし、旧 API が廃止されてしまったため過去ログを API… See the full description on the dataset page: https://huggingface.co/datasets/KakologArchives/KakologArchives.

2.4M

downloads

🥉new

Xperience-10M

Silver64

ropedia-ai · Video

⚠️ Important: If you have already submitted an access request but have not completed the required DocuSign agreement, your request will remain pending. Please complete signing and we will grant access once verified. Interactive Intelligence from Human Xperience Xperience-10M Dataset Summary Xperience-10M is a large-scale egocentric multimodal dataset of human experience for embodied AI, robotics, world models, and spatial… See the full description on the dataset page: https://huggingface.co/datasets/ropedia-ai/xperience-10m.

2.2M

downloads

4new

documentation-images

Silver64

huggingface · Image Recognition

This dataset contains images used in the documentation of HuggingFace's libraries. HF Team: Please make sure you optimize the assets before uploading them. My favorite tool for this is https://tinypng.com/.

2.0M

downloads

5new

banned-historical-archives

Silver58

banned-historical-archives · Code

和谐历史档案馆数据集 - Banned Historical Archives Datasets 和谐历史档案馆数据集包含已录入 https://banned-historical-archives.github.io 和暂未未录入的原始文件。 目录结构 banned-historical-archives.github.io # 已录入该网站的原始数据,不定期从 github 仓库中同步 raw # 原始文件 config # 配置文件 todo # 存放暂未录入网站的文件 部分报纸和图片资料存放在单独的仓库: 名称 地址 状态 参考消息 https://huggingface.co/datasets/banned-historical-archives/ckxx 未录入 人民日报 https://huggingface.co/datasets/banned-historical-archives/rmrb 已精选重要的文章录入 文汇报… See the full description on the dataset page: https://huggingface.co/datasets/banned-historical-archives/banned-historical-archives.

1.6M

downloads

6new

Generated Docs for HF

Silver59

hf-doc-build · Code

This repo contains all the docs published on https://huggingface.co/docs. The docs are generated with https://github.com/huggingface/doc-builder.

1.4M

downloads

7new

xCodeEval

Silver60

NTU-NLP-sg · Code

The ability to solve problems is a hallmark of intelligence and has been an enduring goal in AI. AI systems that can create programs as solutions to problems or assist developers in writing programs can increase productivity and make programming more accessible. Recently, pre-trained large language models have shown impressive abilities in generating new codes from natural language descriptions, repairing buggy codes, translating codes between languages, and retrieving relevant code segments. However, the evaluation of these models has often been performed in a scattered way on only one or two specific tasks, in a few languages, at a partial granularity (e.g., function) level and in many cases without proper training data. Even more concerning is that in most cases the evaluation of generated codes has been done in terms of mere lexical overlap rather than actual execution whereas semantic similarity (or equivalence) of two code segments depends only on their ``execution similarity'', i.e., being able to get the same output for a given input.

1.1M

downloads

8new

WikiText

Silver66

Salesforce · Text Generation & Chat

Dataset Card for "wikitext" Dataset Summary The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License. Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far larger… See the full description on the dataset page: https://huggingface.co/datasets/Salesforce/wikitext.

1.1M

downloads

9new

MINT-1T

Silver58

mlfoundations · Vision-Language

🍃 MINT-1T:Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens 🍃 MINT-1T is an open-source Multimodal INTerleaved dataset with 1 trillion text tokens and 3.4 billion images, a 10x scale-up from existing open-source datasets. Additionally, we include previously untapped sources such as PDFs and ArXiv papers. 🍃 MINT-1T is designed to facilitate research in multimodal pretraining. 🍃 MINT-1T is created by a team from the University of Washington in… See the full description on the dataset page: https://huggingface.co/datasets/mlfoundations/MINT-1T-PDF-CC-2024-18.

1.0M

downloads

10new

PhysicalAI-Autonomous-Vehicles

Silver67

nvidia · Uncategorized

PHYSICAL AI AUTONOMOUS VEHICLES The PhysicalAI-Autonomous-Vehicles dataset provides one of the largest, geographically diverse collections of multi-sensor data empowering AV researchers to build the next generation of Physical AI based end-to-end driving systems. This dataset is ready for commercial/non-commercial AV use per the license agreement. Data Collection Method Automatic/Sensor Labeling Method Automatic/Sensor This dataset has a total of 1700 hours of driving… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/PhysicalAI-Autonomous-Vehicles.

993.0K

downloads

11new

MINT-1T

Silver51

mlfoundations · Vision-Language

🍃 MINT-1T:Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens 🍃 MINT-1T is an open-source Multimodal INTerleaved dataset with 1 trillion text tokens and 3.4 billion images, a 10x scale-up from existing open-source datasets. Additionally, we include previously untapped sources such as PDFs and ArXiv papers. 🍃 MINT-1T is designed to facilitate research in multimodal pretraining. 🍃 MINT-1T is created by a team from the University of Washington in… See the full description on the dataset page: https://huggingface.co/datasets/mlfoundations/MINT-1T-PDF-CC-2023-40.

854.1K

downloads

12new

SWE-bench_Pro

Silver60

ScaleAI · Code

Dataset Summary SWE-Bench Pro is a challenging, enterprise-level dataset for testing agent ability on long-horizon software engineering tasks. Paper: https://static.scale.com/uploads/654197dc94d34f66c0f5184e/SWEAP_Eval_Scale%20(9).pdf See the related evaluation Github: https://github.com/scaleapi/SWE-bench_Pro-os Dataset Structure We follow SWE-Bench Verified (https://huggingface.co/datasets/SWE-bench/SWE-bench_Verified) in terms of dataset structure, with several… See the full description on the dataset page: https://huggingface.co/datasets/ScaleAI/SWE-bench_Pro.

848.0K

downloads

13new

Grade School Math 8K

Silver67

openai · Math & Reasoning

Dataset Card for GSM8K Dataset Summary GSM8K (Grade School Math 8K) is a dataset of 8.5K high quality linguistically diverse grade school math word problems. The dataset was created to support the task of question answering on basic mathematical problems that require multi-step reasoning. These problems take between 2 and 8 steps to solve. Solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ − ×÷) to reach the… See the full description on the dataset page: https://huggingface.co/datasets/openai/gsm8k.

761.9K

downloads

14new

MINT-1T

Silver52

mlfoundations · Vision-Language

🍃 MINT-1T:Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens 🍃 MINT-1T is an open-source Multimodal INTerleaved dataset with 1 trillion text tokens and 3.4 billion images, a 10x scale-up from existing open-source datasets. Additionally, we include previously untapped sources such as PDFs and ArXiv papers. 🍃 MINT-1T is designed to facilitate research in multimodal pretraining. 🍃 MINT-1T is created by a team from the University of Washington in… See the full description on the dataset page: https://huggingface.co/datasets/mlfoundations/MINT-1T-PDF-CC-2023-50.

731.0K

downloads

15new

SWE-bench_Verified

Silver63

princeton-nlp · Code

Dataset Summary SWE-bench Verified is a subset of 500 samples from the SWE-bench test set, which have been human-validated for quality. SWE-bench is a dataset that tests systems’ ability to solve GitHub issues automatically. See this post for more details on the human-validation process. The dataset collects 500 test Issue-Pull Request pairs from popular Python repositories. Evaluation is performed by unit test verification using post-PR behavior as the reference solution. The original… See the full description on the dataset page: https://huggingface.co/datasets/princeton-nlp/SWE-bench_Verified.

694.3K

downloads

16new

uniocc

Silver53

tasl-lab · Code

UniOcc: A Unified Benchmark for Occupancy Forecasting and Prediction in Autonomous Driving Paper | Project Page | Code Autonomous Driving researchers, have you ever been bothered by the fact that popular datasets all have their different formats, and standardizing them is a pain? Have you ever been frustrated by the difficulty of just understanding the file semantics? This challenge is even worse in the occupancy domain. But, UniOcc is here to help. UniOcc is a unified… See the full description on the dataset page: https://huggingface.co/datasets/tasl-lab/uniocc.

660.8K

downloads

17new

MINT-1T

Silver52

mlfoundations · Vision-Language

🍃 MINT-1T:Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens 🍃 MINT-1T is an open-source Multimodal INTerleaved dataset with 1 trillion text tokens and 3.4 billion images, a 10x scale-up from existing open-source datasets. Additionally, we include previously untapped sources such as PDFs and ArXiv papers. 🍃 MINT-1T is designed to facilitate research in multimodal pretraining. 🍃 MINT-1T is created by a team from the University of Washington in… See the full description on the dataset page: https://huggingface.co/datasets/mlfoundations/MINT-1T-PDF-CC-2023-23.

628.0K

downloads

18new

MADLAD-400

Silver61

allenai · Text Generation & Chat

MADLAD-400 Dataset and Introduction MADLAD-400 (Multilingual Audited Dataset: Low-resource And Document-level) is a document-level multilingual dataset based on Common Crawl, covering 419 languages in total. This uses all snapshots of CommonCrawl available as of August 1, 2022. The primary advantage of this dataset over similar datasets is that it is more multilingual (419 languages), it is audited and more highly filtered, and it is document-level. The main disadvantage… See the full description on the dataset page: https://huggingface.co/datasets/allenai/MADLAD-400.

627.1K

downloads

19new

C4

Silver64

allenai · Text Generation & Chat

C4 Dataset Summary A colossal, cleaned version of Common Crawl's web crawl corpus. Based on Common Crawl dataset: "https://commoncrawl.org". This is the processed version of Google's C4 dataset We prepared five variants of the data: en, en.noclean, en.noblocklist, realnewslike, and multilingual (mC4). For reference, these are the sizes of the variants: en: 305GB en.noclean: 2.3TB en.noblocklist: 380GB realnewslike: 15GB multilingual (mC4): 9.7TB (108 subsets, one per… See the full description on the dataset page: https://huggingface.co/datasets/allenai/c4.

621.1K

downloads

20new

MINT-1T

Silver50

mlfoundations · Vision-Language

🍃 MINT-1T:Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens 🍃 MINT-1T is an open-source Multimodal INTerleaved dataset with 1 trillion text tokens and 3.4 billion images, a 10x scale-up from existing open-source datasets. Additionally, we include previously untapped sources such as PDFs and ArXiv papers. 🍃 MINT-1T is designed to facilitate research in multimodal pretraining. 🍃 MINT-1T is created by a team from the University of Washington in… See the full description on the dataset page: https://huggingface.co/datasets/mlfoundations/MINT-1T-PDF-CC-2023-14.

570.5K

downloads

21new

HuggingFaceFW/finephrase

Silver60

HuggingFaceFW · Instruction Following

Dataset Card for HuggingFaceFW/finephrase Dataset Summary Synthetic data generated by DataTrove: Model: HuggingFaceTB/SmolLM2-1.7B-Instruct (main) Source dataset: HuggingFaceFW/fineweb-edu, config sample-350BT, split train Generation config: temperature=1.0, top_p=1.0, top_k=50, max_tokens=2048, model_max_context=8192 Speculative decoding: {"method":"suffix","num_speculative_tokens":32} System prompt: None Input column: text Prompt families: faq prompt Rewrite the… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/finephrase.

546.2K

downloads

22new

pesoz

Bronze33

Kthera · Uncategorized

546.0K

downloads

23new

MINT-1T

Silver52

mlfoundations · Vision-Language

🍃 MINT-1T:Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens 🍃 MINT-1T is an open-source Multimodal INTerleaved dataset with 1 trillion text tokens and 3.4 billion images, a 10x scale-up from existing open-source datasets. Additionally, we include previously untapped sources such as PDFs and ArXiv papers. 🍃 MINT-1T is designed to facilitate research in multimodal pretraining. 🍃 MINT-1T is created by a team from the University of Washington in… See the full description on the dataset page: https://huggingface.co/datasets/mlfoundations/MINT-1T-PDF-CC-2023-06.

541.6K

downloads

24new

dolma3_mix-6T-1025-7B

Silver57

allenai · Math & Reasoning

⚠️ WARNING: This dataset is intended ONLY for reproducing Olmo 3 7B ⚠️ For all other training use cases, including training from scratch, please utilize our primary dolma 3 data mix: https://huggingface.co/datasets/allenai/dolma3_mix-6T. Note: Some olmOCR science PDFs in the current dataset have been redacted following the training of Olmo 3 7B. These texts are indicated with [REMOVED] in the text field. This will affect reproducibility of Olmo 3 7B. For this reason, please use our… See the full description on the dataset page: https://huggingface.co/datasets/allenai/dolma3_mix-6T-1025-7B.

536.4K

downloads

25new

ubuntu_osworld_file_cache

Silver52

xlangai · Benchmarks & Evaluation

OSWorld File Cache This repository serves as a file cache for the OSWorld project, providing reliable and fast access to evaluation files that were previously hosted on Google Drive. Overview OSWorld is a scalable, real computer environment for multimodal agents, supporting task setup, execution-based evaluation, and interactive learning across various operating systems and applications. This cache repository ensures that all evaluation files are consistently accessible… See the full description on the dataset page: https://huggingface.co/datasets/xlangai/ubuntu_osworld_file_cache.

532.8K

downloads

26new

LegalBench (Staging)

Silver61

nguha · Code

Dataset Card for Dataset Name Homepage: https://hazyresearch.stanford.edu/legalbench/ Repository: https://github.com/HazyResearch/legalbench/ Paper: https://arxiv.org/abs/2308.11462 Dataset Description Dataset Summary The LegalBench project is an ongoing open science effort to collaboratively curate tasks for evaluating legal reasoning in English large language models (LLMs). The benchmark currently consists of 162 tasks gathered from 40… See the full description on the dataset page: https://huggingface.co/datasets/nguha/legalbench.

528.2K

downloads

27new

upload2

Bronze33

Maynor996 · Image Recognition

520.2K

downloads

28new

medical-qa-shared-task-v1-toy

Silver56

lavita · Medical & Healthcare

Dataset Card for "medical-qa-shared-task-v1-toy" More Information needed

520.1K

downloads

29new

results

Bronze49

mteb · Uncategorized

Results on MTEB

518.2K

downloads

30new

HF Documentation (PRs)

Silver54

hf-doc-build · Code

This is a dataset which contains the docs from all the PRs that are updating one of the docs from https://huggingface.co/docs. It is automatically updated by this github action from the doc-buider repo.

467.7K

downloads

31new

Egocentric-100K

Silver60

builddotai · Benchmarks & Evaluation

Egocentric-100K is the largest dataset of manual labor. You can visualize the dataset here. Egocentric-100K is state-of-the-art in hand visibility and active manipulation density compared to previous in-the-wild egocentric datasets. The complete 30,000 frame evaluation set is available at Egocentric-100K-Evaluation. Dataset Statistics Attribute Value Total Hours 100,405 Total Frames 10.8 billion Video Clips 2,010,759 Median Clip Length 180.0 seconds Mean Hours… See the full description on the dataset page: https://huggingface.co/datasets/builddotai/Egocentric-100K.

467.5K

downloads

32new

genshin-voices-separated

Bronze39

AquaV · Uncategorized

462.2K

downloads

33new

sts22-crosslingual-sts

Silver53

mteb · Code

STS22.v2 An MTEB dataset Massive Text Embedding Benchmark SemEval 2022 Task 8: Multilingual News Article Similarity. Version 2 filters updated on STS22 by removing pairs where one of entries contain empty sentences. Task category t2t Domains News, Written Reference https://competitions.codalab.org/competitions/33835 How to evaluate on this task You can evaluate an embedding model on this dataset using the following code: import mteb task =… See the full description on the dataset page: https://huggingface.co/datasets/mteb/sts22-crosslingual-sts.

453.5K

downloads

34new

GiftEvalPretrain

Silver56

Salesforce · Code

GIFT-Eval Pre-training Datasets Pretraining dataset aligned with GIFT-Eval that has 71 univariate and 17 multivariate datasets, spanning seven domains and 13 frequencies, totaling 4.5 million time series and 230 billion data points. Notably this collection of data has no leakage issue with the train/test split and can be used to pretrain foundation models that can be fairly evaluated on GIFT-Eval. 📄 Paper 🖥️ Code 📔 Blog Post 🏎️ Leader Board Ethical Considerations… See the full description on the dataset page: https://huggingface.co/datasets/Salesforce/GiftEvalPretrain.

427.7K

downloads

35new

img_upload

Bronze32

Maynor996 · Image Recognition

427.2K

downloads

36new

OpenThoughts-1k-sample

Bronze49

ryanmarten · Code

[!NOTE] We have released a paper for OpenThoughts! See our paper here. Open-Thoughts-1k-sample This is a 1k sample of the OpenThoughts-114k dataset. Open synthetic reasoning dataset with high-quality examples covering math, science, code, and puzzles! Inspect the content with rich formatting with Curator Viewer. Available Subsets default subset containing ready-to-train data used to finetune the OpenThinker-7B and OpenThinker-32B models: ds =… See the full description on the dataset page: https://huggingface.co/datasets/ryanmarten/OpenThoughts-1k-sample.

425.9K

downloads

37new

AutoMathText-V2

Silver58

OpenSQZ · Code

🚀 AutoMathText-V2: A 2.46 Trillion Token AI-Curated STEM Pretraining Dataset   🎉 AutoMathText-v2 has surpassed 1 million downloads!We'd love to know how you're using it. Please take 1 minute to fill out our use case survey. Your feedback will directly shape the future roadmap of this dataset.👉 Share your use case here 📊 AutoMathText-V2 consists of 2.46 trillion tokens of high-quality, deduplicated text spanning web content, mathematics, code, reasoning, and bilingual… See the full description on the dataset page: https://huggingface.co/datasets/OpenSQZ/AutoMathText-V2.

419.8K

downloads

38new

debug

Bronze49

rtrm · Text - General

test3

411.4K

downloads

39new

Measuring Massive Multitask Language Understanding

Silver64

cais · Science & Research

Dataset Card for MMLU Dataset Summary Measuring Massive Multitask Language Understanding by Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt (ICLR 2021). This is a massive multitask test consisting of multiple-choice questions from various branches of knowledge. The test spans subjects in the humanities, social sciences, hard sciences, and other areas that are important for some people to learn. This covers 57 tasks… See the full description on the dataset page: https://huggingface.co/datasets/cais/mmlu.

400.9K

downloads

40new

GLUE (General Language Understanding Evaluation benchmark)

Silver63

nyu-mll · Benchmarks & Evaluation

Dataset Card for GLUE Dataset Summary GLUE, the General Language Understanding Evaluation benchmark (https://gluebenchmark.com/) is a collection of resources for training, evaluating, and analyzing natural language understanding systems. Supported Tasks and Leaderboards The leaderboard for the GLUE benchmark can be found at this address. It comprises the following tasks: ax A manually-curated evaluation dataset for fine-grained analysis of system… See the full description on the dataset page: https://huggingface.co/datasets/nyu-mll/glue.

379.2K

downloads

41new

Ai2Arc

Silver62

allenai · Science & Research

Dataset Card for "ai2_arc" Dataset Summary A new dataset of 7,787 genuine grade-school level, multiple-choice science questions, assembled to encourage research in advanced question-answering. The dataset is partitioned into a Challenge Set and an Easy Set, where the former contains only questions answered incorrectly by both a retrieval-based algorithm and a word co-occurrence algorithm. We are also including a corpus of over 14 million science sentences relevant to… See the full description on the dataset page: https://huggingface.co/datasets/allenai/ai2_arc.

376.1K

downloads

42new

droid_1.0.1

Silver55

cadene · Code

This dataset was created using LeRobot. Dataset Structure meta/info.json: { "codebase_version": "v2.1", "robot_type": "Franka", "total_episodes": 95600, "total_frames": 27612581, "total_tasks": 0, "total_videos": 286800, "total_chunks": 95, "chunks_size": 1000, "fps": 15, "splits": { "train": "0:95600" }, "data_path": "data/chunk-{episode_chunk:03d}/episode_{episode_index:06d}.parquet", "video_path":… See the full description on the dataset page: https://huggingface.co/datasets/cadene/droid_1.0.1.

373.1K

downloads

43new

CommitPackFT

Silver58

bigcode · Instruction Following

CommitPackFT is is a 2GB filtered version of CommitPack to contain only high-quality commit messages that resemble natural language instructions.

367.4K

downloads

44new

regions

Bronze32

world-igr-plum · Uncategorized

350.4K

downloads

45new

results

Bronze33

hallucinations-leaderboard · Benchmarks & Evaluation

326.3K

downloads

46new

Procgen Benchmark Dataset

Silver51

EpicPinkPenguin · Benchmarks & Evaluation

Procgen Benchmark This dataset contains expert trajectories generated by a PPO reinforcement learning agent trained on each of the 16 procedurally-generated gym environments from the Procgen Benchmark. The environments were created on distribution_mode=easy and with unlimited levels. Disclaimer: This is not an official repository from OpenAI. Dataset Usage Regular usage (for environment bigfish): from datasets import load_dataset train_dataset =… See the full description on the dataset page: https://huggingface.co/datasets/EpicPinkPenguin/procgen.

326.2K

downloads

47new

FineWeb-Edu

Silver64

HuggingFaceFW · Instruction Following

📚 FineWeb-Edu 1.3 trillion tokens of the finest educational data the 🌐 web has to offer Paper: https://arxiv.org/abs/2406.17557 What is it? 📚 FineWeb-Edu dataset consists of 1.3T tokens and 5.4T tokens (FineWeb-Edu-score-2) of educational web pages filtered from 🍷 FineWeb dataset. This is the 1.3 trillion version. To enhance FineWeb's quality, we developed an educational quality classifier using annotations generated by LLama3-70B-Instruct. We then… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu.

310.3K

downloads

48new

objaverse

Silver62

allenai · Uncategorized

Objaverse Objaverse is a Massive Dataset with 800K+ Annotated 3D Objects. More documentation is coming soon. In the meantime, please see our paper and website for additional details. License The use of the dataset as a whole is licensed under the ODC-By v1.0 license. Individual objects in Objaverse are all licensed as creative commons distributable objects, and may be under the following licenses: CC-BY 4.0 - 721K objects CC-BY-NC 4.0 - 25K objects CC-BY-NC-SA 4.0 - 52K… See the full description on the dataset page: https://huggingface.co/datasets/allenai/objaverse.

309.9K

downloads

49new

gaia

Bronze46

siril-spcc · Science & Research

This catalog is developed for use with the Siril 1.4 series as a public reference database. Hugging Face is one of several mirrors used to distribute the data. This database is provided for both offline download and also for online access. This dataset is provided for scientific and reproducibility purposes. This is an extract of the Gaia DR3 catalog optimized for spectrophotometric color calibration. The catalog is indexed at HEALpix level 8 and selects up to the 127 brightest sources in each… See the full description on the dataset page: https://huggingface.co/datasets/siril-spcc/gaia.

304.4K

downloads

50new

fineweb-edu-translated

Silver50

Helsinki-NLP · Translation & Multilingual

Helsinki-NLP/fineweb-edu-translated fineweb-edu-tanslated is a collection of automatically translated documents from fineweb-edu. Translations are based on OPUS-MT and HPLT-MT models. The data covers 36,704,000 documents with over 28 billion space-searated tokens of English data translated into 36 languages. The total data set is incudes of over 960 billion tokens and the translated documents are aligned across all languages. More information about how the data has been produced can… See the full description on the dataset page: https://huggingface.co/datasets/Helsinki-NLP/fineweb-edu-translated.

300.6K

downloads

51new

FineWeb-HQ

Silver51

epfml · Text Generation & Chat

FineWeb-HQ Dataset Summary FineWeb-HQ is a high-quality, model-filtered pretraining dataset derived as a subset of FineWeb. FineWeb-HQ was created by selecting the top 10% of FineWeb documents based on a deep learning classifier trained to identify structured and knowledge-rich samples. This classifier uses XLM-RoBERTa embeddings to score documents. To validate our approach, we pretrained 1B-parameter LLM models with a Llama-like architecture across multiple languages and… See the full description on the dataset page: https://huggingface.co/datasets/epfml/FineWeb-HQ.

300.4K

downloads

52new

AIWD6

Bronze46

Kondapally · Image Recognition

AIWD16 Multi-task weather dataset containing annotations for following: Image Classification (weather transitions) Object Detection Semantic Segmentation Instance Segmentation VQA Classification Labels Cloudy_to_Rainy Rainy_to_Cloudy Rainy_to_Sunny Sunny_to_Foggy Foggy_to_Sunny Sunny_to_Rainy Directory Structure images/ — Image datametadata.csv — classification labelsDet_annotations/ — Detection annotations SS_annotations/ - Semantic segmentation… See the full description on the dataset page: https://huggingface.co/datasets/Kondapally/AIWD6.

293.2K

downloads

53new

LLaVA-OneVision-1.5-Mid-Training-85M

Silver57

mvp-lab · Image Recognition

🚀 LLaVA-One-Vision-1.5-Mid-Training-85M Dataset is being uploaded 🚀 Upload Status All Completed: ImageNet-21k、LAIONCN、DataComp-1B、Zero250M、COYO700M、SA-1B、MINT、Obelics 📜 Cite If you find LLaVA-One-Vision-1.5-Mid-Training-85M useful in your research, please consider to cite the following related papers: @misc{an2025llavaonevision15fullyopenframework, title={LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training}… See the full description on the dataset page: https://huggingface.co/datasets/mvp-lab/LLaVA-OneVision-1.5-Mid-Training-85M.

290.6K

downloads

54new

course-images

Bronze39

agents-course · Image Recognition

284.4K

downloads

55new

HellaSwag

Silver59

Rowan · Benchmarks & Evaluation

Dataset Card for "hellaswag" Dataset Summary HellaSwag: Can a Machine Really Finish Your Sentence? is a new dataset for commonsense NLI. A paper was published at ACL2019. Supported Tasks and Leaderboards More Information Needed Languages More Information Needed Dataset Structure Data Instances default Size of downloaded dataset files: 71.49 MB Size of the generated dataset: 65.32 MB Total amount of disk used: 136.81… See the full description on the dataset page: https://huggingface.co/datasets/Rowan/hellaswag.

272.4K

downloads

56new

documentation-images

Bronze31

huggingface-course · Image Recognition

267.8K

downloads

57new

DreamZero-DROID-Data

Bronze33

GEAR-Dreams · Video

253.4K

downloads

58new

OpenAI HumanEval

Silver61

openai · Code

Dataset Card for OpenAI HumanEval Dataset Summary The HumanEval dataset released by OpenAI includes 164 programming problems with a function sig- nature, docstring, body, and several unit tests. They were handwritten to ensure not to be included in the training set of code generation models. Supported Tasks and Leaderboards Languages The programming problems are written in Python and contain English natural text in comments and docstrings.… See the full description on the dataset page: https://huggingface.co/datasets/openai/openai_humaneval.

250.9K

downloads

59new

Hyperheight Data Cube Denoising and Super-Resolution

Bronze46

anfera236 · Code

Hyperheight Data Cube Denoising and Super-Resolution Dataset Summary Generation code and pipeline: https://github.com/Anfera/HHDC-Creator (HHDC-Creator repo). 3-D photon-count waveforms (Hyperheight data cubes) built from NEON discrete-return LiDAR using the HHDC pipeline (hhdc/cube_generator.py). Each cube stores a high-resolution canopy volume (default: 0.5 m vertical bins over 64 m height, footprints every 2 m) across a 96 m × 96 m tile. In the HHDC-Creator pipeline… See the full description on the dataset page: https://huggingface.co/datasets/anfera236/HHDC.

248.6K

downloads

60new

Zyda-2

Silver58

Zyphra · Code

Zyda-2 Zyda-2 is a 5 trillion token language modeling dataset created by collecting open and high quality datasets and combining them and cross-deduplication and model-based quality filtering. Zyda-2 comprises diverse sources of web data, highly educational content, math, code, and scientific papers. To construct Zyda-2, we took the best open-source datasets available: Zyda, FineWeb, DCLM, and Dolma. Models trained on Zyda-2 significantly outperform identical models trained on the… See the full description on the dataset page: https://huggingface.co/datasets/Zyphra/Zyda-2.

246.4K

downloads

61new

SuperGLUE

Silver59

aps · Benchmarks & Evaluation

Dataset Card for "super_glue" Dataset Summary SuperGLUE (https://super.gluebenchmark.com/) is a new benchmark styled after GLUE with a new set of more difficult language understanding tasks, improved resources, and a new public leaderboard. Supported Tasks and Leaderboards More Information Needed Languages More Information Needed Dataset Structure Data Instances axb Size of downloaded dataset files: 0.03 MB Size of… See the full description on the dataset page: https://huggingface.co/datasets/aps/super_glue.

235.2K

downloads

62new

psp

Bronze31

Emmyc2 · Uncategorized

235.1K

downloads

63new

arxiv_ocr

Bronze31

Chelsea707 · Uncategorized

232.6K

downloads

64new

Mostly Basic Python Problems

Silver60

google-research-datasets · Code

Dataset Card for Mostly Basic Python Problems (mbpp) Dataset Summary The benchmark consists of around 1,000 crowd-sourced Python programming problems, designed to be solvable by entry level programmers, covering programming fundamentals, standard library functionality, and so on. Each problem consists of a task description, code solution and 3 automated test cases. As described in the paper, a subset of the data has been hand-verified by us. Released here as part of… See the full description on the dataset page: https://huggingface.co/datasets/google-research-datasets/mbpp.

221.9K

downloads

65new

magvits_data

Bronze31

arcadiaaaaa · Uncategorized

218.7K

downloads

66new

Eval Awareness Dataset

Bronze45

dsinghvi · Code

Eval Awareness Dataset with contrastive pairs of with and without eval cues with behavioural changes across various misaligned situations. Also we provide automated scripts to create these scenarios at with lots of other codes dumped regarding suppression of eval awareness https://github.com/divyanshsinghvi/evalawareness_techniques/ Authors: @divyanshsinghvi, @Riteshbhalerao11

216.1K

downloads

67new

P3

Silver60

bigscience · Science & Research

Dataset Card for P3 Dataset Summary P3 (Public Pool of Prompts) is a collection of prompted English datasets covering a diverse set of NLP tasks. A prompt is the combination of an input template and a target template. The templates are functions mapping a data example into natural language for the input and target sequences. For example, in the case of an NLI dataset, the data example would include fields for Premise, Hypothesis, Label. An input template would be If… See the full description on the dataset page: https://huggingface.co/datasets/bigscience/P3.

215.5K

downloads

68new

LLaVA-OneVision-1.5-Instruct-Data

Silver56

mvp-lab · Instruction Following

LLaVA-OneVision-1.5 Instruction Data Paper | Code 📌 Introduction This dataset, LLaVA-OneVision-1.5-Instruct, was collected and integrated during the development of LLaVA-OneVision-1.5. LLaVA-OneVision-1.5 is a novel family of Large Multimodal Models (LMMs) that achieve state-of-the-art performance with significantly reduced computational and financial costs. This meticulously curated 22M instruction dataset (LLaVA-OneVision-1.5-Instruct) is part of a comprehensive and… See the full description on the dataset page: https://huggingface.co/datasets/mvp-lab/LLaVA-OneVision-1.5-Instruct-Data.

211.1K

downloads

69new

SAGE-10k

Silver56

nvidia · Uncategorized

SAGE-10k SAGE-10k is a large-scale interactive indoor scene dataset featuring realistic layouts, generated by the agentic-driven pipeline introduced in "SAGE: Scalable Agentic 3D Scene Generation for Embodied AI". The dataset contains 10,000 diverse scenes spanning 50 room types and styles, along with 565K uniquely generated 3D objects. 🔑 Key Features SAGE-10k integrates a wide variety of scenes, and particularly, preserves small items for… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/SAGE-10k.

209.4K

downloads

70new

WinoGrande

Silver57

allenai · Math & Reasoning

Dataset Card for "winogrande" Dataset Summary WinoGrande is a new collection of 44k problems, inspired by Winograd Schema Challenge (Levesque, Davis, and Morgenstern 2011), but adjusted to improve the scale and robustness against the dataset-specific bias. Formulated as a fill-in-a-blank task with binary options, the goal is to choose the right option for a given sentence which requires commonsense reasoning. Supported Tasks and Leaderboards More Information… See the full description on the dataset page: https://huggingface.co/datasets/allenai/winogrande.

208.5K

downloads

71new

nbchr_pdfs

Bronze30

daniilakk · Uncategorized

208.0K

downloads

72new

FineWeb

Silver66

HuggingFaceFW · Text Generation & Chat

🍷 FineWeb 15 trillion tokens of the finest data the 🌐 web has to offer What is it? The 🍷 FineWeb dataset consists of more than 18.5T tokens (originally 15T tokens) of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the 🏭 datatrove library, our large scale data processing library. 🍷 FineWeb was originally meant to be a fully open replication of 🦅 RefinedWeb, with a release… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb.

206.0K

downloads

73new

badges

Silver55

huggingface · Code

Badges A set of badges you can use anywhere. Just update the anchor URL to point to the correct action for your Space. Light or dark background with 4 sizes available: small, medium, large, and extra large. How to use? With markdown, just copy the badge from: https://huggingface.co/datasets/huggingface/badges/blob/main/README.md?code=true With HTML, inspect this page with your web browser and copy the outer html. Available sizes Small Medium Large Extra… See the full description on the dataset page: https://huggingface.co/datasets/huggingface/badges.

197.3K

downloads

74new

snodas-snowmelt-cache

Bronze30

Jsinowitz · Uncategorized

195.3K

downloads

75new

LolData

Bronze30

rhmnhsim · Uncategorized

193.9K

downloads

76new

IMDB

Silver61

stanfordnlp · Benchmarks & Evaluation

Dataset Card for "imdb" Dataset Summary Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. Supported Tasks and Leaderboards More Information Needed Languages More Information Needed Dataset Structure… See the full description on the dataset page: https://huggingface.co/datasets/stanfordnlp/imdb.

190.9K

downloads

77new

image_dummy

Bronze45

Narsil · Speech & Audio

\

190.7K

downloads

78new

climbmix-400b-shuffle

Bronze39

karpathy · Text - General

185.4K

downloads

79new

brand-assets

Bronze36

huggingface · Image Recognition

185.0K

downloads

80new

hf_hub_cache

Bronze30

hf-internal-testing · Uncategorized

181.4K

downloads

81new

JAT-dataset

Silver55

jat-project · Reinforcement Learning

JAT Dataset Dataset Description The Jack of All Trades (JAT) dataset combines a wide range of individual datasets. It includes expert demonstrations by expert RL agents, image and caption pairs, textual data and more. The JAT dataset is part of the JAT project, which aims to build a multimodal generalist agent. Paper: https://huggingface.co/papers/2402.09844 Usage >>> from datasets import load_dataset >>> dataset = load_dataset("jat-project/jat-dataset"… See the full description on the dataset page: https://huggingface.co/datasets/jat-project/jat-dataset.

178.5K

downloads

82new

bridgev2

Bronze45

Saberlve · Code

This dataset was created using LeRobot. Dataset Structure meta/info.json: { "codebase_version": "v2.1", "robot_type": "WidowX", "total_episodes": 53192, "total_frames": 1999410, "total_tasks": 19974, "total_videos": 212768, "total_chunks": 54, "chunks_size": 1000, "fps": 5, "splits": { "train": "0:53192" }, "data_path": "data/chunk-{episode_chunk:03d}/episode_{episode_index:06d}.parquet", "video_path":… See the full description on the dataset page: https://huggingface.co/datasets/Saberlve/bridgev2.

178.4K

downloads

83new

MINT-1T

Silver57

mlfoundations · Vision-Language

🍃 MINT-1T:Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens 🍃 MINT-1T is an open-source Multimodal INTerleaved dataset with 1 trillion text tokens and 3.4 billion images, a 10x scale-up from existing open-source datasets. Additionally, we include previously untapped sources such as PDFs and ArXiv papers. 🍃 MINT-1T is designed to facilitate research in multimodal pretraining. 🍃 MINT-1T is created by a team from the University of Washington in… See the full description on the dataset page: https://huggingface.co/datasets/mlfoundations/MINT-1T-HTML.

174.5K

downloads

84new

sound-benchmark

Bronze30

AE-W · Benchmarks & Evaluation

171.7K

downloads

85new

OpenThoughts-114k

Silver62

open-thoughts · Code

[!NOTE] We have released a paper for OpenThoughts! See our paper here. Open-Thoughts-114k Open synthetic reasoning dataset with 114k high-quality examples covering math, science, code, and puzzles! Inspect the content with rich formatting with Curator Viewer. Available Subsets default subset containing ready-to-train data used to finetune the OpenThinker-7B and OpenThinker-32B models: ds = load_dataset("open-thoughts/OpenThoughts-114k", split="train")… See the full description on the dataset page: https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k.

169.4K

downloads

86new

CORAAL Dataset

Bronze45

liamstone707 · Speech & Audio

Dataset Card for CORAAL Dataset Summary This dataset comprises audio files, text files, and audio segments sourced from the Corpus of Regional African American Language (CORAAL). CORAAL is a subset of the Online Resources for African American Language (ORAAL) project, initiated by a team of linguistics researchers at the University of Oregon. The original CORAAL dataset encompasses over 220 sociolinguistic interviews featuring African American Language (AAL) speakers born… See the full description on the dataset page: https://huggingface.co/datasets/liamstone707/CORAAL.

166.6K

downloads

87new

MNBVC

Silver61

liwu · Text Generation & Chat

MNBVC: Massive Never-ending BT Vast Chinese corpus

165.2K

downloads

88new

ETCI 2021 Flood Detection Dataset

Bronze45

luisrH · Image Recognition

ETCI 2021 Flood Detection Dataset Description The ETCI 2021 Flood Detection Dataset is a comprehensive flood detection segmentation dataset that focuses on SAR (Synthetic Aperture Radar) images taken by the ESA Sentinel-1 satellite. This dataset provides pairs of VV (Vertical Transmit, Vertical Receive) and VH (Vertical Transmit, Horizontal Receive) polarization images, which have been processed by the Hybrid Pluggable Processing Pipeline (hyp3). Additionally, the… See the full description on the dataset page: https://huggingface.co/datasets/luisrH/ETCI-2021-Flood-Detection.

159.6K

downloads

89new

CADS-dataset

Bronze45

sunghong · Medical & Healthcare

CADS: A Comprehensive Anatomical Dataset and Segmentation for Whole-Body Anatomy in Computed Tomography Overview CADS is a robust, fully automated framework for segmenting 167 anatomical structures in Computed Tomography (CT), spanning from head to knee regions across diverse anatomical systems. The framework consists of two main components: CADS-dataset: 22,022 CT volumes with complete annotations for 167 anatomical structures. Most extensive whole-body CT dataset… See the full description on the dataset page: https://huggingface.co/datasets/sunghong/CADS-dataset.

157.9K

downloads

90new

uitars-task-111-v2

Bronze30

Anish13 · Uncategorized

155.1K

downloads

91new

dronescapes2

Bronze45

Meehai · Benchmarks & Evaluation

Dronescapes Experts dataset This dataset is an extension of the original dronescapes dataset with new modalities generated using VRE 100% from scratch (aka pretrained experts). The only data that is not generable by VRE is the Ground Truth: semantic (human annotated), depth & normals (SfM) that is inherited from the original dataset for evaluation purposes only. 1. Downloading the data Option 1. Download the pre-processed dataset from HuggingFace… See the full description on the dataset page: https://huggingface.co/datasets/Meehai/dronescapes2.

154.4K

downloads

92new

PhysicalAI-Robotics-GR00T-X-Embodiment-Sim

Silver59

nvidia · Code

PhysicalAI-Robotics-GR00T-X-Embodiment-Sim Github Repo: Isaac GR00T N1 We provide a set of datasets used for post-training of GR00T N1. Each dataset is a collection of trajectories from different robot embodiments and tasks. Cross-embodied bimanual manipulation: 9k trajectories Dataset Name #trajectories bimanual_panda_gripper.Threading 1000 bimanual_panda_hand.LiftTray 1000 bimanual_panda_gripper.ThreePieceAssembly 1000… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/PhysicalAI-Robotics-GR00T-X-Embodiment-Sim.

153.1K

downloads

93new

DPO-dataset

Bronze30

JiaHuang01 · Preference & Alignment (DPO/RLHF)

152.2K

downloads

94new

pair_touch_13m

Bronze46

BorisGuo · Image Recognition

PairTouch 13M Dataset Multi-modal tactile dataset with pose, force, and tactile sensor data. Configs Config Description Sensors pose_data Pose estimation data tac02/xela + camera force_data Force measurement data tac02/xela + gelsight tacniq_gsmini TacNIQ + GSMini data tacniq + gsmini xela_9dtact XELA + 9DTact data xela + 9dtact Usage from datasets import load_dataset # Load specific config ds = load_dataset("BorisGuo/pair_touch_13m"… See the full description on the dataset page: https://huggingface.co/datasets/BorisGuo/pair_touch_13m.

150.0K

downloads

95new

FineVision

Silver61

HuggingFaceM4 · Image Recognition

Fine Vision FineVision is a massive collection of datasets with 17.3M images, 24.3M samples, 88.9M turns, and 9.5B answer tokens, designed for training state-of-the-art open Vision-Language-Models. More detail can be found in the blog post: https://huggingface.co/spaces/HuggingFaceM4/FineVision Load the data from datasets import load_dataset, get_dataset_config_names # Get all subset names and load the first one available_subsets =… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceM4/FineVision.

149.6K

downloads

96new

ReActor

Silver59

Gourieff · Code

ReActor Assets The Fast and Simple Face Swap Extension ComfyUI-ReActor (ex. comfyui-reactor-node) sd-webui-reactor Models file source license buffalo_l.zip DeepInsight codeformer-v0.1.0.pth sczhou GFPGANv1.3.pth TencentARC GFPGANv1.4.pth TencentARC GPEN-BFR-512.onnx harisreedhar RestoreFormer_PP.onnx netrunner.exe inswapper_128.onnx DeepInsight inswapper_128_fp16.onnx Hillobar

149.3K

downloads

97new

oneformer_demo

Bronze30

shi-labs · Uncategorized

148.3K

downloads

98new

zhongyangribao

Bronze31

banned-historical-archives · Uncategorized

145.4K

downloads

99new

pretraining_v1-omega_books

Bronze33

applied-ai-018 · Structured Data

144.1K

downloads

100new

common_corpus

Silver60

PleIAs · Code

Common Corpus Full paper - ICLR 2026 oral Common Corpus is the largest open and permissible licensed text dataset, comprising 2.27 trillion tokens (2,267,302,720,836 tokens). It is a diverse dataset, consisting of books, newspapers, scientific articles, government and legal documents, code, and more. Common Corpus has been created by Pleias in association with several partners. Common Corpus differs from existing open datasets in that it is: Truly Open: contains only data that… See the full description on the dataset page: https://huggingface.co/datasets/PleIAs/common_corpus.

143.0K

downloads

About the Most Downloaded Datasets Leaderboard

The most downloaded training datasets of all time. The data that powers the best fine-tuned models. This leaderboard tracks the top 100 training datasets ranked by downloads, with daily snapshots to monitor how the rankings evolve over time.

Every dataset on this leaderboard is sourced from HuggingFace and verified for relevance to fine-tuning workflows.

Methodology

Rankings are based on total all-time download counts from HuggingFace. Downloads reflect real-world adoption — models and datasets that people actually use in production and research, not just stars or hype.

Rankings are snapshotted daily at 6:00 AM UTC. Position changes shown on the leaderboard compare the current snapshot to the previous day's snapshot. All data is sourced directly from the HuggingFace Hub API and processed through our classification pipeline, which uses tag analysis, model card parsing, and naming pattern detection to identify genuine fine-tunes.

Data Sources

  • HuggingFace Hub API — download counts, likes, trending scores, model metadata, and README/model cards
  • Model card parsing — training datasets, training method (LoRA, DPO, SFT, etc.), framework, hardware, and hyperparameters extracted from README files
  • Tag classification — fine-tune detection via `base_model:finetune:*` and `base_model:quantized:*` HuggingFace tags, plus naming pattern analysis

Who Is This For?

This leaderboard is designed for anyone fine-tuning their own model who needs high-quality training data, or researchers studying what data produces the best results in the fine-tuning ecosystem.

Whether you're a beginner exploring what's possible with fine-tuned AI models or an experienced ML engineer looking for the best starting point for your next project, these rankings give you a data-driven way to find the highest quality datasets without having to wade through thousands of quantizations, format conversions, and abandoned repositories on HuggingFace.

Update Schedule

This leaderboard was last updated on April 3, 2026. Rankings are refreshed daily with the latest download counts, likes, and trending data from HuggingFace. Historical snapshots are preserved to track trends over time — you can see which datasets are growing in popularity and which are being superseded by newer alternatives.