Most Downloaded Datasets
The most downloaded training datasets of all time. The data that powers the best fine-tuned models.
Last updated April 3, 2026 · Updated daily
FineFineWeb by m-a-p holds the #1 position with 2.6M downloads, ahead of ニコニコ実況 過去ログアーカイブ at 2.4M.
The top 10 is dominated by m-a-p, KakologArchives, ropedia-ai. This is the first snapshot — future updates will track position changes and emerging trends.
The gap between #1 and #100 is 2.6M vs 143.0K downloads, showing significant concentration at the top.
FineFineWeb
Silver64m-a-p · Classification & Sentiment
FineFineWeb: A Comprehensive Study on Fine-Grained Domain Web Corpus arXiv: Coming Soon Project Page: Coming Soon Blog: Coming Soon Data Statistics Domain (#tokens/#samples) Iteration 1 Tokens Iteration 2 Tokens Iteration 3 Tokens Total Tokens Iteration 1 Count Iteration 2 Count Iteration 3 Count Total Count aerospace 5.77B 261.63M 309.33M 6.34B 9100000 688505 611034 10399539 agronomy 13.08B 947.41M 229.04M 14.26B 15752828 2711790 649404 19114022 artistic… See the full description on the dataset page: https://huggingface.co/datasets/m-a-p/FineFineWeb.
downloads
ニコニコ実況 過去ログアーカイブ
Silver60KakologArchives · Classification & Sentiment
ニコニコ実況 過去ログアーカイブ ニコニコ実況 過去ログアーカイブは、ニコニコ実況 のサービス開始から現在までのすべての過去ログコメントを収集したデータセットです。 去る2020年12月、ニコニコ実況は ニコニコ生放送内の一公式チャンネルとしてリニューアル されました。これに伴い、2009年11月から運用されてきた旧システムは提供終了となり(事実上のサービス終了)、torne や BRAVIA などの家電への対応が軒並み終了する中、当時の生の声が詰まった約11年分の過去ログも同時に失われることとなってしまいました。 そこで 5ch の DTV 板の住民が中心となり、旧ニコニコ実況が終了するまでに11年分の全チャンネルの過去ログをアーカイブする計画が立ち上がりました。紆余曲折あり Nekopanda 氏が約11年分のラジオや BS も含めた全チャンネルの過去ログを完璧に取得してくださったおかげで、11年分の過去ログが電子の海に消えていく事態は回避できました。しかし、旧 API が廃止されてしまったため過去ログを API… See the full description on the dataset page: https://huggingface.co/datasets/KakologArchives/KakologArchives.
downloads
Xperience-10M
Silver64ropedia-ai · Video
⚠️ Important: If you have already submitted an access request but have not completed the required DocuSign agreement, your request will remain pending. Please complete signing and we will grant access once verified. Interactive Intelligence from Human Xperience Xperience-10M Dataset Summary Xperience-10M is a large-scale egocentric multimodal dataset of human experience for embodied AI, robotics, world models, and spatial… See the full description on the dataset page: https://huggingface.co/datasets/ropedia-ai/xperience-10m.
downloads
documentation-images
Silver64huggingface · Image Recognition
This dataset contains images used in the documentation of HuggingFace's libraries. HF Team: Please make sure you optimize the assets before uploading them. My favorite tool for this is https://tinypng.com/.
downloads
banned-historical-archives
Silver58banned-historical-archives · Code
和谐历史档案馆数据集 - Banned Historical Archives Datasets 和谐历史档案馆数据集包含已录入 https://banned-historical-archives.github.io 和暂未未录入的原始文件。 目录结构 banned-historical-archives.github.io # 已录入该网站的原始数据,不定期从 github 仓库中同步 raw # 原始文件 config # 配置文件 todo # 存放暂未录入网站的文件 部分报纸和图片资料存放在单独的仓库: 名称 地址 状态 参考消息 https://huggingface.co/datasets/banned-historical-archives/ckxx 未录入 人民日报 https://huggingface.co/datasets/banned-historical-archives/rmrb 已精选重要的文章录入 文汇报… See the full description on the dataset page: https://huggingface.co/datasets/banned-historical-archives/banned-historical-archives.
downloads
Generated Docs for HF
Silver59hf-doc-build · Code
This repo contains all the docs published on https://huggingface.co/docs. The docs are generated with https://github.com/huggingface/doc-builder.
downloads
xCodeEval
Silver60NTU-NLP-sg · Code
The ability to solve problems is a hallmark of intelligence and has been an enduring goal in AI. AI systems that can create programs as solutions to problems or assist developers in writing programs can increase productivity and make programming more accessible. Recently, pre-trained large language models have shown impressive abilities in generating new codes from natural language descriptions, repairing buggy codes, translating codes between languages, and retrieving relevant code segments. However, the evaluation of these models has often been performed in a scattered way on only one or two specific tasks, in a few languages, at a partial granularity (e.g., function) level and in many cases without proper training data. Even more concerning is that in most cases the evaluation of generated codes has been done in terms of mere lexical overlap rather than actual execution whereas semantic similarity (or equivalence) of two code segments depends only on their ``execution similarity'', i.e., being able to get the same output for a given input.
downloads
WikiText
Silver66Salesforce · Text Generation & Chat
Dataset Card for "wikitext" Dataset Summary The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License. Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far larger… See the full description on the dataset page: https://huggingface.co/datasets/Salesforce/wikitext.
downloads
MINT-1T
Silver58mlfoundations · Vision-Language
🍃 MINT-1T:Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens 🍃 MINT-1T is an open-source Multimodal INTerleaved dataset with 1 trillion text tokens and 3.4 billion images, a 10x scale-up from existing open-source datasets. Additionally, we include previously untapped sources such as PDFs and ArXiv papers. 🍃 MINT-1T is designed to facilitate research in multimodal pretraining. 🍃 MINT-1T is created by a team from the University of Washington in… See the full description on the dataset page: https://huggingface.co/datasets/mlfoundations/MINT-1T-PDF-CC-2024-18.
downloads
PhysicalAI-Autonomous-Vehicles
Silver67nvidia · Uncategorized
PHYSICAL AI AUTONOMOUS VEHICLES The PhysicalAI-Autonomous-Vehicles dataset provides one of the largest, geographically diverse collections of multi-sensor data empowering AV researchers to build the next generation of Physical AI based end-to-end driving systems. This dataset is ready for commercial/non-commercial AV use per the license agreement. Data Collection Method Automatic/Sensor Labeling Method Automatic/Sensor This dataset has a total of 1700 hours of driving… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/PhysicalAI-Autonomous-Vehicles.
downloads
MINT-1T
Silver51mlfoundations · Vision-Language
🍃 MINT-1T:Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens 🍃 MINT-1T is an open-source Multimodal INTerleaved dataset with 1 trillion text tokens and 3.4 billion images, a 10x scale-up from existing open-source datasets. Additionally, we include previously untapped sources such as PDFs and ArXiv papers. 🍃 MINT-1T is designed to facilitate research in multimodal pretraining. 🍃 MINT-1T is created by a team from the University of Washington in… See the full description on the dataset page: https://huggingface.co/datasets/mlfoundations/MINT-1T-PDF-CC-2023-40.
downloads
SWE-bench_Pro
Silver60ScaleAI · Code
Dataset Summary SWE-Bench Pro is a challenging, enterprise-level dataset for testing agent ability on long-horizon software engineering tasks. Paper: https://static.scale.com/uploads/654197dc94d34f66c0f5184e/SWEAP_Eval_Scale%20(9).pdf See the related evaluation Github: https://github.com/scaleapi/SWE-bench_Pro-os Dataset Structure We follow SWE-Bench Verified (https://huggingface.co/datasets/SWE-bench/SWE-bench_Verified) in terms of dataset structure, with several… See the full description on the dataset page: https://huggingface.co/datasets/ScaleAI/SWE-bench_Pro.
downloads
Grade School Math 8K
Silver67openai · Math & Reasoning
Dataset Card for GSM8K Dataset Summary GSM8K (Grade School Math 8K) is a dataset of 8.5K high quality linguistically diverse grade school math word problems. The dataset was created to support the task of question answering on basic mathematical problems that require multi-step reasoning. These problems take between 2 and 8 steps to solve. Solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ − ×÷) to reach the… See the full description on the dataset page: https://huggingface.co/datasets/openai/gsm8k.
downloads
MINT-1T
Silver52mlfoundations · Vision-Language
🍃 MINT-1T:Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens 🍃 MINT-1T is an open-source Multimodal INTerleaved dataset with 1 trillion text tokens and 3.4 billion images, a 10x scale-up from existing open-source datasets. Additionally, we include previously untapped sources such as PDFs and ArXiv papers. 🍃 MINT-1T is designed to facilitate research in multimodal pretraining. 🍃 MINT-1T is created by a team from the University of Washington in… See the full description on the dataset page: https://huggingface.co/datasets/mlfoundations/MINT-1T-PDF-CC-2023-50.
downloads
SWE-bench_Verified
Silver63princeton-nlp · Code
Dataset Summary SWE-bench Verified is a subset of 500 samples from the SWE-bench test set, which have been human-validated for quality. SWE-bench is a dataset that tests systems’ ability to solve GitHub issues automatically. See this post for more details on the human-validation process. The dataset collects 500 test Issue-Pull Request pairs from popular Python repositories. Evaluation is performed by unit test verification using post-PR behavior as the reference solution. The original… See the full description on the dataset page: https://huggingface.co/datasets/princeton-nlp/SWE-bench_Verified.
downloads
uniocc
Silver53tasl-lab · Code
UniOcc: A Unified Benchmark for Occupancy Forecasting and Prediction in Autonomous Driving Paper | Project Page | Code Autonomous Driving researchers, have you ever been bothered by the fact that popular datasets all have their different formats, and standardizing them is a pain? Have you ever been frustrated by the difficulty of just understanding the file semantics? This challenge is even worse in the occupancy domain. But, UniOcc is here to help. UniOcc is a unified… See the full description on the dataset page: https://huggingface.co/datasets/tasl-lab/uniocc.
downloads
MINT-1T
Silver52mlfoundations · Vision-Language
🍃 MINT-1T:Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens 🍃 MINT-1T is an open-source Multimodal INTerleaved dataset with 1 trillion text tokens and 3.4 billion images, a 10x scale-up from existing open-source datasets. Additionally, we include previously untapped sources such as PDFs and ArXiv papers. 🍃 MINT-1T is designed to facilitate research in multimodal pretraining. 🍃 MINT-1T is created by a team from the University of Washington in… See the full description on the dataset page: https://huggingface.co/datasets/mlfoundations/MINT-1T-PDF-CC-2023-23.
downloads
MADLAD-400
Silver61allenai · Text Generation & Chat
MADLAD-400 Dataset and Introduction MADLAD-400 (Multilingual Audited Dataset: Low-resource And Document-level) is a document-level multilingual dataset based on Common Crawl, covering 419 languages in total. This uses all snapshots of CommonCrawl available as of August 1, 2022. The primary advantage of this dataset over similar datasets is that it is more multilingual (419 languages), it is audited and more highly filtered, and it is document-level. The main disadvantage… See the full description on the dataset page: https://huggingface.co/datasets/allenai/MADLAD-400.
downloads
C4
Silver64allenai · Text Generation & Chat
C4 Dataset Summary A colossal, cleaned version of Common Crawl's web crawl corpus. Based on Common Crawl dataset: "https://commoncrawl.org". This is the processed version of Google's C4 dataset We prepared five variants of the data: en, en.noclean, en.noblocklist, realnewslike, and multilingual (mC4). For reference, these are the sizes of the variants: en: 305GB en.noclean: 2.3TB en.noblocklist: 380GB realnewslike: 15GB multilingual (mC4): 9.7TB (108 subsets, one per… See the full description on the dataset page: https://huggingface.co/datasets/allenai/c4.
downloads
MINT-1T
Silver50mlfoundations · Vision-Language
🍃 MINT-1T:Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens 🍃 MINT-1T is an open-source Multimodal INTerleaved dataset with 1 trillion text tokens and 3.4 billion images, a 10x scale-up from existing open-source datasets. Additionally, we include previously untapped sources such as PDFs and ArXiv papers. 🍃 MINT-1T is designed to facilitate research in multimodal pretraining. 🍃 MINT-1T is created by a team from the University of Washington in… See the full description on the dataset page: https://huggingface.co/datasets/mlfoundations/MINT-1T-PDF-CC-2023-14.
downloads
HuggingFaceFW/finephrase
Silver60HuggingFaceFW · Instruction Following
Dataset Card for HuggingFaceFW/finephrase Dataset Summary Synthetic data generated by DataTrove: Model: HuggingFaceTB/SmolLM2-1.7B-Instruct (main) Source dataset: HuggingFaceFW/fineweb-edu, config sample-350BT, split train Generation config: temperature=1.0, top_p=1.0, top_k=50, max_tokens=2048, model_max_context=8192 Speculative decoding: {"method":"suffix","num_speculative_tokens":32} System prompt: None Input column: text Prompt families: faq prompt Rewrite the… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/finephrase.
downloads
pesoz
Bronze33Kthera · Uncategorized
downloads
MINT-1T
Silver52mlfoundations · Vision-Language
🍃 MINT-1T:Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens 🍃 MINT-1T is an open-source Multimodal INTerleaved dataset with 1 trillion text tokens and 3.4 billion images, a 10x scale-up from existing open-source datasets. Additionally, we include previously untapped sources such as PDFs and ArXiv papers. 🍃 MINT-1T is designed to facilitate research in multimodal pretraining. 🍃 MINT-1T is created by a team from the University of Washington in… See the full description on the dataset page: https://huggingface.co/datasets/mlfoundations/MINT-1T-PDF-CC-2023-06.
downloads
dolma3_mix-6T-1025-7B
Silver57allenai · Math & Reasoning
⚠️ WARNING: This dataset is intended ONLY for reproducing Olmo 3 7B ⚠️ For all other training use cases, including training from scratch, please utilize our primary dolma 3 data mix: https://huggingface.co/datasets/allenai/dolma3_mix-6T. Note: Some olmOCR science PDFs in the current dataset have been redacted following the training of Olmo 3 7B. These texts are indicated with [REMOVED] in the text field. This will affect reproducibility of Olmo 3 7B. For this reason, please use our… See the full description on the dataset page: https://huggingface.co/datasets/allenai/dolma3_mix-6T-1025-7B.
downloads
ubuntu_osworld_file_cache
Silver52xlangai · Benchmarks & Evaluation
OSWorld File Cache This repository serves as a file cache for the OSWorld project, providing reliable and fast access to evaluation files that were previously hosted on Google Drive. Overview OSWorld is a scalable, real computer environment for multimodal agents, supporting task setup, execution-based evaluation, and interactive learning across various operating systems and applications. This cache repository ensures that all evaluation files are consistently accessible… See the full description on the dataset page: https://huggingface.co/datasets/xlangai/ubuntu_osworld_file_cache.
downloads
LegalBench (Staging)
Silver61nguha · Code
Dataset Card for Dataset Name Homepage: https://hazyresearch.stanford.edu/legalbench/ Repository: https://github.com/HazyResearch/legalbench/ Paper: https://arxiv.org/abs/2308.11462 Dataset Description Dataset Summary The LegalBench project is an ongoing open science effort to collaboratively curate tasks for evaluating legal reasoning in English large language models (LLMs). The benchmark currently consists of 162 tasks gathered from 40… See the full description on the dataset page: https://huggingface.co/datasets/nguha/legalbench.
downloads
upload2
Bronze33Maynor996 · Image Recognition
downloads
medical-qa-shared-task-v1-toy
Silver56lavita · Medical & Healthcare
Dataset Card for "medical-qa-shared-task-v1-toy" More Information needed
downloads
results
Bronze49mteb · Uncategorized
Results on MTEB
downloads
HF Documentation (PRs)
Silver54hf-doc-build · Code
This is a dataset which contains the docs from all the PRs that are updating one of the docs from https://huggingface.co/docs. It is automatically updated by this github action from the doc-buider repo.
downloads
Egocentric-100K
Silver60builddotai · Benchmarks & Evaluation
Egocentric-100K is the largest dataset of manual labor. You can visualize the dataset here. Egocentric-100K is state-of-the-art in hand visibility and active manipulation density compared to previous in-the-wild egocentric datasets. The complete 30,000 frame evaluation set is available at Egocentric-100K-Evaluation. Dataset Statistics Attribute Value Total Hours 100,405 Total Frames 10.8 billion Video Clips 2,010,759 Median Clip Length 180.0 seconds Mean Hours… See the full description on the dataset page: https://huggingface.co/datasets/builddotai/Egocentric-100K.
downloads
genshin-voices-separated
Bronze39AquaV · Uncategorized
downloads
sts22-crosslingual-sts
Silver53mteb · Code
STS22.v2 An MTEB dataset Massive Text Embedding Benchmark SemEval 2022 Task 8: Multilingual News Article Similarity. Version 2 filters updated on STS22 by removing pairs where one of entries contain empty sentences. Task category t2t Domains News, Written Reference https://competitions.codalab.org/competitions/33835 How to evaluate on this task You can evaluate an embedding model on this dataset using the following code: import mteb task =… See the full description on the dataset page: https://huggingface.co/datasets/mteb/sts22-crosslingual-sts.
downloads
GiftEvalPretrain
Silver56Salesforce · Code
GIFT-Eval Pre-training Datasets Pretraining dataset aligned with GIFT-Eval that has 71 univariate and 17 multivariate datasets, spanning seven domains and 13 frequencies, totaling 4.5 million time series and 230 billion data points. Notably this collection of data has no leakage issue with the train/test split and can be used to pretrain foundation models that can be fairly evaluated on GIFT-Eval. 📄 Paper 🖥️ Code 📔 Blog Post 🏎️ Leader Board Ethical Considerations… See the full description on the dataset page: https://huggingface.co/datasets/Salesforce/GiftEvalPretrain.
downloads
img_upload
Bronze32Maynor996 · Image Recognition
downloads
OpenThoughts-1k-sample
Bronze49ryanmarten · Code
[!NOTE] We have released a paper for OpenThoughts! See our paper here. Open-Thoughts-1k-sample This is a 1k sample of the OpenThoughts-114k dataset. Open synthetic reasoning dataset with high-quality examples covering math, science, code, and puzzles! Inspect the content with rich formatting with Curator Viewer. Available Subsets default subset containing ready-to-train data used to finetune the OpenThinker-7B and OpenThinker-32B models: ds =… See the full description on the dataset page: https://huggingface.co/datasets/ryanmarten/OpenThoughts-1k-sample.
downloads
AutoMathText-V2
Silver58OpenSQZ · Code
🚀 AutoMathText-V2: A 2.46 Trillion Token AI-Curated STEM Pretraining Dataset 🎉 AutoMathText-v2 has surpassed 1 million downloads!We'd love to know how you're using it. Please take 1 minute to fill out our use case survey. Your feedback will directly shape the future roadmap of this dataset.👉 Share your use case here 📊 AutoMathText-V2 consists of 2.46 trillion tokens of high-quality, deduplicated text spanning web content, mathematics, code, reasoning, and bilingual… See the full description on the dataset page: https://huggingface.co/datasets/OpenSQZ/AutoMathText-V2.
downloads
debug
Bronze49rtrm · Text - General
test3
downloads
Measuring Massive Multitask Language Understanding
Silver64cais · Science & Research
Dataset Card for MMLU Dataset Summary Measuring Massive Multitask Language Understanding by Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt (ICLR 2021). This is a massive multitask test consisting of multiple-choice questions from various branches of knowledge. The test spans subjects in the humanities, social sciences, hard sciences, and other areas that are important for some people to learn. This covers 57 tasks… See the full description on the dataset page: https://huggingface.co/datasets/cais/mmlu.
downloads
GLUE (General Language Understanding Evaluation benchmark)
Silver63nyu-mll · Benchmarks & Evaluation
Dataset Card for GLUE Dataset Summary GLUE, the General Language Understanding Evaluation benchmark (https://gluebenchmark.com/) is a collection of resources for training, evaluating, and analyzing natural language understanding systems. Supported Tasks and Leaderboards The leaderboard for the GLUE benchmark can be found at this address. It comprises the following tasks: ax A manually-curated evaluation dataset for fine-grained analysis of system… See the full description on the dataset page: https://huggingface.co/datasets/nyu-mll/glue.
downloads
Ai2Arc
Silver62allenai · Science & Research
Dataset Card for "ai2_arc" Dataset Summary A new dataset of 7,787 genuine grade-school level, multiple-choice science questions, assembled to encourage research in advanced question-answering. The dataset is partitioned into a Challenge Set and an Easy Set, where the former contains only questions answered incorrectly by both a retrieval-based algorithm and a word co-occurrence algorithm. We are also including a corpus of over 14 million science sentences relevant to… See the full description on the dataset page: https://huggingface.co/datasets/allenai/ai2_arc.
downloads
droid_1.0.1
Silver55cadene · Code
This dataset was created using LeRobot. Dataset Structure meta/info.json: { "codebase_version": "v2.1", "robot_type": "Franka", "total_episodes": 95600, "total_frames": 27612581, "total_tasks": 0, "total_videos": 286800, "total_chunks": 95, "chunks_size": 1000, "fps": 15, "splits": { "train": "0:95600" }, "data_path": "data/chunk-{episode_chunk:03d}/episode_{episode_index:06d}.parquet", "video_path":… See the full description on the dataset page: https://huggingface.co/datasets/cadene/droid_1.0.1.
downloads
CommitPackFT
Silver58bigcode · Instruction Following
CommitPackFT is is a 2GB filtered version of CommitPack to contain only high-quality commit messages that resemble natural language instructions.
downloads
regions
Bronze32world-igr-plum · Uncategorized
downloads
results
Bronze33hallucinations-leaderboard · Benchmarks & Evaluation
downloads
Procgen Benchmark Dataset
Silver51EpicPinkPenguin · Benchmarks & Evaluation
Procgen Benchmark This dataset contains expert trajectories generated by a PPO reinforcement learning agent trained on each of the 16 procedurally-generated gym environments from the Procgen Benchmark. The environments were created on distribution_mode=easy and with unlimited levels. Disclaimer: This is not an official repository from OpenAI. Dataset Usage Regular usage (for environment bigfish): from datasets import load_dataset train_dataset =… See the full description on the dataset page: https://huggingface.co/datasets/EpicPinkPenguin/procgen.
downloads
FineWeb-Edu
Silver64HuggingFaceFW · Instruction Following
📚 FineWeb-Edu 1.3 trillion tokens of the finest educational data the 🌐 web has to offer Paper: https://arxiv.org/abs/2406.17557 What is it? 📚 FineWeb-Edu dataset consists of 1.3T tokens and 5.4T tokens (FineWeb-Edu-score-2) of educational web pages filtered from 🍷 FineWeb dataset. This is the 1.3 trillion version. To enhance FineWeb's quality, we developed an educational quality classifier using annotations generated by LLama3-70B-Instruct. We then… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu.
downloads
objaverse
Silver62allenai · Uncategorized
Objaverse Objaverse is a Massive Dataset with 800K+ Annotated 3D Objects. More documentation is coming soon. In the meantime, please see our paper and website for additional details. License The use of the dataset as a whole is licensed under the ODC-By v1.0 license. Individual objects in Objaverse are all licensed as creative commons distributable objects, and may be under the following licenses: CC-BY 4.0 - 721K objects CC-BY-NC 4.0 - 25K objects CC-BY-NC-SA 4.0 - 52K… See the full description on the dataset page: https://huggingface.co/datasets/allenai/objaverse.
downloads
gaia
Bronze46siril-spcc · Science & Research
This catalog is developed for use with the Siril 1.4 series as a public reference database. Hugging Face is one of several mirrors used to distribute the data. This database is provided for both offline download and also for online access. This dataset is provided for scientific and reproducibility purposes. This is an extract of the Gaia DR3 catalog optimized for spectrophotometric color calibration. The catalog is indexed at HEALpix level 8 and selects up to the 127 brightest sources in each… See the full description on the dataset page: https://huggingface.co/datasets/siril-spcc/gaia.
downloads
fineweb-edu-translated
Silver50Helsinki-NLP · Translation & Multilingual
Helsinki-NLP/fineweb-edu-translated fineweb-edu-tanslated is a collection of automatically translated documents from fineweb-edu. Translations are based on OPUS-MT and HPLT-MT models. The data covers 36,704,000 documents with over 28 billion space-searated tokens of English data translated into 36 languages. The total data set is incudes of over 960 billion tokens and the translated documents are aligned across all languages. More information about how the data has been produced can… See the full description on the dataset page: https://huggingface.co/datasets/Helsinki-NLP/fineweb-edu-translated.
downloads
FineWeb-HQ
Silver51epfml · Text Generation & Chat
FineWeb-HQ Dataset Summary FineWeb-HQ is a high-quality, model-filtered pretraining dataset derived as a subset of FineWeb. FineWeb-HQ was created by selecting the top 10% of FineWeb documents based on a deep learning classifier trained to identify structured and knowledge-rich samples. This classifier uses XLM-RoBERTa embeddings to score documents. To validate our approach, we pretrained 1B-parameter LLM models with a Llama-like architecture across multiple languages and… See the full description on the dataset page: https://huggingface.co/datasets/epfml/FineWeb-HQ.
downloads
AIWD6
Bronze46Kondapally · Image Recognition
AIWD16 Multi-task weather dataset containing annotations for following: Image Classification (weather transitions) Object Detection Semantic Segmentation Instance Segmentation VQA Classification Labels Cloudy_to_Rainy Rainy_to_Cloudy Rainy_to_Sunny Sunny_to_Foggy Foggy_to_Sunny Sunny_to_Rainy Directory Structure images/ — Image datametadata.csv — classification labelsDet_annotations/ — Detection annotations SS_annotations/ - Semantic segmentation… See the full description on the dataset page: https://huggingface.co/datasets/Kondapally/AIWD6.
downloads
LLaVA-OneVision-1.5-Mid-Training-85M
Silver57mvp-lab · Image Recognition
🚀 LLaVA-One-Vision-1.5-Mid-Training-85M Dataset is being uploaded 🚀 Upload Status All Completed: ImageNet-21k、LAIONCN、DataComp-1B、Zero250M、COYO700M、SA-1B、MINT、Obelics 📜 Cite If you find LLaVA-One-Vision-1.5-Mid-Training-85M useful in your research, please consider to cite the following related papers: @misc{an2025llavaonevision15fullyopenframework, title={LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training}… See the full description on the dataset page: https://huggingface.co/datasets/mvp-lab/LLaVA-OneVision-1.5-Mid-Training-85M.
downloads
course-images
Bronze39agents-course · Image Recognition
downloads
HellaSwag
Silver59Rowan · Benchmarks & Evaluation
Dataset Card for "hellaswag" Dataset Summary HellaSwag: Can a Machine Really Finish Your Sentence? is a new dataset for commonsense NLI. A paper was published at ACL2019. Supported Tasks and Leaderboards More Information Needed Languages More Information Needed Dataset Structure Data Instances default Size of downloaded dataset files: 71.49 MB Size of the generated dataset: 65.32 MB Total amount of disk used: 136.81… See the full description on the dataset page: https://huggingface.co/datasets/Rowan/hellaswag.
downloads
documentation-images
Bronze31huggingface-course · Image Recognition
downloads
DreamZero-DROID-Data
Bronze33GEAR-Dreams · Video
downloads
OpenAI HumanEval
Silver61openai · Code
Dataset Card for OpenAI HumanEval Dataset Summary The HumanEval dataset released by OpenAI includes 164 programming problems with a function sig- nature, docstring, body, and several unit tests. They were handwritten to ensure not to be included in the training set of code generation models. Supported Tasks and Leaderboards Languages The programming problems are written in Python and contain English natural text in comments and docstrings.… See the full description on the dataset page: https://huggingface.co/datasets/openai/openai_humaneval.
downloads
Hyperheight Data Cube Denoising and Super-Resolution
Bronze46anfera236 · Code
Hyperheight Data Cube Denoising and Super-Resolution Dataset Summary Generation code and pipeline: https://github.com/Anfera/HHDC-Creator (HHDC-Creator repo). 3-D photon-count waveforms (Hyperheight data cubes) built from NEON discrete-return LiDAR using the HHDC pipeline (hhdc/cube_generator.py). Each cube stores a high-resolution canopy volume (default: 0.5 m vertical bins over 64 m height, footprints every 2 m) across a 96 m × 96 m tile. In the HHDC-Creator pipeline… See the full description on the dataset page: https://huggingface.co/datasets/anfera236/HHDC.
downloads
Zyda-2
Silver58Zyphra · Code
Zyda-2 Zyda-2 is a 5 trillion token language modeling dataset created by collecting open and high quality datasets and combining them and cross-deduplication and model-based quality filtering. Zyda-2 comprises diverse sources of web data, highly educational content, math, code, and scientific papers. To construct Zyda-2, we took the best open-source datasets available: Zyda, FineWeb, DCLM, and Dolma. Models trained on Zyda-2 significantly outperform identical models trained on the… See the full description on the dataset page: https://huggingface.co/datasets/Zyphra/Zyda-2.
downloads
SuperGLUE
Silver59aps · Benchmarks & Evaluation
Dataset Card for "super_glue" Dataset Summary SuperGLUE (https://super.gluebenchmark.com/) is a new benchmark styled after GLUE with a new set of more difficult language understanding tasks, improved resources, and a new public leaderboard. Supported Tasks and Leaderboards More Information Needed Languages More Information Needed Dataset Structure Data Instances axb Size of downloaded dataset files: 0.03 MB Size of… See the full description on the dataset page: https://huggingface.co/datasets/aps/super_glue.
downloads
psp
Bronze31Emmyc2 · Uncategorized
downloads
arxiv_ocr
Bronze31Chelsea707 · Uncategorized
downloads
Mostly Basic Python Problems
Silver60google-research-datasets · Code
Dataset Card for Mostly Basic Python Problems (mbpp) Dataset Summary The benchmark consists of around 1,000 crowd-sourced Python programming problems, designed to be solvable by entry level programmers, covering programming fundamentals, standard library functionality, and so on. Each problem consists of a task description, code solution and 3 automated test cases. As described in the paper, a subset of the data has been hand-verified by us. Released here as part of… See the full description on the dataset page: https://huggingface.co/datasets/google-research-datasets/mbpp.
downloads
magvits_data
Bronze31arcadiaaaaa · Uncategorized
downloads
Eval Awareness Dataset
Bronze45dsinghvi · Code
Eval Awareness Dataset with contrastive pairs of with and without eval cues with behavioural changes across various misaligned situations. Also we provide automated scripts to create these scenarios at with lots of other codes dumped regarding suppression of eval awareness https://github.com/divyanshsinghvi/evalawareness_techniques/ Authors: @divyanshsinghvi, @Riteshbhalerao11
downloads
P3
Silver60bigscience · Science & Research
Dataset Card for P3 Dataset Summary P3 (Public Pool of Prompts) is a collection of prompted English datasets covering a diverse set of NLP tasks. A prompt is the combination of an input template and a target template. The templates are functions mapping a data example into natural language for the input and target sequences. For example, in the case of an NLI dataset, the data example would include fields for Premise, Hypothesis, Label. An input template would be If… See the full description on the dataset page: https://huggingface.co/datasets/bigscience/P3.
downloads
LLaVA-OneVision-1.5-Instruct-Data
Silver56mvp-lab · Instruction Following
LLaVA-OneVision-1.5 Instruction Data Paper | Code 📌 Introduction This dataset, LLaVA-OneVision-1.5-Instruct, was collected and integrated during the development of LLaVA-OneVision-1.5. LLaVA-OneVision-1.5 is a novel family of Large Multimodal Models (LMMs) that achieve state-of-the-art performance with significantly reduced computational and financial costs. This meticulously curated 22M instruction dataset (LLaVA-OneVision-1.5-Instruct) is part of a comprehensive and… See the full description on the dataset page: https://huggingface.co/datasets/mvp-lab/LLaVA-OneVision-1.5-Instruct-Data.
downloads
SAGE-10k
Silver56nvidia · Uncategorized
SAGE-10k SAGE-10k is a large-scale interactive indoor scene dataset featuring realistic layouts, generated by the agentic-driven pipeline introduced in "SAGE: Scalable Agentic 3D Scene Generation for Embodied AI". The dataset contains 10,000 diverse scenes spanning 50 room types and styles, along with 565K uniquely generated 3D objects. 🔑 Key Features SAGE-10k integrates a wide variety of scenes, and particularly, preserves small items for… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/SAGE-10k.
downloads
WinoGrande
Silver57allenai · Math & Reasoning
Dataset Card for "winogrande" Dataset Summary WinoGrande is a new collection of 44k problems, inspired by Winograd Schema Challenge (Levesque, Davis, and Morgenstern 2011), but adjusted to improve the scale and robustness against the dataset-specific bias. Formulated as a fill-in-a-blank task with binary options, the goal is to choose the right option for a given sentence which requires commonsense reasoning. Supported Tasks and Leaderboards More Information… See the full description on the dataset page: https://huggingface.co/datasets/allenai/winogrande.
downloads
nbchr_pdfs
Bronze30daniilakk · Uncategorized
downloads
FineWeb
Silver66HuggingFaceFW · Text Generation & Chat
🍷 FineWeb 15 trillion tokens of the finest data the 🌐 web has to offer What is it? The 🍷 FineWeb dataset consists of more than 18.5T tokens (originally 15T tokens) of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the 🏭 datatrove library, our large scale data processing library. 🍷 FineWeb was originally meant to be a fully open replication of 🦅 RefinedWeb, with a release… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb.
downloads
badges
Silver55huggingface · Code
Badges A set of badges you can use anywhere. Just update the anchor URL to point to the correct action for your Space. Light or dark background with 4 sizes available: small, medium, large, and extra large. How to use? With markdown, just copy the badge from: https://huggingface.co/datasets/huggingface/badges/blob/main/README.md?code=true With HTML, inspect this page with your web browser and copy the outer html. Available sizes Small Medium Large Extra… See the full description on the dataset page: https://huggingface.co/datasets/huggingface/badges.
downloads
snodas-snowmelt-cache
Bronze30Jsinowitz · Uncategorized
downloads
LolData
Bronze30rhmnhsim · Uncategorized
downloads
IMDB
Silver61stanfordnlp · Benchmarks & Evaluation
Dataset Card for "imdb" Dataset Summary Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. Supported Tasks and Leaderboards More Information Needed Languages More Information Needed Dataset Structure… See the full description on the dataset page: https://huggingface.co/datasets/stanfordnlp/imdb.
downloads
image_dummy
Bronze45Narsil · Speech & Audio
\
downloads
climbmix-400b-shuffle
Bronze39karpathy · Text - General
downloads
brand-assets
Bronze36huggingface · Image Recognition
downloads
hf_hub_cache
Bronze30hf-internal-testing · Uncategorized
downloads
JAT-dataset
Silver55jat-project · Reinforcement Learning
JAT Dataset Dataset Description The Jack of All Trades (JAT) dataset combines a wide range of individual datasets. It includes expert demonstrations by expert RL agents, image and caption pairs, textual data and more. The JAT dataset is part of the JAT project, which aims to build a multimodal generalist agent. Paper: https://huggingface.co/papers/2402.09844 Usage >>> from datasets import load_dataset >>> dataset = load_dataset("jat-project/jat-dataset"… See the full description on the dataset page: https://huggingface.co/datasets/jat-project/jat-dataset.
downloads
bridgev2
Bronze45Saberlve · Code
This dataset was created using LeRobot. Dataset Structure meta/info.json: { "codebase_version": "v2.1", "robot_type": "WidowX", "total_episodes": 53192, "total_frames": 1999410, "total_tasks": 19974, "total_videos": 212768, "total_chunks": 54, "chunks_size": 1000, "fps": 5, "splits": { "train": "0:53192" }, "data_path": "data/chunk-{episode_chunk:03d}/episode_{episode_index:06d}.parquet", "video_path":… See the full description on the dataset page: https://huggingface.co/datasets/Saberlve/bridgev2.
downloads
MINT-1T
Silver57mlfoundations · Vision-Language
🍃 MINT-1T:Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens 🍃 MINT-1T is an open-source Multimodal INTerleaved dataset with 1 trillion text tokens and 3.4 billion images, a 10x scale-up from existing open-source datasets. Additionally, we include previously untapped sources such as PDFs and ArXiv papers. 🍃 MINT-1T is designed to facilitate research in multimodal pretraining. 🍃 MINT-1T is created by a team from the University of Washington in… See the full description on the dataset page: https://huggingface.co/datasets/mlfoundations/MINT-1T-HTML.
downloads
sound-benchmark
Bronze30AE-W · Benchmarks & Evaluation
downloads
OpenThoughts-114k
Silver62open-thoughts · Code
[!NOTE] We have released a paper for OpenThoughts! See our paper here. Open-Thoughts-114k Open synthetic reasoning dataset with 114k high-quality examples covering math, science, code, and puzzles! Inspect the content with rich formatting with Curator Viewer. Available Subsets default subset containing ready-to-train data used to finetune the OpenThinker-7B and OpenThinker-32B models: ds = load_dataset("open-thoughts/OpenThoughts-114k", split="train")… See the full description on the dataset page: https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k.
downloads
CORAAL Dataset
Bronze45liamstone707 · Speech & Audio
Dataset Card for CORAAL Dataset Summary This dataset comprises audio files, text files, and audio segments sourced from the Corpus of Regional African American Language (CORAAL). CORAAL is a subset of the Online Resources for African American Language (ORAAL) project, initiated by a team of linguistics researchers at the University of Oregon. The original CORAAL dataset encompasses over 220 sociolinguistic interviews featuring African American Language (AAL) speakers born… See the full description on the dataset page: https://huggingface.co/datasets/liamstone707/CORAAL.
downloads
MNBVC
Silver61liwu · Text Generation & Chat
MNBVC: Massive Never-ending BT Vast Chinese corpus
downloads
ETCI 2021 Flood Detection Dataset
Bronze45luisrH · Image Recognition
ETCI 2021 Flood Detection Dataset Description The ETCI 2021 Flood Detection Dataset is a comprehensive flood detection segmentation dataset that focuses on SAR (Synthetic Aperture Radar) images taken by the ESA Sentinel-1 satellite. This dataset provides pairs of VV (Vertical Transmit, Vertical Receive) and VH (Vertical Transmit, Horizontal Receive) polarization images, which have been processed by the Hybrid Pluggable Processing Pipeline (hyp3). Additionally, the… See the full description on the dataset page: https://huggingface.co/datasets/luisrH/ETCI-2021-Flood-Detection.
downloads
CADS-dataset
Bronze45sunghong · Medical & Healthcare
CADS: A Comprehensive Anatomical Dataset and Segmentation for Whole-Body Anatomy in Computed Tomography Overview CADS is a robust, fully automated framework for segmenting 167 anatomical structures in Computed Tomography (CT), spanning from head to knee regions across diverse anatomical systems. The framework consists of two main components: CADS-dataset: 22,022 CT volumes with complete annotations for 167 anatomical structures. Most extensive whole-body CT dataset… See the full description on the dataset page: https://huggingface.co/datasets/sunghong/CADS-dataset.
downloads
uitars-task-111-v2
Bronze30Anish13 · Uncategorized
downloads
dronescapes2
Bronze45Meehai · Benchmarks & Evaluation
Dronescapes Experts dataset This dataset is an extension of the original dronescapes dataset with new modalities generated using VRE 100% from scratch (aka pretrained experts). The only data that is not generable by VRE is the Ground Truth: semantic (human annotated), depth & normals (SfM) that is inherited from the original dataset for evaluation purposes only. 1. Downloading the data Option 1. Download the pre-processed dataset from HuggingFace… See the full description on the dataset page: https://huggingface.co/datasets/Meehai/dronescapes2.
downloads
PhysicalAI-Robotics-GR00T-X-Embodiment-Sim
Silver59nvidia · Code
PhysicalAI-Robotics-GR00T-X-Embodiment-Sim Github Repo: Isaac GR00T N1 We provide a set of datasets used for post-training of GR00T N1. Each dataset is a collection of trajectories from different robot embodiments and tasks. Cross-embodied bimanual manipulation: 9k trajectories Dataset Name #trajectories bimanual_panda_gripper.Threading 1000 bimanual_panda_hand.LiftTray 1000 bimanual_panda_gripper.ThreePieceAssembly 1000… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/PhysicalAI-Robotics-GR00T-X-Embodiment-Sim.
downloads
DPO-dataset
Bronze30JiaHuang01 · Preference & Alignment (DPO/RLHF)
downloads
pair_touch_13m
Bronze46BorisGuo · Image Recognition
PairTouch 13M Dataset Multi-modal tactile dataset with pose, force, and tactile sensor data. Configs Config Description Sensors pose_data Pose estimation data tac02/xela + camera force_data Force measurement data tac02/xela + gelsight tacniq_gsmini TacNIQ + GSMini data tacniq + gsmini xela_9dtact XELA + 9DTact data xela + 9dtact Usage from datasets import load_dataset # Load specific config ds = load_dataset("BorisGuo/pair_touch_13m"… See the full description on the dataset page: https://huggingface.co/datasets/BorisGuo/pair_touch_13m.
downloads
FineVision
Silver61HuggingFaceM4 · Image Recognition
Fine Vision FineVision is a massive collection of datasets with 17.3M images, 24.3M samples, 88.9M turns, and 9.5B answer tokens, designed for training state-of-the-art open Vision-Language-Models. More detail can be found in the blog post: https://huggingface.co/spaces/HuggingFaceM4/FineVision Load the data from datasets import load_dataset, get_dataset_config_names # Get all subset names and load the first one available_subsets =… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceM4/FineVision.
downloads
ReActor
Silver59Gourieff · Code
ReActor Assets The Fast and Simple Face Swap Extension ComfyUI-ReActor (ex. comfyui-reactor-node) sd-webui-reactor Models file source license buffalo_l.zip DeepInsight codeformer-v0.1.0.pth sczhou GFPGANv1.3.pth TencentARC GFPGANv1.4.pth TencentARC GPEN-BFR-512.onnx harisreedhar RestoreFormer_PP.onnx netrunner.exe inswapper_128.onnx DeepInsight inswapper_128_fp16.onnx Hillobar
downloads
oneformer_demo
Bronze30shi-labs · Uncategorized
downloads
zhongyangribao
Bronze31banned-historical-archives · Uncategorized
downloads
pretraining_v1-omega_books
Bronze33applied-ai-018 · Structured Data
downloads
common_corpus
Silver60PleIAs · Code
Common Corpus Full paper - ICLR 2026 oral Common Corpus is the largest open and permissible licensed text dataset, comprising 2.27 trillion tokens (2,267,302,720,836 tokens). It is a diverse dataset, consisting of books, newspapers, scientific articles, government and legal documents, code, and more. Common Corpus has been created by Pleias in association with several partners. Common Corpus differs from existing open datasets in that it is: Truly Open: contains only data that… See the full description on the dataset page: https://huggingface.co/datasets/PleIAs/common_corpus.
downloads