Fine-Tune Catalog — Curated AI Models & Training Datasets

Classification & Senti...

FineFineWeb

m-a-p

Silver64

FineFineWeb: A Comprehensive Study on Fine-Grained Domain Web Corpus arXiv: Coming Soon Project Page: Coming Soon Blog: Coming Soon Data Statistics Domain (#tokens/#samples) Iteration 1 Tokens Iteration 2 Tokens Iteration 3 Tokens Total Tokens Iteration 1 Count Iteration 2 Count Iteration 3 Count Total Count aerospace 5.77B 261.63M 309.33M 6.34B 9100000 688505 611034 10399539 agronomy 13.08B 947.41M 229.04M 14.26B 15752828 2711790 649404 19114022 artistic… See the full description on the dataset page: https://huggingface.co/datasets/m-a-p/FineFineWeb.

n>1T

2.6M115en

Commercial OK

Classification & Senti...

ニコニコ実況過去ログアーカイブ

KakologArchives

Silver60

ニコニコ実況過去ログアーカイブニコニコ実況過去ログアーカイブは、ニコニコ実況のサービス開始から現在までのすべての過去ログコメントを収集したデータセットです。去る2020年12月、ニコニコ実況はニコニコ生放送内の一公式チャンネルとしてリニューアルされました。これに伴い、2009年11月から運用されてきた旧システムは提供終了となり（事実上のサービス終了）、torne や BRAVIA などの家電への対応が軒並み終了する中、当時の生の声が詰まった約11年分の過去ログも同時に失われることとなってしまいました。そこで 5ch の DTV 板の住民が中心となり、旧ニコニコ実況が終了するまでに11年分の全チャンネルの過去ログをアーカイブする計画が立ち上がりました。紆余曲折あり Nekopanda 氏が約11年分のラジオや BS も含めた全チャンネルの過去ログを完璧に取得してくださったおかげで、11年分の過去ログが電子の海に消えていく事態は回避できました。しかし、旧 API が廃止されてしまったため過去ログを API… See the full description on the dataset page: https://huggingface.co/datasets/KakologArchives/KakologArchives.

2.4M29ja

Commercial OK

Video

Xperience-10M

ropedia-ai

Silver64

⚠️ Important: If you have already submitted an access request but have not completed the required DocuSign agreement, your request will remain pending. Please complete signing and we will grant access once verified. Interactive Intelligence from Human Xperience Xperience-10M Dataset Summary Xperience-10M is a large-scale egocentric multimodal dataset of human experience for embodied AI, robotics, world models, and spatial… See the full description on the dataset page: https://huggingface.co/datasets/ropedia-ai/xperience-10m.

1M<n<10M

2.2M154en

Non-Commercial

Image Recognition

documentation-images

huggingface

Silver64

This dataset contains images used in the documentation of HuggingFace's libraries. HF Team: Please make sure you optimize the assets before uploading them. My favorite tool for this is https://tinypng.com/.

2.0M121

Non-Commercial

Code

banned-historical-archives

Silver58

和谐历史档案馆数据集 - Banned Historical Archives Datasets 和谐历史档案馆数据集包含已录入 https://banned-historical-archives.github.io 和暂未未录入的原始文件。目录结构 banned-historical-archives.github.io # 已录入该网站的原始数据，不定期从 github 仓库中同步 raw # 原始文件 config # 配置文件 todo # 存放暂未录入网站的文件部分报纸和图片资料存放在单独的仓库: 名称地址状态参考消息 https://huggingface.co/datasets/banned-historical-archives/ckxx 未录入人民日报 https://huggingface.co/datasets/banned-historical-archives/rmrb 已精选重要的文章录入文汇报… See the full description on the dataset page: https://huggingface.co/datasets/banned-historical-archives/banned-historical-archives.

n>1T

1.6M19

Unknown

Code

Generated Docs for HF

hf-doc-build

Silver59

This repo contains all the docs published on https://huggingface.co/docs. The docs are generated with https://github.com/huggingface/doc-builder.

1.4M29

Commercial OK

Code

xCodeEval

NTU-NLP-sg

Silver60

The ability to solve problems is a hallmark of intelligence and has been an enduring goal in AI. AI systems that can create programs as solutions to problems or assist developers in writing programs can increase productivity and make programming more accessible. Recently, pre-trained large language models have shown impressive abilities in generating new codes from natural language descriptions, repairing buggy codes, translating codes between languages, and retrieving relevant code segments. However, the evaluation of these models has often been performed in a scattered way on only one or two specific tasks, in a few languages, at a partial granularity (e.g., function) level and in many cases without proper training data. Even more concerning is that in most cases the evaluation of generated codes has been done in terms of mere lexical overlap rather than actual execution whereas semantic similarity (or equivalence) of two code segments depends only on their ``execution similarity'', i.e., being able to get the same output for a given input.

Pretraining1M<n<10M

1.1M65code/en

Attrib. Required

Text Generation & Chat

WikiText

Salesforce

Silver66

Dataset Card for "wikitext" Dataset Summary The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License. Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far larger… See the full description on the dataset page: https://huggingface.co/datasets/Salesforce/wikitext.

Human Annotated1M<n<10M

1.1M653en

Copyleft

Vision-Language

MINT-1T

mlfoundations

Silver58

🍃 MINT-1T:Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens 🍃 MINT-1T is an open-source Multimodal INTerleaved dataset with 1 trillion text tokens and 3.4 billion images, a 10x scale-up from existing open-source datasets. Additionally, we include previously untapped sources such as PDFs and ArXiv papers. 🍃 MINT-1T is designed to facilitate research in multimodal pretraining. 🍃 MINT-1T is created by a team from the University of Washington in… See the full description on the dataset page: https://huggingface.co/datasets/mlfoundations/MINT-1T-PDF-CC-2024-18.

Pretraining100B<n<1T

1.0M23en

Attrib. Required

Uncategorized

PhysicalAI-Autonomous-Vehicles

nvidia

Silver67

PHYSICAL AI AUTONOMOUS VEHICLES The PhysicalAI-Autonomous-Vehicles dataset provides one of the largest, geographically diverse collections of multi-sensor data empowering AV researchers to build the next generation of Physical AI based end-to-end driving systems. This dataset is ready for commercial/non-commercial AV use per the license agreement. Data Collection Method Automatic/Sensor Labeling Method Automatic/Sensor This dataset has a total of 1700 hours of driving… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/PhysicalAI-Autonomous-Vehicles.

993.0K805

Non-Commercial

Vision-Language

MINT-1T

mlfoundations

Silver51

🍃 MINT-1T:Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens 🍃 MINT-1T is an open-source Multimodal INTerleaved dataset with 1 trillion text tokens and 3.4 billion images, a 10x scale-up from existing open-source datasets. Additionally, we include previously untapped sources such as PDFs and ArXiv papers. 🍃 MINT-1T is designed to facilitate research in multimodal pretraining. 🍃 MINT-1T is created by a team from the University of Washington in… See the full description on the dataset page: https://huggingface.co/datasets/mlfoundations/MINT-1T-PDF-CC-2023-40.

Pretraining100B<n<1T

854.1K2en

Attrib. Required

Code

SWE-bench_Pro

ScaleAI

Silver60

Dataset Summary SWE-Bench Pro is a challenging, enterprise-level dataset for testing agent ability on long-horizon software engineering tasks. Paper: https://static.scale.com/uploads/654197dc94d34f66c0f5184e/SWEAP_Eval_Scale%20(9).pdf See the related evaluation Github: https://github.com/scaleapi/SWE-bench_Pro-os Dataset Structure We follow SWE-Bench Verified (https://huggingface.co/datasets/SWE-bench/SWE-bench_Verified) in terms of dataset structure, with several… See the full description on the dataset page: https://huggingface.co/datasets/ScaleAI/SWE-bench_Pro.

848.0K70

Unknown

Math & Reasoning

Grade School Math 8K

openai

Silver67

Dataset Card for GSM8K Dataset Summary GSM8K (Grade School Math 8K) is a dataset of 8.5K high quality linguistically diverse grade school math word problems. The dataset was created to support the task of question answering on basic mathematical problems that require multi-step reasoning. These problems take between 2 and 8 steps to solve. Solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ − ×÷) to reach the… See the full description on the dataset page: https://huggingface.co/datasets/openai/gsm8k.

PPO, Human Annotated1K<n<10K

761.9K1.2Ken

Commercial OK

Vision-Language

MINT-1T

mlfoundations

Silver52

🍃 MINT-1T:Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens 🍃 MINT-1T is an open-source Multimodal INTerleaved dataset with 1 trillion text tokens and 3.4 billion images, a 10x scale-up from existing open-source datasets. Additionally, we include previously untapped sources such as PDFs and ArXiv papers. 🍃 MINT-1T is designed to facilitate research in multimodal pretraining. 🍃 MINT-1T is created by a team from the University of Washington in… See the full description on the dataset page: https://huggingface.co/datasets/mlfoundations/MINT-1T-PDF-CC-2023-50.

Pretraining100B<n<1T

731.0K4en

Attrib. Required

Code

SWE-bench_Verified

princeton-nlp

Silver63

Dataset Summary SWE-bench Verified is a subset of 500 samples from the SWE-bench test set, which have been human-validated for quality. SWE-bench is a dataset that tests systems’ ability to solve GitHub issues automatically. See this post for more details on the human-validation process. The dataset collects 500 test Issue-Pull Request pairs from popular Python repositories. Evaluation is performed by unit test verification using post-PR behavior as the reference solution. The original… See the full description on the dataset page: https://huggingface.co/datasets/princeton-nlp/SWE-bench_Verified.

694.3K316

Unknown

Code

uniocc

tasl-lab

Silver53

UniOcc: A Unified Benchmark for Occupancy Forecasting and Prediction in Autonomous Driving Paper | Project Page | Code Autonomous Driving researchers, have you ever been bothered by the fact that popular datasets all have their different formats, and standardizing them is a pain? Have you ever been frustrated by the difficulty of just understanding the file semantics? This challenge is even worse in the occupancy domain. But, UniOcc is here to help. UniOcc is a unified… See the full description on the dataset page: https://huggingface.co/datasets/tasl-lab/uniocc.

660.8K7

Commercial OK

Vision-Language

MINT-1T

mlfoundations

Silver52

🍃 MINT-1T:Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens 🍃 MINT-1T is an open-source Multimodal INTerleaved dataset with 1 trillion text tokens and 3.4 billion images, a 10x scale-up from existing open-source datasets. Additionally, we include previously untapped sources such as PDFs and ArXiv papers. 🍃 MINT-1T is designed to facilitate research in multimodal pretraining. 🍃 MINT-1T is created by a team from the University of Washington in… See the full description on the dataset page: https://huggingface.co/datasets/mlfoundations/MINT-1T-PDF-CC-2023-23.

Pretraining100B<n<1T

628.0K4en

Attrib. Required

Text Generation & Chat

MADLAD-400

allenai

Silver61

MADLAD-400 Dataset and Introduction MADLAD-400 (Multilingual Audited Dataset: Low-resource And Document-level) is a document-level multilingual dataset based on Common Crawl, covering 419 languages in total. This uses all snapshots of CommonCrawl available as of August 1, 2022. The primary advantage of this dataset over similar datasets is that it is more multilingual (419 languages), it is audited and more highly filtered, and it is document-level. The main disadvantage… See the full description on the dataset page: https://huggingface.co/datasets/allenai/MADLAD-400.

n>1T

627.1K159

Attrib. Required

Text Generation & Chat

C4

allenai

Silver64

C4 Dataset Summary A colossal, cleaned version of Common Crawl's web crawl corpus. Based on Common Crawl dataset: "https://commoncrawl.org". This is the processed version of Google's C4 dataset We prepared five variants of the data: en, en.noclean, en.noblocklist, realnewslike, and multilingual (mC4). For reference, these are the sizes of the variants: en: 305GB en.noclean: 2.3TB en.noblocklist: 380GB realnewslike: 15GB multilingual (mC4): 9.7TB (108 subsets, one per… See the full description on the dataset page: https://huggingface.co/datasets/allenai/c4.

n<1K

621.1K539af/am+101

Attrib. Required

Vision-Language

MINT-1T

mlfoundations

Silver50

🍃 MINT-1T:Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens 🍃 MINT-1T is an open-source Multimodal INTerleaved dataset with 1 trillion text tokens and 3.4 billion images, a 10x scale-up from existing open-source datasets. Additionally, we include previously untapped sources such as PDFs and ArXiv papers. 🍃 MINT-1T is designed to facilitate research in multimodal pretraining. 🍃 MINT-1T is created by a team from the University of Washington in… See the full description on the dataset page: https://huggingface.co/datasets/mlfoundations/MINT-1T-PDF-CC-2023-14.

Pretraining100B<n<1T

570.5K2en

Attrib. Required

Instruction Following

HuggingFaceFW/finephrase

HuggingFaceFW

Silver60

Dataset Card for HuggingFaceFW/finephrase Dataset Summary Synthetic data generated by DataTrove: Model: HuggingFaceTB/SmolLM2-1.7B-Instruct (main) Source dataset: HuggingFaceFW/fineweb-edu, config sample-350BT, split train Generation config: temperature=1.0, top_p=1.0, top_k=50, max_tokens=2048, model_max_context=8192 Speculative decoding: {"method":"suffix","num_speculative_tokens":32} System prompt: None Input column: text Prompt families: faq prompt Rewrite the… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/finephrase.

Synthetic Datan>1M

546.2K91en

Attrib. Required

Uncategorized

pesoz

Kthera

Bronze33

546.0K0

Unknown

Vision-Language

MINT-1T

mlfoundations

Silver52

🍃 MINT-1T:Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens 🍃 MINT-1T is an open-source Multimodal INTerleaved dataset with 1 trillion text tokens and 3.4 billion images, a 10x scale-up from existing open-source datasets. Additionally, we include previously untapped sources such as PDFs and ArXiv papers. 🍃 MINT-1T is designed to facilitate research in multimodal pretraining. 🍃 MINT-1T is created by a team from the University of Washington in… See the full description on the dataset page: https://huggingface.co/datasets/mlfoundations/MINT-1T-PDF-CC-2023-06.

Pretraining100B<n<1T

541.6K5en

Attrib. Required

Math & Reasoning

dolma3_mix-6T-1025-7B

allenai

Silver57

⚠️ WARNING: This dataset is intended ONLY for reproducing Olmo 3 7B ⚠️ For all other training use cases, including training from scratch, please utilize our primary dolma 3 data mix: https://huggingface.co/datasets/allenai/dolma3_mix-6T. Note: Some olmOCR science PDFs in the current dataset have been redacted following the training of Olmo 3 7B. These texts are indicated with [REMOVED] in the text field. This will affect reproducibility of Olmo 3 7B. For this reason, please use our… See the full description on the dataset page: https://huggingface.co/datasets/allenai/dolma3_mix-6T-1025-7B.

536.4K37en

Attrib. Required

Training Datasets

FineFineWeb

ニコニコ実況 過去ログアーカイブ

Xperience-10M

documentation-images

banned-historical-archives

Generated Docs for HF

xCodeEval

WikiText

MINT-1T

PhysicalAI-Autonomous-Vehicles

MINT-1T

SWE-bench_Pro

Grade School Math 8K

MINT-1T

SWE-bench_Verified

uniocc

MINT-1T

MADLAD-400

C4

MINT-1T

HuggingFaceFW/finephrase

pesoz

MINT-1T

dolma3_mix-6T-1025-7B

FineFineWeb

ニコニコ実況 過去ログアーカイブ

Xperience-10M

documentation-images

banned-historical-archives

Generated Docs for HF

xCodeEval

WikiText

MINT-1T

PhysicalAI-Autonomous-Vehicles

MINT-1T

SWE-bench_Pro

Grade School Math 8K

MINT-1T

SWE-bench_Verified

uniocc

MINT-1T

MADLAD-400

C4

MINT-1T

HuggingFaceFW/finephrase

pesoz

MINT-1T

dolma3_mix-6T-1025-7B

ニコニコ実況過去ログアーカイブ

ニコニコ実況過去ログアーカイブ