Dataset Card for GSM8K Dataset Summary GSM8K (Grade School Math 8K) is a dataset of 8.5K high quality linguistically diverse grade school math word problems. The dataset was created to support the task of question answering on basic mathematical problems that require multi-step reasoning. These problems take between 2 and 8 steps to solve. Solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ − ×÷) to reach the… See the full description on the dataset page: https://huggingface.co/datasets/openai/gsm8k.

PPO, Human Annotated1K<n<10K

761.9K1.2Ken

Commercial OK

Uncategorized

PhysicalAI-Autonomous-Vehicles

nvidia

Silver67

PHYSICAL AI AUTONOMOUS VEHICLES The PhysicalAI-Autonomous-Vehicles dataset provides one of the largest, geographically diverse collections of multi-sensor data empowering AV researchers to build the next generation of Physical AI based end-to-end driving systems. This dataset is ready for commercial/non-commercial AV use per the license agreement. Data Collection Method Automatic/Sensor Labeling Method Automatic/Sensor This dataset has a total of 1700 hours of driving… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/PhysicalAI-Autonomous-Vehicles.

993.0K805

Non-Commercial

Text Generation & Chat

FineWeb

HuggingFaceFW

Silver66

🍷 FineWeb 15 trillion tokens of the finest data the 🌐 web has to offer What is it? The 🍷 FineWeb dataset consists of more than 18.5T tokens (originally 15T tokens) of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the 🏭 datatrove library, our large scale data processing library. 🍷 FineWeb was originally meant to be a fully open replication of 🦅 RefinedWeb, with a release… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb.

n>1T

206.0K2.7Ken

Attrib. Required

Text Generation & Chat

WikiText

Salesforce

Silver66

Dataset Card for "wikitext" Dataset Summary The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License. Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far larger… See the full description on the dataset page: https://huggingface.co/datasets/Salesforce/wikitext.

Human Annotated1M<n<10M

1.1M653en

Copyleft

Code

prompts.chat

fka

Silver65

a.k.a. Awesome ChatGPT Prompts This is a Dataset Repository mirror of prompts.chat — a social platform for AI prompts. 📢 Notice This Hugging Face dataset is a mirror. For the latest prompts, features, and community contributions, please visit: 🌐 Website: prompts.chat 📦 GitHub: github.com/f/awesome-chatgpt-prompts About prompts.chat is an open-source platform where users can share, discover, and collect AI prompts from the community. The project can be… See the full description on the dataset page: https://huggingface.co/datasets/fka/prompts.chat.

Synthetic Data100K<n<1M

32.0K9.6K

Commercial OK

Video

Xperience-10M

ropedia-ai

Silver64

⚠️ Important: If you have already submitted an access request but have not completed the required DocuSign agreement, your request will remain pending. Please complete signing and we will grant access once verified. Interactive Intelligence from Human Xperience Xperience-10M Dataset Summary Xperience-10M is a large-scale egocentric multimodal dataset of human experience for embodied AI, robotics, world models, and spatial… See the full description on the dataset page: https://huggingface.co/datasets/ropedia-ai/xperience-10m.

1M<n<10M

2.2M154en

Non-Commercial

What is fine-tuning?

Fine-tuning takes a pre-trained AI model and trains it further on specialized data — teaching a generalist to become an expert in your field. Instead of training from scratch (millions of dollars), you start from an existing model and adapt it with your own dataset.

This catalog helps you find both fine-tuned models others have created, and the training datasets you need for your own fine-tuning projects. We filter out pure quantizations and format conversions — only genuine fine-tunes that involved real training.

Built on generosity

Every model and dataset in this catalog exists because someone chose to share their work with the world. Behind each entry is real human expertise — researchers, engineers, and hobbyists who invested their knowledge, their time, and often significant compute resources to create something valuable, then gave it away freely.

Fine-tuning a model can take days of GPU time. Curating a training dataset can take months of careful annotation. These contributions represent a quiet, extraordinary act of generosity — people sharing the fruits of their labor so that others can build on them, learn from them, and push the boundaries of what's possible.

To every open-source contributor in this catalog: thank you.