Benchmarks & EvaluationPretrainingCommercial OK
dclm-baseline-1.0
by mlfoundations
139.1Kdownloads
262likes
Description
DCLM-baseline
DCLM-baseline is a 4T token / 3B document pretraining dataset that achieves strong performance on language model benchmarks.
Below are comparisions of model trained on DCLM-baseline with other models in the 7B regime.
Model
Params
Tokens
Open dataset?
CORE
MMLU
EXTENDED
Open weights, closed datasets
Llama2
7B
2T
✗
49.2
45.8
34.1
DeepSeek
7B
2T
✗
50.7
48.5
35.3
Mistral-0.3
7B
?
✗
57.0
62.7
45.1
QWEN-2
7B
?
✗
57.5
71.9
50.5
Llama3
8B
15T
✗
57.6… See the full description on the dataset page: https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0.
What can I do with this?
Tags
license:cc-by-4.0arxiv:2406.11794region:us