CodePretrainingUnknown

AutoMathText-V2

by OpenSQZ

Silver58
419.8Kdownloads
68likes
10B<n<100B

Description

๐Ÿš€ AutoMathText-V2: A 2.46 Trillion Token AI-Curated STEM Pretraining Dataset ย  ๐ŸŽ‰ AutoMathText-v2 has surpassed 1 million downloads!We'd love to know how you're using it. Please take 1 minute to fill out our use case survey. Your feedback will directly shape the future roadmap of this dataset.๐Ÿ‘‰ Share your use case here ๐Ÿ“Š AutoMathText-V2 consists of 2.46 trillion tokens of high-quality, deduplicated text spanning web content, mathematics, code, reasoning, and bilingualโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/OpenSQZ/AutoMathText-V2.

What can I do with this?

Tags

task_categories:text-generationtask_categories:question-answeringlanguage:enlanguage:zhsize_categories:1B<n<10Bmodality:tabularmodality:textarxiv:2402.07625region:usLLMpretrainingfinetuningmidtrainingreasoningSTEMmath