CodePretrainingUnknown
AutoMathText-V2
by OpenSQZ
419.8Kdownloads
68likes
10B<n<100BDescription
๐ AutoMathText-V2: A 2.46 Trillion Token AI-Curated STEM Pretraining Dataset
ย
๐ AutoMathText-v2 has surpassed 1 million downloads!We'd love to know how you're using it. Please take 1 minute to fill out our use case survey. Your feedback will directly shape the future roadmap of this dataset.๐ Share your use case here
๐ AutoMathText-V2 consists of 2.46 trillion tokens of high-quality, deduplicated text spanning web content, mathematics, code, reasoning, and bilingualโฆ See the full description on the dataset page: https://huggingface.co/datasets/OpenSQZ/AutoMathText-V2.
What can I do with this?
Tags
task_categories:text-generationtask_categories:question-answeringlanguage:enlanguage:zhsize_categories:1B<n<10Bmodality:tabularmodality:textarxiv:2402.07625region:usLLMpretrainingfinetuningmidtrainingreasoningSTEMmath