Text Generation & ChatSynthetic DataCommercial OK

FineTranslations

by HuggingFaceFW

Silver56
36.6Kdownloads
279likes
n>1T

Description

💬 FineTranslations The world's knowledge in 1+1T tokens of parallel text What is it? This dataset contains over 1 trillion tokens of parallel text in English and 500+ languages. It was obtained by translating data from 🥂 FineWeb2 into English using Gemma3 27B. We relied on datatrove's inference runner to deploy a synthetic data pipeline at scale. Its checkpointing and VLLM lifecycle management features allowed us to use leftover compute from the HF cluster… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/finetranslations.

What can I do with this?

Tags

task_categories:text-generationtask_categories:translationlanguage:abklanguage:abqlanguage:abslanguage:acmlanguage:adhlanguage:adilanguage:adylanguage:aeblanguage:afrlanguage:agxlanguage:aiilanguage:aimlanguage:ainlanguage:ajzlanguage:akblanguage:alnlanguage:alslanguage:alt