Text Generation & ChatSynthetic DataCommercial OK
FineTranslations
by HuggingFaceFW
36.6Kdownloads
279likes
n>1TDescription
💬 FineTranslations
The world's knowledge in 1+1T tokens of parallel text
What is it?
This dataset contains over 1 trillion tokens of parallel text in English and 500+ languages. It was obtained by translating data from 🥂 FineWeb2 into English using Gemma3 27B.
We relied on datatrove's inference runner to deploy a synthetic data pipeline at scale. Its checkpointing and VLLM lifecycle management features allowed us to use leftover compute from the HF cluster… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/finetranslations.
What can I do with this?
Tags
task_categories:text-generationtask_categories:translationlanguage:abklanguage:abqlanguage:abslanguage:acmlanguage:adhlanguage:adilanguage:adylanguage:aeblanguage:afrlanguage:agxlanguage:aiilanguage:aimlanguage:ainlanguage:ajzlanguage:akblanguage:alnlanguage:alslanguage:alt