Text Generation & ChatPretrainingCommercial OK

TxT360

by LLM360

Silver56
44.5Kdownloads
248likes
n>1T

Description

TxT360: A Top-Quality LLM Pre-training Dataset Requires the Perfect Blend Changelog Version Details v1.1 Added new data sources: TxT360_BestOfWeb, TxT360_QA, europarl-aligned, and wikipedia_extended. Details of v1.1 Additions TxT360_BestOfWeb: This is a filtered version of the TxT360 dataset, created using the ProX document filtering model. The model is similar to the FineWeb-Edu classifier, but also assigns an additional format score that… See the full description on the dataset page: https://huggingface.co/datasets/LLM360/TxT360.

What can I do with this?

Tags

task_categories:text-generationlanguage:enlicense:odc-bysize_categories:n>1Tregion:us