Text Generation & ChatPretrainingCommercial OK
TxT360
by LLM360
44.5Kdownloads
248likes
n>1TDescription
TxT360: A Top-Quality LLM Pre-training Dataset Requires the Perfect Blend
Changelog
Version
Details
v1.1
Added new data sources: TxT360_BestOfWeb, TxT360_QA, europarl-aligned, and wikipedia_extended.
Details of v1.1 Additions
TxT360_BestOfWeb: This is a filtered version of the TxT360 dataset, created using the ProX document filtering model. The model is similar to the FineWeb-Edu classifier, but also assigns an additional format score that… See the full description on the dataset page: https://huggingface.co/datasets/LLM360/TxT360.
What can I do with this?
Tags
task_categories:text-generationlanguage:enlicense:odc-bysize_categories:n>1Tregion:us