Text Generation & ChatPretrainingUnknown

SkyPile-150B

by Skywork

Silver56
33.1Kdownloads
403likes
100B<n<1T

Description

SkyPile-150B Dataset Summary SkyPile-150B is a comprehensive, large-scale Chinese dataset specifically designed for the pre-training of large language models. It is derived from a broad array of publicly accessible Chinese Internet web pages. Rigorous filtering, extensive deduplication, and thorough sensitive data filtering have been employed to ensure its quality. Furthermore, we have utilized advanced tools such as fastText and BERT to filter out low-quality data. The… See the full description on the dataset page: https://huggingface.co/datasets/Skywork/SkyPile-150B.

What can I do with this?

Tags

task_categories:text-generationlanguage:zhsize_categories:1M<n<10Mformat:jsonmodality:textlibrary:datasetslibrary:dasklibrary:mlcroissantlibrary:polarsarxiv:2310.19341region:usllm casual-lmlanguage-modeling