Text Generation & ChatPretrainingUnknown
SkyPile-150B
by Skywork
33.1Kdownloads
403likes
100B<n<1TDescription
SkyPile-150B
Dataset Summary
SkyPile-150B is a comprehensive, large-scale Chinese dataset specifically designed for the pre-training of large language models. It is derived from a broad array of publicly accessible Chinese Internet web pages. Rigorous filtering, extensive deduplication, and thorough sensitive data filtering have been employed to ensure its quality. Furthermore, we have utilized advanced tools such as fastText and BERT to filter out low-quality data.
The… See the full description on the dataset page: https://huggingface.co/datasets/Skywork/SkyPile-150B.
What can I do with this?
Tags
task_categories:text-generationlanguage:zhsize_categories:1M<n<10Mformat:jsonmodality:textlibrary:datasetslibrary:dasklibrary:mlcroissantlibrary:polarsarxiv:2310.19341region:usllm casual-lmlanguage-modeling