CodeHuman AnnotatedNon-Commercial
The-Stack
by bigcode
22.2Kdownloads
490likes
unknownDescription
StarCoder Training Dataset
Dataset description
This is the dataset used for training StarCoder and StarCoderBase. It contains 783GB of code in 86 programming languages, and includes 54GB GitHub Issues + 13GB Jupyter notebooks in scripts and text-code pairs,
and 32GB of GitHub commits, which is approximately 250 Billion tokens.
Dataset creation
The creation and filtering of The Stack is explained in the original dataset, we additionally decontaminate and… See the full description on the dataset page: https://huggingface.co/datasets/bigcode/starcoderdata.
What can I do with this?
Tags
task_categories:text-generationlanguage_creators:crowdsourcedlanguage_creators:expert-generatedmultilinguality:multilinguallanguage:codelicense:othersize_categories:100M<n<1Bformat:parquetmodality:textlibrary:datasetslibrary:dasklibrary:mlcroissantlibrary:polarsregion:us