CodeHuman AnnotatedNon-Commercial

The-Stack

by bigcode

Silver56
22.2Kdownloads
490likes
unknown

Description

StarCoder Training Dataset Dataset description This is the dataset used for training StarCoder and StarCoderBase. It contains 783GB of code in 86 programming languages, and includes 54GB GitHub Issues + 13GB Jupyter notebooks in scripts and text-code pairs, and 32GB of GitHub commits, which is approximately 250 Billion tokens. Dataset creation The creation and filtering of The Stack is explained in the original dataset, we additionally decontaminate and… See the full description on the dataset page: https://huggingface.co/datasets/bigcode/starcoderdata.

What can I do with this?

Tags

task_categories:text-generationlanguage_creators:crowdsourcedlanguage_creators:expert-generatedmultilinguality:multilinguallanguage:codelicense:othersize_categories:100M<n<1Bformat:parquetmodality:textlibrary:datasetslibrary:dasklibrary:mlcroissantlibrary:polarsregion:us