CodeHuman AnnotatedNon-Commercial

The-Stack

by bigcode

Silver56

22.2Kdownloads

490likes

unknown

Description

StarCoder Training Dataset Dataset description This is the dataset used for training StarCoder and StarCoderBase. It contains 783GB of code in 86 programming languages, and includes 54GB GitHub Issues + 13GB Jupyter notebooks in scripts and text-code pairs, and 32GB of GitHub commits, which is approximately 250 Billion tokens. Dataset creation The creation and filtering of The Stack is explained in the original dataset, we additionally decontaminate and… See the full description on the dataset page: https://huggingface.co/datasets/bigcode/starcoderdata.

The-Stack

Description

What can I do with this?

Tags