CodeHuman AnnotatedNon-Commercial
The-Stack-v2
by bigcode
10.0Kdownloads
525likes
unknownDescription
The Stack v2
The dataset consists of 4 versions:
bigcode/the-stack-v2: the full "The Stack v2" dataset <-- you are here
bigcode/the-stack-v2-dedup: based on the bigcode/the-stack-v2 but further near-deduplicated
bigcode/the-stack-v2-train-full-ids: based on the bigcode/the-stack-v2-dedup dataset but further filtered with heuristics and spanning 600+ programming languages. The data is grouped into repositories.bigcode/the-stack-v2-train-smol-ids: based on the… See the full description on the dataset page: https://huggingface.co/datasets/bigcode/the-stack-v2.
What can I do with this?
Tags
task_categories:text-generationlanguage_creators:crowdsourcedlanguage_creators:expert-generatedmultilinguality:multilinguallanguage:codelicense:othersize_categories:1B<n<10Bformat:parquetmodality:tabularmodality:textlibrary:datasetslibrary:dasklibrary:mlcroissantlibrary:polarsarxiv:2402.19173arxiv:2107.03374arxiv:2207.14157region:us