CodeHuman AnnotatedNon-Commercial
The-Stack
by bigcode
12.6Kdownloads
392likes
unknownDescription
Dataset Card for The Stack
Changelog
Release
Description
v1.0
Initial release of the Stack. Included 30 programming languages and 18 permissive licenses. Note: Three included licenses (MPL/EPL/LGPL) are considered weak copyleft licenses. The resulting near-deduplicated dataset is 1.5TB in size.
v1.1
The three copyleft licenses ((MPL/EPL/LGPL) were excluded and the list of permissive licenses extended to 193 licenses in total. The list of programming… See the full description on the dataset page: https://huggingface.co/datasets/bigcode/the-stack-dedup.
What can I do with this?
Tags
task_categories:text-generationlanguage_creators:crowdsourcedlanguage_creators:expert-generatedmultilinguality:multilinguallanguage:codelicense:othersize_categories:100M<n<1Bformat:parquetmodality:tabularmodality:textlibrary:datasetslibrary:dasklibrary:mlcroissantlibrary:polarsarxiv:2211.15533arxiv:2107.03374arxiv:2207.14157region:us