CodeHuman AnnotatedNon-Commercial

The-Stack

by bigcode

Silver54
12.6Kdownloads
392likes
unknown

Description

Dataset Card for The Stack Changelog Release Description v1.0 Initial release of the Stack. Included 30 programming languages and 18 permissive licenses. Note: Three included licenses (MPL/EPL/LGPL) are considered weak copyleft licenses. The resulting near-deduplicated dataset is 1.5TB in size. v1.1 The three copyleft licenses ((MPL/EPL/LGPL) were excluded and the list of permissive licenses extended to 193 licenses in total. The list of programming… See the full description on the dataset page: https://huggingface.co/datasets/bigcode/the-stack-dedup.

What can I do with this?

Tags

task_categories:text-generationlanguage_creators:crowdsourcedlanguage_creators:expert-generatedmultilinguality:multilinguallanguage:codelicense:othersize_categories:100M<n<1Bformat:parquetmodality:tabularmodality:textlibrary:datasetslibrary:dasklibrary:mlcroissantlibrary:polarsarxiv:2211.15533arxiv:2107.03374arxiv:2207.14157region:us