Description
The GitHub Code clean dataset in a more filtered version of codeparrot/github-code dataset, it consists of 115M code files from GitHub in 32 programming languages with 60 extensions totaling in almost 1TB of text data.
What can I do with this?
Tags
license:apache-2.0size_categories:10M<n<100Mmodality:textlibrary:datasetslibrary:mlcroissantregion:us