Text Generation & ChatCommercial OK

Falcon RefinedWeb

by tiiuae

Silver57
20.6Kdownloads
901likes
100B<n<1T

Description

πŸ“€ Falcon RefinedWeb Falcon RefinedWeb is a massive English web dataset built by TII and released under an ODC-By 1.0 license. See the πŸ““ paper on arXiv for more details. RefinedWeb is built through stringent filtering and large-scale deduplication of CommonCrawl; we found models trained on RefinedWeb to achieve performance in-line or better than models trained on curated datasets, while only relying on web data. RefinedWeb is also "multimodal-friendly": it contains links and alt… See the full description on the dataset page: https://huggingface.co/datasets/tiiuae/falcon-refinedweb.

What can I do with this?

Tags

task_categories:text-generationlanguage:enlicense:odc-bysize_categories:100M<n<1Bformat:parquetmodality:textlibrary:datasetslibrary:dasklibrary:mlcroissantlibrary:polarsarxiv:2306.01116arxiv:2203.15556arxiv:2107.06499arxiv:2104.08758arxiv:2109.07445arxiv:1911.00359arxiv:2112.11446doi:10.57967/hf/0737region:us