Text Generation & ChatCommercial OK
Falcon RefinedWeb
by tiiuae
20.6Kdownloads
901likes
100B<n<1TDescription
π Falcon RefinedWeb
Falcon RefinedWeb is a massive English web dataset built by TII and released under an ODC-By 1.0 license.
See the π paper on arXiv for more details.
RefinedWeb is built through stringent filtering and large-scale deduplication of CommonCrawl; we found models trained on RefinedWeb to achieve performance in-line or better than models trained on curated datasets, while only relying on web data.
RefinedWeb is also "multimodal-friendly": it contains links and alt⦠See the full description on the dataset page: https://huggingface.co/datasets/tiiuae/falcon-refinedweb.
What can I do with this?
Tags
task_categories:text-generationlanguage:enlicense:odc-bysize_categories:100M<n<1Bformat:parquetmodality:textlibrary:datasetslibrary:dasklibrary:mlcroissantlibrary:polarsarxiv:2306.01116arxiv:2203.15556arxiv:2107.06499arxiv:2104.08758arxiv:2109.07445arxiv:1911.00359arxiv:2112.11446doi:10.57967/hf/0737region:us