CodeCommercial OK

essential-web-v1.0

by EssentialAI

Silver56
50.4Kdownloads
219likes
10B<n<100B

Description

🌐 Essential-Web: Complete 24-Trillion Token Dataset πŸ† Website | πŸ–₯️ Code | πŸ“– Paper | ☁️ AWS πŸ“‹ Dataset Description Essential-Web is a 24-trillion-token web dataset with document-level metadata designed for flexible dataset curation. The dataset provides metadata including subject matter classification, web page type, content complexity, and document quality scores for each of the 23.6 billion documents. Researchers can filter and curate specialized datasets using… See the full description on the dataset page: https://huggingface.co/datasets/EssentialAI/essential-web-v1.0.

What can I do with this?

Tags

license:odc-bysize_categories:10B<n<100Barxiv:2506.14111region:us