CodeCommercial OK
essential-web-v1.0
by EssentialAI
50.4Kdownloads
219likes
10B<n<100BDescription
π Essential-Web: Complete 24-Trillion Token Dataset
π Website | π₯οΈ Code | π Paper | βοΈ AWS
π Dataset Description
Essential-Web is a 24-trillion-token web dataset with document-level metadata designed for flexible dataset curation. The dataset provides metadata including subject matter classification, web page type, content complexity, and document quality scores for each of the 23.6 billion documents.
Researchers can filter and curate specialized datasets using⦠See the full description on the dataset page: https://huggingface.co/datasets/EssentialAI/essential-web-v1.0.
What can I do with this?
Tags
license:odc-bysize_categories:10B<n<100Barxiv:2506.14111region:us