Text - GeneralRAGUnknown
wikipedia-2023-11-embed-multilingual-v3
by CohereLabs
52.3Kdownloads
245likes
Description
Multilingual Embeddings for Wikipedia in 300+ Languages
This dataset contains the wikimedia/wikipedia dataset dump from 2023-11-01 from Wikipedia in all 300+ languages.
The individual articles have been chunked and embedded with the state-of-the-art multilingual Cohere Embed V3 embedding model. This enables an easy way to semantically search across all of Wikipedia or to use it as a knowledge source for your RAG application. In total is it close to 250M paragraphs / embeddings.
You… See the full description on the dataset page: https://huggingface.co/datasets/CohereLabs/wikipedia-2023-11-embed-multilingual-v3.
What can I do with this?
Tags
size_categories:100M<n<1Bmodality:textregion:us