Scenario 5: Text Embeddings Benchmarks in Generative AI

The text embedding scenario mimics embedding generation as part of the data ingestion pipeline of a vector database.

The text embedding scenario is only applicable to the embedding models. In this scenario, all requests are the same size, which is 96 documents, each one with 512 tokens. An example would be a collection of large PDF files, each file with 30,000+ words that a user wants to ingest into a vector DB.

Review the terms used in the hosting dedicated AI cluster benchmarks. For a list of scenarios and their descriptions, see Text Embedding Scenarios. The text embedding scenario is performed in the following region.

US Midwest (Chicago)

Model: cohere.embed-english-v3.0 hosted on one Embed Cohere unit of a dedicated AI cluster
Concurrency Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 2.53 24
8 4.35 108
32 14.93 120
128 47.66 150
Model: cohere.embed-english-light-v3.0 hosted on one Embed Cohere unit of a dedicated AI cluster
Concurrency Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 1.75 30
8 3.93 108
32 14.44 113
128 48.00 120
Model: cohere.embed-multilingual-v3.0 hosted on one Embed Cohere unit of a dedicated AI cluster
Concurrency Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 2.25 24
8 4.33 120
32 14.94 144
128 49.21 198
Model: cohere.embed-multilingual-light-v3.0 hosted on one Embed Cohere unit of a dedicated AI cluster
Concurrency Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 1.69 42
8 3.80 118
32 14.26 126
128 37.17 138
Model: cohere.embed-english-v2.0 hosted on one Embed Cohere unit of a dedicated AI cluster
Concurrency Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 2.51 18
8 4.31 84
32 15 132
128 49.15 150