How We Cut RAG Latency by 50% with Voyage 3.5 Lite Embeddings

At MyClone.is, our mission is to build truly personal digital personas. We achieve this by creating a rich, interactive clone of a user’s knowledge base, powered by Retrieval-Augmented Generation (RAG). We create knowledge base for each user by encoding their uploaded documents, notes, and knowledge into a vector database that powers chat and voice assistants.

Digital Personas Need Fast, Trustworthy Retrieval :

Every time a user interacts with their persona via voice or chat, it runs RAG (retrieval‑augmented generation) over those embeddings to instantly pinpoint the most relevant piece of knowledge from their unique knowledge base—often in milliseconds—to deliver a response that sounds just like them. In this architecture, the embedding model is central: it determines how well the system understands user content, how much vector storage is required, and how quickly relevant information can be retrieved and ranked. At he end latency is the enemy of natural conversation.

Previously, MyClone used OpenAI’s text-embedding-3-small, which produces 1536‑dimensional float vectors optimized for general-purpose semantic similarity. This model is known for strong quality across common retrieval benchmarks at a relatively low price point, but its default 1536‑dim size implies higher storage and bandwidth than lower‑dim alternatives.

In high‑throughput RAG systems, 1536‑dim vectors increase memory footprint, disk usage, and I/O per query, which can become a bottleneck for both latency and cost as the number of users and knowledge items grows.

We recently identified this bottleneck in our RAG pipeline and we took a bold step: we replaced OpenAI’s text-embedding-3-small (1536 dim) with Voyage-3.5 Lite (512 dim). It cuts storage and latency substantially while maintaining, and often improving, retrieval quality for User’s Persona. This kind of infrastructure change directly translates into faster, cheaper, and more natural-feeling AI assistants for your users.

Lets go dive deeper

Why 512 dim Voyage 3.5 lite Can Match or surpass 1536 dim OpenAI :

On the surface, going from 1536 dimensions down to 512 seems like a compromise. Fewer dimensions should mean less information and poorer retrieval quality. However, the landscape of embedding models is evolving rapidly, driven by innovations like Matryoshka Representation Learning (MRL), which Voyage AI utilizes.

Voyage‑3.5‑lite leverages Matryoshka training and quantization‑aware techniques so that the first 256 or 512 dimensions capture the majority of the semantic signal instead of being a naive truncation of a larger vector. Public benchmarks and vendor claims indicate that Voyage‑3.5‑lite at reduced dimensions maintains retrieval performance very close to full‑dimension variants and competitive with leading commercial models.

By contrast, OpenAI’s embeddings were designed primarily with fixed 1536‑dim outputs, where dimensionality reduction is typically done post‑hoc (e.g., PCA or truncation), which may lose information unless carefully tuned for each domain. This makes Voyage‑3.5‑lite more attractive for applications where vector cost and latency are critical but quality cannot be sacrificed.

Quantitative Impact at MyClone :

1. Vector Database Efficiency: Saving Space & Money

The most immediate gain was in our storage layer. By reducing the dimensionality from 1536 to 512, we achieved a ~66% reduction in the storage footprint required for our entire user knowledge base in the Vector DB.

Impact: This translates directly to lower infrastructure costs and a smaller overall system footprint, allowing us to scale more efficiently for our growing user base.

2. Retrieval Speed: Unlocking RAG Performance

Vector databases rely on calculating the similarity (usually cosine similarity) between the query vector and millions of stored document vectors. The computational cost of this search is heavily dependent on the vector size.

Faster Calculation: With vectors that are $\frac{512}{1536} \approx 1/3$ the size, the core mathematical operations in the search index become significantly faster.
Lighter Payloads: Moving smaller vectors across the network from the Vector DB to the RAG service also reduces latency.

This optimization resulted in retrieval latency being slashed by 50% (2x faster).

3. The User Experience Win: Natural Conversation

For a Digital Persona designed for voice interaction, every millisecond counts. A long pause after a user asks a question breaks the illusion of a real conversation.

The massive reduction in retrieval latency directly fed into our overall system speed:

End-to-End Voice Latency: We saw a 15% to 20% reduction in the total time from when a user finishes speaking to when the persona begins its response.
First Token Latency: Crucially, the initial response time for both chat and voice interfaces dropped by 15%. This metric is vital, as it dictates how quickly the user sees or hears that the system is processing their request.

Here is the Side by Side comparison :

Feature	OpenAI text-embedding-3-small	Voyage-3.5-lite (512d float)
Default dimensions	1536	1024 (supports 256/512/1024/2048)
Dimensions used at MyClone	1536	512
Vector size vs 512d	Baseline	3x smaller
Retrieval quality	Strong general-purpose	Competitive / improved on retrieval
Storage cost	High (per vector)	~3× lower at same precision
Vector DB latency	Baseline	2–2.5× faster at MyClone
E2E voice latency impact	Baseline	15–20% reduction at MyClone
First-token latency	Baseline	~15% faster at MyClone
Dimensional flexibility	Fixed 1536 (practical)	Flexible via Matryoshka (256–2048)

Why This Matters for Digital Personas :

For a digital persona platform, user satisfaction is tightly linked to how responsive and on‑point the assistant feels in both chat and voice. Lower vector dimensions reduce tail latency for retrieval, which directly shortens the time to first token and makes voice conversations feel more natural and less “robotic pause” heavy.

At the same time, users expect the persona to recall their uploaded knowledge accurately, which means any optimization that saves cost must not degrade retrieval quality or introduce hallucinations. Voyage‑3.5‑lite’s retrieval‑focused design allows MyClone to hit this balance: high‑fidelity grounding with a much lighter retrieval stack.

Product-Level Benefits for MyClone :

From a product and business perspective, the embedding migration unlocks several advantages:

Better UX at scale: Faster responses improve perceived intelligence and trust, especially in voice interactions where humans are highly sensitive to delay.
Lower infra cost per persona: 3× storage savings and faster queries mean cheaper vector DB and compute, allowing MyClone to host more user knowledge for the same budget.
Headroom for richer features: Freed-up latency and cost can be reinvested into deeper RAG pipelines, more reranking, or multi‑step reasoning without exceeding user latency budgets.
Future flexibility: Voyage‑3.5‑lite supports multiple dimensions and quantization schemes (e.g., int8, binary), opening the door to further optimizations like ultra‑cheap archival memory or hybrid binary‑plus‑float retrieval strategies.

For MyClone, these gains compound: each user’s digital persona can reference more documents, answer faster, and operate more cheaply—while staying faithful to the user’s own voice, style, and knowledge.

Strategic Takeaways :

The shift from OpenAI’s 1536‑dim embeddings to Voyage‑3.5‑lite 512‑dim embeddings shows how embedding choice is a product decision, not just an infra detail. By aligning the embedding model with the needs of high‑scale RAG—fast, cost‑efficient retrieval with strong semantic quality—MyClone improved both user experience and unit economics in one move.

As RAG systems mature, embedding models like Voyage‑3.5‑lite that are explicitly optimized for flexible dimensions, quantization, and retrieval quality will increasingly become the default for latency‑sensitive, knowledge‑heavy products like digital personas.