Think about your favorite apps – the ones that deliver instant results from massive amounts of data. They're likely powered by vector search, the same technology that fuels generative AI.
Vector search is crucial for developers who need to build applications that are lightning-fast, handle massive datasets, and remain cost-effective, even with huge spikes in traffic. But building and deploying this technology can be a real challenge, especially for gen AI applications that demand incredible flexibility, scale, and speed. In a previous blog post, we showed you how to create production-ready AI applications with features like easy filtering, automatic scaling, and seamless updates.
Today, we'll share how Vertex AI’s vector search is tackling these challenges head-on. We'll explore real-world performance benchmarks demonstrating incredible speed and scalability – all in a cost-effective way.
How does Vertex AI vector search work?
Imagine you own a popular online store: to keep shoppers happy, your search engine needs to instantly sift through millions of products and deliver relevant results, even during peak shopping seasons. Vector search is a technique for finding similar items within massive datasets. It works by converting data, like text or images, into numerical representations called embeddings. These embeddings capture the semantic meaning of the data, allowing for more accurate and relevant search results.
For example, imagine your customers are searching for a "navy blue dress shirt." A keyword search might miss products labeled "midnight blue button-down," even though they're essentially the same. Vector search does a better job of surfacing the right products because it uses embeddings to understand the relationships between words and concepts.
A smooth, crisp and responsive semantic search experience is a must-have for e-commerce, media and other consumer-facing web services and is only possible with highly performant vector search. See this blog post for the details of the Infinite Nature demo that offers a glimpse into the future of how we'll interact with information.
You can use it for a wide range of applications, like the e-commerce example shared above, or as a retrieval-augmented generation (RAG) system for generative AI agents, where it grounds responses in your data or recommendation systems that deliver personalized suggestions based on user preferences.
As Xun Wang, Chief Technology Officer of Bloomreach, recently said, "Bloomreach has made the strategic decision to replace OpenAI with Google Vertex AI Embeddings and Vertex AI vector search. Google's platform delivers clear advantages in performance, scalability, reliability and cost optimization. We're confident this move will drive significant benefits and we're thrilled to embark on this new partnership."
Real-world impact of Vertex AI’s vector search
Our customers are achieving remarkable results with vector search. Here are four standout ways this technology is helping them build high-performance gen AI apps.
#1: The fastest vector search for highly responsive applications
To meet customer expectations, fast response times are critical across search, recommendation systems, and gen AI applications. Studies have consistently found that faster response times directly contribute to an increase in revenue, conversion, and retention.
Vector search is engineered for incredibly low latency at high quality, while maintaining cost-effectiveness. In our testing, vector search was able to maintain ultra-low latency (9.6ms at P95) and high recall (0.99) while scaling up to 5K queries per second on a dataset of one billion vectors. By achieving such low latencies, Vertex AI vector search ensures that users receive fast, relevant responses, no matter how large the dataset or how many parallel requests hit the system.
As Yuri M. Brovman from eBay wrote in a recent blog post, “[eBay’s vector search] hit a real-time read latency of less than 4ms at 95%, as measured server-side on the Google Cloud dashboard for vector search".
#2: Massively scalable for all application sizes
Another important consideration in production-ready applications is the ability of your application to support growing data sizes and user bases.
This means it can easily accommodate sudden spurts in demand, making it massively scalable for applications of any size. Vertex AI vector search can scale up to support billions of embeddings and hundreds of thousands of queries per second while maintaining ultra low latency.
#3: Up to 4X more cost effective
Vertex AI vector search not only maintains performance at scale, but it is also 4x more cost effective than competing solutions, especially for high performance applications. With Vertex AI vector search’s ANN index, you will need significantly less compute for fast and relevant results at scale.
Dataset |
QPS |
Recall |
Latency (P95) |
Glove 1M / 100 dim |
44,876 |
0.96 |
3ms |
OpenAI 5M / 1536 dim |
2,981 |
0.96 |
9ms |
Cohere 10M / 768 dim |
3,144 |
0.96 |
7ms |
LAION 100M / 768 dim |
2,997 |
0.96 |
9ms |
BigANN 10M / 128 dim |
33,921 |
0.97 |
3.5 ms |
BigANN 100M / 128 dim |
9,871 |
0.97 |
7.2 ms |
BigANN 1B / 128 dim |
4,967 |
0.99 |
9.6 ms |
Vertex AI vector search’s real-world benchmarks of public datasets by using 2 replicas of n2d machines. Latency was measured at the provided QPS; vector search can scale up beyond this throughput by adding additional replicas.
#4: It’s highly configurable for all application types
In some scenarios, developers might be interested in trading-off latency for higher recall (or vice versa). For example, an e-commerce website might prioritize speed for quick product suggestions, while a research database might prioritize comprehensive results even if it takes slightly longer. Vector search enables tuning these parameters and hitting higher recall or higher latency, to match business needs.
Additionally, vector search supports auto-scaling – and when load on the deployment increases, it scales to maintain performance. We measured auto-scaling and found that vector search was able to maintain consistent latency with high recall, as QPS increased from 1K to 5K.
Developers can also increase the number of replicas in order to handle higher throughput, as well as pick different machine types to balance cost and performance. This flexibility makes vector search suitable for a wide range of applications beyond semantic search, including recommendation systems, chatbots, multimodal search, anomaly detection, and image similarity matching.
Going further with hybrid search
Dense embedding-based semantic search, while excellent at understanding meaning and context, has a weak point: it cannot find items that the embedding model can't make sense of. Items like product numbers, company's internal codenames or newly coined terms, aren’t found by semantic search because the embedding model doesn't understand their meanings.
With Vertex AI vector search's hybrid search, building this type of sophisticated search engine is no longer a daunting task. Developers can easily create a single index that incorporates both dense and sparse embeddings, representing semantic meaning and keyword relevance respectively. This streamlined approach allows for rapid development and deployment of high-performance search applications, fully customized to meet specific business needs.
As Nicolas Presta, Sr. Engineering Manager at Mercado Libre wrote, “Most of our successful sales start with a search, so it is important that we give precise results that best match a user's query. These complex searches are getting better with the addition of the items retrieved from vector search, which will ultimately increase our conversion rate. Hybrid search will unlock more opportunities to uplevel our search engine so that we can create the best customer experience while improving our bottom line." – Nicolas Presta, Sr. Engineering Manager at Mercado Libre.
Posted in