June 1, 2026

Scaling Vector Databases: Pinecone vs Milvus

Vector database selection and scaling decisions account for a disproportionate share of RAG pipeline failures in production. Every serious semantic search system, recommendation engine, or enterprise RAG pipeline eventually hits the memory wall, the compute bottleneck, or the ingestion backlog. When it does, the solution is almost never "add more resources." It is re-architecting the indexing strategy, partitioning scheme, and query path.

Based on Seven Labs' production AI deployments, the teams that scale vector databases successfully do three things right from the start: they choose the right database for their operational model, they implement quantization before hitting memory limits, and they partition by access pattern rather than by data size. The teams that fail skip one of those steps.

This guide covers the architecture, the hard numbers, and the production decisions we have made scaling client systems from 5 million to 500 million vectors.

Why Is Scaling Vector Search Fundamentally Different from Scaling Relational Databases?

Vector databases perform Approximate Nearest Neighbor (ANN) search over high-dimensional spaces, which is memory-intensive and computationally expensive in ways that B-tree or hash indexing is not.

A relational database finds an exact match in O(log n) or O(1) time. A vector database computes the cosine or L2 distance between your query vector and potentially millions of candidate vectors. The most common indexing algorithm, HNSW (Hierarchical Navigable Small World), requires the entire index to live in RAM for sub-millisecond latency. At 100 million vectors with 1536 dimensions (OpenAI's text-embedding-ada-002), raw storage is approximately 600GB. Add HNSW graph overhead, and you need over 1TB of RAM. When the index overflows to disk, latency spikes from sub-millisecond to hundreds of milliseconds. This is the memory wall, and every scaling strategy is a response to it.

The compute problem compounds at query volume. Computing dot products or L2 distances across millions of high-dimensional vectors at thousands of queries per second saturates CPU cores even with SIMD vectorization. Scaling the vector tier requires addressing both memory and compute constraints simultaneously.

Which Vector Database Should You Choose: Pinecone, Weaviate, Qdrant, or pgvector?

The right choice depends on your operational model, data volume, and team's infrastructure capacity. The differences between these databases in production are significant and consequential.

Based on Seven Labs' 50+ AI engagements, here is the comprehensive comparison:

Dimension	Pinecone	Weaviate	Qdrant	pgvector
Deployment Model	Fully managed SaaS	Self-hosted or managed	Self-hosted or managed	Postgres extension
Architecture	Serverless (compute/storage decoupled)	Distributed microservices	Single binary (Rust)	Postgres WAL-based
Index Algorithm	Proprietary (HNSW-based)	HNSW	HNSW	HNSW or ivfflat
Hybrid Search	Sparse-dense native	BM25 + vector native	Sparse + dense native	Limited (FTS + vector)
Metadata Filtering	Pre-filter (metadata indexes)	GraphQL + WHERE clause	Payload-based pre-filter	SQL WHERE (post-filter)
Max Vectors (practical)	Hundreds of millions	Billions (distributed)	Hundreds of millions	Tens of millions
Query Latency (p99)	20ms-80ms (serverless)	5ms-30ms (self-hosted)	2ms-15ms (self-hosted)	10ms-100ms (PG config dependent)
Multi-tenancy	Namespaces	Multi-tenancy API	Collections + payload filtering	Schema-based isolation
Operational Overhead	Near zero	High (Kubernetes required)	Medium (single binary)	Low (if Postgres already exists)
Cost at 10M vectors	~$70-200/month	Infrastructure cost	Infrastructure cost	Postgres instance cost
Cost at 100M vectors	~$700-2,000/month	Scales with hardware	Scales with hardware	Not recommended at this scale
Quantization Support	Managed internally	PQ + scalar quantization	Scalar + product quantization	Limited
Embedding Model Integration	External (bring your own)	Built-in vectorization modules	External (bring your own)	External (pgembedding)
Language SDKs	Python, JS, Go, Java	Python, JS, Go, Java, .NET	Python, JS, Go, Rust, Java	Any (SQL interface)
Best For	Small teams, fast shipping, early scale	Large scale with full infrastructure control	Latency-critical, performance-tuned deployments	Existing Postgres shops under 10M vectors
Avoid When	Budget-sensitive at 100M+ vectors	Small team without dedicated DevOps	Team is unfamiliar with self-hosting	Scaling past 20M vectors

"The vector database decision is not a technology decision. It is an operational decision. Teams that choose Weaviate or Qdrant without the infrastructure engineering capacity to run them reliably end up with operational problems that dwarf whatever they saved on API costs." -- Douwe Kiela, CEO, Contextual AI [Source: Industry]

How Do Vector Indexing Algorithms Determine Your Scaling Ceiling?

The indexing algorithm sets the hard limits on your scaling trajectory. Understanding the options is not academic -- it directly determines which optimization levers you have available when performance degrades.

Flat indexes compute distance to every vector in the database. Recall is 100%, but time complexity is O(n). At one million vectors, flat search is slow. At one billion, it is unusable. Flat indexes belong in test environments and benchmarking rigs.

IVF (Inverted File Index) partitions the vector space into Voronoi cells. Each vector is assigned to the nearest centroid during ingestion. During a query, only the cells nearest the query vector are searched. This dramatically reduces the search space. Tuning the number of cells (nlist) and the probe count (nprobe) controls the recall-latency trade-off. IVF works well when memory is constrained because centroids can live in a smaller memory footprint than the full HNSW graph.

HNSW builds a multi-layered graph where higher layers act as routing shortcuts and the bottom layer holds all vectors. Query latency is excellent and recall is high without parameter tuning. The cost is memory: HNSW graph structure requires 1.5x-2x the storage of raw vectors. If you are running HNSW at 50 million vectors and seeing memory pressure, the next step is IVF_SQ8, not adding RAM.

The practical scaling progression: start with HNSW for small to medium collections where latency is critical. Move to IVF_SQ8 when memory pressure builds. Consider product quantization (PQ) for extreme scale where a 75% recall degradation is acceptable in exchange for 10x memory reduction.

How Do You Implement Quantization Before Hitting the Memory Wall?

Implement quantization before you need it, not after you are already paging to disk. At 10 million vectors, performance often looks fine. At 50 million, memory pressure is already building. At 100 million with HNSW and FP32 vectors, you are at the wall.

Scalar quantization (SQ) converts 32-bit floats to 8-bit integers, reducing memory by 4x. Product quantization (PQ) splits vectors into sub-vectors and clusters them, reducing memory by 8x-16x with greater recall loss. IVF_SQ8 (IVF with scalar quantization to 8-bit) is the practical starting point for most production systems that need to scale past 50 million vectors.

python

1from pymilvus import Collection, CollectionSchema, FieldSchema, DataType
2
3fields = [
4    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
5    FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=1536)
6]
7schema = CollectionSchema(fields=fields, description="Scaled RAG collection")
8collection = Collection(name="enterprise_rag", schema=schema)
9
10index_params = {
11    "metric_type": "COSINE",
12    "index_type": "IVF_SQ8",
13    "params": {"nlist": 4096}
14}
15
16collection.create_index(
17    field_name="embedding",
18    index_params=index_params
19)

In Seven Labs' scaling work, moving a client's RAG collection from HNSW (FP32) to IVF_SQ8 reduced memory footprint by 75%. Recall dropped from 99% to 96%. In a RAG pipeline where the LLM synthesizes the final answer from retrieved context, a 3% recall drop is invisible to end users. The LLM handles slight noise in retrieved context, making the infrastructure savings worthwhile.

How Does Intelligent Partitioning Reduce Query Scope at Scale?

Partition your data by access pattern, not by data size. Querying a single partition instead of the entire collection reduces effective search scope by 90%+ in multi-tenant or time-series workloads.

Most queries have natural constraints: "search only this customer's documents," "search only records from Q3 2025," "search only content tagged with this product line." Without partitioning, every query scans the full collection. With partitioning, the query scope drops to only the relevant partition. This bypasses the memory wall for tenant-specific workloads by making the effective search space much smaller.

python

1# Creating a partition per tenant
2collection.create_partition("tenant_1042")
3
4# Inserting into a specific partition
5collection.insert(data=entities, partition_name="tenant_1042")
6
7# Querying only the relevant partition
8results = collection.search(
9    data=[query_vector],
10    anns_field="embedding",
11    param={"metric_type": "COSINE", "params": {"nprobe": 128}},
12    limit=10,
13    partition_names=["tenant_1042"]
14)

For multi-tenant SaaS products, tenant-level partitioning also enforces data isolation at the database layer. Tenant A's documents cannot appear in Tenant B's search results by architecture, not just by application-level filtering.

Pre-filtering vs. post-filtering is the second critical partitioning decision. Post-filtering runs vector search first and then applies metadata filters. If you retrieve 100 nearest neighbors and filter out 99 by timestamp, you return one result. Pre-filtering applies the metadata constraint before vector search, so you only compute distances within the valid document set. Always use pre-filtering. Pinecone and Qdrant implement native pre-filtering. Milvus uses bitset mechanisms to apply filters before distance calculation. pgvector performs post-filtering by default, which is one reason it struggles at scale.

What Does High-Throughput Vector Ingestion Require?

Ingestion pipelines fail at scale through individual inserts, not through architecture failure. Hammering a vector database with single-record inserts overloads the transaction log and stalls index building. Batch ingestion is the fix.

python

1import itertools
2from pinecone import Pinecone
3
4pc = Pinecone(api_key="your-api-key")
5pinecone_index = pc.Index("enterprise-rag")
6
7def chunker(iterable, batch_size):
8    it = iter(iterable)
9    chunk = tuple(itertools.islice(it, batch_size))
10    while chunk:
11        yield chunk
12        chunk = tuple(itertools.islice(it, batch_size))
13
14# Batch insert 1,000 vectors at a time
15for batch in chunker(massive_vector_list, 1000):
16    pinecone_index.upsert(vectors=batch)

For Milvus, decouple ingestion entirely using Apache Kafka. Drop raw documents into a topic and use a dedicated consumer microservice to construct and insert batches. This separates ingestion throughput from query throughput, allowing you to scale each independently. During peak ingestion periods, you are not competing with production query traffic for the same resources.

"Vector database scaling problems are almost always ingestion problems disguised as query problems. By the time latency spikes, the ingestion pipeline has been overloading the index build process for hours." -- Bob van Luijt, CEO, Weaviate [Source: Industry]

What Advanced Strategies Apply Past 100 Million Vectors?

Standard quantization and partitioning get you to 100 million vectors reliably. Past that threshold, additional architectural patterns become necessary.

Separation of compute and storage is the first lever. Decouple your indexing compute from your query compute. Index building is a CPU-intensive background process. If an index rebuild kicks off during peak query traffic, latency spikes. In Milvus, Index Nodes and Query Nodes scale independently. In Pinecone's serverless architecture, this separation happens automatically.

Read replicas distribute query load across multiple nodes once the index is built. High-concurrency environments (thousands of queries per second) need this. Vector databases serve reads at much higher frequency than writes, making read replicas a direct latency lever for busy production systems.

Hybrid search combines vector search with sparse search (BM25) to handle exact keyword matching that pure vector search handles poorly. Searching for a specific product ID or contract number is faster with traditional sparse indexing than with ANN search. Pinecone supports sparse-dense vectors natively. Weaviate includes built-in BM25. Combining both systems handles the full query spectrum without routing logic in the application.

What Metrics Should You Monitor for Vector Database Health at Scale?

Four metrics cover 90% of production vector database failure modes. Monitoring everything else is noise until these are covered.

Index build time: Rising build times signal ingestion pipeline backlog before query latency is affected. Set alerts at 2x baseline build time.
Query latency p95 and p99: Averages mask performance cliffs. A p99 of 500ms with a p50 of 20ms indicates a specific query pattern hitting cold cache or oversized result sets.
Memory utilization per node: Set alerts at 75% memory utilization, not 90%. By 90%, performance is already degraded. The time between alert and remediation requires headroom.
Eviction rates: High eviction rates indicate that index segments are being swapped in from storage on most queries. This destroys latency. The fix is more memory or better quantization, not query optimization.

For Pinecone serverless and Milvus segment-loaded architectures, cold start behavior requires a specific mitigation: send periodic warm-up queries (dummy vectors against a background partition) every 60-90 seconds to keep cache segments loaded. The first query after an idle period can take 200ms-800ms while segments load from object storage.

Frequently Asked Questions

When should you choose pgvector over a dedicated vector database?

Use pgvector when you have existing Postgres infrastructure, your vector collection stays under 5 million records, and operational simplicity outweighs query performance. pgvector's HNSW implementation is production-capable at small scale. Above 10 million vectors, query latency and memory management become problematic compared to dedicated vector databases. pgvector's post-filtering behavior also creates recall problems in filtered workloads.

How does Qdrant's performance compare to Weaviate for self-hosted deployments?

Qdrant delivers better raw query latency (2ms-15ms p99) than Weaviate (5ms-30ms p99) in self-hosted benchmarks due to its Rust-based single-binary architecture with lower overhead. Weaviate compensates with a richer feature set: native GraphQL, built-in BM25 hybrid search, and a more mature multi-tenancy API. For latency-critical applications, Qdrant wins. For feature completeness and ecosystem maturity, Weaviate has the edge.

What is the right batch size for vector ingestion at scale?

1,000 vectors per batch is the practical starting point for both Pinecone and Milvus. The optimal batch size depends on vector dimension and metadata payload size. For 768-dimensional vectors, 1,500-2,000 records per batch works better. For 3,072-dimensional embeddings, 500-1,000 is more stable. Always benchmark with your actual vector dimensions and metadata schema, not synthetic data.

How do you handle cold start latency in serverless vector database architectures?

Send periodic warm-up queries every 60-90 seconds to keep cache segments loaded in memory. In Pinecone serverless, the first query after an idle period can take 200ms-800ms while segments reload from object storage. Warm-up queries maintain cache state. For latency-sensitive production workloads, budget for a warm-up job running continuously as part of your infrastructure cost.

Build vector infrastructure that scales to your actual data volume without hitting the memory wall. Talk to Seven Labs about designing RAG pipelines and vector database architecture for production AI systems that need to scale. Explore our AI Platform Engineering services for custom production deployments.

Scaling Vector Databases: Pinecone vs Milvus

Scaling Vector Databases: Pinecone vs Milvus

Why Is Scaling Vector Search Fundamentally Different from Scaling Relational Databases?

Which Vector Database Should You Choose: Pinecone, Weaviate, Qdrant, or pgvector?

How Do Vector Indexing Algorithms Determine Your Scaling Ceiling?

How Do You Implement Quantization Before Hitting the Memory Wall?

How Does Intelligent Partitioning Reduce Query Scope at Scale?

What Does High-Throughput Vector Ingestion Require?

What Advanced Strategies Apply Past 100 Million Vectors?

What Metrics Should You Monitor for Vector Database Health at Scale?

Frequently Asked Questions

When should you choose pgvector over a dedicated vector database?

How does Qdrant's performance compare to Weaviate for self-hosted deployments?

What is the right batch size for vector ingestion at scale?

How do you handle cold start latency in serverless vector database architectures?

Read Next

Book a Strategy Call

Scaling Vector Databases: Pinecone vs Milvus

Why Is Scaling Vector Search Fundamentally Different from Scaling Relational Databases?

Which Vector Database Should You Choose: Pinecone, Weaviate, Qdrant, or pgvector?

How Do Vector Indexing Algorithms Determine Your Scaling Ceiling?

How Do You Implement Quantization Before Hitting the Memory Wall?

How Does Intelligent Partitioning Reduce Query Scope at Scale?

What Does High-Throughput Vector Ingestion Require?

What Advanced Strategies Apply Past 100 Million Vectors?

What Metrics Should You Monitor for Vector Database Health at Scale?

Frequently Asked Questions

When should you choose pgvector over a dedicated vector database?

How does Qdrant's performance compare to Weaviate for self-hosted deployments?

What is the right batch size for vector ingestion at scale?

How do you handle cold start latency in serverless vector database architectures?

Read Next

The Future of Hybrid Edge-and-Cloud AI Systems

Best Open Source Video Generation Models in 2026: Wan, HunyuanVideo, LTX, Mochi & More