Book a CallContact Us
Back to all posts
June 1, 2026

Scaling Vector Databases: Pinecone vs Milvus

SYS_ENG

Scaling Vector Databases: Pinecone vs Milvus

Vector databases are the engine behind any serious AI application. When you're building a semantic search system, a recommendation engine, or an advanced RAG (Retrieval-Augmented Generation) pipeline, your vector database is the component most likely to bottleneck your entire system. The reality of scaling vector databases is brutal. You start with a few hundred thousand embeddings, and everything feels lightning-fast. You hit ten million, and suddenly your latency spikes, memory consumption goes through the roof, and your cloud bill looks like a phone number.

In this post, we are going to break down exactly what happens when you scale vector databases-specifically focusing on Pinecone and Milvus. We will look at the architecture, the implementation details, the hard truths of performance pitfalls, and how you can architect a system that scales to billions of vectors without collapsing under its own weight.

The Problem: Why Scaling Vector Search is Hard

The core problem with vector databases is that high-dimensional similarity search is computationally expensive and extremely memory-intensive. Traditional relational databases use B-trees or hash indexes to find exact matches in $O(\log n)$ or $O(1)$ time. Vector databases don't look for exact matches; they perform Approximate Nearest Neighbor (ANN) search. They have to compute the distance (cosine, dot product, or L2) between your query vector and millions of other vectors.

The Memory Wall

The most common indexing algorithm used in vector databases is HNSW (Hierarchical Navigable Small World). HNSW is incredibly fast, but it requires the entire index to reside in RAM for optimal performance. If you have 100 million vectors, each with 1536 dimensions (like OpenAI's text-embedding-ada-002), and you use 32-bit floats, the raw data alone is about 600GB. Add the HNSW graph overhead, and you are looking at over 1TB of RAM.

When your index outgrows available memory, the system starts swapping to disk, and your sub-millisecond latency immediately degrades to hundreds of milliseconds or worse. This is the memory wall. You cannot simply throw more RAM at the problem indefinitely; eventually, the hardware limits dictate that you must distribute the load.

The Compute Bottleneck

Even if you can fit everything into memory, computing distances between 1536-dimensional vectors is CPU-intensive. SIMD (Single Instruction, Multiple Data) instructions help, but as your query volume scales, you will saturate your CPU cores. The mathematics of calculating dot products or L2 distances across millions of high-dimensional vectors at a rate of thousands of queries per second requires serious compute horsepower.

Deep Dive: Vector Indexing Algorithms and Their Limits

To understand why scaling vector databases is so challenging, we must analyze the underlying indexing algorithms. You cannot scale what you do not understand.

Flat Indexes

The naive approach is a flat index. It involves taking the query vector and computing the distance to every single vector in the database. This guarantees 100% recall (accuracy), but it scales linearly $O(n)$. At a million vectors, it's slow. At a billion vectors, it's unusable. Flat indexes are only suitable for tiny datasets or when perfect recall is absolutely mandatory.

Inverted File Index (IVF)

IVF partitions the vector space into Voronoi cells. During ingestion, each vector is assigned to the nearest centroid. During a query, the system identifies the closest centroids to the query vector and only searches within those specific cells. This restricts the search space massively. However, IVF still requires significant memory and compute to maintain the centroids and assign vectors accurately.

Hierarchical Navigable Small World (HNSW)

HNSW builds a multi-layered graph. The bottom layer contains all vectors, and higher layers contain progressively fewer vectors, acting as "expressways" for search. When querying, the algorithm enters at the top layer, navigates to the closest node, and drops down a layer, repeating until it finds the nearest neighbors in the bottom layer. HNSW provides exceptional latency and recall, but its memory overhead is staggering. The graph edges require significant storage, often doubling the memory footprint of the raw vectors.

Architecture Showdown: Pinecone vs Milvus

When tackling the scaling problem, you generally have two paths: a fully managed SaaS solution like Pinecone, or an open-source, self-hosted (or managed) distributed system like Milvus.

Pinecone: The Serverless Approach

Pinecone abstracts away the infrastructure. You don't provision nodes, configure shards, or manage memory. You define an index, send it vectors, and query it.

Pinecone's architecture historically relied on pod-based instances, but their newer serverless architecture is a game-changer. In the serverless architecture, Pinecone decouples compute and storage. Storage lives in an object store (like AWS S3), and stateless compute nodes pull subsets of the index into memory on demand using a sophisticated caching layer.

The Pros of Pinecone:

  • Zero operational overhead. Your engineering team focuses on application logic, not infrastructure.
  • True serverless scaling. You pay for what you use, and the system scales transparently.
  • Excellent developer experience. The SDKs are clean, and the API is intuitive.

The Cons of Pinecone:

  • Expensive at massive scale. As your vector count enters the hundreds of millions, the SaaS markup becomes noticeable.
  • Opaque internals. When things get slow, you can't easily debug the underlying index structure or fine-tune the hardware parameters. You are at the mercy of their managed environment.

Milvus: The Distributed Engine

Milvus is an open-source vector database built explicitly for massive scale. Its architecture is heavily distributed and microservices-based, resembling a modern cloud-native platform rather than a traditional monolithic database.

Milvus separates its architecture into four distinct layers:

  1. Access Layer: A stateless proxy that handles client requests, authentication, and routing.
  2. Coordinator Service: The brain of the operation, managing cluster topology, metadata, and assigning tasks to worker nodes.
  3. Worker Nodes: The workhorses. Query nodes handle ANN search, Data nodes handle data ingestion, and Index nodes build the HNSW or IVF indexes in the background.
  4. Storage: Relies on MinIO or S3 for object storage and etcd for metadata, ensuring high durability.

The Pros of Milvus:

  • Infinitely scalable if you have the engineering chops. You can scale query nodes independently of ingestion nodes.
  • Open-source, so you control the infrastructure and costs. You can deploy it on bare metal or custom Kubernetes clusters to optimize hardware utilization.
  • Granular control. It supports advanced indexing algorithms (IVF_FLAT, IVF_SQ8, IVF_PQ) that allow you to trade off recall for memory efficiency exactly how you want.

The Cons of Milvus:

  • High operational complexity. Running Milvus in production requires managing Kubernetes, etcd, Apache Pulsar or Kafka, and MinIO. It requires a dedicated DevOps or platform engineering team.

Implementation: Exactly How We Scaled

When we had to scale a client's RAG system from 5 million to 500 million vectors, we hit the limits of naive implementations. Here is exactly how we tackled it and the code we used to make it happen.

Step 1: Quantization is Mandatory

If you are running 32-bit float vectors at scale, you are burning money. The first step in scaling vector databases is implementing quantization.

Quantization reduces the precision of your vectors. Scalar Quantization (SQ) reduces 32-bit floats to 8-bit integers, cutting memory usage by 4x. Product Quantization (PQ) compresses the vectors even further by splitting them into sub-vectors and clustering them, reducing memory footprint by up to 10x or more.

In Milvus, switching to an IVF_SQ8 index changed everything for us.

from pymilvus import Collection, CollectionSchema, FieldSchema, DataType

# Define the schema
fields = [
    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
    FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=1536)
]
schema = CollectionSchema(fields=fields, description="Scaled RAG collection")
collection = Collection(name="enterprise_rag", schema=schema)

# Milvus Index Configuration
index_params = {
    "metric_type": "COSINE",
    "index_type": "IVF_SQ8",
    "params": {"nlist": 4096}
}

# Create the index
collection.create_index(
    field_name="embedding", 
    index_params=index_params
)

With IVF_SQ8, we reduced our memory footprint by 75%. The recall dropped slightly (from 99% to 96%), but in a RAG pipeline where the LLM does the final synthesis, that drop is entirely acceptable. The LLM can handle slight noise in the retrieved context, making the massive cost savings worthwhile.

Step 2: Intelligent Partitioning and Metadata Filtering

You rarely need to search the entire vector space. Most queries have logical constraints (e.g., "search only documents from 2023" or "search only user X's data").

Both Pinecone and Milvus support metadata filtering. However, filtering after the vector search is a recipe for disaster (post-filtering). You might find 100 nearest neighbors, only to filter out 99 of them based on a timestamp, leaving the user with terrible results.

You must use pre-filtering or single-stage filtering. Milvus uses a bitset mechanism to apply filters before calculating distances. Pinecone handles this natively with its metadata indexes.

To scale further, partition your data. In Milvus, we heavily partitioned by tenant ID to ensure multi-tenant isolation and performance:

# Creating a partition
collection.create_partition("tenant_1042")

# Inserting into a specific partition in Milvus
collection.insert(
    data=entities,
    partition_name="tenant_1042"
)

# Querying a specific partition
results = collection.search(
    data=[query_vector],
    anns_field="embedding",
    param={"metric_type": "COSINE", "params": {"nprobe": 128}},
    limit=10,
    partition_names=["tenant_1042"]
)

By querying specific partitions, you drastically reduce the search space, bypassing the memory wall entirely for tenant-specific workloads.

Step 3: Handling High-Throughput Ingestion

Scaling isn't just about reading; it's about writing. Ingestion pipelines break at scale. If you hammer a vector database with individual inserts, you will overload the transaction log and stall the index building process.

Batching is critical. We built a data pipeline that batches vectors into optimal chunks before hitting the vector database.

# Pinecone Batch Ingestion Pipeline
import itertools
from pinecone import Pinecone

pc = Pinecone(api_key="your-api-key")
pinecone_index = pc.Index("enterprise-rag")

def chunker(iterable, batch_size):
    """Yield successive n-sized chunks from iterable."""
    it = iter(iterable)
    chunk = tuple(itertools.islice(it, batch_size))
    while chunk:
        yield chunk
        chunk = tuple(itertools.islice(it, batch_size))

# Batch insert 1,000 vectors at a time
# massive_vector_list is a list of dicts: {'id': 'vec1', 'values': [...], 'metadata': {...}}
for batch in chunker(massive_vector_list, 1000):
    pinecone_index.upsert(vectors=batch)

For Milvus, we decoupled ingestion entirely using Apache Kafka, dropping raw data into a topic and using a dedicated consumer microservice to construct and insert batches.

Advanced Scaling Strategies

When you push past 100 million vectors, standard tuning isn't enough. You need advanced strategies.

Separation of Compute and Storage

If you are managing your own infrastructure, decouple your indexing compute from your query compute. Indexing is a CPU-intensive background process. If an index rebuild kicks off while you are handling peak query traffic, your latency will spike. In Milvus, you achieve this by scaling Index Nodes independently of Query Nodes.

Read Replicas

Just like relational databases, vector databases benefit from read replicas. Once your index is built, you can replicate it across multiple nodes to distribute the read traffic. This is crucial for high-concurrency environments.

Hybrid Search

Vector search is great for semantic matching, but terrible for exact keyword matching (e.g., searching for a specific product ID). Implementing hybrid search-combining vector search with traditional sparse search (like BM25)-allows you to offload exact matches to a more efficient system (like Elasticsearch) and reserve the vector database for pure semantic queries. Pinecone supports sparse-dense vectors natively, which simplifies this architecture.

Observability and Monitoring at Scale

You cannot operate a large-scale vector database blind. You need rigorous observability.

Monitor these specific metrics:

  1. Index Build Time: If this starts creeping up, your ingestion pipeline is going to back up.
  2. Query Latency (p95 and p99): Averages lie. Look at your p99 latency to identify performance cliffs.
  3. Memory Utilization per Node: Crucial for managing the memory wall. Set alerts well before you hit 90%.
  4. Eviction Rates: If you are using a system that swaps to disk, track how often it evicts vectors from RAM. High eviction rates indicate you need more memory or better quantization.

Pitfalls to Avoid

Scaling vector databases reveals several brutal truths. Ignore these at your own peril.

  1. Ignoring Cold Starts: In serverless architectures like Pinecone's, or when Milvus loads a segment from disk to memory, the first query is slow. If your application requires consistently low latency, you must implement dummy "warm-up" queries to keep the caches hot.
  2. Over-Indexing: Not every field needs an index. Vector indexes are expensive to build. If your ingestion throughput is dropping, check if you are rebuilding HNSW graphs too frequently or indexing metadata fields that are rarely used in queries.
  3. Chasing 100% Recall: Engineers obsess over getting 100% recall on ANN searches. It's a trap. Dropping to 95% recall can yield 10x performance improvements with almost no perceptible difference to the end user. Use PQ or SQ, and tune your ef_search or nprobe parameters aggressively.
  4. Failing to Benchmark with Real Data: Synthetic vectors behave differently than real-world embeddings. Always benchmark your vector database with the exact embeddings generated by your specific model.

The Outcome

Choosing between Pinecone and Milvus comes down to a fundamental build vs. buy calculation.

If you have a small engineering team, tight deadlines, and your vectors number in the tens of millions, Pinecone is the obvious choice. The time saved on infrastructure management vastly outweighs the SaaS premium. You can deploy a production-ready system in hours.

If you are dealing with hundreds of millions or billions of vectors, and you have a dedicated DevOps or platform engineering team, Milvus is the path forward. Its microservices architecture allows you to scale ingestion and querying independently, and the savings from quantization and efficient hardware utilization on your own infrastructure are massive.

Scaling vector databases is a hard engineering problem, but by understanding the memory constraints, utilizing quantization, and architecting intelligent data partitions, you can build systems that serve billions of vectors in milliseconds. Stop guessing, start benchmarking, and build for scale from day one.

Loading...

Read Next

BOLA Vulnerabilities in GraphQL APIs: The Silent Threat

Exploring BOLA vulnerabilities in GraphQL APIs, why traditional authorization fails, and how to arch...

Read article

Implementing Redis Caching for Next.js 15 Apps

A direct, opinionated guide to implementing Redis caching in Next.js 15. We cover the architecture, ...

Read article
Chat with us