mdcms/sample-sites/neuraldb-docs/pages/ops-scaling.md

---
title: Scaling
sort: 120
section-id: operations
keywords: scaling, sharding, read replicas, horizontal scaling, capacity planning, performance
description: Scaling NeuralDB horizontally with sharding, read replicas, and capacity planning
language: en
---

# Scaling

NeuralDB is designed to scale horizontally. This page covers adding read replicas for query throughput, sharding for data volume, and capacity planning to avoid resource exhaustion.

## Vertical Scaling (Scale Up)

Before adding nodes, ensure you have maximised single-node performance:

### Memory

The biggest lever for NeuralDB performance is memory. Ensure:
- `vector_buffer` is large enough to hold all active HNSW graphs
- `shared_buffers` is set to 25% of RAM
- `work_mem` is appropriate for your query patterns

```sql
-- Check if vectors are being served from disk (slow) vs memory (fast)
SELECT index_name, hnsw_graph_size_bytes, hnsw_in_memory
FROM neuraldb_stat_vector_indexes
ORDER BY hnsw_graph_size_bytes DESC;
```

If `hnsw_in_memory = false`, increase `vector_buffer`.

### CPU

Vector ANN searches are CPU-bound. Enable parallel query:

```ini
max_parallel_workers_per_gather = 8
max_parallel_workers = 16
```

```sql
-- Allow parallel ANN queries for large tables
SET max_parallel_workers_per_gather = 8;
SELECT * FROM large_table ORDER BY embedding <=> $1 LIMIT 10;
```

### Storage I/O

Use NVMe SSDs with high IOPS. Configure the OS:

```bash
# Increase read-ahead for sequential I/O
sudo blockdev --setra 1024 /dev/nvme0n1

# Use deadline/mq-deadline I/O scheduler
echo "mq-deadline" | sudo tee /sys/block/nvme0n1/queue/scheduler

# Disable transparent huge pages (reduces latency variability)
echo never | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
```

## Read Replicas

Add read replicas to distribute query load.

### Setting Up Read Replicas

Follow the [Replication guide](config-replication.md) to add replicas. Each replica can independently serve `SELECT` queries, including vector similarity searches.

### Client-Side Read Splitting

Configure your application to route reads to replicas:

**Python:**
```python
from neuraldb import NeuralDB

primary = NeuralDB("postgresql://neuraldb:pass@primary:5432/mydb")
replica = NeuralDB("postgresql://neuraldb:pass@replica:5432/mydb")

def search(query_vector):
    # Read goes to replica
    return replica.query("SELECT * FROM docs ORDER BY embedding <=> %s LIMIT 10", [query_vector])

def insert(content, embedding):
    # Write goes to primary
    return primary.execute("INSERT INTO docs (content, embedding) VALUES (%s, %s)", [content, embedding])
```

**Connection string with `target_session_attrs`:**
```
postgresql://neuraldb:pass@primary:5432,replica:5432/mydb?target_session_attrs=prefer-standby
```

### Read Replica Scaling Targets

| Replicas | Approximate peak QPS (1536-dim, 10M vectors) |
|---------|----------------------------------------------|
| 1 primary | 8,000 |
| 1 primary + 2 replicas | 24,000 |
| 1 primary + 4 replicas | 48,000 |
| 1 primary + 8 replicas | 96,000 |

## Horizontal Sharding

For datasets exceeding single-node capacity (>50M vectors or >5 TB), shard across multiple primary nodes.

### Shard Configuration

```sql
-- Create a sharded cluster (requires NeuralDB Cluster Edition)
SELECT neuraldb_cluster.init_cluster(
  shards => 8,
  replication_factor => 2
);

-- Create a sharded table
CREATE TABLE documents (
  id UUID NOT NULL DEFAULT gen_random_uuid(),
  tenant_id UUID NOT NULL,
  content TEXT,
  embedding VECTOR(1536)
) SHARD BY tenant_id;

-- Each shard holds ~1/8 of the data
-- All rows with the same tenant_id are colocated on the same shard
```

### Cross-Shard Queries

Cross-shard queries (where the filter doesn't align with the shard key) are automatically parallelised across shards:

```sql
-- This query executes on all 8 shards in parallel
SELECT id, content, 1 - (embedding <=> $1) AS similarity
FROM documents
ORDER BY embedding <=> $1
LIMIT 10;
-- Results are merged and re-ranked by the coordinator
```

Performance with 8 shards: near-linear scaling. An 8-shard cluster serves ~8× the QPS of a single node for cross-shard searches, with ~20% overhead for coordination.

### Shard Rebalancing

When adding new shard nodes, rebalance data:

```sql
-- Rebalance shards (online, non-blocking)
SELECT neuraldb_cluster.rebalance_shards();

-- Monitor progress
SELECT * FROM neuraldb_cluster.rebalance_status;
```

## Capacity Planning

### Storage Capacity

Estimate required storage:

```
Row data ≈ avg_row_size_bytes × num_rows × 1.3 (index overhead)
Vector data ≈ dimensions × 4 bytes × num_vectors
HNSW graph ≈ dimensions × 4 bytes × num_vectors × 1.3
WAL ≈ daily_write_volume × wal_retention_days

Total ≈ row_data + vector_data + HNSW_graph + WAL + 20% buffer
```

Example: 100M rows, 1536 dimensions, 500 bytes average row size:
- Row data: 500B × 100M × 1.3 ≈ **65 GB**
- Vector data: 1536 × 4B × 100M ≈ **614 GB**
- HNSW graph: 614 GB × 1.3 ≈ **800 GB** (must fit in `vector_buffer`)
- WAL (7 days): 10 GB/day × 7 = **70 GB**
- **Total: ~1.6 TB storage, 800 GB RAM for HNSW**

### Connection Capacity

```
max_connections = max_app_connections + pgbouncer_pool_size + replication_slots + 3 (superuser)
```

For 500 app connections through PgBouncer with pool size 20:
```
max_connections = 20 + 10 (replicas) + 3 = 33
```

PgBouncer multiplexes 500 app connections → 20 database connections.

### Alert Thresholds

| Resource | Warning | Critical |
|---------|---------|---------|
| Connections | 80% of max | 95% of max |
| Storage | 70% full | 85% full |
| vector_buffer utilisation | 80% | 90% |
| Replication lag | 30s | 120s |
| Query p99 latency | 500ms | 2000ms |