mdcms/sample-sites/neuraldb-docs/pages/ops-migration.md
2026-05-18 14:30:49 +07:00

250 lines
6.5 KiB
Markdown

---
title: Migration
sort: 130
section-id: operations
keywords: migration, import, Postgres, Pinecone, Weaviate, data migration, ETL
description: Migrating data to NeuralDB from PostgreSQL, Pinecone, Weaviate, and other sources
language: en
---
# Migration
This guide covers migrating data into NeuralDB from common sources: PostgreSQL (with or without pgvector), Pinecone, and Weaviate.
## From PostgreSQL (without vectors)
If you are migrating a standard PostgreSQL database to NeuralDB, the simplest path is a logical dump and restore:
```bash
# 1. Dump from source Postgres
pg_dump \
-h source-host \
-U source-user \
-d source-database \
--format=custom \
--compress=9 \
> source-backup.dump
# 2. Create the target database in NeuralDB
psql -h neuraldb-host -U neuraldb -c "CREATE DATABASE myapp;"
# 3. Restore into NeuralDB
pg_restore \
-h neuraldb-host \
-U neuraldb \
-d myapp \
--jobs=8 \
--no-owner \
source-backup.dump
```
### Adding Vector Columns Post-Migration
After restoring the schema and data, add vector columns and generate embeddings:
```sql
-- Add the vector column
ALTER TABLE documents ADD COLUMN embedding VECTOR(1536);
-- Create the index (do this before backfilling on large tables)
CREATE INDEX CONCURRENTLY documents_embedding_idx
ON documents USING hnsw (embedding vector_cosine_ops);
```
Then backfill embeddings in batches:
```python
import openai
from neuraldb import NeuralDB
client = NeuralDB(connection_string)
openai_client = openai.OpenAI()
BATCH_SIZE = 100
while True:
rows = client.query("""
SELECT id, content FROM documents
WHERE embedding IS NULL
LIMIT %s
""", [BATCH_SIZE])
if not rows:
break
texts = [row['content'] for row in rows]
response = openai_client.embeddings.create(
model="text-embedding-3-small",
input=texts
)
updates = [
(response.data[i].embedding, rows[i]['id'])
for i in range(len(rows))
]
client.executemany(
"UPDATE documents SET embedding = %s WHERE id = %s",
updates
)
print(f"Backfilled {len(rows)} rows")
```
## From PostgreSQL + pgvector
pgvector uses the same `VECTOR` type as NeuralDB. Migration is a direct dump and restore with minimal adjustments.
```bash
# Dump — exclude pgvector extension (NeuralDB has native vector support)
pg_dump \
-h source-host -U source-user -d source-db \
--format=custom \
--exclude-extension=vector \
> pgvector-backup.dump
pg_restore \
-h neuraldb-host -U neuraldb -d myapp \
--jobs=8 \
pgvector-backup.dump
```
### Re-create HNSW Indexes
pgvector HNSW indexes are not transferred. Recreate them in NeuralDB:
```sql
-- Drop pgvector-created indexes
DROP INDEX IF EXISTS documents_embedding_idx;
-- Create NeuralDB HNSW index (same syntax, better performance)
CREATE INDEX CONCURRENTLY documents_embedding_idx
ON documents USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
```
## From Pinecone
Pinecone stores vectors with metadata. Export using the Pinecone SDK and ingest into NeuralDB:
```python
import pinecone
from neuraldb import NeuralDB, BulkIngestor
# Source: Pinecone
pc = pinecone.Pinecone(api_key=os.environ["PINECONE_API_KEY"])
index = pc.Index("my-index")
# Target: NeuralDB
client = NeuralDB(os.environ["NEURALDB_URL"])
# Create target table
client.execute("""
CREATE TABLE IF NOT EXISTS pinecone_migration (
id TEXT PRIMARY KEY,
embedding VECTOR(1536),
metadata JSONB,
migrated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
)
""")
client.execute("""
CREATE INDEX IF NOT EXISTS pinecone_migration_emb_idx
ON pinecone_migration USING hnsw (embedding vector_cosine_ops)
""")
# Paginate through all Pinecone vectors
ingestor = BulkIngestor(client, table="pinecone_migration", batch_size=500)
with ingestor as ing:
for ids_batch in paginate_pinecone_ids(index, batch_size=1000):
fetch_response = index.fetch(ids=ids_batch)
for vector_id, vector_data in fetch_response.vectors.items():
ing.add({
"id": vector_id,
"embedding": vector_data.values,
"metadata": vector_data.metadata or {}
})
print(f"Migrated {ingestor.total_inserted} vectors")
```
### Mapping Pinecone Metadata to Columns
Flatten commonly-queried metadata fields into dedicated columns for better query performance:
```python
# Instead of: metadata JSONB
# Create typed columns for common filter fields:
client.execute("""
ALTER TABLE pinecone_migration
ADD COLUMN IF NOT EXISTS category TEXT GENERATED ALWAYS AS (metadata->>'category') STORED,
ADD COLUMN IF NOT EXISTS created_date DATE GENERATED ALWAYS AS ((metadata->>'date')::DATE) STORED;
CREATE INDEX ON pinecone_migration (category);
CREATE INDEX ON pinecone_migration (created_date);
""")
```
## From Weaviate
Export Weaviate data using the Weaviate client SDK:
```python
import weaviate
from neuraldb import NeuralDB, BulkIngestor
weaviate_client = weaviate.connect_to_local()
neuraldb_client = NeuralDB(os.environ["NEURALDB_URL"])
collection = weaviate_client.collections.get("Document")
# Create target schema
neuraldb_client.execute("""
CREATE TABLE weaviate_documents (
id UUID PRIMARY KEY,
content TEXT,
category TEXT,
source TEXT,
embedding VECTOR(1536)
);
CREATE INDEX ON weaviate_documents USING hnsw (embedding vector_cosine_ops);
""")
ingestor = BulkIngestor(neuraldb_client, table="weaviate_documents", batch_size=500)
with ingestor as ing:
for item in collection.iterator(include_vector=True):
ing.add({
"id": str(item.uuid),
"content": item.properties.get("content", ""),
"category": item.properties.get("category"),
"source": item.properties.get("source"),
"embedding": item.vector.get("default"),
})
weaviate_client.close()
print(f"Migrated {ingestor.total_inserted} objects")
```
## Verifying Migration
After any migration, verify data integrity:
```sql
-- Row count comparison
SELECT COUNT(*) FROM documents;
-- Sample vector similarity (should match source)
SELECT id, content, 1 - (embedding <=> (SELECT embedding FROM documents LIMIT 1)) AS sim
FROM documents
ORDER BY embedding <=> (SELECT embedding FROM documents LIMIT 1)
LIMIT 5;
-- Check for null embeddings
SELECT COUNT(*) FROM documents WHERE embedding IS NULL;
-- Index health
SELECT index_name, hnsw_in_memory, estimated_recall
FROM neuraldb_stat_vector_indexes;
```