mirror of
https://github.com/kbenestad/mdcms.git
synced 2026-06-18 15:24:32 +00:00
250 lines
6.5 KiB
Markdown
250 lines
6.5 KiB
Markdown
---
|
|
title: Migration
|
|
sort: 130
|
|
section-id: operations
|
|
keywords: migration, import, Postgres, Pinecone, Weaviate, data migration, ETL
|
|
description: Migrating data to NeuralDB from PostgreSQL, Pinecone, Weaviate, and other sources
|
|
language: en
|
|
---
|
|
|
|
# Migration
|
|
|
|
This guide covers migrating data into NeuralDB from common sources: PostgreSQL (with or without pgvector), Pinecone, and Weaviate.
|
|
|
|
## From PostgreSQL (without vectors)
|
|
|
|
If you are migrating a standard PostgreSQL database to NeuralDB, the simplest path is a logical dump and restore:
|
|
|
|
```bash
|
|
# 1. Dump from source Postgres
|
|
pg_dump \
|
|
-h source-host \
|
|
-U source-user \
|
|
-d source-database \
|
|
--format=custom \
|
|
--compress=9 \
|
|
> source-backup.dump
|
|
|
|
# 2. Create the target database in NeuralDB
|
|
psql -h neuraldb-host -U neuraldb -c "CREATE DATABASE myapp;"
|
|
|
|
# 3. Restore into NeuralDB
|
|
pg_restore \
|
|
-h neuraldb-host \
|
|
-U neuraldb \
|
|
-d myapp \
|
|
--jobs=8 \
|
|
--no-owner \
|
|
source-backup.dump
|
|
```
|
|
|
|
### Adding Vector Columns Post-Migration
|
|
|
|
After restoring the schema and data, add vector columns and generate embeddings:
|
|
|
|
```sql
|
|
-- Add the vector column
|
|
ALTER TABLE documents ADD COLUMN embedding VECTOR(1536);
|
|
|
|
-- Create the index (do this before backfilling on large tables)
|
|
CREATE INDEX CONCURRENTLY documents_embedding_idx
|
|
ON documents USING hnsw (embedding vector_cosine_ops);
|
|
```
|
|
|
|
Then backfill embeddings in batches:
|
|
|
|
```python
|
|
import openai
|
|
from neuraldb import NeuralDB
|
|
|
|
client = NeuralDB(connection_string)
|
|
openai_client = openai.OpenAI()
|
|
|
|
BATCH_SIZE = 100
|
|
|
|
while True:
|
|
rows = client.query("""
|
|
SELECT id, content FROM documents
|
|
WHERE embedding IS NULL
|
|
LIMIT %s
|
|
""", [BATCH_SIZE])
|
|
|
|
if not rows:
|
|
break
|
|
|
|
texts = [row['content'] for row in rows]
|
|
response = openai_client.embeddings.create(
|
|
model="text-embedding-3-small",
|
|
input=texts
|
|
)
|
|
|
|
updates = [
|
|
(response.data[i].embedding, rows[i]['id'])
|
|
for i in range(len(rows))
|
|
]
|
|
|
|
client.executemany(
|
|
"UPDATE documents SET embedding = %s WHERE id = %s",
|
|
updates
|
|
)
|
|
print(f"Backfilled {len(rows)} rows")
|
|
```
|
|
|
|
## From PostgreSQL + pgvector
|
|
|
|
pgvector uses the same `VECTOR` type as NeuralDB. Migration is a direct dump and restore with minimal adjustments.
|
|
|
|
```bash
|
|
# Dump — exclude pgvector extension (NeuralDB has native vector support)
|
|
pg_dump \
|
|
-h source-host -U source-user -d source-db \
|
|
--format=custom \
|
|
--exclude-extension=vector \
|
|
> pgvector-backup.dump
|
|
|
|
pg_restore \
|
|
-h neuraldb-host -U neuraldb -d myapp \
|
|
--jobs=8 \
|
|
pgvector-backup.dump
|
|
```
|
|
|
|
### Re-create HNSW Indexes
|
|
|
|
pgvector HNSW indexes are not transferred. Recreate them in NeuralDB:
|
|
|
|
```sql
|
|
-- Drop pgvector-created indexes
|
|
DROP INDEX IF EXISTS documents_embedding_idx;
|
|
|
|
-- Create NeuralDB HNSW index (same syntax, better performance)
|
|
CREATE INDEX CONCURRENTLY documents_embedding_idx
|
|
ON documents USING hnsw (embedding vector_cosine_ops)
|
|
WITH (m = 16, ef_construction = 64);
|
|
```
|
|
|
|
## From Pinecone
|
|
|
|
Pinecone stores vectors with metadata. Export using the Pinecone SDK and ingest into NeuralDB:
|
|
|
|
```python
|
|
import pinecone
|
|
from neuraldb import NeuralDB, BulkIngestor
|
|
|
|
# Source: Pinecone
|
|
pc = pinecone.Pinecone(api_key=os.environ["PINECONE_API_KEY"])
|
|
index = pc.Index("my-index")
|
|
|
|
# Target: NeuralDB
|
|
client = NeuralDB(os.environ["NEURALDB_URL"])
|
|
|
|
# Create target table
|
|
client.execute("""
|
|
CREATE TABLE IF NOT EXISTS pinecone_migration (
|
|
id TEXT PRIMARY KEY,
|
|
embedding VECTOR(1536),
|
|
metadata JSONB,
|
|
migrated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
|
|
)
|
|
""")
|
|
|
|
client.execute("""
|
|
CREATE INDEX IF NOT EXISTS pinecone_migration_emb_idx
|
|
ON pinecone_migration USING hnsw (embedding vector_cosine_ops)
|
|
""")
|
|
|
|
# Paginate through all Pinecone vectors
|
|
ingestor = BulkIngestor(client, table="pinecone_migration", batch_size=500)
|
|
|
|
with ingestor as ing:
|
|
for ids_batch in paginate_pinecone_ids(index, batch_size=1000):
|
|
fetch_response = index.fetch(ids=ids_batch)
|
|
|
|
for vector_id, vector_data in fetch_response.vectors.items():
|
|
ing.add({
|
|
"id": vector_id,
|
|
"embedding": vector_data.values,
|
|
"metadata": vector_data.metadata or {}
|
|
})
|
|
|
|
print(f"Migrated {ingestor.total_inserted} vectors")
|
|
```
|
|
|
|
### Mapping Pinecone Metadata to Columns
|
|
|
|
Flatten commonly-queried metadata fields into dedicated columns for better query performance:
|
|
|
|
```python
|
|
# Instead of: metadata JSONB
|
|
# Create typed columns for common filter fields:
|
|
client.execute("""
|
|
ALTER TABLE pinecone_migration
|
|
ADD COLUMN IF NOT EXISTS category TEXT GENERATED ALWAYS AS (metadata->>'category') STORED,
|
|
ADD COLUMN IF NOT EXISTS created_date DATE GENERATED ALWAYS AS ((metadata->>'date')::DATE) STORED;
|
|
|
|
CREATE INDEX ON pinecone_migration (category);
|
|
CREATE INDEX ON pinecone_migration (created_date);
|
|
""")
|
|
```
|
|
|
|
## From Weaviate
|
|
|
|
Export Weaviate data using the Weaviate client SDK:
|
|
|
|
```python
|
|
import weaviate
|
|
from neuraldb import NeuralDB, BulkIngestor
|
|
|
|
weaviate_client = weaviate.connect_to_local()
|
|
neuraldb_client = NeuralDB(os.environ["NEURALDB_URL"])
|
|
|
|
collection = weaviate_client.collections.get("Document")
|
|
|
|
# Create target schema
|
|
neuraldb_client.execute("""
|
|
CREATE TABLE weaviate_documents (
|
|
id UUID PRIMARY KEY,
|
|
content TEXT,
|
|
category TEXT,
|
|
source TEXT,
|
|
embedding VECTOR(1536)
|
|
);
|
|
CREATE INDEX ON weaviate_documents USING hnsw (embedding vector_cosine_ops);
|
|
""")
|
|
|
|
ingestor = BulkIngestor(neuraldb_client, table="weaviate_documents", batch_size=500)
|
|
|
|
with ingestor as ing:
|
|
for item in collection.iterator(include_vector=True):
|
|
ing.add({
|
|
"id": str(item.uuid),
|
|
"content": item.properties.get("content", ""),
|
|
"category": item.properties.get("category"),
|
|
"source": item.properties.get("source"),
|
|
"embedding": item.vector.get("default"),
|
|
})
|
|
|
|
weaviate_client.close()
|
|
print(f"Migrated {ingestor.total_inserted} objects")
|
|
```
|
|
|
|
## Verifying Migration
|
|
|
|
After any migration, verify data integrity:
|
|
|
|
```sql
|
|
-- Row count comparison
|
|
SELECT COUNT(*) FROM documents;
|
|
|
|
-- Sample vector similarity (should match source)
|
|
SELECT id, content, 1 - (embedding <=> (SELECT embedding FROM documents LIMIT 1)) AS sim
|
|
FROM documents
|
|
ORDER BY embedding <=> (SELECT embedding FROM documents LIMIT 1)
|
|
LIMIT 5;
|
|
|
|
-- Check for null embeddings
|
|
SELECT COUNT(*) FROM documents WHERE embedding IS NULL;
|
|
|
|
-- Index health
|
|
SELECT index_name, hnsw_in_memory, estimated_recall
|
|
FROM neuraldb_stat_vector_indexes;
|
|
```
|