mdcms/sample-sites/neuraldb-docs/pages/ops-monitoring.md
2026-05-18 14:30:49 +07:00

6.3 KiB

title sort section-id keywords description language
Monitoring 100 operations monitoring, Prometheus, Grafana, metrics, alerts, observability, dashboards Monitoring NeuralDB with Prometheus metrics, Grafana dashboards, and alert configuration en

Monitoring

NeuralDB Dashboard

Observability is critical for database operations. NeuralDB exposes Prometheus-compatible metrics and provides an official Grafana dashboard for real-time monitoring.

Prometheus Metrics

NeuralDB exposes metrics at http://localhost:9187/metrics (via the bundled exporter).

Enable the metrics exporter:

# neuraldb.conf
metrics.enabled = true
metrics.port = 9187
metrics.path = /metrics

Or run the standalone exporter:

neuraldb_exporter \
  --web.listen-address=:9187 \
  --db.uri="postgresql://monitor:password@localhost:5432/neuraldb?sslmode=disable"

Key Metrics

Connection Metrics

Metric Type Description
neuraldb_connections_total Gauge Current connections by state
neuraldb_connections_max Gauge max_connections setting
neuraldb_connection_pool_waiting Gauge Queries waiting for a connection

Query Metrics

Metric Type Description
neuraldb_queries_total Counter Total queries by database and status
neuraldb_query_duration_seconds Histogram Query duration (p50, p95, p99)
neuraldb_slow_queries_total Counter Queries exceeding log_min_duration_statement
neuraldb_deadlocks_total Counter Deadlocks detected

Vector Metrics

Metric Type Description
neuraldb_vector_queries_total Counter Vector similarity queries by index
neuraldb_vector_query_duration_seconds Histogram ANN query latency
neuraldb_hnsw_index_size_bytes Gauge In-memory size of HNSW graphs
neuraldb_hnsw_build_duration_seconds Histogram Time to build HNSW indexes
neuraldb_vector_recall_ratio Gauge Estimated recall for ANN queries

Replication Metrics

Metric Type Description
neuraldb_replication_lag_bytes Gauge WAL lag per replica
neuraldb_replication_lag_seconds Gauge Time lag per replica
neuraldb_wal_size_bytes Gauge Current WAL on-disk size

Storage Metrics

Metric Type Description
neuraldb_database_size_bytes Gauge Total database size
neuraldb_table_size_bytes Gauge Size per table
neuraldb_bloat_ratio Gauge Estimated dead row ratio
neuraldb_checkpoint_duration_seconds Histogram Checkpoint write time

Prometheus Configuration

# prometheus.yml
scrape_configs:
  - job_name: 'neuraldb'
    static_configs:
      - targets: ['localhost:9187']
    scrape_interval: 15s
    metrics_path: /metrics

Grafana Dashboard

Import the official NeuralDB dashboard from Grafana.com (Dashboard ID: 18921):

# Import via Grafana API
curl -X POST \
  http://admin:password@localhost:3000/api/dashboards/import \
  -H "Content-Type: application/json" \
  -d '{ "gnetId": 18921, "overwrite": true, "inputs": [{"name": "DS_PROMETHEUS", "type": "datasource", "pluginId": "prometheus", "value": "Prometheus"}] }'

The dashboard includes panels for:

  • Query rate and error rate
  • Query latency percentiles (p50, p95, p99)
  • Active connections vs max connections
  • Vector index memory usage
  • Replication lag
  • Database and table sizes
  • Cache hit ratio
  • Checkpoint frequency

Alerting Rules

Create Prometheus alerting rules for critical conditions:

# neuraldb-alerts.yml
groups:
  - name: neuraldb
    rules:

      - alert: NeuralDBConnectionsHigh
        expr: neuraldb_connections_total{state="active"} / neuraldb_connections_max > 0.85
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "NeuralDB connections above 85%"
          description: "{{ $value | humanizePercentage }} of max connections in use"

      - alert: NeuralDBConnectionsExhausted
        expr: neuraldb_connections_total{state="active"} / neuraldb_connections_max > 0.98
        for: 30s
        labels:
          severity: critical
        annotations:
          summary: "NeuralDB connections nearly exhausted"

      - alert: NeuralDBHighQueryLatency
        expr: histogram_quantile(0.99, rate(neuraldb_query_duration_seconds_bucket[5m])) > 1.0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "P99 query latency above 1 second"

      - alert: NeuralDBReplicationLagHigh
        expr: neuraldb_replication_lag_seconds > 30
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Replication lag above 30 seconds"

      - alert: NeuralDBDiskSpaceHigh
        expr: (neuraldb_database_size_bytes / disk_total_bytes) > 0.80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Database storage above 80% capacity"

      - alert: NeuralDBVectorBufferExhausted
        expr: neuraldb_hnsw_index_size_bytes > (neuraldb_vector_buffer_size_bytes * 0.90)
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "HNSW indexes using >90% of vector_buffer"

Built-In Query Statistics

-- Top 10 slowest queries
SELECT query,
       calls,
       round(mean_exec_time::numeric, 2) AS avg_ms,
       round(total_exec_time::numeric, 2) AS total_ms,
       round(stddev_exec_time::numeric, 2) AS stddev_ms
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 10;

-- Cache hit ratio (should be >99%)
SELECT
  sum(blks_hit) * 100.0 / sum(blks_hit + blks_read) AS cache_hit_ratio
FROM pg_stat_database
WHERE datname != 'template0';

-- Lock waits
SELECT pid, query, state, wait_event_type, wait_event, query_start
FROM pg_stat_activity
WHERE wait_event_type = 'Lock'
ORDER BY query_start;

Log-Based Alerting

Forward slow query logs to your SIEM or log aggregation system:

# neuraldb.conf
log_destination = 'jsonlog'
log_min_duration_statement = 500   # log queries slower than 500ms
log_line_prefix = '%t [%p] %u@%d '

Parse JSON logs in Loki or Elasticsearch and alert when the rate of slow queries exceeds a threshold.