mirror of https://github.com/kbenestad/mdcms.git synced 2026-06-18 15:24:32 +00:00

kbenestad 59efc20dde Updated sample-sites.

2026-05-18 14:30:49 +07:00

6.3 KiB

Raw Blame History

title	sort	section-id	keywords	description	language
Monitoring	100	operations	monitoring, Prometheus, Grafana, metrics, alerts, observability, dashboards	Monitoring NeuralDB with Prometheus metrics, Grafana dashboards, and alert configuration	en

Monitoring

Observability is critical for database operations. NeuralDB exposes Prometheus-compatible metrics and provides an official Grafana dashboard for real-time monitoring.

Prometheus Metrics

NeuralDB exposes metrics at http://localhost:9187/metrics (via the bundled exporter).

Enable the metrics exporter:

# neuraldb.conf
metrics.enabled = true
metrics.port = 9187
metrics.path = /metrics

Or run the standalone exporter:

neuraldb_exporter \
  --web.listen-address=:9187 \
  --db.uri="postgresql://monitor:password@localhost:5432/neuraldb?sslmode=disable"

Key Metrics

Connection Metrics

Metric	Type	Description
`neuraldb_connections_total`	Gauge	Current connections by state
`neuraldb_connections_max`	Gauge	`max_connections` setting
`neuraldb_connection_pool_waiting`	Gauge	Queries waiting for a connection

Query Metrics

Metric	Type	Description
`neuraldb_queries_total`	Counter	Total queries by database and status
`neuraldb_query_duration_seconds`	Histogram	Query duration (p50, p95, p99)
`neuraldb_slow_queries_total`	Counter	Queries exceeding `log_min_duration_statement`
`neuraldb_deadlocks_total`	Counter	Deadlocks detected

Vector Metrics

Metric	Type	Description
`neuraldb_vector_queries_total`	Counter	Vector similarity queries by index
`neuraldb_vector_query_duration_seconds`	Histogram	ANN query latency
`neuraldb_hnsw_index_size_bytes`	Gauge	In-memory size of HNSW graphs
`neuraldb_hnsw_build_duration_seconds`	Histogram	Time to build HNSW indexes
`neuraldb_vector_recall_ratio`	Gauge	Estimated recall for ANN queries

Replication Metrics

Metric	Type	Description
`neuraldb_replication_lag_bytes`	Gauge	WAL lag per replica
`neuraldb_replication_lag_seconds`	Gauge	Time lag per replica
`neuraldb_wal_size_bytes`	Gauge	Current WAL on-disk size

Storage Metrics

Metric	Type	Description
`neuraldb_database_size_bytes`	Gauge	Total database size
`neuraldb_table_size_bytes`	Gauge	Size per table
`neuraldb_bloat_ratio`	Gauge	Estimated dead row ratio
`neuraldb_checkpoint_duration_seconds`	Histogram	Checkpoint write time

Prometheus Configuration

# prometheus.yml
scrape_configs:
  - job_name: 'neuraldb'
    static_configs:
      - targets: ['localhost:9187']
    scrape_interval: 15s
    metrics_path: /metrics

Grafana Dashboard

Import the official NeuralDB dashboard from Grafana.com (Dashboard ID: 18921):

# Import via Grafana API
curl -X POST \
  http://admin:password@localhost:3000/api/dashboards/import \
  -H "Content-Type: application/json" \
  -d '{ "gnetId": 18921, "overwrite": true, "inputs": [{"name": "DS_PROMETHEUS", "type": "datasource", "pluginId": "prometheus", "value": "Prometheus"}] }'

The dashboard includes panels for:

Query rate and error rate
Query latency percentiles (p50, p95, p99)
Active connections vs max connections
Vector index memory usage
Replication lag
Database and table sizes
Cache hit ratio
Checkpoint frequency

Alerting Rules

Create Prometheus alerting rules for critical conditions:

# neuraldb-alerts.yml
groups:
  - name: neuraldb
    rules:

      - alert: NeuralDBConnectionsHigh
        expr: neuraldb_connections_total{state="active"} / neuraldb_connections_max > 0.85
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "NeuralDB connections above 85%"
          description: "{{ $value | humanizePercentage }} of max connections in use"

      - alert: NeuralDBConnectionsExhausted
        expr: neuraldb_connections_total{state="active"} / neuraldb_connections_max > 0.98
        for: 30s
        labels:
          severity: critical
        annotations:
          summary: "NeuralDB connections nearly exhausted"

      - alert: NeuralDBHighQueryLatency
        expr: histogram_quantile(0.99, rate(neuraldb_query_duration_seconds_bucket[5m])) > 1.0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "P99 query latency above 1 second"

      - alert: NeuralDBReplicationLagHigh
        expr: neuraldb_replication_lag_seconds > 30
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Replication lag above 30 seconds"

      - alert: NeuralDBDiskSpaceHigh
        expr: (neuraldb_database_size_bytes / disk_total_bytes) > 0.80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Database storage above 80% capacity"

      - alert: NeuralDBVectorBufferExhausted
        expr: neuraldb_hnsw_index_size_bytes > (neuraldb_vector_buffer_size_bytes * 0.90)
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "HNSW indexes using >90% of vector_buffer"

Built-In Query Statistics

-- Top 10 slowest queries
SELECT query,
       calls,
       round(mean_exec_time::numeric, 2) AS avg_ms,
       round(total_exec_time::numeric, 2) AS total_ms,
       round(stddev_exec_time::numeric, 2) AS stddev_ms
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 10;

-- Cache hit ratio (should be >99%)
SELECT
  sum(blks_hit) * 100.0 / sum(blks_hit + blks_read) AS cache_hit_ratio
FROM pg_stat_database
WHERE datname != 'template0';

-- Lock waits
SELECT pid, query, state, wait_event_type, wait_event, query_start
FROM pg_stat_activity
WHERE wait_event_type = 'Lock'
ORDER BY query_start;

Log-Based Alerting

Forward slow query logs to your SIEM or log aggregation system:

# neuraldb.conf
log_destination = 'jsonlog'
log_min_duration_statement = 500   # log queries slower than 500ms
log_line_prefix = '%t [%p] %u@%d '

Parse JSON logs in Loki or Elasticsearch and alert when the rate of slow queries exceeds a threshold.

6.3 KiB Raw Blame History