mdcms/sample-sites/neuraldb-docs/pages/ops-monitoring.md
2026-05-18 14:30:49 +07:00

214 lines
6.3 KiB
Markdown

---
title: Monitoring
sort: 100
section-id: operations
keywords: monitoring, Prometheus, Grafana, metrics, alerts, observability, dashboards
description: Monitoring NeuralDB with Prometheus metrics, Grafana dashboards, and alert configuration
language: en
---
# Monitoring
![NeuralDB Dashboard](assets/images/dashboard.jpg)
Observability is critical for database operations. NeuralDB exposes Prometheus-compatible metrics and provides an official Grafana dashboard for real-time monitoring.
## Prometheus Metrics
NeuralDB exposes metrics at `http://localhost:9187/metrics` (via the bundled exporter).
Enable the metrics exporter:
```ini
# neuraldb.conf
metrics.enabled = true
metrics.port = 9187
metrics.path = /metrics
```
Or run the standalone exporter:
```bash
neuraldb_exporter \
--web.listen-address=:9187 \
--db.uri="postgresql://monitor:password@localhost:5432/neuraldb?sslmode=disable"
```
### Key Metrics
#### Connection Metrics
| Metric | Type | Description |
|--------|------|-------------|
| `neuraldb_connections_total` | Gauge | Current connections by state |
| `neuraldb_connections_max` | Gauge | `max_connections` setting |
| `neuraldb_connection_pool_waiting` | Gauge | Queries waiting for a connection |
#### Query Metrics
| Metric | Type | Description |
|--------|------|-------------|
| `neuraldb_queries_total` | Counter | Total queries by database and status |
| `neuraldb_query_duration_seconds` | Histogram | Query duration (p50, p95, p99) |
| `neuraldb_slow_queries_total` | Counter | Queries exceeding `log_min_duration_statement` |
| `neuraldb_deadlocks_total` | Counter | Deadlocks detected |
#### Vector Metrics
| Metric | Type | Description |
|--------|------|-------------|
| `neuraldb_vector_queries_total` | Counter | Vector similarity queries by index |
| `neuraldb_vector_query_duration_seconds` | Histogram | ANN query latency |
| `neuraldb_hnsw_index_size_bytes` | Gauge | In-memory size of HNSW graphs |
| `neuraldb_hnsw_build_duration_seconds` | Histogram | Time to build HNSW indexes |
| `neuraldb_vector_recall_ratio` | Gauge | Estimated recall for ANN queries |
#### Replication Metrics
| Metric | Type | Description |
|--------|------|-------------|
| `neuraldb_replication_lag_bytes` | Gauge | WAL lag per replica |
| `neuraldb_replication_lag_seconds` | Gauge | Time lag per replica |
| `neuraldb_wal_size_bytes` | Gauge | Current WAL on-disk size |
#### Storage Metrics
| Metric | Type | Description |
|--------|------|-------------|
| `neuraldb_database_size_bytes` | Gauge | Total database size |
| `neuraldb_table_size_bytes` | Gauge | Size per table |
| `neuraldb_bloat_ratio` | Gauge | Estimated dead row ratio |
| `neuraldb_checkpoint_duration_seconds` | Histogram | Checkpoint write time |
## Prometheus Configuration
```yaml
# prometheus.yml
scrape_configs:
- job_name: 'neuraldb'
static_configs:
- targets: ['localhost:9187']
scrape_interval: 15s
metrics_path: /metrics
```
## Grafana Dashboard
Import the official NeuralDB dashboard from Grafana.com (Dashboard ID: **18921**):
```bash
# Import via Grafana API
curl -X POST \
http://admin:password@localhost:3000/api/dashboards/import \
-H "Content-Type: application/json" \
-d '{ "gnetId": 18921, "overwrite": true, "inputs": [{"name": "DS_PROMETHEUS", "type": "datasource", "pluginId": "prometheus", "value": "Prometheus"}] }'
```
The dashboard includes panels for:
- Query rate and error rate
- Query latency percentiles (p50, p95, p99)
- Active connections vs max connections
- Vector index memory usage
- Replication lag
- Database and table sizes
- Cache hit ratio
- Checkpoint frequency
## Alerting Rules
Create Prometheus alerting rules for critical conditions:
```yaml
# neuraldb-alerts.yml
groups:
- name: neuraldb
rules:
- alert: NeuralDBConnectionsHigh
expr: neuraldb_connections_total{state="active"} / neuraldb_connections_max > 0.85
for: 2m
labels:
severity: warning
annotations:
summary: "NeuralDB connections above 85%"
description: "{{ $value | humanizePercentage }} of max connections in use"
- alert: NeuralDBConnectionsExhausted
expr: neuraldb_connections_total{state="active"} / neuraldb_connections_max > 0.98
for: 30s
labels:
severity: critical
annotations:
summary: "NeuralDB connections nearly exhausted"
- alert: NeuralDBHighQueryLatency
expr: histogram_quantile(0.99, rate(neuraldb_query_duration_seconds_bucket[5m])) > 1.0
for: 5m
labels:
severity: warning
annotations:
summary: "P99 query latency above 1 second"
- alert: NeuralDBReplicationLagHigh
expr: neuraldb_replication_lag_seconds > 30
for: 1m
labels:
severity: warning
annotations:
summary: "Replication lag above 30 seconds"
- alert: NeuralDBDiskSpaceHigh
expr: (neuraldb_database_size_bytes / disk_total_bytes) > 0.80
for: 5m
labels:
severity: warning
annotations:
summary: "Database storage above 80% capacity"
- alert: NeuralDBVectorBufferExhausted
expr: neuraldb_hnsw_index_size_bytes > (neuraldb_vector_buffer_size_bytes * 0.90)
for: 5m
labels:
severity: warning
annotations:
summary: "HNSW indexes using >90% of vector_buffer"
```
## Built-In Query Statistics
```sql
-- Top 10 slowest queries
SELECT query,
calls,
round(mean_exec_time::numeric, 2) AS avg_ms,
round(total_exec_time::numeric, 2) AS total_ms,
round(stddev_exec_time::numeric, 2) AS stddev_ms
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 10;
-- Cache hit ratio (should be >99%)
SELECT
sum(blks_hit) * 100.0 / sum(blks_hit + blks_read) AS cache_hit_ratio
FROM pg_stat_database
WHERE datname != 'template0';
-- Lock waits
SELECT pid, query, state, wait_event_type, wait_event, query_start
FROM pg_stat_activity
WHERE wait_event_type = 'Lock'
ORDER BY query_start;
```
## Log-Based Alerting
Forward slow query logs to your SIEM or log aggregation system:
```ini
# neuraldb.conf
log_destination = 'jsonlog'
log_min_duration_statement = 500 # log queries slower than 500ms
log_line_prefix = '%t [%p] %u@%d '
```
Parse JSON logs in Loki or Elasticsearch and alert when the rate of slow queries exceeds a threshold.