mdcms/sample-sites/neuraldb-docs/pages/nql-aggregations.md
2026-05-18 14:30:49 +07:00

5.3 KiB

title sort section-id keywords description language
Aggregations 130 query-language aggregations, GROUP BY, COUNT, SUM, vectors, AVG, centroid, analytics Aggregating data in NQL including GROUP BY, COUNT, SUM, and vector-specific aggregation functions en

Aggregations

NQL supports the full SQL aggregation toolkit, extended with vector-specific aggregate functions for centroid computation, clustering, and semantic analytics.

Standard Aggregations

All standard SQL aggregate functions work as expected:

-- Count documents by category
SELECT category, COUNT(*) AS doc_count
FROM documents
GROUP BY category
ORDER BY doc_count DESC;

-- Average price by category
SELECT category,
       COUNT(*) AS products,
       AVG(price) AS avg_price,
       MIN(price) AS min_price,
       MAX(price) AS max_price,
       SUM(stock * price) AS inventory_value
FROM products
WHERE available = true
GROUP BY category
ORDER BY inventory_value DESC;

Vector Aggregations

AVG(embedding) — Centroid Computation

Compute the centroid (average vector) of a group:

-- Centroid of all "technology" documents
SELECT AVG(embedding) AS centroid
FROM documents
WHERE category = 'technology';

Use centroids to find documents representative of a cluster:

WITH centroid AS (
  SELECT AVG(embedding) AS c FROM documents WHERE category = 'technology'
)
SELECT id, title, 1 - (embedding <=> centroid.c) AS similarity_to_centroid
FROM documents, centroid
WHERE category = 'technology'
ORDER BY embedding <=> centroid.c
LIMIT 10;

vector_centroid(embedding) — Weighted Centroid

Compute a weighted centroid using a score column:

-- Weighted centroid by rating (higher-rated items pull more)
SELECT vector_centroid(embedding, rating) AS weighted_centroid
FROM products
WHERE category = 'electronics';

vector_agg_concat(embedding) — Vector Array

Collect vectors into an array for downstream processing:

SELECT category, vector_agg_concat(embedding) AS all_embeddings
FROM documents
GROUP BY category;

Find the best document in each category for a given query:

SELECT DISTINCT ON (category)
  id, category, title, 1 - (embedding <=> $1) AS similarity
FROM documents
WHERE embedding IS NOT NULL
ORDER BY category, embedding <=> $1;

Or using a lateral join for more control:

SELECT cat.category, top_doc.id, top_doc.title, top_doc.similarity
FROM (SELECT DISTINCT category FROM documents) cat,
LATERAL (
  SELECT id, title, 1 - (embedding <=> $1) AS similarity
  FROM documents
  WHERE category = cat.category
  ORDER BY embedding <=> $1
  LIMIT 1
) top_doc;

Window Functions

Use window functions to rank results within partitions:

-- Rank documents by similarity within each category
SELECT
  id, title, category,
  1 - (embedding <=> $1) AS similarity,
  RANK() OVER (
    PARTITION BY category
    ORDER BY embedding <=> $1
  ) AS rank_in_category
FROM documents
WHERE 1 - (embedding <=> $1) > 0.5
ORDER BY category, rank_in_category;

Rolling average similarity over time:

SELECT
  date_trunc('day', created_at) AS day,
  AVG(1 - (embedding <=> $1)) AS avg_daily_similarity,
  AVG(AVG(1 - (embedding <=> $1))) OVER (
    ORDER BY date_trunc('day', created_at)
    ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
  ) AS rolling_7d_avg
FROM documents
GROUP BY day
ORDER BY day;

Clustering with GROUP BY

Perform k-means style clustering by assigning documents to their nearest centroid:

-- Given pre-computed centroids in a centroids table:
SELECT d.id, d.content,
       c.cluster_id,
       (d.embedding <=> c.centroid) AS distance_to_centroid
FROM documents d
CROSS JOIN LATERAL (
  SELECT cluster_id, centroid
  FROM centroids
  ORDER BY d.embedding <=> centroid
  LIMIT 1
) c;

HAVING with Vector Conditions

-- Categories where the average intra-category similarity is high (tight clusters)
SELECT category,
       COUNT(*) AS doc_count,
       1 - AVG(embedding <=> (SELECT AVG(e2.embedding) FROM documents e2 WHERE e2.category = e.category)) AS cohesion
FROM documents e
GROUP BY category
HAVING COUNT(*) > 10
ORDER BY cohesion DESC;

Time-Series Analytics

Analyse how semantic content shifts over time:

-- Daily semantic drift: how different is today's content from last week's?
WITH weekly_centroids AS (
  SELECT
    date_trunc('week', created_at) AS week,
    AVG(embedding) AS centroid
  FROM documents
  GROUP BY week
)
SELECT
  w1.week,
  1 - (w1.centroid <=> w2.centroid) AS similarity_to_prev_week
FROM weekly_centroids w1
LEFT JOIN weekly_centroids w2
  ON w2.week = w1.week - INTERVAL '1 week'
ORDER BY w1.week;

JSON Aggregation with Vectors

Combine JSON aggregation with vector results:

SELECT
  category,
  COUNT(*) AS total,
  AVG(price) AS avg_price,
  JSON_AGG(
    JSON_BUILD_OBJECT('id', id, 'name', name, 'similarity', 1 - (embedding <=> $1))
    ORDER BY embedding <=> $1
  ) FILTER (WHERE ROW_NUMBER() OVER (PARTITION BY category ORDER BY embedding <=> $1) <= 3)
    AS top_3_per_category
FROM products
WHERE available = true
GROUP BY category;

ROLLUP and CUBE

Standard SQL ROLLUP and CUBE work for hierarchical aggregation:

SELECT
  region,
  category,
  COUNT(*) AS count,
  AVG(price) AS avg_price
FROM products
GROUP BY ROLLUP(region, category)
ORDER BY region NULLS LAST, category NULLS LAST;