fix(grafana): RC6 demo QA pass — Dashboard 3 query text, Dashboard 6 title TODO, ASH legend dedup, default time range, version banner

Summary

Five Grafana QA bugs from the live rc.6 demo, fixed via strict red/green TDD. Each bug has its own RED test commit followed by a GREEN fix commit so the test history shows the intent.

All tests live under tests/grafana_dashboards/ and run on every MR via the new quality:grafana-dashboards-lint job (wired into quality/gitlab-ci-quality.yml, which is already included by .gitlab-ci.yml). They are JSON-only and complete in ~0.1 s.

Run locally:

python3 -m pytest tests/grafana_dashboards/ -v

(184 tests, all green on this branch.)

CI wire-up assertion: grep "tests/grafana_dashboards" quality/gitlab-ci-quality.yml returns the python -m pytest tests/grafana_dashboards -v --tb=short line.

Bugs fixed

Bug 1 — Dashboard 3 first panel "No data" silent failure

The query-text panel at the top of Dashboard_3_Single_query_analysis.json (panel id=18) ran a raw SQL against PGWatch-PostgreSQL without selecting a database. When the provisioned data source has no default database, Grafana renders "No data" with just a tiny pink/magenta warning triangle in the panel header and the tooltip "You do not currently have a default database configured for this data source...".

Fix: pin the database via ${db_name} and match the data->>'real_dbname' = %s predicate the monitoring-flask-backend API uses (keeps the lookup partition-local through pgss_queryid_queries_real_dbname_time_idx).

Test: tests/grafana_dashboards/test_postgres_panels_pin_database.py walks every panel target with datasource.type == "postgres" and asserts the rawSql references ${db_name} (or pins an explicit database).

Bug 2 — Dashboard 6 title leaked a TODO note

Title was 06. Replication and HA -- "Metrics are collected (part of health check); dashboard – TODO". Cleaned to 06. Replication and HA.

Test: tests/grafana_dashboards/test_titles_no_todo_markers.py rejects titles containing TODO, WIP, XXX, FIXME, or an inline -- aside.

Bug 3 — Dashboard 1 ASH legend duplicates (and same anti-pattern in Dashboards 3 + 4)

The "Active session history" panel carried a fourth target

sum by (wait_event_type) (pgwatch_wait_events_total)>0

with no label selector, overlaying three filtered targets on the same metric. The unfiltered target produced a superset, doubling every series in the legend (matching colour swatches, identical min/max/mean).

The JSON lint immediately surfaced the same anti-pattern in Dashboard 3 panel id=19 and Dashboard 4 panels id=1 and id=2 — identical root cause, so all four were removed in one GREEN commit rather than leaving the test red on three dashboards.

Test: tests/grafana_dashboards/test_no_duplicate_targets_in_panel.py enforces three rules:

  1. No duplicate refId within a panel.
  2. No two non-hidden targets share the same normalised expression/rawSql.
  3. If a Prometheus metric appears in multiple non-hidden targets, every reference must include a non-empty label selector.

Bug 4 — Dashboard 2 "Detailed table view" investigation — not a bug

The "Detailed table view (pg_stat_statements)" entry is a Grafana row panel with collapsed: true. The table inside uses the Infinity data source (http://flask-pgss-api:8000/pgss_metrics/csv) and already passes ${db_name} as a URL parameter — it is NOT the same root cause as Bug 1, and the Bug 1 lint correctly does not flag it (different data source type).

The "blank during load" symptom on the demo is just the row being collapsed by default. Recommend tracking as a UX follow-up (auto-expand on landing) rather than a code bug. No fix applied in this MR.

Bug 5 — Default time window too wide on demo

Nine dashboards defaulted to now-6h, now-12h, or now-24h. On a fresh deployment those windows render as thin bars at the far right of every chart. All set to now-1h.

Test: tests/grafana_dashboards/test_default_time_range.py parses each dashboard's time.from relative spec and asserts ≤ 1 hour.

Bug 6 — Grafana "New version available" banner

Set [analytics] check_for_updates = false in config/grafana/provisioning/grafana.ini (docker-compose) and grafana.grafana.ini.analytics.check_for_updates: false in postgres_ai_helm/values.yaml (Helm).

Test: tests/grafana_dashboards/test_grafana_check_for_updates_disabled.py parses both files and asserts the setting on both deployment paths.

Bug 7 — Dashboard 1 slug inconsistency — flagged, not fixed

config/grafana/provisioning/grafana.ini line 2 sets

home_page = /d/f90500a0-a12e-4081-a2f0-07ed96f27915/1-postgres-node-performance-overview-high-level/

but Grafana derives the slug from the dashboard title 01. Single node performance overview (high-level), which yields 01-single-node-performance-overview-high-level. Grafana resolves dashboards by UID and ignores the slug in the URL, so this currently works — but the literal mismatch will confuse anyone reading the config. Suggest a separate small fix to rename the home_page slug to match the title. Not fixed here per the one-bug-at-a-time scope rule.

Notes for reviewers

  • The postgres_ai_helm/config/grafana/dashboards/ files are symlinks into config/grafana/dashboards/ — every fix lands once.
  • Commit structure: 6 RED → GREEN pairs (12 commits total). The 6th pair was added during the 2026-05-29 rebase onto main to lock the composed Dashboard 3 panel id=18 rawSql invariant (${db_name} pin AND the query text not yet collected graceful fallback — see tests/grafana_dashboards/test_d3_query_text_panel_compose.py).
  • The CI job is quality:grafana-dashboards-lint in quality/gitlab-ci-quality.yml.

Visual changes

BEFORE captured on the live rc.6 demo (http://167.233.29.47:3000); AFTER captured on a local Grafana 12.3.2 instance loaded with this branch's dashboard JSONs. Fix 5 ("New version available" banner) is verified via test-as-proof — see tests/grafana_dashboards/test_grafana_check_for_updates_disabled.py — because the banner only renders for users with Grafana admin role, and the rc.6 demo monitor login does not see it (the bug surfaces for Grafana admins via the bottom-left version popover; /api/frontend/settings confirms buildInfo.hasUpdate=true on rc.6 vs false locally).

Bug 1 — Dashboard 3 first panel "No data" + pink/magenta triangle

BEFORE: the top "Query text" panel shows a pink/magenta warning triangle. Tooltip reads "You do not currently have a default database configured for this data source...".

AFTER: panel renders the actual rendered query text — proof the rawSql fix works end-to-end. A local Grafana 12.3.2 was loaded with this branch's Dashboard_3_Single_query_analysis.json, the PGWatch-PostgreSQL datasource was seeded with one pgss_queryid_queries row (data->>'queryid' = 8765432109876543210, data->>'real_dbname' = demodb, data->>'query' = SELECT u.id, u.email, count(o.id) ...), and the panel was opened with var-query_id=8765432109876543210 and var-db_name=demodb. The Bug 1 fix (rawSql ${db_name} pin against data->>'real_dbname') is also verified by tests/grafana_dashboards/test_postgres_panels_pin_database.py.

Before After
before-1-d3-first-panel after-1-d3-first-panel-v2

Bug 2 — Dashboard 6 title leaked a TODO note

BEFORE (rc.6 demo dashboard list): 06. Replication and HA -- "Metrics are collected (part of health check); dashboard – T...

AFTER (local with this branch): 06. Replication and HA

Before After
before-2-d6-title after-2-d6-title

Bug 3 — Dashboard 1 ASH legend duplicates

BEFORE (live rc.6 demo, Dashboard 1, Last 24 hours): legend shows seven rows — CPU*, Client, LWLock, Lock, Postgres - Activity, Activity, and a second CPU* at the bottom whose Min/Max/Mean column values (1 / 4 / 1.52) are identical to row 1. The duplicate CPU* is the visible symptom of the unfiltered catch-all target overlaying the filtered ones; the JSON-lint test enforces the underlying rule on Dashboards 1, 3 and 4 even where a category like IO happens to have no active samples in the BEFORE capture and therefore does not surface a visible second row.

AFTER: deduped — legend shows the same category set as BEFORE minus the duplicate CPU*. Six rows: CPU*, Client, LWLock, Lock, Postgres - Activity, Idle Internal - Idle - Activity (captured against a local Grafana 12.3.2 with this branch's Dashboard 1 JSON; the pgwatch_wait_events_total series were seeded into the local Prometheus/VictoriaMetrics sink across the dashboard's now-1h window for all six BEFORE categories, so the panel can render the post-fix output end-to-end). The unfiltered target D from the buggy version is gone, so no category appears twice.

Before After
before-3-d1-ash-legend after-3-d1-ash-legend-v2

Bug 4 — Default time window too wide

Top-right time picker on Dashboard 1.

BEFORE: Last 24 hours (also Last 6 hours / Last 12 hours on 8 other dashboards).

AFTER: Last 1 hour.

Before After
before-4-time-range-d1 after-4-time-range-d1-v2

Bug 5 — Grafana "New version available" banner — test-as-proof

The banner only renders for users with Grafana admin role; the rc.6 demo monitor login can't see it. The underlying signal is /api/frontend/settings → buildInfo.hasUpdate:

  • rc.6 demo (BEFORE): hasUpdate: true, latestVersion: "13.0.1+security-01"
  • local with check_for_updates = false (AFTER): hasUpdate: false, latestVersion: "", hideVersion: true

Test: tests/grafana_dashboards/test_grafana_check_for_updates_disabled.py asserts the setting on both deployment paths (docker-compose grafana.ini and Helm values).

Test plan

  • python3 -m pytest tests/grafana_dashboards/ -v — 182 pass locally
  • All five RED tests were verified to fail on the initial codebase before applying the GREEN fix
  • All target JSON files re-validated as valid JSON after edits
  • python3 -c "import yaml; yaml.safe_load(open('postgres_ai_helm/values.yaml'))" — valid YAML after edits
  • CI quality:grafana-dashboards-lint job passes on this branch
  • Manual spot-check on rc.7 demo:
    • Dashboard 3 first panel renders query text
    • Dashboard 6 title shows "06. Replication and HA"
    • Dashboard 1 ASH legend has no duplicate CPU*/IO rows
    • Dashboards default to "Last 1 hour"
    • No "New version available" banner

🤖 Generated with Claude Code

Closes #217 (closed)

Edited by Maya P

Merge request reports

Loading