Skip to content

[deployed] Create HTTP Server Monitoring Dashboard in Grafana

Problem to Solve

We need visibility into the performance and health of our HTTP server. Without a dedicated dashboard, it is difficult to answer fundamental questions such as:

  • How much traffic is the server handling?
  • What is the error rate for user-facing requests?
  • Are specific endpoints performing slowly?

This operational blindness makes it challenging to proactively identify bottlenecks, diagnose production issues, and understand the real-time health of the service.

Proposed Solution

Create a comprehensive Grafana dashboard to provide at-a-glance insights into the HTTP server's behaviour using Prometheus metrics. The dashboard should designed to offer both a high-level overview and the ability to drill down into specific issues.

Metrics:

  • Replace the gkg_http_requests_total metric with gkg_http_responses_total, which now includes method, path, and status labels for more granular monitoring.
  • Keep gkg_http_request_duration_seconds to measure latency.

Dashboard creation:

Create dashboards as code in https://gitlab.com/gitlab-com/runbooks with the following pages:

  1. Overview: Create an overview page with high level http-server metrics section:
    • HTTP error rate
    • HTTP request rate
    • P99 latency
  2. HTTP Server: Create an http-server page with fine-grained metrics:
    • HTTP error rate per method/path (all by default)
    • P99 latency per method/path (all by default)
    • P95 latency per method/path (all by default)
    • Response status per method/path (all by default)
  3. HTTP Server: Create filters for method and path which allows developers to isolate the metrics for a single endpoint.

Pro tip: Use the example dashboards from feat(observability): add http-server metrics an... (!362 - merged) to create the production dashboards.

Dashboard Design Choices:

  1. High-Level KPIs: The top-level stats for Request Rate, Error Rate (5xx), and P99 Latency serve as the primary health indicators. They provide an immediate answer to "Is the service healthy right now?" P99 latency is chosen over average latency because it better reflects the worst-case user experience, which is often hidden by simple averages.
  2. Detailed Time-Series Graphs:
    • Response Status Codes: This graph breaks down traffic by response type (2xx, 4xx, 5xx), helping to quickly differentiate between server-side failures and client-side errors.
    • Latency & Error Rate by Endpoint: These are critical for debugging. By breaking down metrics per-endpoint, we can instantly pinpoint which specific API calls are slow or failing, dramatically reducing the time required to resolve an incident.
Edited by Jean-Gabriel Doyon