Get environment metrics from CPT environment and store it as an artifact

Summary

Collect environment metrics from the CNG GitLab instance during CPT test runs and store them as artifacts. These metrics will provide crucial context for performance analysis and will be included in the Duo-generated performance reports to help identify whether performance issues stem from resource constraints, application bottlenecks, or environment configuration problems.

Problem

Currently, CPT only collects k6 test metrics (RPS, TTFB, request success rates) but lacks visibility into the underlying CNG environment's health and resource utilization during test execution. This makes it difficult to:

  • Determine if performance degradation is due to resource constraints vs application issues
  • Identify which GitLab components (webservice, gitaly, postgresql, redis) are bottlenecks
  • Understand if the test environment is healthy and stable during testing
  • Provide sufficient context to Duo for accurate performance report generation

Proposed Solution

Create a script to collect comprehensive environment metrics from the CNG Kubernetes cluster and store them as CI artifacts alongside k6 test results.

Metrics to Collect

1. Kubernetes Pod-Level Resource Metrics

Collect CPU and memory usage for all GitLab component pods:

kubectl top pods -n gitlab

Key components to track:

  • Webservice - Handles HTTP requests
  • Sidekiq - Background job processing
  • Gitaly - Git operations
  • PostgreSQL - Database
  • Redis - Cache/sessions
  • Workhorse - Request routing
  • Registry - Container registry

2. Kubernetes Node-Level Metrics

kubectl top nodes
kubectl describe nodes

Metrics:

  • Total CPU capacity vs usage
  • Total Memory capacity vs usage
  • Pod count and density
  • Resource pressure indicators

3. GitLab Application Metrics (Prometheus)

Scrape Prometheus metrics from GitLab components:

Webservice metrics (port 8083):

  • http_request_duration_seconds - Request latency histograms
  • gitlab_transaction_duration_seconds - Transaction timing
  • gitlab_cache_operations_total - Cache hit/miss rates
  • gitlab_database_connection_pool_size - DB connection pool
  • gitlab_database_connection_pool_busy - Active DB connections
  • puma_workers - Puma worker count

Gitaly metrics (port 9236):

  • gitaly_service_client_requests_total - RPC request count
  • gitaly_service_client_request_duration_seconds - RPC latency

PostgreSQL metrics:

  • pg_stat_database_* - Database statistics
  • pg_stat_activity_count - Active connections
  • pg_locks_count - Lock contention

Redis metrics:

  • redis_connected_clients - Client connections
  • redis_used_memory_bytes - Memory usage
  • redis_commands_processed_total - Command throughput

4. Pod Health and Status

kubectl get pods -n gitlab -o json
kubectl get events -n gitlab --sort-by='.lastTimestamp'

Metrics:

  • Pod restart counts (indicates crashes/OOM)
  • Pod ready/not ready status
  • OOMKilled events
  • Resource requests vs limits
  • Recent error events

5. Persistent Volume Metrics

kubectl get pvc -n gitlab

Metrics:

  • Gitaly storage usage
  • PostgreSQL storage usage
  • Redis persistence storage

6. Environment Configuration Context

Capture from existing environment variables:

  • GITLAB_HELM_CHART_REF - Chart version
  • Component image tags (webservice, sidekiq, gitaly, workhorse, etc.)
  • --resource-preset performance - Resource allocation profile

Expected Output Format

The collected metrics should be structured as:

{
  "test_results": {
    "rps": 150.5,
    "ttfb_p90_ms": 450,
    "success_rate": 99.8
  },
  "environment_metrics": {
    "gitlab_version": "17.7.0",
    "helm_chart_ref": "abc123",
    "resource_preset": "performance",
    "pod_resources": {
      "webservice": {
        "cpu_usage": "250m",
        "memory_usage": "1.2Gi",
        "restarts": 0,
        "cpu_request": "500m",
        "cpu_limit": "2",
        "memory_request": "2Gi",
        "memory_limit": "4Gi"
      },
      "gitaly": { },
      "postgresql": { },
      "redis": { }
    },
    "node_resources": {
      "cpu_capacity": "4",
      "cpu_usage": "2.5",
      "memory_capacity": "16Gi",
      "memory_usage": "8Gi"
    },
    "application_metrics": {
      "http_request_p95_ms": 450,
      "db_connection_pool_busy": 15,
      "db_connection_pool_size": 20,
      "cache_hit_rate": 0.85,
      "gitaly_rpc_p95_ms": 120
    },
    "health_indicators": {
      "pod_restarts": 0,
      "oom_kills": 0,
      "failed_pods": 0
    }
  }
}
Edited by Vishal Patel