Get environment metrics from CPT environment and store it as an artifact

Summary

Collect environment metrics from the CNG GitLab instance during CPT test runs and store them as artifacts. These metrics will provide crucial context for performance analysis and will be included in the Duo-generated performance reports to help identify whether performance issues stem from resource constraints, application bottlenecks, or environment configuration problems.

Problem

Currently, CPT only collects k6 test metrics (RPS, TTFB, request success rates) but lacks visibility into the underlying CNG environment's health and resource utilization during test execution. This makes it difficult to:

Determine if performance degradation is due to resource constraints vs application issues
Identify which GitLab components (webservice, gitaly, postgresql, redis) are bottlenecks
Understand if the test environment is healthy and stable during testing
Provide sufficient context to Duo for accurate performance report generation

Proposed Solution

Create a script to collect comprehensive environment metrics from the CNG Kubernetes cluster and store them as CI artifacts alongside k6 test results.

Metrics to Collect

1. Kubernetes Pod-Level Resource Metrics

Collect CPU and memory usage for all GitLab component pods:

kubectl top pods -n gitlab

Key components to track:

Webservice - Handles HTTP requests
Sidekiq - Background job processing
Gitaly - Git operations
PostgreSQL - Database
Redis - Cache/sessions
Workhorse - Request routing
Registry - Container registry

2. Kubernetes Node-Level Metrics

kubectl top nodes
kubectl describe nodes

Metrics:

Total CPU capacity vs usage
Total Memory capacity vs usage
Pod count and density
Resource pressure indicators

3. GitLab Application Metrics (Prometheus)

Scrape Prometheus metrics from GitLab components:

Webservice metrics (port 8083):

http_request_duration_seconds - Request latency histograms
gitlab_transaction_duration_seconds - Transaction timing
gitlab_cache_operations_total - Cache hit/miss rates
gitlab_database_connection_pool_size - DB connection pool
gitlab_database_connection_pool_busy - Active DB connections
puma_workers - Puma worker count

Gitaly metrics (port 9236):

gitaly_service_client_requests_total - RPC request count
gitaly_service_client_request_duration_seconds - RPC latency

PostgreSQL metrics:

pg_stat_database_* - Database statistics
pg_stat_activity_count - Active connections
pg_locks_count - Lock contention

Redis metrics:

redis_connected_clients - Client connections
redis_used_memory_bytes - Memory usage
redis_commands_processed_total - Command throughput

4. Pod Health and Status

kubectl get pods -n gitlab -o json
kubectl get events -n gitlab --sort-by='.lastTimestamp'

Metrics:

Pod restart counts (indicates crashes/OOM)
Pod ready/not ready status
OOMKilled events
Resource requests vs limits
Recent error events

5. Persistent Volume Metrics

kubectl get pvc -n gitlab

Metrics:

Gitaly storage usage
PostgreSQL storage usage
Redis persistence storage

6. Environment Configuration Context

Capture from existing environment variables:

GITLAB_HELM_CHART_REF - Chart version
Component image tags (webservice, sidekiq, gitaly, workhorse, etc.)
--resource-preset performance - Resource allocation profile

Expected Output Format

The collected metrics should be structured as:

{
  "test_results": {
    "rps": 150.5,
    "ttfb_p90_ms": 450,
    "success_rate": 99.8
  },
  "environment_metrics": {
    "gitlab_version": "17.7.0",
    "helm_chart_ref": "abc123",
    "resource_preset": "performance",
    "pod_resources": {
      "webservice": {
        "cpu_usage": "250m",
        "memory_usage": "1.2Gi",
        "restarts": 0,
        "cpu_request": "500m",
        "cpu_limit": "2",
        "memory_request": "2Gi",
        "memory_limit": "4Gi"
      },
      "gitaly": { },
      "postgresql": { },
      "redis": { }
    },
    "node_resources": {
      "cpu_capacity": "4",
      "cpu_usage": "2.5",
      "memory_capacity": "16Gi",
      "memory_usage": "8Gi"
    },
    "application_metrics": {
      "http_request_p95_ms": 450,
      "db_connection_pool_busy": 15,
      "db_connection_pool_size": 20,
      "cache_hit_rate": 0.85,
      "gitaly_rpc_p95_ms": 120
    },
    "health_indicators": {
      "pod_restarts": 0,
      "oom_kills": 0,
      "failed_pods": 0
    }
  }
}

Edited Nov 30, 2025 by Vishal Patel