Get environment metrics from CPT environment and store it as an artifact
Summary
Collect environment metrics from the CNG GitLab instance during CPT test runs and store them as artifacts. These metrics will provide crucial context for performance analysis and will be included in the Duo-generated performance reports to help identify whether performance issues stem from resource constraints, application bottlenecks, or environment configuration problems.
Problem
Currently, CPT only collects k6 test metrics (RPS, TTFB, request success rates) but lacks visibility into the underlying CNG environment's health and resource utilization during test execution. This makes it difficult to:
- Determine if performance degradation is due to resource constraints vs application issues
- Identify which GitLab components (webservice, gitaly, postgresql, redis) are bottlenecks
- Understand if the test environment is healthy and stable during testing
- Provide sufficient context to Duo for accurate performance report generation
Proposed Solution
Create a script to collect comprehensive environment metrics from the CNG Kubernetes cluster and store them as CI artifacts alongside k6 test results.
Metrics to Collect
1. Kubernetes Pod-Level Resource Metrics
Collect CPU and memory usage for all GitLab component pods:
kubectl top pods -n gitlab
Key components to track:
- Webservice - Handles HTTP requests
- Sidekiq - Background job processing
- Gitaly - Git operations
- PostgreSQL - Database
- Redis - Cache/sessions
- Workhorse - Request routing
- Registry - Container registry
2. Kubernetes Node-Level Metrics
kubectl top nodes
kubectl describe nodes
Metrics:
- Total CPU capacity vs usage
- Total Memory capacity vs usage
- Pod count and density
- Resource pressure indicators
3. GitLab Application Metrics (Prometheus)
Scrape Prometheus metrics from GitLab components:
Webservice metrics (port 8083):
-
http_request_duration_seconds- Request latency histograms -
gitlab_transaction_duration_seconds- Transaction timing -
gitlab_cache_operations_total- Cache hit/miss rates -
gitlab_database_connection_pool_size- DB connection pool -
gitlab_database_connection_pool_busy- Active DB connections -
puma_workers- Puma worker count
Gitaly metrics (port 9236):
-
gitaly_service_client_requests_total- RPC request count -
gitaly_service_client_request_duration_seconds- RPC latency
PostgreSQL metrics:
-
pg_stat_database_*- Database statistics -
pg_stat_activity_count- Active connections -
pg_locks_count- Lock contention
Redis metrics:
-
redis_connected_clients- Client connections -
redis_used_memory_bytes- Memory usage -
redis_commands_processed_total- Command throughput
4. Pod Health and Status
kubectl get pods -n gitlab -o json
kubectl get events -n gitlab --sort-by='.lastTimestamp'
Metrics:
- Pod restart counts (indicates crashes/OOM)
- Pod ready/not ready status
- OOMKilled events
- Resource requests vs limits
- Recent error events
5. Persistent Volume Metrics
kubectl get pvc -n gitlab
Metrics:
- Gitaly storage usage
- PostgreSQL storage usage
- Redis persistence storage
6. Environment Configuration Context
Capture from existing environment variables:
-
GITLAB_HELM_CHART_REF- Chart version - Component image tags (webservice, sidekiq, gitaly, workhorse, etc.)
-
--resource-preset performance- Resource allocation profile
Expected Output Format
The collected metrics should be structured as:
{
"test_results": {
"rps": 150.5,
"ttfb_p90_ms": 450,
"success_rate": 99.8
},
"environment_metrics": {
"gitlab_version": "17.7.0",
"helm_chart_ref": "abc123",
"resource_preset": "performance",
"pod_resources": {
"webservice": {
"cpu_usage": "250m",
"memory_usage": "1.2Gi",
"restarts": 0,
"cpu_request": "500m",
"cpu_limit": "2",
"memory_request": "2Gi",
"memory_limit": "4Gi"
},
"gitaly": { },
"postgresql": { },
"redis": { }
},
"node_resources": {
"cpu_capacity": "4",
"cpu_usage": "2.5",
"memory_capacity": "16Gi",
"memory_usage": "8Gi"
},
"application_metrics": {
"http_request_p95_ms": 450,
"db_connection_pool_busy": 15,
"db_connection_pool_size": 20,
"cache_hit_rate": 0.85,
"gitaly_rpc_p95_ms": 120
},
"health_indicators": {
"pod_restarts": 0,
"oom_kills": 0,
"failed_pods": 0
}
}
}