Integrate the environment metrics to the reporting

Summary

Integrate environment metrics (collected in #116) into CPT's Duo-generated performance reports to provide comprehensive analysis that correlates application performance with infrastructure health.

Problem

Currently, CPT reports only analyze k6 test metrics (RPS, TTFB, success rates) in isolation. When performance degradation occurs, it's unclear whether the root cause is:

Application-level issues (code changes, inefficient queries)
Infrastructure constraints (CPU/memory saturation, pod restarts)
Environment instability (OOMKills, resource pressure)

This incomplete context leads to:

Misdiagnosed performance issues - Blaming code changes when the environment was resource-constrained
Manual investigation overhead - Engineers must separately check Kubernetes metrics
Reduced report value - Duo can't provide actionable insights without full context

Goal

Enhance CPT reports to include environment metrics analysis, enabling Duo to:

Identify if performance variations correlate with resource saturation
Detect environment health issues (pod restarts, OOMKills) during test runs
Provide context-aware recommendations (e.g., "TTFB increased 40% while webservice CPU hit 95% - likely resource constraint, not code regression")

Implementation Strategy

1. Data Integration & Preparation

Load and merge metrics:

Read environment metrics JSON artifact from #116
Combine with k6 test results into unified payload
Validate completeness and handle missing data gracefully
Ensure temporal alignment (metrics from same test window)

Merged data structure:

{ "test_metadata": { "mr_iid": "12345", "commit_sha": "abc123", "duration": 300 }, "k6_metrics": { "rps": 150.5, "ttfb_p90_ms": 450, "success_rate": 99.8 }, "environment_metrics": { "pod_resources": { "webservice": { "cpu_usage": "85%", "memory": "1.2Gi/4Gi", "restarts": 0 } }, "node_resources": { "cpu_usage": "2.5/4", "memory": "8Gi/16Gi" }, "application_metrics": { "db_pool_busy": 15, "db_pool_size": 20, "cache_hit_rate": 0.85 }, "health_indicators": { "pod_restarts": 0, "oom_kills": 0 } } }

2. Enhance Duo Prompt

Add environment analysis instructions:

`CRITICAL ANALYSIS RULES:

Correlate performance changes with resource utilization
Flag resource constraint if CPU >80% or Memory >85%
Flag instability if pod restarts >0 during test
Flag DB bottleneck if connection pool busy/size ratio >0.8
Distinguish code-related vs infrastructure-related issues

REQUIRED REPORT SECTIONS:

Performance Summary (k6 metrics)
Environment Health (pod status, restarts, OOMKills)
Resource Utilization (CPU/memory per component)
Correlation Analysis (performance vs resource changes)
Root Cause Assessment (code vs infrastructure)
Actionable Recommendations `

3. Update Report Generator

Modify existing report flow:

Pass merged payload to Duo API with enhanced prompt
Apply retry logic (#113 (closed)) for complete reports
Maintain outlier filtering rules
Format output with new environment sections

Expected report sections:

`## Environment Health

✅ No pod restarts | ⚠️ Webservice CPU 85% (threshold: 80%)

Resource Utilization

Component	CPU	Memory	Status
Webservice	85%	1.2Gi/4Gi	⚠️ High
PostgreSQL	60%	1.5Gi/4Gi	✅ OK

Correlation Analysis

TTFB P95 ↑40% (350ms→490ms) + Webservice CPU ↑25% (60%→85%) Assessment: Performance degradation correlates with CPU saturation Recommendation: Likely resource constraint. Increase webservice CPU or scale replicas. `

4. Testing & Validation

Test scenarios:

Healthy environment + good performance → Report shows "no issues"
Resource-constrained + degraded performance → Duo identifies resource constraint
Healthy environment + code regression → Duo identifies code issue
Pod restarts during test → Duo flags environment instability

Validation criteria:

Environment metrics in all reports
Correct correlation between performance and resources
Distinguishes code vs infrastructure issues
Provides actionable recommendations
Graceful fallback if environment metrics missing

5. Error Handling & Rollout

Error handling:

Fall back to k6-only analysis if environment data unavailable
Log warnings for incomplete metrics
Ensure payload doesn't exceed Duo token limits

Rollout:

Enable for internal PE testing
Monitor report quality and accuracy
Gather feedback and iterate
Roll out to all CPT users
Update documentation with examples

Success Metrics

100% of reports include environment metrics
90% accuracy identifying resource constraints
Reduced manual investigation time
Positive user feedback on actionability

Edited Dec 03, 2025 by Nivetha Prabakaran