Integrate the environment metrics to the reporting

Summary

Integrate environment metrics (collected in #116) into CPT's Duo-generated performance reports to provide comprehensive analysis that correlates application performance with infrastructure health.

Problem

Currently, CPT reports only analyze k6 test metrics (RPS, TTFB, success rates) in isolation. When performance degradation occurs, it's unclear whether the root cause is:

  • Application-level issues (code changes, inefficient queries)
  • Infrastructure constraints (CPU/memory saturation, pod restarts)
  • Environment instability (OOMKills, resource pressure)

This incomplete context leads to:

  • Misdiagnosed performance issues - Blaming code changes when the environment was resource-constrained
  • Manual investigation overhead - Engineers must separately check Kubernetes metrics
  • Reduced report value - Duo can't provide actionable insights without full context

Goal

Enhance CPT reports to include environment metrics analysis, enabling Duo to:

  • Identify if performance variations correlate with resource saturation
  • Detect environment health issues (pod restarts, OOMKills) during test runs
  • Provide context-aware recommendations (e.g., "TTFB increased 40% while webservice CPU hit 95% - likely resource constraint, not code regression")

Implementation Strategy

1. Data Integration & Preparation

Load and merge metrics:

  • Read environment metrics JSON artifact from #116
  • Combine with k6 test results into unified payload
  • Validate completeness and handle missing data gracefully
  • Ensure temporal alignment (metrics from same test window)

Merged data structure:

{ "test_metadata": { "mr_iid": "12345", "commit_sha": "abc123", "duration": 300 }, "k6_metrics": { "rps": 150.5, "ttfb_p90_ms": 450, "success_rate": 99.8 }, "environment_metrics": { "pod_resources": { "webservice": { "cpu_usage": "85%", "memory": "1.2Gi/4Gi", "restarts": 0 } }, "node_resources": { "cpu_usage": "2.5/4", "memory": "8Gi/16Gi" }, "application_metrics": { "db_pool_busy": 15, "db_pool_size": 20, "cache_hit_rate": 0.85 }, "health_indicators": { "pod_restarts": 0, "oom_kills": 0 } } }


2. Enhance Duo Prompt

Add environment analysis instructions:

`CRITICAL ANALYSIS RULES:

  1. Correlate performance changes with resource utilization
  2. Flag resource constraint if CPU >80% or Memory >85%
  3. Flag instability if pod restarts >0 during test
  4. Flag DB bottleneck if connection pool busy/size ratio >0.8
  5. Distinguish code-related vs infrastructure-related issues

REQUIRED REPORT SECTIONS:

  1. Performance Summary (k6 metrics)
  2. Environment Health (pod status, restarts, OOMKills)
  3. Resource Utilization (CPU/memory per component)
  4. Correlation Analysis (performance vs resource changes)
  5. Root Cause Assessment (code vs infrastructure)
  6. Actionable Recommendations `

3. Update Report Generator

Modify existing report flow:

  • Pass merged payload to Duo API with enhanced prompt
  • Apply retry logic (#113 (closed)) for complete reports
  • Maintain outlier filtering rules
  • Format output with new environment sections

Expected report sections:

`## Environment Health

  • No pod restarts | ⚠️ Webservice CPU 85% (threshold: 80%)

Resource Utilization

Component CPU Memory Status
Webservice 85% 1.2Gi/4Gi ⚠️ High
PostgreSQL 60% 1.5Gi/4Gi OK

Correlation Analysis

TTFB P95 ↑40% (350ms→490ms) + Webservice CPU ↑25% (60%→85%) Assessment: Performance degradation correlates with CPU saturation Recommendation: Likely resource constraint. Increase webservice CPU or scale replicas. `


4. Testing & Validation

Test scenarios:

  • Healthy environment + good performance → Report shows "no issues"
  • Resource-constrained + degraded performance → Duo identifies resource constraint
  • Healthy environment + code regression → Duo identifies code issue
  • Pod restarts during test → Duo flags environment instability

Validation criteria:

  • Environment metrics in all reports
  • Correct correlation between performance and resources
  • Distinguishes code vs infrastructure issues
  • Provides actionable recommendations
  • Graceful fallback if environment metrics missing

5. Error Handling & Rollout

Error handling:

  • Fall back to k6-only analysis if environment data unavailable
  • Log warnings for incomplete metrics
  • Ensure payload doesn't exceed Duo token limits

Rollout:

  1. Enable for internal PE testing
  2. Monitor report quality and accuracy
  3. Gather feedback and iterate
  4. Roll out to all CPT users
  5. Update documentation with examples

Success Metrics

  • 100% of reports include environment metrics
  • 90% accuracy identifying resource constraints
  • Reduced manual investigation time
  • Positive user feedback on actionability
Edited by Nivetha Prabakaran