Skip to content

Operate Secret Detection service in passive mode until observations are recorded

Context

As mentioned in the weekly discussion - As we don't have solid metrics about the SD scan requests traffic, and the service being part of a critical path (git push event), there's a high chance that a failure of the service could disrupt customers' business. We need to ensure we have sufficient metrics to justify the service's ability to handle the traffic without disrupting businesses.

Plan

We will run the Secret Detection service passively in the production environment such that we will not utilize it for Secret Detection scanning but rather observe the metrics related to the system and traffic. We will continue using the secret-detection gem to run secret detection scans. For every scan request received, a copy of the request will be sent to the SD service in a fire-and-forget manner to avoid affecting SPP's overall latency.

This approach would give us a fair idea about the unknown load and system metrics in the production environment without affecting customers' business. After monitoring for a milestone or two post-GA release, we would have enough data to switch to SD service.

Observations to record from SD service

  • Average and Peak no of requests received.
  • Average and Peak resource(CPU/RAM/IO) consumption.
  • Min, Average, and Max runway instances scaled.
  • Crashes occurred along with reasons
  • Feel free to add any other metrics here

Observation Sources

Success Criteria to switch to active mode

  • No errors or warnings
  • Average Latency is below 100ms (proposed)
  • Test suite for SDS has fully coverage (proposed)

Events

Event Milestone Comment Status
Release to Production in Passive mode %17.9 Code is on prod, CR for settings is complete, FF will be turned on 2024-12-19 workflowcomplete
Release to Production in Active mode (Beta) %17.10
Release to Production in Active mode (GA) %17.11

Observation logs

Date Avg/Peak # requests/s Avg/Peak CPU Avg/Peak RAM Min/Avg/Max Instances Request latency min/avg/max (ms) # Exceptions Dashboard link Notes
2025-02-05 0.07 / 0.47 0.99% / 0.99% 1.40% / 1.99% 2 / 3 / 5 9.9 / - / 229.4 0 link Time range starting 2025-01-31 because normal traffic started then
2025-02-12 0.05 /1.0 1.00% / 3.97% 1.23% / 1.98% 2 / 3 / 4 9.9 / - / 339.42 0 link
2025-02-18 0.03 / 0.27 0.99% / 0.99% 0.99% / 0.99% 2 / 3 / 6 9.9 / - / 435 0 link
2025-02-26 0.04 / 0.23 1.0% / 1.99% 0.99% / 0.99% 2 / 3 / 6 9.9 / 12.2 / 210 0 link
2025-03-05 0.08 / 0.45 0.99% / 12.95% 1.59% / 1.99% 2 / 3 / 7 9.9 / 11.59 / 221.48 0 link Some data was from different instances to illustrate extremes
2025-03-12 0.08 / 0.48 0.99% / 0.99% 1.25% / 1.99% 2 / 3 / 6 9.9 / 14.35 / 336.24 0 link 26470 SPP enabled projects
Edited by Ethan Urie