Operate Secret Detection service in passive mode until observations are recorded
Context
As mentioned in the weekly discussion - As we don't have solid metrics about the SD scan requests traffic, and the service being part of a critical path (git push event), there's a high chance that a failure of the service could disrupt customers' business. We need to ensure we have sufficient metrics to justify the service's ability to handle the traffic without disrupting businesses.
Plan
We will run the Secret Detection service passively in the production environment such that we will not utilize it for Secret Detection scanning but rather observe the metrics related to the system and traffic. We will continue using the secret-detection gem to run secret detection scans. For every scan request received, a copy of the request will be sent to the SD service in a fire-and-forget manner to avoid affecting SPP's overall latency.
This approach would give us a fair idea about the unknown load and system metrics in the production environment without affecting customers' business. After monitoring for a milestone or two post-GA release, we would have enough data to switch to SD service.
Observations to record from SD service
- Average and Peak no of requests received.
- Average and Peak resource(CPU/RAM/IO) consumption.
- Min, Average, and Max runway instances scaled.
- Crashes occurred along with reasons
- Feel free to add any other metrics here
Observation Sources
- We can refer to the Secret Detection Service dashboard which includes all the above-mentioned resource metrics
- We can refer to service logs to monitor application crashes
Success Criteria to switch to active mode
- No errors or warnings
- Average Latency is below 100ms (proposed)
- Test suite for SDS has fully coverage (proposed)
Events
Event | Milestone | Comment | Status |
---|---|---|---|
Release to Production in Passive mode | %17.9 | Code is on prod, CR for settings is complete, FF will be turned on 2024-12-19 | workflowcomplete |
Release to Production in Active mode (Beta) | %17.10 | ||
Release to Production in Active mode (GA) | %17.11 |
Observation logs
Date | Avg/Peak # requests/s | Avg/Peak CPU | Avg/Peak RAM | Min/Avg/Max Instances | Request latency min/avg/max (ms) | # Exceptions | Dashboard link | Notes |
---|---|---|---|---|---|---|---|---|
2025-02-05 | 0.07 / 0.47 | 0.99% / 0.99% | 1.40% / 1.99% | 2 / 3 / 5 | 9.9 / - / 229.4 | 0 | link | Time range starting 2025-01-31 because normal traffic started then |
2025-02-12 | 0.05 /1.0 | 1.00% / 3.97% | 1.23% / 1.98% | 2 / 3 / 4 | 9.9 / - / 339.42 | 0 | link | |
2025-02-18 | 0.03 / 0.27 | 0.99% / 0.99% | 0.99% / 0.99% | 2 / 3 / 6 | 9.9 / - / 435 | 0 | link | |
2025-02-26 | 0.04 / 0.23 | 1.0% / 1.99% | 0.99% / 0.99% | 2 / 3 / 6 | 9.9 / 12.2 / 210 | 0 | link | |
2025-03-05 | 0.08 / 0.45 | 0.99% / 12.95% | 1.59% / 1.99% | 2 / 3 / 7 | 9.9 / 11.59 / 221.48 | 0 | link | Some data was from different instances to illustrate extremes |
2025-03-12 | 0.08 / 0.48 | 0.99% / 0.99% | 1.25% / 1.99% | 2 / 3 / 6 | 9.9 / 14.35 / 336.24 | 0 | link | 26470 SPP enabled projects |