Operate Secret Detection service in passive mode until observations are recorded

Context

As mentioned in the weekly discussion - As we don't have solid metrics about the SD scan requests traffic, and the service being part of a critical path (git push event), there's a high chance that a failure of the service could disrupt customers' business. We need to ensure we have sufficient metrics to justify the service's ability to handle the traffic without disrupting businesses.

Plan

We will run the Secret Detection service passively in the production environment such that we will not utilize it for Secret Detection scanning but rather observe the metrics related to the system and traffic. We will continue using the secret-detection gem to run secret detection scans. For every scan request received, a copy of the request will be sent to the SD service in a fire-and-forget manner to avoid affecting SPP's overall latency.

This approach would give us a fair idea about the unknown load and system metrics in the production environment without affecting customers' business. After monitoring for a milestone or two post-GA release, we would have enough data to switch to SD service.

Observations to record from SD service

Average and Peak no of requests received.
Average and Peak resource(CPU/RAM/IO) consumption.
Min, Average, and Max runway instances scaled.
Crashes occurred along with reasons
Feel free to add any other metrics here

Observation Sources

We can refer to the Secret Detection Service dashboard which includes all the above-mentioned resource metrics
We can refer to service logs to monitor application crashes

Success Criteria to switch to active mode

No errors or warnings
Average Latency is below 100ms (proposed)
Test suite for SDS has fully coverage (proposed)

Events

Event	Milestone	Comment	Status
Release to Production in Passive mode	%17.9	Code is on prod, CR for settings is complete, FF will be turned on 2024-12-19	workflowcomplete
Release to Production in Active mode (Beta)	%17.10
Release to Production in Active mode (GA)	%17.11

Observation logs

Date	Avg/Peak # requests/s	Avg/Peak CPU	Avg/Peak RAM	Min/Avg/Max Instances	Request latency min/avg/max (ms)	Dashboard link	Notes
2025-02-05	0.07 / 0.47	0.99% / 0.99%	1.40% / 1.99%	2 / 3 / 5	9.9 / - / 229.4	link	Time range starting 2025-01-31 because normal traffic started then
2025-02-12	0.05 /1.0	1.00% / 3.97%	1.23% / 1.98%	2 / 3 / 4	9.9 / - / 339.42	link
2025-02-18	0.03 / 0.27	0.99% / 0.99%	0.99% / 0.99%	2 / 3 / 6	9.9 / - / 435	link
2025-02-26	0.04 / 0.23	1.0% / 1.99%	0.99% / 0.99%	2 / 3 / 6	9.9 / 12.2 / 210	link
2025-03-05	0.08 / 0.45	0.99% / 12.95%	1.59% / 1.99%	2 / 3 / 7	9.9 / 11.59 / 221.48	link	Some data was from different instances to illustrate extremes
2025-03-12	0.08 / 0.48	0.99% / 0.99%	1.25% / 1.99%	2 / 3 / 6	9.9 / 14.35 / 336.24	link	26470 SPP enabled projects

Edited Mar 12, 2025 by Ethan Urie