Add per-destination circuit breaker for audit event streaming
What does this MR do and why?
Add per-destination circuit breaker for audit event streaming
Introduces a circuit breaker that skips destinations after 5 external-service or user-config failures within a 5-minute sliding window. Tripped destinations are skipped for 15 minutes. State is stored in Redis (SharedState), matching the pattern used by Gitlab::CircuitBreaker.
Only external errors trip the breaker; internal GitLab errors are reported to Sentry but do not disable customer destinations. Observability includes a structured Gitlab::AppLogger.warn on trip and a new Prometheus counter gitlab_audit_event_streaming_circuit_breaker_total.
Gated by audit_event_streaming_circuit_breaker feature flag (default off, gitlab_com_derisk type) for safe rollout, since incorrect breaker trips would silently drop audit events - a compliance- sensitive failure mode.
Changelog: added
References
Screenshots or screen recordings
| Before | After |
|---|---|
How to set up and validate locally
-
Enable Feature.enable(:audit_event_streaming_circuit_breaker, group)
-
In the UI, go to Group → Secure → Audit events → Streams and add two HTTP destinations:
Failing destination: URL http://10.255.255.1 (non-routable, will time out) Working destination: URL from https://requestcatcher.com (create a free catcher and copy the URL)
if destination creation fails for non-routable ip use
dest = group.external_audit_event_streaming_destinations.find_by(name: '<failing-dest-name>')
dest.update_column(:config, { 'url' => 'http://10.255.255.1' })-
Trigger at least 5 audit events for the group. Any audit-generating action works, for example:
Change a group setting (Settings → General → Permissions and group features). Update a merge request approval rule. Add or remove a group member.
-
After 5 failed deliveries within 5 minutes, verify the breaker tripped for the failing destination only:
dests.map { |d| [d.name, AuditEvents::Streaming::CircuitBreaker.open?(d)] } # => [["failing", true], ["working", false]] -
Verify the breaker auto-recovers. Either wait 15 minutes for the open key to expire, or clear it manually:
Gitlab::Redis::SharedState.with do |r|
r.del("audit_events:streaming:circuit:{group}:open:#{dests.first.id}")
end
AuditEvents::Streaming::CircuitBreaker.open?(dest.id)
# => falseMR acceptance checklist
Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.