Add per-destination circuit breaker for audit event streaming

What does this MR do and why?

Add per-destination circuit breaker for audit event streaming

Introduces a circuit breaker that skips destinations after 5 external-service or user-config failures within a 5-minute sliding window. Tripped destinations are skipped for 15 minutes. State is stored in Redis (SharedState), matching the pattern used by Gitlab::CircuitBreaker.

Only external errors trip the breaker; internal GitLab errors are reported to Sentry but do not disable customer destinations. Observability includes a structured Gitlab::AppLogger.warn on trip and a new Prometheus counter gitlab_audit_event_streaming_circuit_breaker_total.

Gated by audit_event_streaming_circuit_breaker feature flag (default off, gitlab_com_derisk type) for safe rollout, since incorrect breaker trips would silently drop audit events - a compliance- sensitive failure mode.

Changelog: added

References

Screenshots or screen recordings

Before After

How to set up and validate locally

  1. Enable Feature.enable(:audit_event_streaming_circuit_breaker, group)

  2. In the UI, go to Group → Secure → Audit events → Streams and add two HTTP destinations:

    Failing destination: URL http://10.255.255.1 (non-routable, will time out) Working destination: URL from https://requestcatcher.com (create a free catcher and copy the URL)

    if destination creation fails for non-routable ip use

   dest = group.external_audit_event_streaming_destinations.find_by(name: '<failing-dest-name>')
   dest.update_column(:config, { 'url' => 'http://10.255.255.1' })
  1. Trigger at least 5 audit events for the group. Any audit-generating action works, for example:

    Change a group setting (Settings → General → Permissions and group features). Update a merge request approval rule. Add or remove a group member.

  2. After 5 failed deliveries within 5 minutes, verify the breaker tripped for the failing destination only:

        dests.map { |d| [d.name, AuditEvents::Streaming::CircuitBreaker.open?(d)] }
        # => [["failing", true], ["working", false]]
  3. Verify the breaker auto-recovers. Either wait 15 minutes for the open key to expire, or clear it manually:

   Gitlab::Redis::SharedState.with do |r|
    r.del("audit_events:streaming:circuit:{group}:open:#{dests.first.id}")
  end

  AuditEvents::Streaming::CircuitBreaker.open?(dest.id)
  # => false

MR acceptance checklist

Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Harsimar Sandhu

Merge request reports

Loading