Investigate better incident response to Duo Workflow downtime

Incident to be discussed: gitlab-org/duo-workflow/duo-workflow-service#333 (closed)

Areas for Discussion

This incident highlights several areas that require team discussion:

Process Concerns

  • The change appeared innocuous but had significant production impact

Questions for Team Discussion

  • How can we better identify cross-service dependencies before deploying changes?
  • What testing improvements could help catch similar issues in the future?
  • Should we establish consistent standards for header handling across services?
  • What monitoring could help detect similar issues earlier?
  • How can we improve our documentation around service requirements?

Next Steps

The team needs to:

  1. Discuss and determine the appropriate long-term solution
  2. Decide on preventive measures to avoid similar incidents
  3. Plan implementation priorities based on team consensus

Related Links

  • Revert MR
  • Original MR
Edited Apr 01, 2025 by Sebastian Rehm
Assignee Loading
Time tracking Loading