Investigate better incident response to Duo Workflow downtime
Incident to be discussed: gitlab-org/duo-workflow/duo-workflow-service#333 (closed)
Areas for Discussion
This incident highlights several areas that require team discussion:
Process Concerns
- The change appeared innocuous but had significant production impact
Questions for Team Discussion
- How can we better identify cross-service dependencies before deploying changes?
- What testing improvements could help catch similar issues in the future?
- Should we establish consistent standards for header handling across services?
- What monitoring could help detect similar issues earlier?
- How can we improve our documentation around service requirements?
Next Steps
The team needs to:
- Discuss and determine the appropriate long-term solution
- Decide on preventive measures to avoid similar incidents
- Plan implementation priorities based on team consensus
Related Links
Edited by Sebastian Rehm