Add Contributing Factors to incident.io:
Overview
This issue tracks the implementation of standardized Contributing Factors custom fields in incident.io to enable comprehensive root causes analysis and improve our incident management maturity. By establishing a structured taxonomy of contributing factors, we'll enhance our ability to identify patterns, prevent future incidents, and maintain consistency across incident reviews.
DRI
TBD
Participants
TBD
Why Contributing Factors Standardization?
- Discovery & Analysis: A structured taxonomy makes it easier to identify recurring patterns and systemic issues across incidents
- Consistency: Standardized options ensure all incident responders categorize issues using the same language and framework
- Reporting: Enables automated reporting on common failure modes to inform infrastructure investments and process improvements
- Blameless Culture: Comprehensive categories covering technical, human, and process factors reinforce our commitment to blameless incident reviews
- Integration Readiness: Well-structured data enables future automation and integration with other tools in our incident management ecosystem
List of Contributing Factors
We are implementing Contributing Factors with the following categories:
Technical Issues
- Bug in application code
- Configuration error
- Feature Flag enabled
- Feature Flag missing
- Infrastructure or hardware failure
- Database/Data store failure
- Data/Schema Changes
- Network or connectivity issue
- Capacity overload / Performance issue
- Release or deployment problem
- Architectural/design limitation
Human Factors
- Information or feedback gaps
- Miscommunication or coordination gap
- System design knowledge gaps
- Necessary workarounds
Process / Policy Shortcomings
- Inadequate testing or QA
- Change management gap
- Lack of documentation/runbooks
- Ownership or escalation gap
External Factors
- Third-party service/API outage
- Cloud/Infrastructure provider issue
- Upstream dependency change
- Security attack or breach
Monitoring / Alerting Gaps
- Delayed detection (missing alert)
- Incomplete observability
- Automation/Tooling issue
Benefits
- Root Cause Identification: Comprehensive categorization ensures no contributing factor is overlooked during incident analysis
- Trend Analysis: Enables quarterly/annual reporting on most common incident causes to drive preventive measures
- Resource Allocation: Data-driven insights on failure patterns help prioritize engineering efforts and infrastructure investments
- Knowledge Sharing: Consistent tagging improves searchability and learning from past incidents
- Compliance & Audit: Structured data supports regulatory reporting and demonstrates mature incident management practices
Implementation Steps
-
Custom Field Configuration
-
Create custom field in incident.io -
Ensure that more than one contributing factor can be selected -
Configure field options with proper grouping/categorization -
Set field as required for incident closure -
Test field functionality in staging environment
-
-
Documentation & Training
-
Update training materials for incident responders -
Update incident response procedures to include factor selection
-
-
Rollout & Adoption
-
Gather feedback and refine options if needed -
Monitor adoption rates and field usage
-
-
Reporting & Analytics
-
Integrate data with existing SRE metrics and reports
-
Success Metrics
- 100% of incidents have at least one contributing factor selected within 30 days of rollout
Edited by Alex Hanselka