Add Contributing Factors to incident.io:

Overview

This issue tracks the implementation of standardized Contributing Factors custom fields in incident.io to enable comprehensive root causes analysis and improve our incident management maturity. By establishing a structured taxonomy of contributing factors, we'll enhance our ability to identify patterns, prevent future incidents, and maintain consistency across incident reviews.

DRI

TBD

Participants

TBD

Why Contributing Factors Standardization?

  • Discovery & Analysis: A structured taxonomy makes it easier to identify recurring patterns and systemic issues across incidents
  • Consistency: Standardized options ensure all incident responders categorize issues using the same language and framework
  • Reporting: Enables automated reporting on common failure modes to inform infrastructure investments and process improvements
  • Blameless Culture: Comprehensive categories covering technical, human, and process factors reinforce our commitment to blameless incident reviews
  • Integration Readiness: Well-structured data enables future automation and integration with other tools in our incident management ecosystem

List of Contributing Factors

We are implementing Contributing Factors with the following categories:

Technical Issues

  • Bug in application code
  • Configuration error
  • Feature Flag enabled
  • Feature Flag missing
  • Infrastructure or hardware failure
  • Database/Data store failure
  • Data/Schema Changes
  • Network or connectivity issue
  • Capacity overload / Performance issue
  • Release or deployment problem
  • Architectural/design limitation

Human Factors

  • Information or feedback gaps
  • Miscommunication or coordination gap
  • System design knowledge gaps
  • Necessary workarounds

Process / Policy Shortcomings

  • Inadequate testing or QA
  • Change management gap
  • Lack of documentation/runbooks
  • Ownership or escalation gap

External Factors

  • Third-party service/API outage
  • Cloud/Infrastructure provider issue
  • Upstream dependency change
  • Security attack or breach

Monitoring / Alerting Gaps

  • Delayed detection (missing alert)
  • Incomplete observability
  • Automation/Tooling issue

Benefits

  • Root Cause Identification: Comprehensive categorization ensures no contributing factor is overlooked during incident analysis
  • Trend Analysis: Enables quarterly/annual reporting on most common incident causes to drive preventive measures
  • Resource Allocation: Data-driven insights on failure patterns help prioritize engineering efforts and infrastructure investments
  • Knowledge Sharing: Consistent tagging improves searchability and learning from past incidents
  • Compliance & Audit: Structured data supports regulatory reporting and demonstrates mature incident management practices

Implementation Steps

  1. Custom Field Configuration

    • Create custom field in incident.io
    • Ensure that more than one contributing factor can be selected
    • Configure field options with proper grouping/categorization
    • Set field as required for incident closure
    • Test field functionality in staging environment
  2. Documentation & Training

    • Update training materials for incident responders
    • Update incident response procedures to include factor selection
  3. Rollout & Adoption

    • Gather feedback and refine options if needed
    • Monitor adoption rates and field usage
  4. Reporting & Analytics

    • Integrate data with existing SRE metrics and reports

Success Metrics

  • 100% of incidents have at least one contributing factor selected within 30 days of rollout
Edited by Alex Hanselka