Clarify Data Team incident resolution and closure criteria to align with outcome ownership (08dbfdce) · Commits · GitLab.com / Content Sites / handbook

content/handbook/enterprise-data/data-governance/incident-management.md

+19 −1

Original line number	Diff line number	Diff line
		@@ -103,13 +103,31 @@ Resolution Approach
		* Prioritize implementing a fix, even a temporary workaround, before pursuing long-term solutions.
		* Managers review severity assessments (determined by the detection DRI) and prioritize work accordingly.

		For data incidents, we distinguish clearly between mitigation and resolution:

		An incident should be treated as resolved only when all of the following are true:

		* The underlying cause has been fixed and validated (for example, pipelines are stable, infrastructure is no longer degraded).
		* Impacted data has been fully backfilled, re‑processed, or otherwise corrected so that downstream models, dashboards, and reports are accurate and meet our timeliness SLOs.
		* Affected stakeholders are unblocked and can rely on the data again. Incidents should not be closed purely when the part of the work is complete, if the data downstream remains stale or incorrect. The incident stays open until the data is fully refreshed and accurate (owning the outcome).

		This ensures our “time to resolve” matches the period during which users experienced degraded or unreliable data, not just the time until code or infrastructure changes were deployed.

		If multiple teams are involved, i.e. of a data pipeline failure (the Data Platform Team resolves the pipeline outage) and downstream dbt models are impacted (The Analytics Engineering team has to perform backfills), it could be that 2 separate incidents are opened (1 by Data Platform Team, 1 by Analytics Engineering team), in this case you _can_ create an overarching incident that links to these 2 child incidents. The overarching incident stays open until both incidents are resolved (to indicate the time to resolve) where the 2 child incidents are independently closed.

		Communication

		* DRIs must provide regular status updates in the incident channel, including expected resolution timelines once available.

		Closure
		Close incidents promptly after verifying that:

		* the fix is working as intended, and
		* the affected data has been fully refreshed and validated against our SLOs, and
		* downstream consumers (for example, key dashboards or recurring reports) are no longer relying on temporary workarounds.

		If part of the work is completed but data backfills or re‑computations are still in progress, keep the incident open and document this explicitly in the timeline and status updates. The incident is closed only when both the technical fix and the data state are back to normal.

		* Close incidents promptly after verifying the fix is working as intended.
		* Conduct a retrospective for each incident to identify preventive measures and avoid future occurrences

		Incident SLOs