@@ -8,8 +8,8 @@ This page documents the process for managing follow-up items from incidents.
-**Follow Up Item**: Action items, bugs, or improvements identified during an incident that need to be addressed after the incident is resolved.
-**Incident Lead**: The person responsible for coordinating the incident response and ensuring proper follow-up after resolution.
-**Corrective Action**: A follow-up item aimed at preventing similar incidents in the future (synonymous with InfraDev for labeling purposes).
-**InfraDev**: [Infrastructure development issues](/handbook/engineering/workflow/#infradev) that affect the production platform.
-**Corrective Action**: Any follow-up item from an incident — bugs, improvements, or process changes — aimed at preventing recurrence or reducing impact. All incident follow-up issues receive this label automatically.
-**InfraDev**: A priority label applied to follow-up issues that require urgently prioritized attention to support SaaS availability and reliability. Not all corrective actions are infradev issues — the label indicates the issue must be tracked and resolved per the [InfraDev process](/handbook/engineering/workflow/#infradev).
## Default Issue Locations
@@ -17,7 +17,7 @@ We use different default projects for incident follow-up issues based on privacy
@@ -767,7 +767,19 @@ The infradev process is established to identify issues requiring priority attent
### Scope
The [infradev issue board](https://gitlab.com/groups/gitlab-org/-/boards/1193197?label_name[]=infradev) is the primary focus of this process.
The [SaaS Health Dashboard](https://saas-health-83948d.gitlab.io/infradev/) is the primary focus of this process for the GitLab application. `infradev` issues also live in the following supported projects, all of which are monitored for SLO compliance by [triage-ops](https://gitlab.com/gitlab-org/quality/triage-ops) automation:
Additional projects with open `infradev` issues are visible in the [SaaS Health Dashboard](https://saas-health-83948d.gitlab.io/).
`infradev` issues currently also exist in projects that are not yet covered by triage-ops SLO monitoring. Any new location that will host `infradev` issues must have triage-ops configured to enforce SLO monitoring before issues are added.
### Relationship to corrective actions
Every follow-up item from an incident receives the `corrective action` label automatically. The `infradev` label is applied *in addition* when the item requires urgently prioritized attention to support SaaS availability and reliability. Not every corrective action is an `infradev` issue, but every `infradev` issue created from an incident is also a corrective action. See [Incident Follow Up Issues](/handbook/engineering/infrastructure-platforms/incident-management/incident-follow-ups/) for the full definitions and the process for triaging items from an incident.
### Roles and Responsibilities
@@ -807,10 +819,20 @@ Issues are nominated to the board through the inclusion of the label `infradev`
During triage, teams may request that the infrastructure platforms team remove the `infradev` label from `~severity::3` and `~severity::4` issues by commenting with their reasoning and pinging the incident lead, if they determine the issue does not meet the bar for infradev prioritization. The `infradev` label **must not** be removed from `~severity::1` or `~severity::2` issues.
`~infradev` issues requiring a ~"breaking change" should not exist. If a current `~infradev` issue requires a breaking change then it should split into two issues. The first issue should be the immediate `~infradev` work that can be done under current SLOs. The second issue should be ~"breaking change" work that needs to be completed at the next major release in accordance with [deprecation guidance](https://docs.gitlab.com/ee/development/deprecation_guidelines/). Agreement from development DRI as well as the infrastructure DRI should be documented on the issue.
Infradev issues are also shown in the monthly [Error Budget Report](/handbook/engineering/error-budgets/#budget-reporting).
### Reporting
The state of `infradev` issues is shared at the [Operational Excellence Meeting](https://docs.google.com/document/d/1gSTe2gKNRha-PknIzoBYuwmp73Ipc2blJehje2s7Obc/edit?usp=sharing) meeting.
### Historical metrics
Historical trends and distribution of `infradev` issues over time are available in the [Tableau Infradev Dashboard](https://10az.online.tableau.com/#/site/gitlab/views/DraftInfrastructureEmbeddedDashboard/InfradevDashboard?:iid=1).
### A Guide to Creating Effective Infradev Issues
Triage of infradev Issues is desired to occur asynchronously. These points below with endure that your infradev issues gain maximum traction.
@@ -830,7 +852,7 @@ Triage of infradev Issues is desired to occur asynchronously. These points below
1.**Always include a permalink to the source of the screenshot so that others can investigate further**.
1.**Provide a clear, unambiguous, self-contained solution to the problem**. Do not add the `infradev` label to architectural problems, vague solutions, or requests to investigate an unknown root-cause.
1.**Ensure scope is limited**. Each issue should be able to be owned by a single stage group team and should not need to be broken down further. Single task solutions are best.
1.**Ensure a realistic severity is applied**: review the [availability severity label guidelines](/handbook/product-development/how-we-work/issue-triage/#availability) and ensure that applied severity matches. Always ensure all issues have a severity, even if you are unsure.
1.**Ensure a realistic severity is applied**: review the [availability severity label guidelines](/handbook/product-development/how-we-work/issue-triage/#availability) and ensure that applied severity matches. Always ensure all issues have a severity, even if you are unsure. The severity of the infradev issue should reflect the ongoing risk if the issue goes unresolved — it is independent of the severity of the incident that generated it.
1.**If possible, include ownership labels** for more effective triage. The [product categories](/handbook/product/categories/) can help determine the appropriate stage group to assign the issue to.
1.**Cross-reference links to Production Incidents, PagerDuty Alerts, Slack Alerts and Slack Discussions**. To help ensure that the team performing the triage have all the available data.
1. By adding "Related" links on the infradev issue, the [Infradev Status Report](https://gitlab.com/gitlab-org/infradev-reports/-/issues) will display a count of the number of production incidents related to each infradev issue, for easier and clearer prioritization.
@@ -250,7 +250,11 @@ In order to define an issue as a "transient bug," use the `~"bug::transient"` la
### Infradev Issues
An issue may have an `infradev` label attached to it, which means it subscribes to a dedicated process to related to SaaS availability and reliability, as detailed in the [Infradev Engineering Workflow](/handbook/engineering/workflow/#infradev). These issues follow the established [severity SLOs for bugs](/handbook/product-development/how-we-work/issue-triage/#severity-slos).
`infradev` issues are created when an incident occurs on GitLab.com or Dedicated, and the responders determine that this issue could prevent future incidents of this type from occurring.
An issue may have an `infradev` label attached to it, which means it subscribes to a dedicated process related to SaaS availability and reliability, as detailed in the [Infradev Engineering Workflow](/handbook/engineering/workflow/#infradev). These issues follow the established [severity SLOs for bugs](/handbook/product-development/how-we-work/issue-triage/#severity-slos).
`infradev` issues can exist across several projects, including `gitlab-org/gitlab`, `gitlab-com/gl-infra/production-engineering`, `gitlab-org/gitaly`, and `gitlab-com/gl-infra/platform/runway/team`. All open `infradev` issues are visible in the [SaaS Health Dashboard](https://saas-health-83948d.gitlab.io/) regardless of which project they live in. Historical trends and distribution are available in the [Tableau Infradev Dashboard](https://10az.online.tableau.com/#/site/gitlab/views/DraftInfrastructureEmbeddedDashboard/InfradevDashboard?:iid=1).