@@ -289,7 +289,7 @@ All exceptions must be approved by a VP of Engineering. Reasons for an FCL excep
The [team](/handbook/company/structure/#organizational-structure) involved is the owner of the service or feature. The team is responsible for both the coordination and completion of the FCL. The manager of the team is responsible for:
- Form the group of engineers working under the FCL. By default, it will be the whole team, but it could be a reduced group if there is not enough work for everyone.
- Form the group of engineers working under the FCL. By default, it will be the owning team, but it could be a reduced group if there is not enough work for everyone.
- Plan and execute the FCL.
- Inform their manager (e.g. Senior Manager / Director) and Product counterpart that the team will focus efforts towards an FCL which may impact capacity planning.
- Provides updates at the [SaaS Health Review](/handbook/engineering/infrastructure-platforms/saas-health-review).
@@ -312,7 +312,18 @@ The following bulleted list provides a suggested timeline starting from incident
#### Activities
During the FCL, the team(s) exclusive focus is around [reliability work](#scope-of-work-during-fcl), and any feature type of work in-flight has to be paused or re-assigned. Maintainer duties can still be done during this period and should keep other teams moving forward. Explicitly higher priority work such as security and data loss prevention should continue as well. The team(s) must:
During the FCL, all in-flight feature work is paused on the impacted service or feature category. Team members involved in the FCL are exclusively focused on [reliability work](#scope-of-work-during-fcl). Maintainer duties can still be done during this period and should keep other teams moving forward. Explicitly higher priority work such as security and data loss prevention should continue as well.
While an FCL generally will include the team that owns the feature category or service, other team members who contribute to the development of the feature or service may be included. As part of FCL setup, the team should:
1. Identify all services and feature categories the team is responsible for that are under FCL
2. Identify closely coupled or dependent services and teams who may also make changes to those services or feature categories
3. Notify those teams about the FCL and coordinate to ensure changes are appropriate within the [scope](#determining-fcl-scope)
4. Consider applying a [Change Lock](https://gitlab.com/gitlab-com/gl-infra/change-lock/-/blob/master/README.md) to your teams services to prevent unintended deployments of your service.
Teams making changes to services or feature categories owned by teams in an FCL should coordinate with the FCL team and should be included in the FCL issue for visibility. The [Feature Change Locks project](https://gitlab.com/gitlab-com/feature-change-locks/-/work_items) tracks all open FCLs.
The team(s) must:
- Create a public slack channel called `#fcl-incident-[number]`, with members
- The Team's Manager
@@ -338,6 +349,32 @@ During the FCL, the team(s) exclusive focus is around [reliability work](#scope-
- All FCL stakeholders and participants shall participate async. Managers of the groups participating in the FCL, including Sr. EMs and Directors should be invited.
- Outcome includes [handbook](/handbook/) and [GitLab Docs](https://docs.gitlab.com/ee/) updates where applicable.
##### Determining FCL Scope
**What's In Scope**
The **team** that owns the service or feature category identified as causal or contributing to the incident goes into FCL. The team pauses all in-flight feature work on the services and feature categories they are responsible for.
For teams that maintain shared service infrastructure (e.g., the team that maintains Sidekiq infrastructure), if they go into FCL, they may not make changes to that infrastructure. Other teams may continue to use the service normally - for example, adding new Sidekiq jobs or running database migrations. As part of FCL setup, the team should notify other teams who may make changes to the infrastructure about the FCL
**Side-Effects vs Related Causes**
When determining FCL scope, it's important to distinguish between side-effects and related causes:
**Side-Effects (Team NOT included in FCL scope):**
These are incidents where a change to one feature or service impacts another feature or service unexpectedly:
- _Example_: A Topology Service configuration change causes a 404 error in the Repository tree page, but the repository code itself did not contribute to causing the 404. The repository team would not be subject to an FCL since their code was not a contributing factor. Both teams should contribute to the post-incident review to better understand and improve the dependency or coupling that caused the incident.
**Related Causes (Team included in FCL scope):**
These are incidents where external changes occur, but the team's code, configuration, or service compounds or contributes to the effect:
- _Example_: A shared service configuration change occurs, and a sidekiq job from feature category X compounds the effect due to a slow query, contributing to the incident. The team owning feature category X would be subject to an FCL because their code contributed to the incident's impact.
The key distinction is whether the team's code, service, or configuration actively contributed to the incident beyond being a passive recipient of external changes.
##### Scope of work during FCL
After the Incident Review is completed, the team(s) focus is on preventing similar problems from recurring and improving detection. This should include, but is not limited to: