Infrastructure’s highest priority is to ensure availability and performance of GitLab.com. In order to achieve this goal, the chaos that is introduced whenever changes take place in the production environment must be managed. These changes include both those that are purposely made to the environment through service and maintenance work, including deployments (which are handled through Change Management), and those that happen as a result of failures in infrastructure components (which are handled through Incident Management).
Changes are defined as significant modifications to the operational environment and can be classified into three types:
- Service changes are regular, routine changes. When service changes are executed through well-defined, tested procedures or are performed as automated tasks that do not involve human interaction (agents, scheduled tasks), they do not require review or approval. Otherwise, they must go through both review and approval (essentially in the first successful iteration). The driving goal is to turn all changes into service changes.
- Maintenance changes are possibly complex changes performed inside a scheduled maintenance windows that require careful planning, stakeholder review, and approval for execution.
- Vendor changes are changes performed by external service providers that may affect the environment in terms of availability and performance and are generally outside out control.
Deployments are a special change metatype depending on their scope and the effect they may have on the environment. When deployments are significant and affect a large portion of the environment, they are treated as maintenance changes. When they are small and their effect is limited to specific portions of the environment, they can be treated as service changes. As we make progress towards continunous CI/CD, we aim to turn all deployments into simple service changes.
Any change can be performed on an emergency basis in response to an incident (in which case, Incident Management takes over oversight of said change).
Change Management's primary goal is to safeguard the integrity of the environment through increased predictability. Managing chaos is both feasible and attainable. As with any other endeavor we undertake, it entails structure and discipline in our approach to manage the environment.
- Change Severities
- Change Plans
- Change Reviews
Change severities encapsulate risk associated with a change in the environment. Said risk entails the potential effects if the change fails and becomes an incident.
|Severity||Definition and Examples|
|S1||Change can have a significant impact on all of GitLab.com.|
|Examples of S1 changes include changes that require site-wide downtime, changes that affect critical infrastructure whose availability affects all of GitLab.com, or changes that affect a large portion of the environment.|
|S2||Change can have a significant impact on large portions of GitLab.com.|
|Examples of S2 changes include ...|
|S3||Change can have a limited impact on important portions of GitLab.com.|
|Examples of S3 changes include ...|
|S4||Change can have a minor impact on on GitLab.com...|
|Examples of S4 changes include ...|
- So as to minimize the number of variables at play, no changes are executed during an active incident, even if they are scheduled and approved. They must be rescheduled.
- S1 and S2 changes are always serialized and executed exclusively (i.e., never concurrently).
- S3 and S4 changes are allowed to take place concurrently as long as there is awareness of said concurrency.
|Role||Definition and Examples|
|The Event Manager is the tactical leader of the change team. For sergice changes, the EMOC is the person executing the change. For maintenance changes, the EMOC is the person in the IMOC rotation. S1 and S2 changes require an EMOC.|
|The Communications Manager is the communications leader of the change team. The focus of the Change Team is executing the change as safely and quickly as possible. For S1 and S2 maintenance changes, a CMOC communicates with the appropriate stakeholders. Othersiwe, EMOC can handle communication.|
|The Change Team is primarily composed of technical staff perfoming the change.|
Our long-term goal is to achieve a running state where changes can be performed asynchronously as service changes. This is aligned with our CI/CD objectives. Various factors currently prevent this, primarily centered around Infrastructure staffing, high-priotiry projects such as the GCP migration, the implementation of high-levels of defensive automation, and the development of procedures to achieve continuous CI/CD.
As a bridge, we are implemneting change windows, which afford us speed, order, and predictability for all stakeholders. Change windows daily slots of time that have Infrastructure resources assigned to them. We aim to provide as many of these change windows as available Infrastructure resources allow (which implies as that new people join the team, the available slots will be expanded).
- The on-call person is not assigned to any slot while on call.
- Infrastructure team members sign up for slots about a week in advance
- Anyone wishing to make changes signs up for a slot.
We will develop a regular schedule as soon as Change Management is implemented.
There is a preference to slot riskier changes towards the beginning of the week at the lowest possible peak hours so that more people are available to handle incidents. Significant changes should be avoided on Fridays.
Information is a key asset during any change. Properly managing the flow of information to its intended destination is critical in keeping interested stakeholders apprised of developments in a timely fashion. The awareness that a change is happening is critical in helping stakeholders plan for said changes.
This flow is determined by:
- the type of information,
- its intended audience,
- and timing sensitivity.
For instance, a large end-user may choose to avoid doing a software release during a maintenance window to avoid any chance that issues may affect their release.
Furthermore, avoiding information overload is necessary to keep every stakeholder’s focus.
To that end, we should:
- Have a dedicated change bridge (zoom call) for S1 and S2 changes.
- Have a dedicated
#productioncontains sizeable amounts of information and it takes effort to filter out non-relevant items. This is particularly important for the change team, which must be focused on technical information to perform the change. While #changes is an open channel, we will encourage people to use other channels to communicate with the IMOC.
- Have periodic updates intended to the various audiences at place (CMOC handles this):
- End-users (Twitter)
- Support staff
- Employees at large
- A dedicated repo for issues related to Production separate from the queue that holds Production Engineering’s work: namely, issues for incidents and changes. This is useful because there may be other teams, over time, that need to do work in the production environment.
Change plans provide detailed descriptios of proposed changes. There are a number of
- Develops a library of change procedures
- Provides detailed designs that are prime targets for automation
- Allows team review of said plans, and provides check points for team members to raise issues: the more eyes on a change, the more we leverage each team member's individual experices as a collective body of "lessons learned"
Ideally, the planner and the executor should be different individuals. The on-call resource have veto power over any and all changes.
Change plans are useful in that they provide a path to execute a change in production. Their real, value, however, lies in the ability to have the team evaluate the change and bring to bear their experience. Change reviwes afford us the opporunity to discuss changes as a team, and provides the setting where issues can be raised and resolved. While most of Infrastructure should review changes, we will request a minimum quorun of three reviewers to approve a change.
Change Management Runbooks
With severities, roles and communication channels defined, we can start to develop runbooks to help us manage changes and expectations around them, which is the next step in developing solid Change Management.