During the livestream on 2018-09-20 we decided to suspend a network configuration change and I think it might be worth discussing a way to formalize this process.
What we need:
something to call it
something that is accessible for the team and company to view
something we can query from CICD to stop our automated changes from rolling out
For (1) we could call it Production SOC (suspension of changes) though it is not a standard term. We should avoid other popular terms like "blackout windows" or "blackdays". I did some googling around and couldn't find anything that is standard and would love to hear suggestions.
For (2) I would like to signal boost this but putting it on the shared meeting calendar, if that is a bit too much we could create a new calendar for it.
For (3) would a google calendar work or should we consider other options?
the last internal company pipeline tool I used had a feature where you could force pipelines to pause during the certain time intervals, @ayufan is there anything like this on the roadmap for cicd?
Name suggestions (feel free to vote or add more if you have them)
I find the idea sensible, it's what we've been doing at my previous company as well in order to minimize unexpected impact that infra/service changes might have on the platform in times of:
expected high load (the live stream @jarv mentioned is a good example, another example could be the mass migration from GitHub to GitLab)
expected unavailability of personnel (Summit days, Xmas to New Year)
complex efforts to mitigate a specific ongoing issue affecting GitLab.com (GCP migration, or similar)
"There are only two hard things in Computer Science: cache invalidation and naming things."
Let's make sure we pick term that is both inclusive, and rolls off the tongue easily.
For (3) would a google calendar work or should we consider other options?
the last internal company pipeline tool I used had a feature where you could force pipelines to pause during the certain time intervals, @ayufan is there anything like this on the roadmap for cicd?
You can pause your runners via Web or API. Does it help?
You can pause your runners via Web or API. Does it help?
@ayufan that would work if it is possible to do over the api so we could have something driving it from a schedule.
What I have seen with other tools is to setup "office hours" or other windows for pausing cicd pipelines. This is probably something other customers of CICD would like as well if there isn't a feature request for it already.
What I have seen with other tools is to setup "office hours" or other windows for pausing cicd pipelines. This is probably something other customers of CICD would like as well if there isn't a feature request for it already.
It seems reasonable. Do you mind creating a feature proposal?
Rather than a calendar event, perhaps we could use an issue tag? CI/CD doesn't run if there is an open issue tagged with "Change Lockout" or whatever we decide.
This has the advantage of allowing an arbitrary amount of discussion along the lines of "Are we finished with everything we are concerned about and can we close this ticket?"
It would prevent the case where the window ends automatically while we are still working on stuff. These things often take longer than we think they will when we schedule windows.
It would allow multiple overlapping change lockouts at once - all of which need to be independently cleared before CI/CD will run again.
It's fairly easy to query for open tickets with a specific tag
Rather than a calendar event, perhaps we could use an issue tag? CI/CD doesn't run if there is an open issue tagged with "Change Lockout" or whatever we decide.
Oh that's a really nice idea, we could even use "due date" to set the ending and have something that automatically closes the issue when the date passes.
Added another suggestion for naming, we used the term "Freeze" at Nordstrom (Production Freeze, Change Freeze, Pre-[event] Freeze, etc.). for @dsylva's idea of using issue tags
This is a nice initiative and I also would like to join the discussion. :)
Based on what I have seen implemented in the past at various places, the following is what I can summarize:
Code Freeze
This somewhat overlaps with what Craig has mentioned above. A code/config change could be made, pushed, built and even integ/beta/alpha tested but the change is not allowed to get promoted to any prod environment (onebox/canary/spanning prod environments). A way to set this up could be controlled by calendar dates (and at the hour/minute granular level). Basically, if a change is sitting and getting baked right before a prod environment stage - then we would assume that it has passed all of the tests and is no different than any other change during non-code-freeze times.
Grey Days / Black Days
Simply having a: "code freeze" and halting all production changes have its disadvantages, because it is possible that teams still need to be able to ship mission critical changes even during "peak" time-frames (i.e serious bug fixes, scaling and other changes that business decides to introduce). To handle this scenario, a concept of "grey days" and "black days" could help. The former allows production changes but only with manager/director level approval based on a change-management call; whereas, the latter allows no production change at all (unless, of course, something really bad happens and a change has to be made which would need to be approved by a higher management level)
Automation
I will later provide my thoughts on what others have suggested here so far, but in this section I will think of "calendar date" as the way to control the code freeze. Ideally, what we could do is:
Have a way to choose a date/time range for code-freeze on each pipeline. Let's call it a "blocker".
A pipeline will not let a change deploy in any prod environment (canary/onebox/prod) as long as there is a "code freeze" blocker at current time
Either the blocker itself or the pipeline will periodically check until the blocker expires and removes it - which then would start allowing the queued changes get promoted to prod. As long as the changes have passed various tests until this point, we consider them safe to deploy just like any other changes outside of the code-freeze window.
When engineers do code push and/or raise MR, we can also send them an automated message informing them off existing code-freeze window.
Expansion
We could then leverage such a calendar date/time driven blocker and expand it to other use cases such as regional deployments. An example of it could be:
Don't deploy to NA stack between 8AM-8PM on weekdays
Don't deploy to EU stack between 8AM-8PM on weekdays
...etc
We don't currently have a regional stack like this so it might not be immediately useful but something we can still think about.
Feedback on suggestions provided by team members
Issue Tag - I think this is a very interesting idea! I also like the point about preventing accidental/unintentional changes rolling in once a calendar-driven blocker is lifted. But to double tap into this idea deeper, how would we create this tag and how would we know which issues to apply the tag to?
John Jarvischanged title from Introduce suspension of changes during important events to Introduce suspension of changes and releases during important events
changed title from Introduce suspension of changes during important events to Introduce suspension of changes and releases during important events
Do we want code changes to still go to production during "peak" times? The ICL name implies that we are only thinking about the infrastructure side of things and doesn't appear to be including application side of things.
Is our intention only to prevent changes from taking place in production (and canary, now that we have i) or would this apply to the entire pipeline stages? (check-in > CI > staging > canary > prod)?
Do we want code changes to still go to production during "peak" times? The ICL name implies that we are only thinking about the infrastructure side of things and doesn't appear to be including application side of things.
I definitely want to include deployments, perhaps ICL is a bad name then?
Is our intention only to prevent changes from taking place in production (and canary, now that we have i) or would this apply to the entire pipeline stages? (check-in > CI > staging > canary > prod)?
If we have automated deployments we should do our best to avoid peak times but I think we must more strongly favor times when engineers are available.
How about PCL instead of ICL - for Production Change Lock - since it's not really infrastructure that we're trying to lock - staging infrastructure changes should be fine.
@Finotto I am fine with anyone picking this up if they like, currently it looks like it was moved back into the backlog so I assume it is lower priority?
Personally I would like to have some proposals for
something that is accessible for the team and company to view
something we can query from CICD to stop our automated changes from rolling out
We should probably have at least some proposals before we start with a handbook MR, unless we want to move the discussion there.
This issue does not appear to have an issue weight set.
As a general guidelines use a weight of 1 for an access request issue or a simple
configuration update. Use this as a multiplier for setting the weight.
If you are unsure about what weight to set it is better to add a generous estimate and change it later.
If the weight on this issue is 8 or larger then it might be a good idea
to consider splitting this issue up into smaller pieces.
By vote, ICL (Infrastructure Change Lock) won. However, given that we should still be able to do infrastructure changes up to staging during peak times the name ICL would become a little bit confusing. Therefore, I believe we are sticking with: PCL (Production Change Lock). To clarify, this is only within GitLab because in the feature request that @jarv cut to Product team they will call it "blackout".
Accessibility and View'ability
Since we don't have any out-of-the-shelf solution that can conveniently take a date/time range and halt production change rollouts today, besides the feature request we filed, we would need to look at how we can achieve this with our current capabilities. The following requirements are considered when thinking of options:
Users (not only engineering teams but other teams should also have access and visibility)
Date/time range should be easily applied and adjusted, as needed
Documentation explaining what each PCL is about and processes to override, when needed
Automation possibility
Google Calendar
Utilize Google Calendar (We have a Production Calendar) and block peak days.
Pros:
It is visible company wide and everyone (engineering and non-engineering teams) uses it
Date/time range can be easily applied and adjusted
Documentation would be handbook
We would have to look at Google Calendar API for automation
Cons:
Communicating to all the users about the google calendar part and adoption might be tricky
Custom Calendar in an Issue
Open an issue prior to upcoming peak days and create a custom calendar inside of it, and communicate the issue to users.
Pros
It is also viewable and accessibly by everyone. Given that we might have different peaks around the year, it might be better to inform users everytime.
Date/time range can be set but would just need to be written.
The same issue can be used as the "documentation"
Issue creation could be automated and notification could also be send automatically.
Cons
Integrating it to a bigger automation in the future (once the CICD Runner feature is implemented) might be tricky and actually this process may no longer be applicable/used.
git push
When git push is ran, we can look at responding not only with MR link but a message regarding the production change lock if the change is pushed on a day that falls within PCL date/time range.
Pros
Still covers all users
Date/time range could be defined on the backend side (this needs to be determined) that handles git push and returns MR link in the response.
A link to a documentation/issue could be sent along with the PCL information.
Need to check where the code is that handles git push and returns MR link.
Cons
Users won't know of the PCL until they push. Therefore, it is an after-the-fact information and might impact planning of teams.
I believe based on above considerations, we can perhaps start with Google Calendar and iterate on it going further and also get feedback from teams on its effectivity. (Action item: work on an MR to do it).
Automated way to halt change rollout / deployment
Looked into a few different options:
Update Runner details and make it active=false
Logical test (isItBlackDay type) in the script section of gitlab-ci.yml
Control pipeline structure with only/except and variables
Turn the job to manual
Will explore the options a little further. However, options # 2 -> # 4 would require changes to gitlab-ci.yml file and I doubt that we can ask every such file to have a change made to support the PCL and adoption rate might not be good. Therefore, will explore how we can systematically control and introduce the PCL with a seamless change (or as much as possible) and drive good adoption across teams.
Considering the Google Calendar option: We have a GitLab Calendar calendar that we can use to block days. However, the next question naturally becomes what should fall under the Peak Days. The following are what I can think of:
GitLab Events (i.e Livestream Day, Summit...etc)
Public Events / Holidays
The former is somewhat easy because regardless of how globally distributed we are, we should just try to avoid making production changes during those times. However, the latter is a little bit tricky - because, again, given we are a truly remote and globally distributed company we have engineers all over. Then the question becomes, which specific holidays would we want to mark so that if a production change occurs (somehow) and we run into an issue (i.e incidents), we still have good team coverage to help us investigate, resolve the issue and communicate to customers/stakholders at the same time. In order to make a data-driven decision for this, I have pulled the engineering team members' location and the below is the result:
Engineer Count
Country
61
USA
14
United Kingdom
11
The Netherlands
10
Germany
7
Canada
6
Australia
6
5
India
4
Spain
4
South Africa
4
France
4
Brazil
3
Ukraine
3
Portugal
3
Poland
3
Mexico
2
Taiwan
2
Slovenia
2
Nigeria
2
Ireland
2
Czech Republic
2
Chile
2
Belgium
2
Austria
1
Zimbabwe
1
Serbia
1
Russia
1
Philippines
1
Peru
1
Pakistan
1
Nicaragua
1
New Zealand
1
Mongolia
1
Malta
1
Malaysia
1
Luxembourg
1
Kenya
1
Japan
1
Italy
1
Hungary
1
Greece
1
Egypt
1
Denmark
1
Bosnia and Herzegovina
What this shows us is that if we make a production change during a US public holiday (i.e Thanksgiving, Christmas...etc) where most of the team members are on holiday/vacation...etc, then our team coverage would be reduced significantly in case an incident occurs. Therefore, by no means of making one country more important over another, simply based on the # of engineers we have in different countries and the coverage, I propose we at least block out public holidays in USA, UK, Netherlands and Germany. And we can start with public holidays in USA and iterate over to the next 3 countries. Another data point we could look at is our customers' location. (Will look into this and cross check against the 4 countries listed above).