Continuous deployment should be easy and boring. One thing that makes it more comfortable is to have monitoring to measure service-level objectives and the impact on those SLOs of an individual deploy. When doing an automatic incremental deploy (https://gitlab.com/gitlab-org/gitlab-ee/issues/1660) or canary deploy (https://gitlab.com/gitlab-org/gitlab-ee/issues/1659), we should be able to use these measurements to automatically halt a deploy and even revert/rollback.
Use Cases
Scenario: Incremental rollout, notices error rate exceeds SLO of 0.1%, aborts rollout at 1%, and reverts to last-known-good version.
Using the existing Prometheus API
we will query the current threshold of error rates
If an error threshold exceeds the defined threshold we will stop deployment (we need to check if we can leverage the existing trigger similar to the incident response issue creation)
If the rollout was stopped due to exceeding threshold, On the deploy board there should be a notification of: "Rollout stopped due to high error rate"
We will present the error rate only in the environment page (deploy board) and we will make this very minimal - UX TBD As for Notifications - for the MVC we will use issue creation/email notifications that exist for incident response - TODOs/assignments will not be part of the MVC
Updating items more than 3 months out to align to the quarter instead of a specific release. We are still targeting within the same time period, but it's unnecessary to apply to a specific release this far out.
Would like the capability to incrementally rollout eg: 10% or 50% and have the ability to rollback that 10% or 50% released. (Not sure if its all or none today)
Have the capability to either auto rollback when a certain metric threshold has exceeded within the metrics gathered or have the ability in the gitlab-ci.yml to rollout to 10% and then have the ability to set a wait time (eg: 30 mins or some other arbitrary time ) before continuing the rollout. Since there is no automated rollback this would allow users the time to monitor the rollout and determine if the change is stable enough to continue the rollout.
We do have timed rollouts so that should solve for 2. Other than the ability to auto rollback when a certain metric threshold has been exceeded.
@ogolowinski For taking metrics (e.g. error rate), we can consider using our Prometheus integration. You can ask devopsmonitor folks about the details.
A basic idea to approach this issue is:
Users setup prometheus instance per environment. GitLab continuously polls the metrics from each prometheus (background process) and persists the status summary in database.
If one of the performance factor exceeds a threshold, GitLab hooks the event and do a corrective action (e.g. rollback).
The questions are:
Is this feature available on GKE integration only? Or configurable in any format?
Do we want to dogfood this feature in our auto deploy flow? If so, please loop in Delivery folks. Since our auto deploy uses external CD tools, we cannot control rollback in GitLab side. We'd need a way to communicate with an external CD tools.
2.Provide a way to view the metrics - either by hovering over a pod or in a separate page or link to the Prometheus dashboard- UX TBD
Add a setting to allow user to halt deployment on reaching threshold. This will stop deployment to additional pods and notify the user - UX TBD
Add a setting to allow user to rollback deployment on reaching threshold. User must also configure last known version to rollback to. This will rollback the deployment to the last known good deployment and notify the user - UX TBD
Add audit log event for auto rollback, indicating the error and timestamp of the rollback
Allow user to configure custom metrics
I would start with Kubernetes first and then think about other deployments (with your help).
What do you think of this plan?
@ogolowinski, right now the way to set alert on Prometheus metrics (e.g. error rate) is to go to the Metrics chart and manually setup the alert using the 3 dots button
A user can then go to the Settings ->Operation tab, enable the incident
Which creates an issue for each triggered alert
@dosuken123 so it looks like indeed most of this already exists up until triggered alert.
If we can read this alert we could then halt/rollback. I would create an issue for halt and an issue for rollback. what else do you think we need here?
@ogolowinski I feel rollback is a strong action and it's not always the best way to mitigate the incident. For example, if there is an abuser and he does DoS attack to the webserver, rollback doesn't help the case. On the other hand, halting the environment seems good and practical idea. We often do the same in our auto deploy that if there is an on-going incident, we don't deploy on time and postpone it. So halting deployment (lock down or unlock environments) seems good first step.
Probably, we'd add another checkbox under "Incidents" section that "Prevent further deployments until manually unlocking the environments"? Documentation-wise, we'll likely add another section under Taking action on incidents. We don't have the lock-down feature yet, so that would need to be added in a separate issue.
By the way, this issue targets GitLab Premium, however, the prometheus alert actions are under GitLab Ultimate. We would need to consider adjusting the tiers.
@sarahwaldner Can we add a checkbox under "Incidents" section named "Prevent further deployments until manually unlocking the environments"
Also is there an API that catches the event besides the email? I understand that it also automatically creates an issue. @dosuken123 Can we leverage the issue in any way?
Can we add a checkbox under "Incidents" section named "Prevent further deployments until manually unlocking the environments"
That does not seem like the right place to add that setting. The Incidents section within Settings > Operations is for managing Incidents and notifications. I think that a better place for that setting would in Environments or Repository. If you would like to discuss this further, let's get a call set up with us and our respective UX team members.
@dosuken123 based on @sarahwaldner's answer in #8295 (comment 291754703) can that be used as a trigger, or do we need to create an SPI that will trigger the pipeline?
@ogolowinski reading through this issue it looks like it's still too vague for potential inclusion in %12.7, in my opinion. It's missing some key decisions around what will be monitored (and maybe how), and what behaviors will be implemented and when based on what's seen, how it could be configured (if it could be configured), etc. This is also a pretty big/interesting topic and might be good to do a validation flow on.
b. if the threshold appears - we need to indicate this on the deploy board on the problematic pod
c. Introduce a new setting "Prevent further deployments until manually unlocking the environments"? Documentation-wise, we'll likely add another section under Taking action on incidents. Following #8295 (comment 263372375) we need to think of the right place to put this
d. 1. Add a setting to allow user to halt deployment on reaching threshold. This will stop deployment to additional pods and notify the user - UX TBD
e. Add a setting to allow user to rollback deployment on reaching threshold. User must also configure last known version to rollback to. This will rollback the deployment to the last known good deployment and notify the user - UX TBD
f. Add audit log event for auto rollback, indicating the error and timestamp of the rollback
g. Allow user to configure the threshold or other metrics (based on Prometheus)
Since this needs a lot of preparation, I am moving this to 12,9 but we should add the UX (at least for the first 2 iterations) during 12.8 period
The basic idea is to leverage what we already have supported in devopsmonitor and try to leverage incidence response configuration that is already supported for exceeding the threshold to trigger to the pipeline to stop rollout.
MVC: Inform the user of manual intervention when rolling out is not possible. Stop rolling out the deployment.
Automatically rollback is not part of the mvc
How do we create a trigger to our pipeline so it stops rolling out?
Shinya: If we create an issue, it should be possible to create events.
Incremental rollout is handled in Kubernetes, a different platform. The pipeline just tell K8 to start a deployment. We prob need a way to tell k8 to stop rolling out - not sure if it's possible, need to dig into k8 docs and help from ~"devops::configure".
Orit: We can tell the user to stop it manually before k8 is told to stop it.
It's good to show the notification in the environments dashboard
Rayana: We need to think where else the users need to be notified from -- since it requires a manual action from them. Email notification. Does it create a todo for users?
Orit: Not familiar with it, it seems like it creates issues. Not sure if the issue gets assigned to someone. Once you have an alert, users can go to operation settings. We can add a todo, it should be possible.
Nadia: Can we do some investigation on how it's being used?
Future: Automatically create issues, initiate with incident response (use that to trigger a pipeline)
Orit: First phase we shouldn't need the configure team.
Shinya: Deployment board depends on k8, it's only available because of it.
Rayana: Need to know what those teams have planned for the UX of the feature.
Shinya: Need to investigate the feasibility.
Rayana: UI of the MR also needs to to be altered.
Orit: Need to be using monitoring, need to go to incident settings and allow to create issue/todo, and then it goes back to the interface (error message).
Orit: Also worth to talk to customers and show some mockups. Start recruiting people cc @loriewhitaker
As a first step we want to alert the user that there is a problem that requires manual stop of rollout.
We are thinking of adding this to the deploy board.
At the moment if a threshold is met, and IR is enabled in the settings , an issue is created.
Can we add a notification - email and/or TODO? @sarahwaldner
We need to figure out . way to tell K8s to stop deployment - @dosuken123
Next phase automatic stop deployment will be handled
@rverissimo & @loriewhitaker UX Research with customers to understand wat they are doing today, introduce new mockups and understand where we need to add notifications/actions
We need to figure out . way to tell K8s to stop deployment
@dosuken123 @tkuah is there any way to stop deployment? I think the deployment controller takes one instruction at a time so in theory one could deploy and roll back? Not sure if there's a way to "stop".