Investigate https git errors during K8s deploys
During the last canary deploy, we noticed an increase in errors that caused the alert component_error_ratio_burn_rate_slo_out_of_bounds_upper
. This specific alert was for the main stage, which is a bit strange because we are not taking any git https traffic on the main stage, but we believe this was for canary.
looking at the canary workhorse dashboard for git it does look like we see small error spikes:
Given that git https traffic is new in K8s we would like to do an analysis of the following:
-
Why was this alert for the main stage? 🤔 -
Are we regularly seeing git https errors during deploys? Look at the past few deploys to see if there is a correlation -
Does the gke tcp lb readiness fail during a deploy? We can't really rely on the health check because it doesn't run frequent enough, we should try scraping the readiness endpoint (the same one that HAProxy is using) in a tight loop during a K8s deploy to make sure we aren't dropping out -
Are we waiting long enough to drain connections when recycling pods? This is probably the most important question to answer because our way of draining VMs is much different than draining K8s, where connections on VMs can stay active up until we hup puma
Given these concerns we discussed that we may want to either reduce the traffic to git https canary, or set into maintenance.
- Reducing traffic can be done by setting the weight here
- Alternatively, we can set just the gke canary backend into maintenance (after merging https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/4251) in chef repo:
./bin/set-server-state gprd drain "https_git/gke-cny-git"
Maybe (1) is the better option for now, since it does essentially the same thing as putting it into maintenance.
cc @skarbek @cmcfarland @T4cC0re @amyphillips @AnthonySandoval