Canary can disappear from some HAProxies
Problem Statement
We are unaware when canary goes into maintenance mode. Our tooling doesn't contain the necessary auditing or hooks to alert us that maintenance was explicitly enabled. Due to this, if maintenance is complete but a command is missed, we risk leaving canary in a maintenance or drained state thus not accepting requests.
This was recently a problem with ONE of the two HAProxy nodes in staging: https://gitlab.slack.com/archives/CB3LSMEJV/p1670431914647439 (Internal Slack link from 2022-12-07)
Evidence from the time of investigation
% ./bin/get-server-state gstg
Fetching server state...
3 fe api/api-gke-us-east1-b: UP
3 fe api/api-gke-us-east1-c: UP
3 fe api/api-gke-us-east1-d: UP
2 fe api/gke-cny-api: MAINT
1 fe api/gke-cny-api: UP
3 fe api_rate_limit/localhost:
3 fe asset_proxy/asset-bucket: UP
2 fe canary_api/gke-cny-api: MAINT
1 fe canary_api/gke-cny-api: UP
2 fe canary_https_git/gke-cny-git-https: MAINT
1 fe canary_https_git/gke-cny-git-https: UP
2 fe canary_kas_k8s_proxy/kas-cny-gke: MAINT
1 fe canary_kas_k8s_proxy/kas-cny-gke: UP
2 fe canary_kas/kas-cny-gke: MAINT
1 fe canary_kas/kas-cny-gke: UP
2 fe canary_web/gke-cny-web: MAINT
1 fe canary_web/gke-cny-web: UP
2 fe-ci api/api-gke-us-east1-b: UP
2 fe-ci api/api-gke-us-east1-c: UP
2 fe-ci api/api-gke-us-east1-d: UP
1 fe-ci api/gke-cny-api: MAINT
1 fe-ci api/gke-cny-api: UP
2 fe-ci api_rate_limit/localhost:
2 fe-ci asset_proxy/asset-bucket: UP
1 fe-ci canary_api/gke-cny-api: MAINT
1 fe-ci canary_api/gke-cny-api: UP
1 fe-ci canary_https_git/gke-cny-git-https: MAINT
1 fe-ci canary_https_git/gke-cny-git-https: UP
1 fe-ci canary_kas_k8s_proxy/kas-cny-gke: MAINT
1 fe-ci canary_kas_k8s_proxy/kas-cny-gke: UP
1 fe-ci canary_kas/kas-cny-gke: MAINT
1 fe-ci canary_kas/kas-cny-gke: UP
1 fe-ci canary_web/gke-cny-web: MAINT
1 fe-ci canary_web/gke-cny-web: UP
2 fe-ci https_git/git-https-gke-us-east1-b: UP
2 fe-ci https_git/git-https-gke-us-east1-c: UP
2 fe-ci https_git/git-https-gke-us-east1-d: UP
1 fe-ci https_git/gke-cny-git-https: MAINT
1 fe-ci https_git/gke-cny-git-https: UP
2 fe-ci kas_k8s_proxy/kas-gke: UP
2 fe-ci kas/kas-gke: UP
2 fe-ci main_api/api-gke-us-east1-b: UP
2 fe-ci main_api/api-gke-us-east1-c: UP
2 fe-ci main_api/api-gke-us-east1-d: UP
2 fe-ci main_web/web-gke-us-east1-b-8181: UP
2 fe-ci main_web/web-gke-us-east1-c-8181: UP
2 fe-ci main_web/web-gke-us-east1-d-8181: UP
1 fe-ci ssh/gke-cny-ssh: MAINT
1 fe-ci ssh/gke-cny-ssh: UP
2 fe-ci ssh/shell-gke-us-east1-b: UP
2 fe-ci ssh/shell-gke-us-east1-c: UP
2 fe-ci ssh/shell-gke-us-east1-d: UP
1 fe-ci web/gke-cny-web: MAINT
1 fe-ci web/gke-cny-web: UP
1 fe-ci websockets/gke-cny-ws: MAINT
1 fe-ci websockets/gke-cny-ws: UP
2 fe-ci websockets/ws-gke-us-east1-b: UP
2 fe-ci websockets/ws-gke-us-east1-c: UP
2 fe-ci websockets/ws-gke-us-east1-d: UP
2 fe-ci web/web-gke-us-east1-b-8181: UP
2 fe-ci web/web-gke-us-east1-c-8181: UP
2 fe-ci web/web-gke-us-east1-d-8181: UP
3 fe https_git/git-https-gke-us-east1-b: UP
3 fe https_git/git-https-gke-us-east1-c: UP
3 fe https_git/git-https-gke-us-east1-d: UP
2 fe https_git/gke-cny-git-https: MAINT
1 fe https_git/gke-cny-git-https: UP
3 fe kas_k8s_proxy/kas-gke: UP
3 fe kas/kas-gke: UP
3 fe main_api/api-gke-us-east1-b: UP
3 fe main_api/api-gke-us-east1-c: UP
3 fe main_api/api-gke-us-east1-d: UP
3 fe main_web/web-gke-us-east1-b-8181: UP
3 fe main_web/web-gke-us-east1-c-8181: UP
3 fe main_web/web-gke-us-east1-d-8181: UP
1 fe-pages pages_http/gke-cny-pages: MAINT
1 fe-pages pages_http/gke-cny-pages: UP
2 fe-pages pages_http/pages-us-east1-b: UP
2 fe-pages pages_http/pages-us-east1-c: UP
2 fe-pages pages_http/pages-us-east1-d: UP
1 fe-pages pages_https/gke-cny-pages-proxyv2: MAINT
1 fe-pages pages_https/gke-cny-pages-proxyv2: UP
2 fe-pages pages_https/pages-us-east1-b-proxyv2: UP
2 fe-pages pages_https/pages-us-east1-c-proxyv2: UP
2 fe-pages pages_https/pages-us-east1-d-proxyv2: UP
2 fe-registry registry/registry-us-east1-b: UP
2 fe-registry registry/registry-us-east1-c: UP
2 fe-registry registry/registry-us-east1-d: UP
2 fe ssh/gke-cny-ssh: MAINT
1 fe ssh/gke-cny-ssh: UP
3 fe ssh/shell-gke-us-east1-b: UP
3 fe ssh/shell-gke-us-east1-c: UP
3 fe ssh/shell-gke-us-east1-d: UP
2 fe web/gke-cny-web: MAINT
1 fe web/gke-cny-web: UP
2 fe websockets/gke-cny-ws: MAINT
1 fe websockets/gke-cny-ws: UP
3 fe websockets/ws-gke-us-east1-b: UP
3 fe websockets/ws-gke-us-east1-c: UP
3 fe websockets/ws-gke-us-east1-d: UP
3 fe web/web-gke-us-east1-b-8181: UP
3 fe web/web-gke-us-east1-c-8181: UP
3 fe web/web-gke-us-east1-d-8181: UP
Why was a single HAProxy in maintenance? No idea...
What Should We Do?
Minimally, we should create an alert when canary's are not 100% online.
Whether we do this via a check prior to and after a deploy or a simple alert via Prometheus to our channel is an implementation detail we should discuss.
Milestones
-
Evaluate how we want to expose a problem with our HAProxy configuration -
Implement