Post migration analysis of Git HTTPs VMs to the Zonal Kubernetes Clusters
As a final step in the Git HTTPs / WebSocket migration to Kubernetes we should perform a post migration analysis to assess performance improvements and cost reductions similar to what we did for the urgent-other sidekiq shard
This will allow us to track the success of our migration in terms of infrastructure size and cost.
Timeline
Configuration changes were made over 8 working days:
- 2020-10-15: Started to move Git HTTPs traffic to the 3 zonal GKE clusters, 50%
- 2020-10-16: Noticed that during deploys we were seeing an apdex drop that occasionally dropped below our SLO targets for workhorse
- 2020-10-16: When investigating we realized we were not using the same queuing arguments for workhorse, fixed in gitlab-com/gl-infra/k8s-workloads/gitlab-com!459 (merged) and opened gitlab-org/charts/gitlab#2365 to move these defaults into charts
- 2020-10-16: Found an issue where if workhorse argument parsing fails it is possible for clusters to go unavailable but still pass the health check. Opened gitlab-org/charts/gitlab#2360 to track the healthcheck and we fixed the upstream issue with workhorse
- 2020-10-20: We had a theory that maybe the readiness check is passing too early, causing a slow-down on new pods. Opened gitlab-com/gl-infra/k8s-workloads/gitlab-com!467 (merged) to increase the readiness check delay
- 2020-10-20: Spun off the investigation into poor performance during rolling updates into #1294 (closed)
- 2020-10-22: To be more cautious about deployments, we set the maximum unavailable pods to
0
for rolling deployments gitlab-com/gl-infra/k8s-workloads/gitlab-com!471 (merged) instead of the K8s default of 25%. - 2020-10-26: Finally we found the root cause of the slow-down, which is a bug in the older version of nginx-ingress that comes with the chart. #1294 (comment 436049480). Opened gitlab-org/charts/gitlab#2377 (closed) to upgrade the version, or add one of the workarounds as a default.
- 2020-10-26: 100% of traffic to the zonal clusters.
Performance - A dramatic improvement
Performance increased after migrating to the K8s cluster which is mostly due to saturation of the git fleet and it being slightly under-provioned. This was a known issue going into the migration and we opted to not expand the git fleet before the migration completed.
Before the Migration: Tuesday, Oct 13th from 10:00 to 12:00 (Virtual Machines)
After the Migration: Tuesday, Oct 27th from 10:00 to 12:00 (Kubernetes)
Note that we are often at 100% apdex for the Git service and what is even more shocking is that the 50th percentile for workhorse latencies dropped by more than half. The spikiness of the Workhorse latencies before the migration shows that we were often hitting saturation points on the VMs.
Cost
- We increased the number of VMs from 25 to 30, while we continue to observe auto-scaling behavior and adjust resource requests to better optimize the K8s nodes. See also the additional cost overhead from running clusters, which we estimate to be ~1500/month additional #1175 (closed)
Deploy times
To deploy to all 25 Virtual git fleet machines it takes ~50 minutes, to deploy to the entire K8s cluster,for all services running there it takes ~17 minutes. Though we don't see overall deployment times decreasing yet, once we finish moving all front-end services to the cluster we are expecting at least a 50% reduction in overall duration.