Snippets Groups Projects

Post migration analysis of Git HTTPs VMs to the Zonal Kubernetes Clusters

As a final step in the Git HTTPs / WebSocket migration to Kubernetes we should perform a post migration analysis to assess performance improvements and cost reductions similar to what we did for the urgent-other sidekiq shard

This will allow us to track the success of our migration in terms of infrastructure size and cost.

Timeline

Configuration changes were made over 8 working days:

2020-10-15: Started to move Git HTTPs traffic to the 3 zonal GKE clusters, 50%
2020-10-16: Noticed that during deploys we were seeing an apdex drop that occasionally dropped below our SLO targets for workhorse
2020-10-16: When investigating we realized we were not using the same queuing arguments for workhorse, fixed in gitlab-com/gl-infra/k8s-workloads/gitlab-com!459 (merged) and opened gitlab-org/charts/gitlab#2365 to move these defaults into charts
2020-10-16: Found an issue where if workhorse argument parsing fails it is possible for clusters to go unavailable but still pass the health check. Opened gitlab-org/charts/gitlab#2360 to track the healthcheck and we fixed the upstream issue with workhorse
2020-10-20: We had a theory that maybe the readiness check is passing too early, causing a slow-down on new pods. Opened gitlab-com/gl-infra/k8s-workloads/gitlab-com!467 (merged) to increase the readiness check delay
2020-10-20: Spun off the investigation into poor performance during rolling updates into #1294 (closed)
2020-10-22: To be more cautious about deployments, we set the maximum unavailable pods to 0 for rolling deployments gitlab-com/gl-infra/k8s-workloads/gitlab-com!471 (merged) instead of the K8s default of 25%.
2020-10-26: Finally we found the root cause of the slow-down, which is a bug in the older version of nginx-ingress that comes with the chart. #1294 (comment 436049480). Opened gitlab-org/charts/gitlab#2377 (closed) to upgrade the version, or add one of the workarounds as a default.
2020-10-26: 100% of traffic to the zonal clusters.

Performance - A dramatic improvement

Performance increased after migrating to the K8s cluster which is mostly due to saturation of the git fleet and it being slightly under-provioned. This was a known issue going into the migration and we opted to not expand the git fleet before the migration completed.

Before the Migration: Tuesday, Oct 13th from 10:00 to 12:00 (Virtual Machines)

After the Migration: Tuesday, Oct 27th from 10:00 to 12:00 (Kubernetes)

Note that we are often at 100% apdex for the Git service and what is even more shocking is that the 50th percentile for workhorse latencies dropped by more than half. The spikiness of the Workhorse latencies before the migration shows that we were often hitting saturation points on the VMs.

Cost

We increased the number of VMs from 25 to 30, while we continue to observe auto-scaling behavior and adjust resource requests to better optimize the K8s nodes. See also the additional cost overhead from running clusters, which we estimate to be ~1500/month additional #1175 (closed)

Deploy times

To deploy to all 25 Virtual git fleet machines it takes ~50 minutes, to deploy to the entire K8s cluster,for all services running there it takes ~17 minutes. Though we don't see overall deployment times decreasing yet, once we finish moving all front-end services to the cluster we are expecting at least a 50% reduction in overall duration.

Edited 4 years ago by John Jarvis

Designs

Child items ...

Activity

Amy Phillips added kubernetes priority4 teamDelivery labels 4 years ago

added kubernetes priority4 teamDelivery labels
Amy Phillips added to epic &228 (closed) 4 years ago

added to epic &228 (closed)
🤖 GitLab Bot 🤖 added DeliveryP4 workflow-infraTriage labels 4 years ago

added DeliveryP4 workflow-infraTriage labels
John Jarvis changed title from Post migration analysis of WebSocket nodes to Post migration analysis of Git HTTPs VMs to the Zonal Kubernetes Clusters 4 years ago

changed title from Post migration analysis of WebSocket nodes to Post migration analysis of Git HTTPs VMs to the Zonal Kubernetes Clusters
John Jarvis changed the description 4 years ago

changed the description
John Jarvis assigned to @jarv 4 years ago

assigned to @jarv
John Jarvis changed the description 4 years ago

changed the description
Amy Phillips mentioned in issue gitlab-com/www-gitlab-com#8294 (closed) 4 years ago

mentioned in issue gitlab-com/www-gitlab-com#8294 (closed)
John Jarvis mentioned in issue scalability#618 4 years ago

mentioned in issue scalability#618
John Jarvis changed the description 4 years ago

changed the description
John Jarvis @jarv · 4 years ago

Owner

I've added a blurb on cost and deploy times. If there is anything else we want to see here feel free to comment/reopen.
John Jarvis closed 4 years ago

closed
John Jarvis changed the description 4 years ago

changed the description
Amy Phillips mentioned in epic &112 (closed) 4 years ago

mentioned in epic &112 (closed)
Amy Phillips mentioned in epic &228 (closed) 4 years ago

mentioned in epic &228 (closed)

Please register or sign in to reply

Due date

None

Health status

None

Confidentiality

Confidentiality controls have moved to the issue actions menu () at the top of the page.

0 Participants