Vet deployments ahead of moving API traffic over to Kubernetes

Run a thorough test of the API service during deployment scenarios. Utilize this issue to determine what issues users will see, what our metrics will show, and what logs data we'll have. We currently handle upgrades nicely by draining a node from haproxy, performing the upgrade, and then pushing that server back into rotation. Kubernetes will handle this differently. Since we'll have nginx in front of the service to start, but we wish to remove nginx in the future, test both use cases. Since this service is directly user facing, we need to minimize the amount of 502/503 errors as much as possible prior to running this in production.

Utilize pre to perform load testing of the API service

Perform the following tests:

Perform a deploy
Kill off a random pod

Test twice:

Using the nginx ingress service endpoint
Using the API service endpoint

And capture as much information as possible to determine what tweaks we may need to make, or blockers that we may need to resolve

This issue can be closed when a thorough analysis is complete, and we are satisfied that we can push this service into production without hindering the API's current capabilities during application disruptions.

While we are here, validate the Puma worker count. Can we see differing behavior with differing puma counts if possible. Maybe also see if we can tweak the puma worker count per workload. Such that we can set the API puma workers different to the rest of the webservices deployment.

/cc @jarv

Summary of Work

Using bombardier for testing coming from one of the fe nodes in staging. The endpoint utilized was the nginx ingress, and we hit the projects api endpoint for our testing as I wanted to both flex the api service and ensure we were receiving legitimate responses.

When testing Pod killings to mimic a failed pod, and during mock deployments, which involved rolling a deployment, we never received 5xx class errors. So we started off pretty well.

We did notice that mock deployments were taking an excruciating amount of time. This being due to our use of a lengthy terminationGracePeriod and blackout period being set and along with a delayed readiness probe. Tweaks came into play to reduce the time spent here.

Decrease terminationGracePeriod: gitlab-com/gl-infra/k8s-workloads/gitlab-com!755 (merged)
Set blackout period to our application default: gitlab-com/gl-infra/k8s-workloads/gitlab-com!743 (merged)

With both of these changes, deployments now occur much faster and we've confirmed that we continue to not receive 5xx class errors.

We did not test using the Service Endpoint of the API service (bypassing Nginx). This is due to having a custom configuration in nginx specific to the API and where we leverage proxy request buffering to protect our rails applications: #1557 (comment 561998359)

Edited May 07, 2021 by Amy Phillips