Consider new api deployment for route /api/v4/jobs/request

added DeliveryP4 ServiceAPI kubernetes workflow-infraTriage + 1 deleted label

mentioned in issue production#4666 (closed)

Copying a comment from @ggillies production#4666 (comment 583169631)

Hmmmm I personally am not sure about this, simply from a perspective as an EOC I already find what we a bit tricky to follow when trying to track from an alert to a set of pods (and do ES, Bigquery, and thanos queries over the right set of pods). And I work on this stuff day to day! I would like to instead change it so that when the pod receives a signal it immediately returns a 204 to any long-polling connections so that they will disconnect straight away, and we can just shorten blackout periods overall (that's how cloud native workloads should work). I guess that's being tracked in gitlab-org/gitlab#325114 (closed)

I guess if we keep the same labels it shouldn't be too bad, but I am just against introducing more deployment complexity when the real solution is in the application/pod itself.

There is another reason why this may be useful, and why we may want to expedite at least part of this.

We recently bought us a bit more time on redis-persistent scalability by setting workhorse subscribe commands to only run on the api hosts / webservice: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12812.

However, this gain was quickly offset by migrating API to kubernetes, which increased the overall number of workhorses. The mechanism scales with number of workhorses, and is only in use for the /api/v4/jobs/request endpoint.

So if we split that endpoint out into a separate webservice, we can reduce the number of workhorses that need to subscribe to updates in redis, thus freeing up some more redis CPU.

mentioned in merge request gitlab-com/gl-infra/k8s-workloads/gitlab-com!910 (closed)

Fix for graceful shutdown is incoming which should address some of the original problems: gitlab-org/gitlab!62701 (merged).

However, I think isolating these hosts still is a good idea for the reasons mentioned above.

As such I've opened up a draft MR: gitlab-com/gl-infra/k8s-workloads/gitlab-com!910 (closed), may still use some work to get it into a shippable state, this is my first time touching that part of our k8s setup.

mentioned in issue scalability#1392

added 1 deleted label

added groupdelivery label

added 1 deleted label and removed teamDelivery label

added teamDelivery-Deployments label and removed 1 deleted label

Consider new api deployment for route /api/v4/jobs/request

Overview

Problem Statement

Consideration

Designs

Child items ...

Activity