hpa downscaling webservice causing 502 errors with AWS Load Balancer

Problem to solve

Advice on configuring AWS Load Balancer when this is being used as a replacement for nginx-ingress.

Current general recommendation is to use 75% as the minimum available webserive pods, customer would like to see this lowered for improved cost savings.

Configuration investigation for nginx-ingress is being done in hpa downscaling webservice causing 502 errors with nginx-ingress

This issue is to investigate how that advice translates to AWS Load Balancer usage instead.

This documentation could maybe be placed in https://docs.gitlab.com/charts/charts/gitlab/webservice/

Further details

Proposal

Who can address the issue

Ingress configuration when using AWS Load Balancer

Root cause and resolution

When Kubernetes terminates a Webservice Pod, the AWS Load Balancer appears to continue to briefly send new traffic to the Pod, even when the endpoint has been marked as "Draining" in the target group by the AWS Load Balancer Controller.

AWS recommends using a preStop hook to catch the SIGTERM from Kubernetes and have the application return an "unhealthy" status and sleep for long enough to have the AWS Load Balancer recognize that the endpoint is unhealthy and remove it from the target group.

While the webservice container makes use of a preStop hook to send a SIGINT to the puma master process and sets a environment variable used by the Rails application to sleep for the period set by the shutdown.blackoutSeconds value for the Webservice Chart, the gitlab-workhorse container responds to the SIGTERM immediately and shuts down.

We've opened !3084 (merged) to add a preStop sleep for the gitlab-workhorse container, using the same shutdown.blackoutSeconds value. When used with an appropriate value for shutdown.blackoutSeconds and implementing the correct AWS Load Balancer Controller healthcheck annotations for the webservice Ingress, this should mitigate most, if not all, of the 502 errors returned by the AWS Load Balancer when webservice pods terminate due to scale down events.

We'd recommend the following settings and annotations once !3084 (merged) is merged and released:

shutdown.blackoutSeconds: 30 (or longer to let long-running requests finish).
deployment.terminationGracePeriodSeconds: 40 (must be longer than shutdown.blackoutSeconds

The default for shutdown.blackoutSeconds is 10s - this is too low for the minimum healthcheck interval for the AWS Load Balancer (which is 5s minimum and 15s by default).

Set the following for gitlab.webservice.ingress.annotations (Note: these should be set specifically for webservice and not as part of global.ingress.annotations)

alb.ingress.kubernetes.io/healthcheck-path: "/-/readiness"
alb.ingress.kubernetes.io/healthcheck-interval-seconds: '10' (Or any value less than shutdown.blackoutSeconds / 2 - the default to mark the endpoint unhealthy is 2 subsequent 'unhealthy' responses)
alb.ingress.kubernetes.io/healthcheck-timeout-seconds: '5' (must be less than healthcheck-interval-seconds)

We recommend testing these values with your usage patterns and traffic, and tune these values as appropriate for your environment.

Edited Apr 10, 2023 by Jason Young