Support lifecycle hook for workhorse container in webservice deployment

Summary

We observe 502 errors on webservice pod termination events (HPA, rolling upgrades), because of workhorse container exits immediately. Please add lifecycle hook support for workhorse container as a quick fix, until workhorse graceful shutdown is fixed.

More details

Hi,
the problem is that when webservice deployment scales down, we observe 502 errors on ingress controller (Traefik). We have blackout, terminationGracePeriod configured, but this doesn’t help.

What we have found so far.

webservice pod has two containers: workhorse, and the puma webservice.

When pod gets termination signal, we see that puma container honestly waits for blackout period, but workhorse container exits immediately after it gets SIGTERM from kubernetes, not giving Ingress controller time to finish in-flight requests or remove pod ip from upstream list.

Since ingress is pointing to workhorse 8181 port, it was clear that root cause behind 502 errors is workhorse container.

As an experiment, we added preStop hook with sleep 60 right to workhorse container in webservice deployment manifest, and the problem with 502 has gone.

The thing here is that Helm chart, most probably by some reason, doesn’t have a way to specify pod lifecycle hooks.
Most probably workhorse should handle graceful shutdown itself, but according to workhorse code, there's no handling of graceful shutdown at the moment.

Steps to reproduce

Delete webservice pod, observe 502 on Ingress controller.

Configuration used

          webservice:
            enabled: true
            hpa:
              minReplicas: 2
              maxReplicas: 20
            ingress:
              proxyBodySize: "2048m"
              tls:
                secretName: <redacted>
            metrics:
              enabled: true
              serviceMonitor:
                enabled: true
                additionalLabels:
                  prometheus: main
            resources:
              requests:
                cpu: 2
                memory: 2400Mi
            shutdown:
              blackoutSeconds: 60
            deployment:
              terminationGracePeriodSeconds: 70

Current behavior

502 errors on webservice pod termination

Expected behavior

No 502 errors on werbservice pod termination

Versions

Chart: 6.3.1

Edited Nov 30, 2022 by Andrey Golev