Support lifecycle hook for workhorse container in webservice deployment
Summary
We observe 502 errors on webservice pod termination events (HPA, rolling upgrades), because of workhorse container exits immediately. Please add lifecycle hook support for workhorse container as a quick fix, until workhorse graceful shutdown is fixed.
More details
Hi,
the problem is that when webservice deployment scales down, we observe 502 errors on ingress controller (Traefik).
We have blackout, terminationGracePeriod configured, but this doesn’t help.
What we have found so far.
webservice pod has two containers: workhorse, and the puma webservice.
When pod gets termination signal, we see that puma container honestly waits for blackout period, but workhorse container exits immediately after it gets SIGTERM from kubernetes, not giving Ingress controller time to finish in-flight requests or remove pod ip from upstream list.
Since ingress is pointing to workhorse 8181 port, it was clear that root cause behind 502 errors is workhorse container.
As an experiment, we added preStop hook with sleep 60 right to workhorse container in webservice deployment manifest, and the problem with 502 has gone.
The thing here is that Helm chart, most probably by some reason, doesn’t have a way to specify pod lifecycle hooks.
Most probably workhorse should handle graceful shutdown itself, but according to workhorse code, there's no handling of graceful shutdown at the moment.
Steps to reproduce
Delete webservice pod, observe 502 on Ingress controller.
Configuration used
webservice:
enabled: true
hpa:
minReplicas: 2
maxReplicas: 20
ingress:
proxyBodySize: "2048m"
tls:
secretName: <redacted>
metrics:
enabled: true
serviceMonitor:
enabled: true
additionalLabels:
prometheus: main
resources:
requests:
cpu: 2
memory: 2400Mi
shutdown:
blackoutSeconds: 60
deployment:
terminationGracePeriodSeconds: 70
Current behavior
502 errors on webservice pod termination
Expected behavior
No 502 errors on werbservice pod termination
Versions
- Chart: 6.3.1