nginx-ingress-controller scale up events result in 502's for clients
production#5561 (comment 681529229) shows that scale up events which create new nginx ingress replicas and reload configurations cause 502s to be thrown to clients (KAS was just the most noticeable victim). We see something in the order of maybe 1K failed requests out of somewhere approaching 100K requests, but only over the space of 5-6 seconds so the total effect to apdex is negligible given we aggregate to at least 1 minute.
However, those 1K requests have a bad time, and it seems like something we should try to fix.
A 502 is Bad Gateway, meaning nginx couldn't talk to the backend it is proxying to (workhorse?); it seems more likely to be an early failure to connect (TCP), as timeouts should give a 504 Gateway Timeout instead.
After a quick review of the nginx ingress discussions in various places in gl-infra, I think this is all new information. However, it has also been mooted (delivery#1974 (moved)) to remove the nginx ingress from in front of the API fleet, which will likely eliminate this problem (or replace it with new and exciting bugs)
Related issues:
- delivery#1937 (comment 669732237) - in particular it links to a bug in kubernetes, although that's about scale downs (I think).