How will runners behave during CI decomposition failover
Since the runner fleet generates high traffic and will need to write to the CI tables it is important we appreciate what exactly will happen during the failover period in &6160 (closed) phase 7. We can also validate this during the "phase 5" dry run on staging.
If we find this could generate some cascading adverse load on our systems we may wish to come up with a mitigation plan like:
- Block runner API traffic at a proxy
- Refactor the code to be slightly more graceful
- Figure out if runners can implement backoff strategy
My hope is that runners receiving a 500 response from every request for a few seconds should not generate any more load on the system than runners that are operating normally and they should just pick up where they left of asking for jobs when the failover finishes. In fact this should be no different to regular Patroni failovers which already periodically happen.