Adjustments around bot long-polling behavior (!864) · Merge requests · BuildGrid / buildgrid

Description

This PR addresses a few issues relating to the long-polling of bots.

Update MAX_WORKER_TTL to be 300 seconds instead of 3600.
Use a percentage of the given deadline (80%) instead of the entire value - NETWORK_TIMEOUT (previously 1s)
Increase NETWORK_TIMEOUT to 3 seconds from 1 second.

MAX_WORKER_TTL is bumped down to 300 seconds to handle the case where no request-timeout is specified on the client side more gracefully. We have logic for if the deadline is None, but depending on the client language an unset request-timeout might end up being an arbitrarily large uint64 value instead. 300 was chosen as it was the previous default of MAX_JOB_BLOCK_TIME, which MAX_WORKER_TTL replaced.

The adjustments around NETWORK_TIMEOUT were done to give BuildGrid more time to respond to requests when there is no work available. We've seen issues where when using a large threadpool it would often take longer than the 1s we previously allocated to stop waiting for work and finish the request. This would result in buildbox-worker crashing and potentially restarting, making the issue worse. By using a percentage of the given deadline and increasing the minimum value we allow (NETWORK_TIMEOUT) we ensure that BuildGrid has enough time to finish requests outside of more serious issues.

Adjustments around bot long-polling behavior

Description

Merge request reports