Adjustments around bot long-polling behavior
Description
This PR addresses a few issues relating to the long-polling of bots.
- Update
MAX_WORKER_TTLto be 300 seconds instead of 3600. - Use a percentage of the given deadline (80%) instead of the entire value -
NETWORK_TIMEOUT(previously 1s) - Increase
NETWORK_TIMEOUTto 3 seconds from 1 second.
MAX_WORKER_TTL is bumped down to 300 seconds to handle the case where no request-timeout is specified on the client side more gracefully. We have logic for if the deadline is None, but depending on the client language an unset request-timeout might end up being an arbitrarily large uint64 value instead. 300 was chosen as it was the previous default of MAX_JOB_BLOCK_TIME, which MAX_WORKER_TTL replaced.
The adjustments around NETWORK_TIMEOUT were done to give BuildGrid more time to respond to requests when there is no work available. We've seen issues where when using a large threadpool it would often take longer than the 1s we previously allocated to stop waiting for work and finish the request. This would result in buildbox-worker crashing and potentially restarting, making the issue worse. By using a percentage of the given deadline and increasing the minimum value we allow (NETWORK_TIMEOUT) we ensure that BuildGrid has enough time to finish requests outside of more serious issues.