Reduce the number of database queries when assigning leases
Context
Currently, when a bot sends an UpdateBotSession
request with a long timeout the server keeps the connection open and repeatedly looks for available jobs matching the worker's capabilities (see #239 (closed) for some thoughts on improving that query itself). If we have many workers connecting, that means making large numbers of relatively complex and similar queries to the database every second.
This is wasteful, using up available database connections and also causing unnecessary load on the the database. The current implementation also has the limitation that new jobs can only be assigned at the rate at which the query is repeated. We can solve this limitation (if using PostgreSQL) by using a long-lived connection which uses LISTEN/NOTIFY to detect new jobs being queued, and using that to decide when to make the job assignment query. However, this ties up a database connection for the duration of the bot connection, and with many bots will quickly starve the other parts of BuildGrid of connections.
Proposed solution
We solved a similar issue of many connections needing to get regular updates of job status by adding the job watcher thread. This is a single thread which (using PostgreSQL) connects to the database and uses LISTEN/NOTIFY to detect changes made to jobs that connections are interested in, and hand updates to those connections when the changes are detected.
We can solve the requesting work problem in a similar way. A thread can be started which listens for new jobs entering the queue. This thread will track the capabilities of workers who are waiting for work (and didn't find a job with an initial query), and match new jobs to relevant workers as they are queued. This gives the low-latency assignment that is preferred by some clients, and also allows many workers to utilise a single database connection for finding jobs.
Even the initial query can be optimised away by using the same thread to also maintain a cached view of the job queue. The thread can additionally listen for jobs being assigned, and use that in conjunction with notifications of jobs being enqueued to curate a cache of all the queued jobs. This reduces the number of database connections used by workers trying to find leases to just a long-lived connection listening for notifications, and a short-lived connection to attempt to claim a specific job when one becomes available