Fix timeout errors
(UPDATE 2023-08-17: wrote a new description for this issue, the old one wasn't that useful.)
BPO's architecture needs to be reworked to fix various timeout errors (from sourcehut to bpo, sometimes also from gitlab to bpo).
Right now the API works like this:
- API call comes in
- BPO runs a bunch of related code, such as figuring out which package to build next, or publishing packages from WIP to final repository and this takes time
- Only after that BPO answers to the API call with 200; if it takes too long we'll get a timeout error from either the client (sourcehut, gitlab) or the reverse proxy infront of the API
- BPO processes the next API call
Note that the public website (https://build.postmarketos.org/) is just a static HTML page that gets regenerated often, it is served by nginx directly without going through bpo. So accessing that is not related to the timeout issues.
What we want instead:
- API call comes in
- BPO puts it into a queue
- BPO processes the next API call
Maybe even multiple API calls in parallel, but even if we keep it at 1 API call a time it would already fix the timeout issues as long as each API call is handled quickly.
Meanwhile a second thread should operate on the queue:
- Take one item from the queue
- Do the heavy lifting (figure out which package to build next, publish packages, ...)
- Take next item from the queue
It will be some effort, but it should be possible to rewrite parts of bpo so it works like this. The good news is that there's a testsuite with ~93% test coverage, so at least we can very well verify that everything still works as it should. I'll need some time to implement this though and right now I have a lot on my plate.
old description:
Timeout example: https://builds.sr.ht/~postmarketOS/job/96747
EDIT: related test seems to run slower too, we can probably profile this: test/test_push_hook_gitlab.py::test_push_hook_gitlab_to_repo_missing_to_nop