AI Gateway becomes unresponsive by CPU-bound task
Problem to solve
In the past few months, we've been seeing that there are numerous timeout error happening in Cloud Run. It turned out that this is caused by CPU-bound task which completely halts the server process and incoming requests.
Here are the findings so far:
-
duration_s
is misleading.duration_s
starts counting from when the request accepted by the server, however, the actual request came earlier than that when the server process is occupied by the other CPU-bound task. -
inference_duration_s
is misleading. The response from 3rd party model provider has already come back to the server, however, the server won't start processing it because it's occupied by the other CPU-bound task. - AI Gateway server runs in a single process in a single core. Python asyncio/coroutine allows us to use asynchrous programming, however, CPU-bound task is not correctly handled in this nature.
See https://gitlab.com/gitlab-com/gl-infra/production/-/issues/17366#note_1721304738 and https://gitlab.com/gitlab-com/gl-infra/production/-/issues/17366#note_1722616976 for more details.
Proposal
The main suspect is CodeParser.from_language_id
. There are some open questions:
- Can we run
from_language_id
in a separate thread? - Can we memoize the
from_language_id
per request? - Can TreeSitter's
Parser#parse
run concurrently? Would executingtree-sitter-languages.so
cause lock contention?
PreTrainedTokenizer
could be CPU bound as well. See https://gitlab.com/gitlab-com/gl-infra/production/-/issues/17366#note_1724350681
Further details
Links / references
Edited by Shinya Maeda