AI Gateway becomes unresponsive by CPU-bound task

Problem to solve

In the past few months, we've been seeing that there are numerous timeout error happening in Cloud Run. It turned out that this is caused by CPU-bound task which completely halts the server process and incoming requests.

Here are the findings so far:

duration_s is misleading. duration_s starts counting from when the request accepted by the server, however, the actual request came earlier than that when the server process is occupied by the other CPU-bound task.
inference_duration_s is misleading. The response from 3rd party model provider has already come back to the server, however, the server won't start processing it because it's occupied by the other CPU-bound task.
AI Gateway server runs in a single process in a single core. Python asyncio/coroutine allows us to use asynchrous programming, however, CPU-bound task is not correctly handled in this nature.

See https://gitlab.com/gitlab-com/gl-infra/production/-/issues/17366#note_1721304738 and https://gitlab.com/gitlab-com/gl-infra/production/-/issues/17366#note_1722616976 for more details.

Proposal

The main suspect is CodeParser.from_language_id. There are some open questions:

Can we run from_language_id in a separate thread?
Can we memoize the from_language_id per request?
Can TreeSitter's Parser#parse run concurrently? Would executing tree-sitter-languages.so cause lock contention?

PreTrainedTokenizer could be CPU bound as well. See https://gitlab.com/gitlab-com/gl-infra/production/-/issues/17366#note_1724350681

Further details

Links / references

Edited Jan 15, 2024 by Shinya Maeda