fix: FastAPI runs new threads for dependency resolutions
What does this merge request do and why?
This MR fixes an issue that many threads are spun up when multiple requests are received concurrently, which could be a root cause of https://gitlab.com/gitlab-org/modelops/applied-ml/code-suggestions/ai-assist/-/issues/409+ and https://gitlab.com/gitlab-com/gl-infra/production/-/issues/17366+.
Currently, we're using FastAPI's Depends
and Dependency Injector's Wiring (a.k.a Provide
/@inject
) in the following way:
async def chat(
request: Request,
chat_request: ChatRequest,
anthropic_claude_factory: Factory[AnthropicModel] = Depends(
Provide[ContainerApplication.chat.anthropic_claude_factory.provider]
),
This is actually an example usage provided by Dependency Injector, however, since Provider
object is not async/coroutine compatible, FastAPI runs a new thread from the pool in order to resolve the dependency. See solve_dependencies
module in FastAPI:
elif is_coroutine_callable(call):
solved = await call(**sub_values)
else:
solved = await run_in_threadpool(call, **sub_values)
https://github.com/tiangolo/fastapi/blob/0.108.0/fastapi/dependencies/utils.py#L600
This would be not a good practice because most of the dependencies inside the Dependency Injector are not thread-safe.
We fix this issue by passing async def
to the Depends
, so that it resolves the Dependency Injector in the main thread.
Related to https://gitlab.com/gitlab-org/modelops/applied-ml/code-suggestions/ai-assist/-/issues/409+
How to set up and validate locally
- Enable thread monitoring:
# Instrumentators
AIGW_INSTRUMENTATOR__THREAD_MONITORING_ENABLED=True
AIGW_INSTRUMENTATOR__THREAD_MONITORING_INTERVAL=1
poetry run ai_gateway
- Simulate concurrent requests:
for i in {1..10}
do
curl -X 'POST' \
'http://0.0.0.0:5052/v1/chat/agent' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"prompt_components": [
{
"type": "string",
"metadata": {
"source": "string",
"version": "string"
},
"payload": {
"content": "\n\nHuman: Hi, How are you?\n\nAssistant:",
"provider": "anthropic",
"model": "claude-2.0",
"params": {
"stop_sequences": [
"\n\nHuman",
"Observation:"
],
"temperature": 0.2,
"max_tokens_to_sample": 2048
}
}
}
],
"stream": false
}' &
done
(FYI, &
is to execute the requests in subprocesses concurrently.)
- Make sure that the
threads_count
in themodelgateway_debug.log
doesn't increase.
Merge request checklist
-
Tests added for new functionality. If not, please raise an issue to follow up. -
Documentation added/updated, if needed.