Split model inference and post-processing timing measurement
Problem to solve
As noted by @shinya.maeda,
Anyways, looking at the inference_duration_s, it seems also taking into account of post processing like https://gitlab.com/gitlab-org/modelops/applied-ml/code-suggestions/ai-assist/-/blob/main/ai_gateway/code_suggestions/generations.py#L146-151. Would it make sense to measure the duration of model.generate exactly? So that we can isolate the problems between the actual model inference or the other processes.
Proposal
Split the model inference and post-processing timing measurement to provide more insights to request latency.
Further details
Links / references
Edited by Tan Le