Split model inference and post-processing timing measurement

Problem to solve

Anyways, looking at the inference_duration_s, it seems also taking into account of post processing like https://gitlab.com/gitlab-org/modelops/applied-ml/code-suggestions/ai-assist/-/blob/main/ai_gateway/code_suggestions/generations.py#L146-151. Would it make sense to measure the duration of model.generate exactly? So that we can isolate the problems between the actual model inference or the other processes.

Proposal

Split the model inference and post-processing timing measurement to provide more insights to request latency.

Further details

Links / references

Edited Dec 04, 2023 by Tan Le