Get completions on the Triton backend side
The existing way to obtain completion sometimes causes a bug described in #28 (closed):
- decode the entire model output (including prompt) on the Triton backend side and return it back to model-gateway
- use the prompt length to obtain the generated completion on the model-gateway side
With the following changes, we obtain the completions on the Triton backend side before decoding the sequence of IDs into the corresponding tokens.
Ref: #28 (closed)
Edited by Alexander Chueshev