Support streaming for completions and generations with Anthropic models
Problem to solve
Large prediction output means users would need to wait longer. They could eventually give up and re-trigger a new request which put more load on our systems at a whole. Additionally, low latency is an important feature for both code completions and generations as well as chat.
Proposal
We would like to enable streaming for both code completions and generations. Streaming allows the model to make predictions on data with very low latency, as soon as the data is received.
Except for code-gecko
, most Vertex models and all Anthropic models support streaming.
Models | Use Case | Streaming |
---|---|---|
code-gecko |
Code Completions | |
claude-instant-1 |
Code Completions | |
code-bison |
Code Generations | |
claude-2 |
Code Generations, Chat |
Further details
High-level diagram of the complete flow
This issue focuses on implementing the part highlighted in #e9967a
.
Plan of attack
- Support streaming for code completions with Anthropic
- Support streaming for code generations with Anthropic
- Implement error handling
- Implement post processing
- Add API documentation
Links / references
Edited by Tan Le