Support streaming for completions and generations with Anthropic models

Problem to solve

Large prediction output means users would need to wait longer. They could eventually give up and re-trigger a new request which put more load on our systems at a whole. Additionally, low latency is an important feature for both code completions and generations as well as chat.

Proposal

We would like to enable streaming for both code completions and generations. Streaming allows the model to make predictions on data with very low latency, as soon as the data is received.

Except for code-gecko, most Vertex models and all Anthropic models support streaming.

Models	Use Case	Streaming
`code-gecko`	Code Completions	❌
`claude-instant-1`	Code Completions	✅
`code-bison`	Code Generations	✅
`claude-2`	Code Generations, Chat	✅

Further details

High-level diagram of the complete flow

This issue focuses on implementing the part highlighted in #e9967a.

Plan of attack

Support streaming for code completions with Anthropic
Support streaming for code generations with Anthropic
Implement error handling
Implement post processing
Add API documentation

Links / references

Edited Nov 14, 2023 by Tan Le