Metric: Performance Metrics

Overview

We have added quality metrics to Prompt Library. This issue is to explore performance metrics.

Performance Metrics for LLM Evaluation

Requests per Second (Concurrency)

Definition: Number of requests processed by the LLM per second.
Purpose: Measures the capability of the LLM to handle multiple requests simultaneously, indicating its scalability and performance under load.

Tokens per Second

Definition: Counts the tokens rendered per second during LLM response streaming.
Purpose: Assesses the speed and efficiency of the LLM in generating and streaming responses, which is critical for user experience in real-time applications.

Time to First Token Render

Definition: Time to first token render from submission of the user prompt, measured at multiple percentiles.
Purpose: Evaluates the responsiveness of the LLM, providing insights into how quickly it can begin delivering a response after receiving a user prompt.

Error Rate

Definition: Error rate for different types of errors such as 401 error, 429 error.
Purpose: Tracks the frequency and types of errors encountered, which helps in diagnosing issues and improving the reliability and stability of the LLM.

Reliability

Definition: The percentage of successful requests compared to total requests, including those with errors or failures.
Purpose: Measures the overall reliability of the LLM, indicating its robustness and dependability in handling user requests without failures.

Latency

Definition: The average duration of processing time between the submission of a request query and the receipt of a response.
Purpose: Reflects the efficiency of the LLM in processing and responding to queries, which is crucial for maintaining a smooth and responsive user interaction.
Status: - Done

These metrics collectively provide a comprehensive overview of the performance, reliability, and efficiency of the LLM, guiding improvements and ensuring optimal user experience.

Edited May 20, 2024 by Mon Ray