Monitoring for AI Assist
Problems to solve
As part of the Closed Beta Release of AI Assist, we will be looking into building basic monitoring to be able to look into
- Cluster Health/ GKE Dashboards
- GPU Utilisation
- Elastic Logs
Context
A high-level diagram of AI Assist service components
(adapted from AI Assist API README)
A high-level diagram of AI Assist monitoring infrastructure
- GitLab
- Logs are forwarded to Logstash and searchable via Elastic Search/Kibana
- Leverage log-based metrics (last 7 days) to build dashboard
- Metrics are scrapped and stored in Prometheus and can be visualised via Grafana
- AI Assist
- Logs are stored in GCP Logging
- Leverage log-based metrics (last 7 days) to build dashboard
- Metrics are scrapped and stored in Prometheus and can be visualised via Grafana
References
- Triton Inference server metrics
Edited by Tan Le