Evaluate managed Prometheus hosting
Overview
GCP announced the Managed Service for Prometheus which can offload the work of managing multiple Prometheus and Thanos components ourselves.
Resources
- Introducing Google Cloud's new managed service for Prometheus
- Deep dive: Managed Service for Prometheus
- Monitor your applications on Google Managed Prometheus
- GCP TAM are avialable
Acceptance criteria (WIP)
- Able to set custom recording rules.
- Able to set custom alerting rules.
- Send alerts to pager duty
- Unified view between environments like
gstg
andgprd
(maybe soft requirement?) - Metric retention is at least 1 year.
- Cost compare to the current setup we have
Open questions
- Does this deploy tools like
kube-state-metrics
? - Can we upload out own recording rules?
- Where do we upload alerts
Things that will change from our current setup
- Remove Thanos since we'll get long term storage out of the box
-
We can only query per environment likeYou can use MetricsScope to have multiple projects in viewGitLab-production
andGitLab-staging-1
before we had a unified view
Current Workload
Checked last on 2022-05-02
- Retention Period: 365 days (no downsampling)
- Total amount of data in object storage:
280 TB
- Cached storage: 89.4
- Sample Ingestion/sec:
4,875,198
- Query Requests Per Second (using Thanos): ~55
- Request duration per handler
-
p90
- Best:
0.2s
- Worst:
2.2s
- Per Handler
- Best:
-
p95
- Best:
1s
- Worst:
4.5s
- Per Handler
- Best:
-
p99
- Best:
3s
- Worst:
19.6s
- Per Handler
- Best:
-
Edited by Steve Xuereb