Migrate alertmanager to kubernetes
This issue is here to discuss and agree on a migration plan for alertmanager to kubernetes. Steps in the migration plan below can then be extracted into their own issues.
Desired state
- One alertmanager cluster running in GKE.
- No prometheus pairs in GCE
- Remaining non-GKE jobs are scraped by GKE prometheus instances
Draft Migration plan
- Get back to single-alertmanager sanity
- Ensure GKE & GCE alertmanagers have config parity, so that the next steps don't produce any surprises.
- Point GKE prometheus shards at GCE alertmanager
- Turn down GKE alertmanager
- Migrate prometheus jobs from GCE pairs to GKE pairs
- It's simpler to migrate alertmanager to GKE if there are no GCE prometheus instances that need to push alerts to it. Therefore, we should complete the prometheus job migrations before continuing. It's relatively simple for GKE prometheus instances to scrape GCE workloads, so this is not coupled to workload migrations.
- We probably want to avoid continuing to use prometheus file service discovery, as this makes k8s deployments clunkier than they need to be.
- Option 1: Switch prometheus over to consul SD
- Option 2: add richer labels to our GCE instances and use GCE SD.
- When the job lists in GCE prometheus shards are empty, they can be turned down
- Migrate alertmanager to GKE
- Stand up a GKE AM again with config sourced from a common base with the GCE one, to minimise risk of drift.
- It must be hooked up to all our integrations (e.g. PD).
- Repoint all prometehus pairs at the new GKE AM.
- Turn down GCE AM.
RFC @bjk-gitlab @AnthonySandoval @andrewn (and anyone else)
Edited by Craig Furman