CustomersDot monitoring GKE clusters
For https://gitlab.com/gitlab-com/gl-infra/customersdot-ansible-poc/-/issues/17
Context
We're setting up the new monitoring infrastructure for CustomersDot. This monitoring infrastructure will run in GKE, discover CustomersDot nodes via gce_sd
, and ultimately be scraped by the main Gitlab thanos infrastructure.
High overview
- Create GKE clusters on the GCP projects.
- Deploy Prometheus on those GKE clusters.
- Have Prometheus on GKE autodiscover machines in GCE with one of the following methods
- Deploy thanos store on those GKE clusters.
- Update thanos query to use query the new thanos-store and thanos-sidecar.
Staging
-
Create cluster 👉 https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/3217 -
Set up access to SRE to cluster with glsh kube
command👉 gitlab-com/runbooks!4197 (merged)-
Fix stgsub
bastion fqdn to match naming convention👉 https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/3271
-
-
Create service accounts for CI 👉 https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/3294 -
Create static IP address and DNS records for Prometheus -
Refactor all monitoring resources into it's own file 👉 https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/3295 -
Google Address 👉 https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/3296 -
DNS record 👉 https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/3296
-
-
Create IAP brand for proemtheus -
Create IAP brand via terraform 👉 https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/3279 -
Update permissions for stgsub
andprdsub
terraform-ci
service account👉 https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/3288 -
Open access request to add terraform-ci
service account toops-contact
👉 https://gitlab.com/gitlab-com/team-member-epics/access-requests/-/issues/13168
-
-
Bootstrap cluster -
Allow internet access to GKE 👉 https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/3303 -
Allow gitlab.com
in IAP👉 https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/3306 -
Give k8s-workloads
service accounts access to GKMS, and buckets👉 https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/3308/diffs -
Give access to the runner to the cluster and MR -
fix missing permissions for the k8s-workloads
accounts: -
Deploy Prometheus inside of stgsub
-
Set up CI/CD pipeline -
Deploy prometheus 👉 gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!529 (merged)
-
-
Runbook deploy to new Prometheus cluster -
Add k8s-workload
service account to ops -
Update matrix_non_produciton -
Validate that the deployment is successful.
-
Production
-
Refactor the DNS resources created for Prometheus inside of a module 👉 https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/3296#note_120187- We should take into consideration the other environments like
gstg
,gprd
monitoring.tf as well. - Take a look at how the grouprunner abstracted this part as well.
- We should take into consideration the other environments like
Improvements made to cluster creation
-
Documentation updates -
Service Monitor refactoring -
Refactor CI service accounts into their own module
Edited by Michal Wasilewski