move pubsubbeats to k8s
-
one of the tf MRs: https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/1837 , it adds binding required for Workload Identity -
switch the gke module to for_each: https://ops.gitlab.net/gitlab-com/gl-infra/terraform-modules/google/gke/-/merge_requests/32 -
switch tf repo to new version of the module: https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/1839 -
switch all other clusters to for_eachfor node_pools: https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/1845 -
clean the vertical_pod_autoscallingingprd,gstgandpre: https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/1845#note_86016-
MR going back to our previous default: https://ops.gitlab.net/gitlab-com/gl-infra/terraform-modules/google/gke/-/merge_requests/34 -
switch to new version of the module: https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/1846
-
-
bump version of the google-beta terraform provider so that we can update node_pools without recreating them: https://ops.gitlab.net/gitlab-com/gl-infra/terraform-modules/google/gke/-/merge_requests/35 -
apply the Workload Identity binding
-
- some ideas related to how we handle node updates:
- first successful deployment to gstg: gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!81 (comment 365595960)
-
enable pubsubbeat pods for all other nonprod indices: -
gitlab-helmfiles change: gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!84 (merged) -
split deployment file: gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!88 (merged) -
missing GCP binding in ops: https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/1858 -
remove pods for topics that don't exist: gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!90 (merged)
-
-
finish nonprod -
we don't have observability of the pods in nonprod -
monitoring: - we should already have pod metrics
- Running a beat exporter will probably mean adding a sidecar with the exporter to the deployment and adding the endpoint to Prometheus config for scraping. This might be blocked on: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10334
- we spent some time troubleshooting #10334 (closed) , but couldn't find anything, it might be fixed by upgrading to latest version of the Prom Operator, but in order to do that we need to wait for the cluster to be upgraded to 1.16 which should happen in the next few days: delivery#889 (closed)
-
beat exporter + Prome scraping config: gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!113 (merged) -
fix ServiceMonitor reference to a port: gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!116 (merged) -
disable readiness probes on exporters for now: gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!117 (diffs) -
resize the default pool in pre: https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/1901 -
resize the node pool: https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/1902 - some observations we made when working on the pre node pool: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10751
-
lower requests: gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!118 (diffs)
- a number of issues came up while working on Prometheus in ops:
- https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10727
- https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10728
- https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10729
- https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10730
- https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10741
- https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10755
- https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10772
- https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10804
- problem with BackendConfig CRDs: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10731
- still trying to deploy Prometheus to ops k8s: gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!109 (merged)
- don't install gcp crds: gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!111 (merged)
- allow version of CRD to be specified: gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!110 (merged)
- change request for removing CRDs created with helm and applying CRDs from GCP: production#2375 (closed)
- switch to using gcs secrets: gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!112 (merged)
-
confirm the metrics are available in prometheus in gke -
add pubsubbeatto the Operator whitelist: gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!119 (merged) -
add a headless service so that Prometheus can discover pods for scraping: gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!121 (merged) -
fix exporter port: gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!122 (merged)
-
-
logging - we could use the same approach as we currently utilize for the application: https://gitlab.com/gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles/-/blob/master/releases/fluentd-elasticsearch/values.yaml.gotmpl#L11
-
ES logging cluster config update: gitlab-com/runbooks!2431 (merged) -
fluentd-elasticsearch config change to start sending pubsubbeat logs: gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!93 (merged) -
move es-diagnostics to another namespace: gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!96 (merged) and https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/1864 -
deploy gitlab-monitoringin the ops cluster: gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!99 (merged)-
static IP and DNS for prom k8s ops ingress: https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/1887
-
-
deploy fluentd-elasticsearchin the ops cluster: gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!97 (merged)
-
- we could use the same approach as we currently utilize for the application: https://gitlab.com/gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles/-/blob/master/releases/fluentd-elasticsearch/values.yaml.gotmpl#L11
-
alerting - some of these alerts: https://gitlab.com/gitlab-com/runbooks/-/blob/master/rules/logging.yml will need to be translated to k8s metrics
- It turns out that the alerts that are using beat metrics should already be working. The alerts that won't work are the ones that are based on mtail metrics since we are not running mtail in kubernetes yet:
-
update mtail image: docker-mtail!1 (merged) -
set up mtail image mirror on ops (adjust labels of tokens in 1pass accordingly) -
add mtail daemonSet and a program, issue: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10777 MR: gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!189 (merged) [ ] add mtail program for pubsubbeat: gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!184 (closed)
-
- It turns out that the alerts that are using beat metrics should already be working. The alerts that won't work are the ones that are based on mtail metrics since we are not running mtail in kubernetes yet:
- some of these alerts: https://gitlab.com/gitlab-com/runbooks/-/blob/master/rules/logging.yml will need to be translated to k8s metrics
-
Grafana dashboards -
logging: Overview: https://dashboards.gitlab.net/d/logging-main/logging-overview?orgId=1&from=now-3h&to=now- doesn't seem to contain any beat metrics (grepped the json source of the dashboard in Grafana UI, I don't think we're keeping the source of the dashboard anywhere)
-
Logging: https://dashboards.gitlab.net/d/USVj3qHmk/logging?orgId=1&from=now-7d&to=now&refresh=30s- doesn't contain any beat metrics,
PubSubbeat graphs are based on Stakdriver metrics for PubSub, not the beat itself
- doesn't contain any beat metrics,
-
-
tracing- at the moment there's only a single component, we don't have Jaeger set up and GCE instances are not using tracing
-
-
deprovision GCE pubsubbeat VMs in nonprod -
switch to using sets for topics in all envs, this should be a noop: -
default to trueforuse_new_node_name -
feature flag the parts of the tf module that won't be needed in k8s, this should be a noop, might require some state change, the only bits that we'll need are: topics, gke sinks, service accounts in the main.tf, bindings for service accounts -
stop pubsubbeat processes on VMs -
flip the feature flag so that VMs are removed - https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/1863
-
-
-
data from Prom in the ops k8s cluster doesn't end up in Thanos: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10775 -
roll over node pools in production -
explore GKE's VMs rotation using a custom cluster, Slack discussion: https://gitlab.slack.com/archives/CCFV016SV/p1592839497207600 - the main reason for relying on TF rather than manual steps so that we don't have hand hold the cluster and instead can rely on automation
-
-
enable pubsubbeat in production: gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!181 (merged) -
disable pubsubbeat in gprd: gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!212 (merged) -
investigate failures in gprd: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/11383 -
create IAP objects: https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/2062 -
enable beats in gprd again: gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!213 (merged) -
beats created another set of subscriptions: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/11398 -
enable beats in gprd:
-
-
deprovision GCE pubsubbeat VMs in prod: https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/2024 -
add node selector: gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!218 (diffs) -
remove pubsubbeat VMs jobs from Prometheus: https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/4256 -
limit cpu usage using go runtime settings in env var: - https://golang.org/pkg/runtime/#GOMAXPROCS
-
MR with updates: gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!221 (merged) -
cpu limits in kubernetes: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/11424
-
switch the name of subscriptions so that there's no duplicate: gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!219 (merged) -
reenable readiness checks on exporters: gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!220 (merged) -
make sure pubsubbeat can be deployed to minikube -
add es cluster for pubsubbeat dev: gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!188 (merged)
-
-
adjust resource requests in all envs
Edited by Michal Wasilewski