Summary of Fulfillment Infra situation for SRE support
This issue is an attempt to list all items related to the CustomersDot infrastructure and related processes. DRI: @ahmadsherif ### Runbook documentation for troubleshooting The `docs/` folder of the Runbooks project being the SSOT for SREs when an app needs some troubleshoot, much information about our VMs, their provisioning, app deployments, logs location on VMs, etc can be found on the [CustomersDot overview](https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/customersdot/overview.md) as well as the [README](https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/customersdot/README.md) documents in this project. The `overview.md` document linked above is a good starting point. Some of the info in it is also listed below. ### Our Ansible project We use [this Ansible project](https://gitlab.com/gitlab-com/gl-infra/customersdot-ansible) to provision the CustomersDot application on the VMs and deploy the app on them. It consists of two playbooks, meant of each of these tasks: - [the "provision.yml" playbook](https://gitlab.com/gitlab-com/gl-infra/customersdot-ansible/-/blob/master/doc/readme.md#the-provisionyml-playbook) - [How to manually trigger a provisioning in Staging and Production](https://gitlab.com/gitlab-com/gl-infra/customersdot-ansible/-/blob/master/doc/readme.md#manual-provisioning) - [the "deploy.yml" playbook](https://gitlab.com/gitlab-com/gl-infra/customersdot-ansible/-/blob/master/doc/readme.md#the-deployyml-playbook) - [How to manually deploy the app to Staging or Production](https://gitlab.com/gitlab-com/gl-infra/customersdot-ansible/-/blob/master/doc/readme.md#manual-deployment-to-production) ### Prometheus - [Prometheus instance for CustomersDot Production](https://prometheus-gke.prdsub.gitlab.net/graph?g0.expr=&g0.tab=1&g0.stacked=0&g0.show_exemplars=0&g0.range_input=1h) - [Prometheus instance for CustomersDot Staging](https://prometheus-gke.stgsub.gitlab.net/graph?g0.expr=&g0.tab=1&g0.stacked=0&g0.show_exemplars=0&g0.range_input=1h) ### GCP - [`gitlab-subscriptions-staging` project](https://console.cloud.google.com/home/dashboard?project=gitlab-subscriptions-staging) - [`gitlab-subscriptions-prod` project](https://console.cloud.google.com/home/dashboard?project=gitlab-subscriptions-prod) ### Grafana - [CustomersDot main page on Grafana](https://dashboards.gitlab.net/d/customersdot-main/customersdot-overview?orgId=1) - [Error budgets for all Fulfillment groups](https://dashboards.gitlab.net/d/product-fulfillment/product-error-budgets-fulfillment?orgId=1) ### Other monitoring/observability tools - [CustomersDot Production availability](https://customersdot.cloudwatch.net/status/customersdot-production) (via [Uptime Kuma](https://github.com/louislam/uptime-kuma)). ### TODOs - [ ] Make sure to have access to the `Subscription portal` 1Password vault. This is where the password to [decrypt the Ansible vaults](https://gitlab.com/gitlab-com/gl-infra/customersdot-ansible/-/blob/master/doc/readme.md#ansible-vaults-for-customersdot-secrets) are stored. ### Currently opened SRE-related issues - [Add uptime graph to Grafana](https://gitlab.com/gitlab-com/gl-infra/customersdot-ansible/-/issues/148) - [Migrate from postgreSQL to Cloud SQL](https://gitlab.com/gitlab-com/gl-infra/customersdot-ansible/-/issues/152) - [Add logs to Kibana](https://gitlab.com/gitlab-com/gl-infra/customersdot-ansible/-/issues/42) - [GCP Logging - review Sidekiq log configuration](https://gitlab.com/gitlab-com/gl-infra/customersdot-ansible/-/issues/158) ## Transition checklist Starting this week (Oct 24 2022), @ebaque's priorities shift to [the development of CustomersDot BillingAccount](https://gitlab.com/groups/gitlab-org/-/epics/8331), which means that he's leaving active infra-development. That said, he's remaining available for any infra-related code reviews. The following is hand-over and a summary about what currently needs some attention: - [Add Postgres exporter to Production](https://gitlab.com/gitlab-com/gl-infra/customersdot-ansible/-/issues/155) (issue) Even though MRs related to this issue have been merged, an alert came up a couple of hours ago. In addition to that (or maybe "in relation to that"), There was [some discussion](https://gitlab.com/gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles/-/merge_requests/1068#note_1148956481) about the fact that we may need to update the `pg_hba` Postgres file in Staging and Production to allow the remote connection from the exporter to read the database. But with the new CloudSQL migration, this may not be relevant any longer. - [Create new playbook for CloudSQL](https://gitlab.com/gitlab-com/gl-infra/customersdot-ansible/-/merge_requests/323) (MR) This may be a blocker to the migration to CloudSQL for production. We need to be able to provision CloudSQL in addition to the CustomersDot VM and that may require the addition of a new playbook. This is what this MR is trying to do. - [Add uptime graph to Grafana](https://gitlab.com/gitlab-com/gl-infra/customersdot-ansible/-/issues/148) (issue) Prometheus exporters have been set up in order to allow the creation of uptime graphs. This issue was not taken care of as the upfront work of setting up various exporters took a long time. - [Define SLO and iterate on it](https://gitlab.com/gitlab-com/gl-infra/customersdot-ansible/-/issues/144) (issue) The intent to create a process regarding the setup of Apdex thresholds, ie. the amount of time under which the Apdex score is impacted positively. - For Rails request, it's currently set to [400 ms](https://gitlab.com/gitlab-org/customers-gitlab-com/blob/b66226ab6982ac6815549cf3da6b4913d336889c/lib/metrics/request_collector.rb#L31). - For Sidekiq jobs, it's currently set to [5 seconds](https://gitlab.com/gitlab-org/customers-gitlab-com/blob/b66226ab6982ac6815549cf3da6b4913d336889c/lib/metrics/job_execution_collector.rb#L31). This issue might be closed if the threshold cannot be refined any longer.
epic