2021-11-16: Switchover dashboards.gitlab.net to the new instance in Kubernetes

Production Change

Change Summary

This change will move the internal dashboards at dashboards.gitlab.net to the new instance deployed in Kubernetes by:

importing the database - infrastructure#14544
switching over the domain name - infrastructure#14545
updating the monitoring (alerts, dashboard) - infrastructure#14546

Downtime warning: The dashboards will be inaccessible for about 30 minutes after changing the DNS record and deleting the old load-balancer.

Note about datasources: initially for this migration, only the currently functional and in-use datasources are configured:

most Prometheus datasources are present, except those that are no longer present or accessible;
all Elasticsearch datasources are removed because of incorrect credentials and/or URLs, but can be re-added easily as they are simply commented out in the configuration;
one Sitespeed datasource is removed because it is timing out and not referenced in any dashboard, the other one is renamed from sitespeed new to sitespeed and the few dashboards using it will need to be updated accordingly;
the Google BigQuery, Pagerduty, Prometheus AlertManager and Simple Annotations datasources are removed because they are not referenced in any dashboard.

Change Details

Services Impacted - ServiceGrafana
Change Technician - @pguinoiseau
Change Reviewer - @f_santos, @steveazz
Time tracking - 60 minutes
Downtime Component - ServiceGrafana

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - 10 minutes

Get approval for the following MRs:

Checkout and build grafana-sqlite-to-postgres:

git clone https://github.com/wbh1/grafana-sqlite-to-postgres.git
make -C grafana-sqlite-to-postgres
ln -s grafana-sqlite-to-postgres/dist/grafana-migrate_darwin_amd64-v2.0.1-3-g737d5f8 grafana-migrate

Prepare a simple local Grafana + PostgreSQL docker-compose configuration in docker-compose.yml:

---
version: '3.18'

services:
  grafana:
    image: grafana/grafana:7.2.0
    ports:
      - 3000:3000
    links:
      - postgres
    environment:
      GF_DATABASE_TYPE: postgres
      GF_DATABASE_HOST: postgres
      GF_DATABASE_NAME: grafana
      GF_DATABASE_USER: grafana
      GF_DATABASE_PASSWORD: grafana
  postgres:
    image: postgres:13
    environment:
      POSTGRES_USER: grafana
      POSTGRES_PASSWORD: grafana
      POSTGRES_DB: grafana
    ports:
      - 5432:5432

Start PostgreSQL and prepare an empty PostgreSQL database for Grafana 7.2.0 locally:

docker-compose up --no-start
docker-compose start postgres
# Wait to the database to be created then run the Grafana migrations
docker-compose run grafana
# Delete all initial data except the migration logs:
psql -h 127.0.0.1 -U grafana -W grafana <<EOF
DELETE FROM "dashboard_acl";
DELETE FROM "org";
DELETE FROM "org_user";
DELETE FROM "server_lock";
DELETE FROM "user";
EOF
# Dump the database to be safe
pg_dump --host 127.0.0.1 --user grafana --password --format custom --file grafana-pgsql-7.2.0-empty.pgdump grafana

Update the Grafana image version to v8.2.4 in docker-compose.yml for later:
```
  services:
    grafana:
      image: grafana/grafana:8.2.4
```
Then reinitialize the Grafana container (but don't run it yet!):
```
docker-compose up --no-start
```
Set label changein-progress on this issue
Inform EOC about incoming Grafana downtime and that in the meantime the old instance can still be accessed with the admin credentials (as Google OAuth will be broken) at http://localhost:8080 after opening a SSH tunnel to the VM:
```
ssh -N -L 8080:localhost:80 dashboards-01-inf-ops.c.gitlab-ops.internal
```

Check for users active in the last 30 minutes and inform them of the downtime and to save their changes if any:

ssh dashboards-01-inf-ops.c.gitlab-ops.internal
sudo sqlite3 /var/lib/grafana/grafana.db "SELECT login FROM user WHERE last_seen_at > datetime('now', '-1800 seconds')"

Check in Cloudflare that the TTL of the DNS record for dashboards.gitlab.net is still 5 minutes

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 40 minutes

1. Update the DNS record and ingress first (it will take some time in the background)

Merge the following MRs to update the DNS records and ingress:
- config-mgmt!3147
- gitlab-com/gl-infra/k8s-workloads/tanka-deployments!234 (merged)
Go to the OAuth Client ID dashboards-next.gitlab.net and update its name, Authorized Origins and Authorized Redirect URIs to use dashboards.gitlab.net instead

2. Migrate the database

Stop the Grafana pods:

kubectl --context ops-gitlab-gke --namespace monitoring scale --replicas=0 deployment/grafana
kubectl --context ops-gitlab-gke --namespace monitoring wait --for=delete --selector app.kubernetes.io/name=grafana pod

Delete and recreate the new PostgreSQL database:

gcloud sql databases delete --project gitlab-ops --instance grafana-internal-f534 grafana
gcloud sql databases create --project gitlab-ops --instance grafana-internal-f534 grafana

Copy the SQLite database locally:

rsync --progress --rsync-path='sudo rsync' dashboards-01-inf-ops.c.gitlab-ops.internal:/var/lib/grafana/grafana.db .

Delete the datasources configuration (which will be reprovisioned later):
```
sqlite3 grafana.db 'DELETE FROM data_source'
```

Migrate the database to PostgreSQL:

./grafana-migrate ./grafana.db postgres://grafana:grafana@127.0.0.1:5432/grafana\?sslmode=disable

Run migrations for Grafana v8.2.4:
```
docker-compose run grafana
```

Dump the PostgreSQL database in SQL format:

pg_dump --host 127.0.0.1 --user grafana --password --format plain --file dashboards-grafana-8.2.4.sql grafana

Upload the PostgreSQL dump to a GCS bucket:

gsutil cp dashboards-grafana-8.2.4.sql gs://gitlab-ops-cloudsql-import/grafana/

Import the PostgreSQL dump into the new database:

gcloud sql import sql --project gitlab-ops --database grafana --user grafana grafana-internal-f534 gs://gitlab-ops-cloudsql-import/grafana/dashboards-grafana-8.2.4.sql

Restart the Grafana pods:

kubectl --context ops-gitlab-gke --namespace monitoring scale --replicas=2 deployment/grafana

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - 10 minutes

Visit https://dashboards.gitlab.net/ and verify that it is working. If still offline, check the status of the loadbalancer and wait some more if the certificate is still being provisioned

Verify that you can login with the admin credentials (stored in 1password). If needed, the admin password can be reset with the following:

echo $password | kubectl --context ops-gitlab-gke --namespace monitoring exec --stdin --container grafana grafana-XXX -- grafana-cli --config /etc/grafana-config/grafana.ini admin reset-admin-password --password-from-stdin

Verify that you can login with the Google authentication
Verify that the dashboards are present
Verify that the datasources are provisioned
Merge gitlab-com/runbooks!4062 (merged) to update the monitoring alerts and dashboard
Inform EOC that the change is complete and they can stop using their SSH tunnel to the old instance

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 5 minutes

Open a MR reverting config-mgmt!3147, get it approved and merge it

Monitoring

Key metrics to observe

After updating the monitoring dashboard...

Metric: grafana_google_lb
- Location: https://dashboards.gitlab.net/d/monitoring-main/monitoring-overview
- What changes to this metric should prompt a rollback: missing metrics or error ratio above 5%

Summary of infrastructure changes

Does this change introduce new compute instances?
Does this change re-size any existing compute instances?
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?

Summary of the above

Changes checklist

Edited Nov 17, 2021 by Pierre Guinoiseau