2021-11-16: Switchover dashboards.gitlab.net to the new instance in Kubernetes
Production Change
Change Summary
Epic: &146 (closed)
This change will move the internal dashboards at dashboards.gitlab.net
to the new instance deployed in Kubernetes by:
- importing the database - infrastructure#14544
- switching over the domain name - infrastructure#14545
- updating the monitoring (alerts, dashboard) - infrastructure#14546
Downtime warning: The dashboards will be inaccessible for about 30 minutes after changing the DNS record and deleting the old load-balancer.
Note about datasources: initially for this migration, only the currently functional and in-use datasources are configured:
- most Prometheus datasources are present, except those that are no longer present or accessible;
- all Elasticsearch datasources are removed because of incorrect credentials and/or URLs, but can be re-added easily as they are simply commented out in the configuration;
- one Sitespeed datasource is removed because it is timing out and not
referenced in any dashboard, the other one is renamed from
sitespeed new
tositespeed
and the few dashboards using it will need to be updated accordingly; - the Google BigQuery, Pagerduty, Prometheus AlertManager and Simple Annotations datasources are removed because they are not referenced in any dashboard.
Change Details
- Services Impacted - ServiceGrafana
-
Change Technician -
@pguinoiseau
- Change Reviewer - @f_santos, @steveazz
- Time tracking - 60 minutes
- Downtime Component - ServiceGrafana
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - 10 minutes
-
Get approval for the following MRs: -
Checkout and build grafana-sqlite-to-postgres: git clone https://github.com/wbh1/grafana-sqlite-to-postgres.git make -C grafana-sqlite-to-postgres ln -s grafana-sqlite-to-postgres/dist/grafana-migrate_darwin_amd64-v2.0.1-3-g737d5f8 grafana-migrate
-
Prepare a simple local Grafana + PostgreSQL docker-compose
configuration indocker-compose.yml
:--- version: '3.18' services: grafana: image: grafana/grafana:7.2.0 ports: - 3000:3000 links: - postgres environment: GF_DATABASE_TYPE: postgres GF_DATABASE_HOST: postgres GF_DATABASE_NAME: grafana GF_DATABASE_USER: grafana GF_DATABASE_PASSWORD: grafana postgres: image: postgres:13 environment: POSTGRES_USER: grafana POSTGRES_PASSWORD: grafana POSTGRES_DB: grafana ports: - 5432:5432
-
Start PostgreSQL and prepare an empty PostgreSQL database for Grafana 7.2.0 locally: docker-compose up --no-start docker-compose start postgres # Wait to the database to be created then run the Grafana migrations docker-compose run grafana # Delete all initial data except the migration logs: psql -h 127.0.0.1 -U grafana -W grafana <<EOF DELETE FROM "dashboard_acl"; DELETE FROM "org"; DELETE FROM "org_user"; DELETE FROM "server_lock"; DELETE FROM "user"; EOF # Dump the database to be safe pg_dump --host 127.0.0.1 --user grafana --password --format custom --file grafana-pgsql-7.2.0-empty.pgdump grafana
-
Update the Grafana image version to v8.2.4 in docker-compose.yml
for later:services: grafana: image: grafana/grafana:8.2.4
docker-compose up --no-start
-
Set label changein-progress on this issue -
Inform EOC about incoming Grafana downtime and that in the meantime the old instance can still be accessed with the admin credentials (as Google OAuth will be broken) at http://localhost:8080
after opening a SSH tunnel to the VM:ssh -N -L 8080:localhost:80 dashboards-01-inf-ops.c.gitlab-ops.internal
-
Check for users active in the last 30 minutes and inform them of the downtime and to save their changes if any: ssh dashboards-01-inf-ops.c.gitlab-ops.internal sudo sqlite3 /var/lib/grafana/grafana.db "SELECT login FROM user WHERE last_seen_at > datetime('now', '-1800 seconds')"
-
Check in Cloudflare that the TTL of the DNS record for dashboards.gitlab.net
is still 5 minutes
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 40 minutes
1. Update the DNS record and ingress first (it will take some time in the background)
-
Merge the following MRs to update the DNS records and ingress: -
Go to the OAuth Client ID dashboards-next.gitlab.net
and update its name, Authorized Origins and Authorized Redirect URIs to usedashboards.gitlab.net
instead
2. Migrate the database
-
Stop the Grafana pods: kubectl --context ops-gitlab-gke --namespace monitoring scale --replicas=0 deployment/grafana kubectl --context ops-gitlab-gke --namespace monitoring wait --for=delete --selector app.kubernetes.io/name=grafana pod
-
Delete and recreate the new PostgreSQL database: gcloud sql databases delete --project gitlab-ops --instance grafana-internal-f534 grafana gcloud sql databases create --project gitlab-ops --instance grafana-internal-f534 grafana
-
Copy the SQLite database locally: rsync --progress --rsync-path='sudo rsync' dashboards-01-inf-ops.c.gitlab-ops.internal:/var/lib/grafana/grafana.db .
-
Delete the datasources configuration (which will be reprovisioned later): sqlite3 grafana.db 'DELETE FROM data_source'
-
Migrate the database to PostgreSQL: ./grafana-migrate ./grafana.db postgres://grafana:grafana@127.0.0.1:5432/grafana\?sslmode=disable
-
Run migrations for Grafana v8.2.4: docker-compose run grafana
-
Dump the PostgreSQL database in SQL format: pg_dump --host 127.0.0.1 --user grafana --password --format plain --file dashboards-grafana-8.2.4.sql grafana
-
Upload the PostgreSQL dump to a GCS bucket: gsutil cp dashboards-grafana-8.2.4.sql gs://gitlab-ops-cloudsql-import/grafana/
-
Import the PostgreSQL dump into the new database: gcloud sql import sql --project gitlab-ops --database grafana --user grafana grafana-internal-f534 gs://gitlab-ops-cloudsql-import/grafana/dashboards-grafana-8.2.4.sql
-
Restart the Grafana pods: kubectl --context ops-gitlab-gke --namespace monitoring scale --replicas=2 deployment/grafana
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - 10 minutes
-
Visit https://dashboards.gitlab.net/ and verify that it is working. If still offline, check the status of the loadbalancer and wait some more if the certificate is still being provisioned -
Verify that you can login with the admin credentials (stored in 1password). If needed, the admin password can be reset with the following: echo $password | kubectl --context ops-gitlab-gke --namespace monitoring exec --stdin --container grafana grafana-XXX -- grafana-cli --config /etc/grafana-config/grafana.ini admin reset-admin-password --password-from-stdin
-
Verify that you can login with the Google authentication -
Verify that the dashboards are present -
Verify that the datasources are provisioned -
Merge gitlab-com/runbooks!4062 (merged) to update the monitoring alerts and dashboard -
Inform EOC that the change is complete and they can stop using their SSH tunnel to the old instance
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 5 minutes
-
Open a MR reverting config-mgmt!3147, get it approved and merge it
Monitoring
Key metrics to observe
After updating the monitoring dashboard...
- Metric:
grafana_google_lb
- Location: https://dashboards.gitlab.net/d/monitoring-main/monitoring-overview
- What changes to this metric should prompt a rollback: missing metrics or error ratio above 5%
Summary of infrastructure changes
-
Does this change introduce new compute instances? -
Does this change re-size any existing compute instances? -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?
Summary of the above
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
This Change Issue is linked to the appropriate Issue and/or Epic -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall
and this issue and await their acknowledgement.) -
Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managers
and this issue and await their acknowledgment.) -
There are currently no active incidents.
Edited by Pierre Guinoiseau