[Meta] Infrastructure stability + scalability
We can’t fix what we can’t see.
continue to improve monitoring and logging.
-
monitoring not only per host metrics, but an overall service response time (git/api/web) metric.
-
Availability monitoring for the DB
-
Logging. Centralized logging for all services and hosts. Including chef/rails consoles in order to remove the need for logging into production/staging environments.
Stability: Immediate things to fix:
-
Redis:
-
Splitting Redis into separate clusters. https://gitlab.com/gitlab-com/infrastructure/issues/2448
-
-
NFS servers.
-
circuit breaker:
-
https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/11449 -
migration of endpoints to gitaly
-
Giaty migration boards: https://gitlab.com/gitlab-org/gitaly/boards/331341
-
-
Database: unable to failover
-
GitLab Server HA gitlab-org/omnibus-gitlab#2452 (closed)
-
Prepping for Scale: Need to have
-
Staging https://gitlab.com/gitlab-com/infrastructure/issues/2751
-
Staging environment that mirrors production except with anonymize database data.
-
quick spin-up/tear-down of staging environments. Allow multiple staging environments to run at once
-
-
Canary deployments:
-
Meta - Getting to canary deployment,(https://gitlab.com/gitlab-com/infrastructure/issues/1504)
-
-
Feature Flags:
To scale
-
Move CI artifacts to object storage -https://gitlab.com/gitlab-com/infrastructure/issues/2387
- Move the rest of the non-repo storage to object storage.
-
services separated . -https://gitlab.com/gitlab-com/infrastructure/issues/2458
-
Geo - multiple region .