[Meta] Let's Make Staging Great Again
Our high-level plan to make Staging Great Again is as follows:
- Our current staging, with the production topology, gets re-branded to pre-prod.
- Pre-prod gets an SLA similar to production, obviously, a Production outage has priority over a Pre-Prod one, but that's it.
- Pre-prod is only accessible by people that have production access. The rules are the same.
- We build a new staging with a topology similar to production
- The database in this environment will be Pseudonymization using https://gitlab.com/gitlab-org/gitlab-ce/issues/37390
- Additional Pseudonymization needed for the DB, usernames and projects (possible?) should be obfuscated
- All developers will get access to this environment.
- After sanitizing the database we will snapshot the database drives so we can have a "rebuild staging" button that will regenerate it quickly.
- Using https://gitlab.com/gitlab-com/infrastructure/issues/2590 we will have a way of building an on-demand single-host staging environment to test migrations. (We still have to figure out the details on how to spin up/down this environments)
Changes to deployments:
A deployment to production should start in pre-prod, the steps for execution should be like this:
- We run the migrations in pre-prod.
- We execute GitLab-QA to ensure that the old application supports running with the new model.
- We change the application for the new version
- We execute GitLab-QA again to ensure that the new version works.
If the application fails at any time we fail the whole process and block the deployment. Then the pre-prod database gets rebuilt with the last DB disk snapshot from production.
This deployment process should be enforced by takeoff.
Changes for cookbooks development:
We will start using environments (or similar) to separate versions of cookbooks so we can test this in isolation https://gitlab.com/gitlab-com/infrastructure/issues/2763
How?
- Wire chef environments for separated cookbook pinned versions https://gitlab.com/gitlab-com/infrastructure/issues/2763
- Create a new pre-prod environment that uses the same Chef environment as production. https://gitlab.com/gitlab-com/infrastructure/issues/2801
- Setup monitoring and alarming for pre-prod https://gitlab.com/gitlab-com/infrastructure/issues/2800
- Devise a way to refresh the prep-prod database using snapshots so it can be updated regularly. https://gitlab.com/gitlab-com/infrastructure/issues/2801
- Create or use the existing staging with a separate chef environment. Confirm that we can make isolated changes. https://gitlab.com/gitlab-com/infrastructure/issues/2812
- Sanitize staging DB https://gitlab.com/gitlab-com/infrastructure/issues/2811
- Setup monitoring and alarming for staging
- Devise a way to refresh the staging database from production using snapshots so that it can be updated and sanitized regularly.
- Update takeoff to point to the new pre-prod.
- Announce the deployment changes for full awareness.
- https://gitlab.com/gitlab-com/infrastructure/issues/2784 Create an access group for developers
- Grant access to developers to this sanitized environment (and to the VPN... this is gonna be a lot of toil)
OLD ISSUE:
I've seen several issues around staging and some discussions of what to fix. I'm making this as a single point to track changes being worked on for staging.
Fixing staging will take higher priority, as we need one place to test changes before they reach production.
So far for issues I see.
Automate the staging environment creation https://gitlab.com/gitlab-com/infrastructure/issues/1815
Discover/POC for multiple copies of staging using terraform workspaces https://gitlab.com/gitlab-com/infrastructure/issues/2590
Provision a Nightly Staging Environment https://gitlab.com/gitlab-com/infrastructure/issues/2674
Issues that might be missing
- Develop a way to use chef in stating without affecting production. (testing chef changes)
- Automated testing for any configuration change.
- Staging environments with truncated DB for testing that doesn't require a full db? (is this feasible?)