[Meta] - Getting to canary deployment, review apps, autoscale, life, the universe and everything else
Reasoning
We need to move faster every day, in order to do so we need to enable development to test things in smaller batches with a higher frequency, so when we reach production we can enable new features in a safe way, increasing the quality of our released software.
QA in production can be achieved through canary deployment or feature toggles. Both features can increase the velocity of our development and then deployment. The path to get to any of them can be used to deliver multiple middle steps that will enable us to shorten the feedback cycle, first during development, then in staging, and finally in production. Preventing us from delivering the wrong thing way sooner than now, or allowing us to find out that things are not performant way sooner in the process: while it's still in development.
To get there we will need to provide a dynamic platform where people can deploy in a self-service way so we (production) are not the bottleneck. This is a fantastic opportunity to fix some long-standing issues that we have been dealing with for quite some time in infrastructure, but also to future-proof our scaling ability by creating the building blocks we need to scale the infrastructure without having to scale the team.
Steps required to get to canary deployment
Remind that this is a high-level overview and that we may be missing middle steps in between. Still, I'll try to provide the added value of each of the steps taken.
Setting up the building blocks of the underlying infrastructure
- Use this to improve our current deployment procedure(#1739 (closed)), instead of installing the package in the current hosts, create a new fleet and switch the traffic over.
- Deploy Consul and use it for service discovery and DNS
- #1859 (closed) Move our consul instance to the new VNET and make it secure
- #1860 (closed) Install consul agent in all the fleet nodes (using chef)
- #1861 (closed) Use with Prometheus to ditch our chef based service discovery
- #1485 (closed) Use for discovering database primary and secondaries through DNS
- #1723 (closed) Generally used as an asset for internal DNS resolution - Also used by Gitaly
- Use consul-template to dynamically configure and reconfigure the fleet instead of using chef-client with crontab.
- #1863 (closed) First use I can think of is to reconfigure HAProxy dynamically
- Use vault to store secrets instead of chef-vault
- #1864 (closed) Install hashicorp vault and secure it
- #1212 (closed) Deprecate chef-vault because it has proven to not scale, being cumbersome, mysterious and error prone, stealing multiple working hours just waiting for the vault to re-encrypt whenever we need to add someone to it (like onboarding a release manager)
- #1865 (closed) Wire credentials to consul-template so we can rotate them in real-time.
- gitlab-cog/postgres#1 We would like to also use this to store TOTP seeds for given users to have strong authentication for specific operations with Marvin
- We will then perform credentials rolling, document, and schedule rolling them much more often than what we are doing not. Bonus points if we automate this credential rolling.
- Use chef-solo to detach a VM configuration from the chef server lifecycle
-
#1211 (closed) Generate and configurate machines with chef-solo
- This will enable us to move to branchy development of chef recipes and configurations, while being capable of applying them to VMs, detaching the configuration of the application itself from the configuration of the hosts
- Detach the application deployment lifecycle from chef-client itself to prevent upgrading secondary hosts while we are running migrations on the blessed (this has caused confusion and downtime multiple times already)
- #1816 (closed) Automate staging database snapshotting to allow on-demand generation of production-size staging environments.
- Staging will become a secondary database of production's primary, we will manually refresh the secondary staging on a regular schedule.
- Automate refreshing and snapshotting staging storage to keep it in sync with the database snapshots as part of the refresh process
- Use packer to build VM and docker images with chef-solo as a provisioner.
- This will allow us to remove multiple snowflake VMs that are just being expensive right now.
- Also, this will allow us to keep different already configured containerized versions of the application, which will enable us to, later on, roll forward and back deployments of configurations.
- Use the previously packer built images in conjunction with Terraform
- This will allow us to provision statically configured VMs so we can replace them live by provisioning the hosts and then changing the load balancer configuration in real-time, effectively reaching zero downtime deploys.
- Using this, if we need to rollback, we just invert the LB configuration and keep using the old hosts.
Using the newly built infrastructure to accelerate development
-
Build a container orchestration infrastructure for front end staging, probably using kubernetes for this.
- It would be better if we can use a managed service to reduce complexity, but this depends on where are we being hosted.
- Using the previously built images with Packer, but this time as containers, allowing fast deployments in staging and also allowing us to start building a release process using a container orchestration tool.
-
Using the staging snapshot plus the container orchestration infrastructure, once we have a strict lifecycle for review apps, wire it all together to generate multiple staging environments, building images for each branch that will be run as containers with a full sized production database from GitLab.com itself.
- This will allow us to test database migrations per branch before even merging into master.
- Push the staging environment containers to Consul when they start running.
- Build a specific staging Prometheus server that will be connected to Consul to scrape these staging instances.
-
Add black box monitoring to this Prometheus server so when a staging instance is created we automatically start scraping it to detect performance regressions.
- We should keep 1 staging instance that is the current production image to use as a baseline, this way we can directly compare for regressions in a single dashboard.
- Adding new URLs to this black box monitoring has to be an extremely simple process, this way developers can just add new URLs to probe before deploying a review app.
- Add alerting based on this black box monitoring so developers get notified when their review-app contains a performance regression.
At this stage, developers will have a specific dashboard where they can compare live how the feature they are working on will impact production, while it is still a branch. Additionally, on each deploy of a new review app that is based on a branch, we will be executing the migrations in a production sized database.
Staging Deployment Process
With an automated staging deployment procedure we will reshape our current deployment to be like this:
-
Using our provided docker image we will append all the things we need to move faster on deployment with Packer. For ex: OpenSSH-specific configurations, consul-template for live changes, etc. We will need to own image changes because this is something we are doing in production now, so we will need to have a way to keep doing it in a containerized world.
- We will depend on having a build infrastructure that can be triggered to build from a given branch from GitLab.com, this way we can reuse our omnibus generated images. If this is not possible, we will need to find an alternative way with the build team.
- When a developer requests a review apps deploy, we will pick the latest database snapshot, clone it, attach it to a staging server and mount it to then start a new Postgres instance with a different port.
- We will then spawn a Kubernetes job to run the migrations on this production-sized database instance. If the job fails, we gather the log output and tear down the instance as it is not necessary anymore (solving the current migration testing issue).
- Once the migration is finished (we should measure how much time it took) we will push a pod with all the required containers, and register this service in the service discovery system (Consul).
- With the registration in Consul, Consul-Template will be automatically triggered in a front end LB (HAProxy) where the branch name will become a new subdomain that points to this instance of GitLab staging. Making the service reachable through a public URL.
- Additionally, with the Consul registration, we will start grabbing Prometheus metrics from the Pod, and will start poking with black box monitoring, adding a label to these metrics so the developer can find them.
This can also be used as a previous step to run GitLab-QA.
Production Deployment Process
We will use the previous process, except that instead of registering a pod we will just push a deployment file.
This way we will simply run the migration in a job and then perform a rolling upgrade, removing all the current complexity from the process. If migrations fail (they should have failed way before in the staging environment) we will simply stop the deployment, or run a new job to revert them.
We will have multiple opportunities to test migrations as we will not have a single staging environment anymore like we do today.
When we reach this point, we can easily add autoscale capabilities to our infrastructure.
Canary Deployment
At this stage, I have multiple concerns regarding canary deployment and I think that we should move in the direction of feature toggles instead
If we want to go down the path of canary deployment we just need to reuse our deployment model for staging and spawn a pod with a specific image version and direct traffic to it by using an HTTP header with a token (for ex.)
But given how our development process is not I think that something like this will increase complexity, and that's why I think that we need to move to a continuous delivery model rather than a canary deploy where we just deploy things to production and activate them when we know that they are stable, which is when we release them in the package on by default.
Questions
- How does omnibus-gitlab fit into this picture?
We will need to move away from the omnibus-gitlab debian package and start using our docker image for whatever is cattle. Pets will keep working with the Debian package for now.
- For the first iteration, are we just going to take the giant Docker image and connect Redis/PostgreSQL separately as we do now?
Yes, we will start using our giant image, but we will use it as a base image that we need to tweak to make it production ready. The same way we are doing with chef by applying changes to the deployed package for things like OpenSSH. We can't lose this ability, which is why we will work with Packer to control the images we push to staging first, then production.
Redis and Postgresql will remain as virtual hosts for now, mainly because of the size of these services, if we can reduce the size of Redis by splitting it into multiple, then we may start using Redis as a specific pod.
I don't think that Postgres will be containerized anytime soon, at least not during the initial iterations.