Risks of rolling out different database cluster automation and administrations
Scenario:
Currently, we have the requirement of provisioning new database clusters to attend to our sharding needs.
The epic &492 (closed) gives us the possibility to provision new database clusters. Basically, the new clusters are configured with ansible and internally using omnibus/chef. To maintain this document more readable, I will refer to this provisioning process as Ansible provisioning.
The current provisioning process in production today for the database ecosystem is done by Chef. To maintain this document more readable I will refer to this provisioning process as Chef provisioning.
Identified Risks:
Creating a new cluster using the Ansible provisioning, e.g., as required for sharding the new CI database, we would generate technical debt and the following risks:
- Different troubleshooting based on cluster. This will increase the complexity for any SRE or DBREs in a troubleshooting situation because we are handling different paths and configuration files, depending on the cluster provisioning.
- Different behavior in production, since one cluster will be administered by Chef, another one by Ansible, with a different source of true( repos) and setup.
- Different administrative routines and documentation depending on the cluster. If we need to block traffic on a node, we will operate differently in each cluster. (e.g., adding tags of
nobalance
andnofailover
). - Different processes, e.g., we are changing how to roll out a new instance.
- Increasing the complexity of the terraform code and troubleshooting. The Ansible provisioning will add a new repo mixing: Terraform, Ansible and Docker code/configuration.
- Increasing alert/troubleshoot fatigue - due to having different cluster provisioning. The same alert would have different troubleshooting based on the provisioning.
- Different package distribution, monitoring setup depending on the provisioning.
- Lack of testing for the Ansible provisioning. Currently, we do not have unit or integration tests. The current Chef configuration has been in production for years, giving us some guarantee of stability.
- Lack of updated documentation for the new cluster. It would be needed to duplicate all our runbooks and documentation to support both clusters.
- Lack of training to the all SRE team to operate the Ansible provisioned cluster comfortably.
Possible solutions to release a consistent new database cluster:
1 - Provisioning a new Chef cluster.
Initially roll out the new database cluster, using the Chef provisioning for the new database. Afterward, plan migrations to the Ansible provisioning when all the risks are mitigated.
ETA: With the SRE support, we should be able to provision a new cluster in two weeks (10 days).
2 - Migrating staging and production to the Ansible provisioning
Migrate the current staging and production database environment to the provisioning process. Further, create the new database cluster, for the CI database, under the Ansible provisioning.
ETA: TBD