Improve observability and logging of CustomersDot
Progress tracking: https://gitlab.com/gitlab-org/fulfillment-meta/-/issues/309
---
This epic tracks work related to the [Fulfillment Engineering Allocation: "Improve availability of CustomersDot by migrating from Azure to GCP"](https://gitlab.com/gitlab-com/www-gitlab-com/-/merge_requests/82955)
---
The goal of this Epic is to organize a series of issues that will solve several problems:
1. Remove infrastructure from Azure.
2. Solve several problems listed in [this epic](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/334).
3. Make our infrastructure more consistent, secure, and robust.
The following roadmap is scheduled for Q3 (2021):
| Work | Guesstimate | Status | Theme |
|-----------------------------------------------------------------------|-------------|---------|----------------------|
| Create a new Google Cloud Project for staging | M | Done | Infrastructure Setup |
| Use Puma as default web server | L | WIP, Q2 | Infrastructure Setup |
| Configure basic Ansible playbook | L | WIP, Q2 | Infrastructure Setup |
| Configure Ansible CustomersDot environment | M | WIP, Q2 | Infrastructure Setup |
| Deploy to staging using Ansible | L | Q2-Q3 | Infrastructure Setup |
| Configure Ansible CustomersDot deployment | L | Q2-Q3 | Infrastructure Setup |
| Add CustomersDot deployment to GitLab CI | L | Q3 | Infrastructure Setup |
| Configure fail2ban, cloudfare | L | Q3 | Infrastructure Setup |
| Create a new Google Cloud Project for production | L | Q3 | Infrastructure Setup |
| Replicate current deployment blocking (auto-block on staging failure) | L | Q3 | Infrastructure Setup |
| Migrate to new production fleet and testing | XXL | Q3 | Infrastructure Setup |
| Clean up old customers Infra | M | Q3 | Infrastructure Setup |
| Static IP address requirements | L | Q3 | Infrastructure Setup |
| Use Ansible/Ansistrano auto-rollback feature | L | Q3 | Infrastructure Setup |
| Add Prometheus monitoring | M | Q3 | Basic Observability |
| Add minimum uptime/health checks and alerts | L | Q3 | Basic Observability |
| Add logs to Kibana | L | Q3 | Basic Observability |
| Add current API health and alerts (including Zuora, SFDC…) | L | Q3 | Basic Observability |
| Configure all alerts with Slack notifications | M | Q3 | Basic Observability |
| Define SLA and make dashboards for CustomersDot | L | Q3 | Basic Observability |
| Configure deployment notifications and alerts | L | Q3 | Basic Observability |
This is an outline of the initial steps:
1. Create a new Google Cloud Project for staging `gitlab-subscriptions-staging`
2. Create a new Chef role for the new staging fleet `stg-subscriptions`
- Start this role as basic as possible
- No customers cookbook to start with
3. Create a new subscriptions-staging Terraform environment
- TCP load balancer with health checks and a static IP
- Start with a single node to begin with
- Bastion host
- Bastion Host Chef Role
- Restrict outgoing as well as incoming traffic
4. Add the customers Chef role to this staging environment
4. Create a new Google Cloud Project for production `gitlab-subscriptions-production`
5. Create a new Chef role for the new production fleet `prd-subscriptions`
6. Create a new subscriptions-production Terraform environment
7. Migrate to the new production fleet
- Migrate data (database?)
- Switch DNS
8. Clean up old customers infrastructure
- Remove old nodes
- Remove old chef roles
- Remove old recipes from cookbook
epic