Improve observability and logging of CustomersDot (#390) · Epics · GitLab Infrastructure Team

Improve observability and logging of CustomersDot

Progress tracking: https://gitlab.com/gitlab-org/fulfillment-meta/-/issues/309 --- This epic tracks work related to the [Fulfillment Engineering Allocation: "Improve availability of CustomersDot by migrating from Azure to GCP"](https://gitlab.com/gitlab-com/www-gitlab-com/-/merge_requests/82955) --- The goal of this Epic is to organize a series of issues that will solve several problems: 1. Remove infrastructure from Azure. 2. Solve several problems listed in [this epic](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/334). 3. Make our infrastructure more consistent, secure, and robust. The following roadmap is scheduled for Q3 (2021): | Work | Guesstimate | Status | Theme | |-----------------------------------------------------------------------|-------------|---------|----------------------| | Create a new Google Cloud Project for staging | M | Done | Infrastructure Setup | | Use Puma as default web server | L | WIP, Q2 | Infrastructure Setup | | Configure basic Ansible playbook | L | WIP, Q2 | Infrastructure Setup | | Configure Ansible CustomersDot environment | M | WIP, Q2 | Infrastructure Setup | | Deploy to staging using Ansible | L | Q2-Q3 | Infrastructure Setup | | Configure Ansible CustomersDot deployment | L | Q2-Q3 | Infrastructure Setup | | Add CustomersDot deployment to GitLab CI | L | Q3 | Infrastructure Setup | | Configure fail2ban, cloudfare | L | Q3 | Infrastructure Setup | | Create a new Google Cloud Project for production | L | Q3 | Infrastructure Setup | | Replicate current deployment blocking (auto-block on staging failure) | L | Q3 | Infrastructure Setup | | Migrate to new production fleet and testing | XXL | Q3 | Infrastructure Setup | | Clean up old customers Infra | M | Q3 | Infrastructure Setup | | Static IP address requirements | L | Q3 | Infrastructure Setup | | Use Ansible/Ansistrano auto-rollback feature | L | Q3 | Infrastructure Setup | | Add Prometheus monitoring | M | Q3 | Basic Observability | | Add minimum uptime/health checks and alerts | L | Q3 | Basic Observability | | Add logs to Kibana | L | Q3 | Basic Observability | | Add current API health and alerts (including Zuora, SFDC…) | L | Q3 | Basic Observability | | Configure all alerts with Slack notifications | M | Q3 | Basic Observability | | Define SLA and make dashboards for CustomersDot | L | Q3 | Basic Observability | | Configure deployment notifications and alerts | L | Q3 | Basic Observability | This is an outline of the initial steps: 1. Create a new Google Cloud Project for staging `gitlab-subscriptions-staging` 2. Create a new Chef role for the new staging fleet `stg-subscriptions` - Start this role as basic as possible - No customers cookbook to start with 3. Create a new subscriptions-staging Terraform environment - TCP load balancer with health checks and a static IP - Start with a single node to begin with - Bastion host - Bastion Host Chef Role - Restrict outgoing as well as incoming traffic 4. Add the customers Chef role to this staging environment 4. Create a new Google Cloud Project for production `gitlab-subscriptions-production` 5. Create a new Chef role for the new production fleet `prd-subscriptions` 6. Create a new subscriptions-production Terraform environment 7. Migrate to the new production fleet - Migrate data (database?) - Switch DNS 8. Clean up old customers infrastructure - Remove old nodes - Remove old chef roles - Remove old recipes from cookbook

epic