Revise SRE onboarding

We're getting a bunch of new hires soon, we want to put together some materials, and gather some thoughts on how to structure the process.

This issue allows us to track the topics and materials as we discover them. This can then be fed back into the runbooks and other resources.

Proposed onboarding sessions

Application architecture
- Architecture diagram
- GCP overview of VMs
- Life of a request: web (tutorial)
- Life of a request: git (tutorial)
- Exploration of hosts over SSH
Kubernetes at GitLab
- Which workloads are actually running on k8s?
- Image building
- Deployment pipeline to k8s
- Regional vs Zonal clusters, node pools, taints
- Resources, requests, limits
- Low-level: How do resource limits translate to kernel concepts like cgroups and namespaces?
Diagnosis with Kibana
- Logging pipeline architecture
- Which indices correspond to which services?
- Finding things in logs, filtering on fields, looking at field distributions
- Correlation across indices via correlation_id (tracing)
- Cross-links from grafana service dashboards => Kibana
- Visualization, and time-series top-k queries
Prometheus
- How metrics get made
- Grafana service dashboards
- SLO metrics
- Thanos architecture
- PromQL
- Alertmanager
Patroni and Postgres at GitLab
- What are locks
- What does it mean when tuples are dead
- How does Analyze cause incidents all the time
- Can we run Analyze per table/column?
- The more the table changes, the more aggressive Analyze would be, true or false?
- If true, in what cases would abuse cause Analyze to consume high IO.
- Which table does each kind of abuse affect?
- Example: CI abuse vs issues/comments/discussions abuse.

More areas to cover

Sidekiq
Gitaly
Redis
Pages
CI
Canary
Deployer
Chef & Terraform
Being on-call
Blameless incident reviews

Edited Apr 22, 2021 by Igor