Revise SRE onboarding
We're getting a bunch of new hires soon, we want to put together some materials, and gather some thoughts on how to structure the process.
This issue allows us to track the topics and materials as we discover them. This can then be fed back into the runbooks and other resources.
Proposed onboarding sessions
- Application architecture
- Architecture diagram
- GCP overview of VMs
- Life of a request: web (tutorial)
- Life of a request: git (tutorial)
- Exploration of hosts over SSH
- Kubernetes at GitLab
- Which workloads are actually running on k8s?
- Image building
- Deployment pipeline to k8s
- Regional vs Zonal clusters, node pools, taints
- Resources, requests, limits
- Low-level: How do resource limits translate to kernel concepts like cgroups and namespaces?
- Diagnosis with Kibana
- Logging pipeline architecture
- Which indices correspond to which services?
- Finding things in logs, filtering on fields, looking at field distributions
- Correlation across indices via correlation_id (tracing)
- Cross-links from grafana service dashboards => Kibana
- Visualization, and time-series top-k queries
- Prometheus
- How metrics get made
- Grafana service dashboards
- SLO metrics
- Thanos architecture
- PromQL
- Alertmanager
- Patroni and Postgres at GitLab
- What are locks
- What does it mean when tuples are dead
- How does Analyze cause incidents all the time
- Can we run Analyze per table/column?
- The more the table changes, the more aggressive Analyze would be, true or false?
- If true, in what cases would abuse cause Analyze to consume high IO.
- Which table does each kind of abuse affect?
- Example: CI abuse vs issues/comments/discussions abuse.
More areas to cover
- Sidekiq
- Gitaly
- Redis
- Pages
- CI
- Canary
- Deployer
- Chef & Terraform
- Being on-call
- Blameless incident reviews
Edited by Igor