[Operationalization] Build Runbooks for OpenBao
Overview
As part of GitLab Secrets Manager Beta, we need runbooks for the OpenBao service.
Runbooks should give high-level overviews of the service, common troubleshooting steps, access information, infrastructure pieces, references to documentation we've built, &c.
Links
- Vault runbooks: https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/vault/vault.md
- OpenBao production readiness: https://gitlab.com/gitlab-com/gl-infra/readiness/-/tree/master/openbao
- suggested structure: https://gitlab.com/gitlab-com/runbooks/-/tree/master/docs#gitlab-runbooks
- directory structure: https://gitlab.com/gitlab-com/gl-infra/readiness#directory-structure
- template: https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/template-service-overview.md - same as the MR template
- https://runbooks.gitlab.com/runway/cloudsql_backup_restore/
There's also a simplified template. It's just a list of sections with no description. It's used in many runbooks though.
- https://runbooks.gitlab.com/sast-service/
- https://runbooks.gitlab.com/secret-revocation/
- https://runbooks.gitlab.com/ai-gateway/
Implementation plan
Cover all sections of the service template.
We might leverage existing runbooks for Vault, though runbooks tend to be simpler, especially for services deployed using Runway.
Example: https://runbooks.gitlab.com/sast-service/ – which is stateless.
service template
Summary
- Service description
- Architecture summary (with optional system diagram)
- Upstream and downstream dependencies
- The current location of the service in all environments (VM name / k8s cluster and namespace / external provider)
- Upstream dependencies
- Downstream dependencies
Observability
- Prominent link to primary dashboard
- SLO Dashboard if applicable
- Embedded Grafana metric if applicable
- Additional dashboard links or queries
- Links to logging queries
- Location of local or pod logs
Troubleshooting
- Basic Troubleshooting order
- Common problems
- Link to query for recent changes
- Access instructions (or link to standard procedure)
- CLI commands for checking status
Common Operations
- CLI commands for common procedures
Alerts
- A short note for each alert that links to this page (with anchor tags)
Service Changes
- Location of configuration repo
- Location of helm repo or chef cookbooks
Backups and recovery
- Location of persistent data
- Location of persistent data backups
- Short instructions for restore, or link to full document
References
- Additional links to documentation
- Link to readiness review
Other ideas
- Owner or group
- Slack channel?
Edited by Fabien Catteau