Complete Readiness review for API service running on Kubernetes clusters
Utilize this issue to track the work necessary to create the readiness review for the implementation of our API fleet on Kubernetes.
The following part is based on the readiness review issue template, adapting it where necessary for the K8s move of the API service.
Operational Readiness Guide for Infrastructure Services
Goal of the readiness review should be to identify gaps, create issues for them, link them in the review issue and bring them to a solution.
The readiness review should mostly link to design docs or runbooks for referral. It is recommended to write most of the information falling out of the readiness review into the README.md in the services' runbook directory. Runbooks should be the main source of truth. Design docs and readiness reviews tend to be point-in-time snapshots and should not duplicate information in runbooks.
Summary
-
Short overview mentioning purpose of the service, dependencies and owners - The API service is enabling internal and external clients to interact with GitLab.com via http(s) endpoints without using the web UI. The API service is mainly used to drive automation using GitLab and critical for the function of GitLab.com.
-
Explain the scope of this review and what is explicitly out of scope. - This review is scoped to the migration of the already existing API service over to K8s. With that, this review will focus on the migration procedure. But as we never had a review of the API service, we should also add some review items for the service itself.
Architecture
-
Runbook README.md contains an architecture overview (provide link: MR) -
Runbook README.md contains a logical architecture diagram -
Runbook README.md contains a physical architecture diagram (optional) -
Runbook README.md provides enough information for a reviewer to get an understanding of the service and it's components, dependencies and interactions
-
Documentation
-
is there a blue print/design doc? (provide link) - Henri: There is no design doc for the API service. There are design docs for K8s which I will link in the API runbook (MR). For the migration of API service to K8s the Epic can serve as a kind of design doc. Will also link this in a runbook.
-
do we have runbooks? (provide links) - Henri: There are no runbooks directly dedicated to API so far. There is an MR for updating the API README.md. We should have runbooks for basics like dialing up capacity, tuning autoscaling or healthchecks and troubleshooting pointers. Issue: #1568 (closed)
-
are runbooks up-to-date? N/A -
where else is documentation for this service located? -
is there a service catalog entry? Yes, here. -
is service catalog listing all dependencies? -
has service catalog links to all existing documentation? -
is service catalog linking to readiness review?
-
Performance
-
is there a runbook section with performance characteristics? (it should cover following considerations, provide link) - Yes, points below covered in this MR and this issue. -
current requests/s (min, max, average), latency characteristics, saturation, ... -
throtteling/limits -
bottlenecks (cpu-bound, memory-bound, ...) -
is there documentation on how/why we set certain config options that are affecting performance?
-
Scalability
-
is there a runbook section with scalability information? (it should cover following considerations, provide link) - Yes, points below covered in this MR. -
expected load in the future -
how can we scale to the expected load? -
can it be scaled across availability zones or regions? -
are there scalability limitations? -
are we doing performance tests? Covered in #1592 (closed)
-
Availability
-
is there a runbook section covering availability considerations? (it should cover following topics, provide link) -
failure modes of this service, blast radius, how long does it take to recover? -
what happens on outage of services we are depending on? -
Availability Zone (AZ) outage -
split brain between AZs -
region outage -
other external dependencies that could affect availability -
what other services are affected by an outage of this service? -
is there an existing Recovery Time Objective (RTO) documented? How do we plan to achieve it? -
do we have an error budget? -
are we doing disaster recovery tests? -
is there a failover procedure? Do we have runbook instructions?
-
Durability
-
is there a runbook section covering durability considerations? (it should cover following topics, provide link) -
possible failure modes and how to recover from them -
deletion by accident -
disk failure -
data corruption -
GCP outage - ...
-
is there an existing Recovery Point Objective (RPO) documented? How do we plan to achieve it? - Backups
-
are we testing backup replay? -
are we monitoring backups? -
what is the backup retention policy? -
are backups in a different logical and physical environment?
-
-
Security/Compliance
-
is there a runbook section covering security considerations? (it should cover following topics, provide link) -
list of access roles -
Who has which role? -
How do we protect access? -
Auditability of access -
Which entrypoints need protection? -
How are we applying security updates? (OS and service) -
Regulations/Policies applying? (PII, SOX, ...) -
how do we protect customer data? -
encryption at rest? -
could customer data leak in logs? -
how long do we keep logs?
-
-
-
is someone from security included for the readiness review?
Monitoring
-
is there a runbook section covering monitoring? (it should cover following topics, provide link) -
list key SLIs. Are we monitoring them? -
list SLOs. Are we monitoring/alerting on them? -
list of relevant alerts -
are alerts actionable and linking to a runbook? -
do we have a metrics catalog entry for the service? (provide link) -
list of relevant dashboards -
list of relevant logs
-