Refactor `readiness` and `liveness` probes and add them to `sidekiq|web_exporter`
What does this MR do?
This MR does:
- add
/readiness
and/liveness
probes toSidekiq
andWeb
exporters, - changes the logic for
/readiness
probe, - change the structure of
/readiness
probe
/readiness
Logic of Following @jarv !17955 (comment 224971063):
Kubernetes uses liveness probes to know when to restart a container. If a container is unresponsive—perhaps the application is deadlocked due to a multi-threading defect—restarting the container can make the application more available, despite the defect. It certainly beats paging someone in the middle of the night to restart a container.[1]
Kubernetes uses readiness probes to decide when the container is available for accepting traffic. The readiness probe is used to control which pods are used as the backends for a service. A pod is considered ready when all of its containers are ready. If a pod is not ready, it is removed from service load balancers.
It aligns the behavior of /readiness
probe. The /liveness
already behaves as described.
The /readiness
probe logic to consider it successful is all above needs to make endpoint to be successful:
-
Gitlab::HealthChecks::DbCheck
, -
Gitlab::HealthChecks::Redis::RedisCheck
, -
Gitlab::HealthChecks::Redis::CacheCheck
, -
Gitlab::HealthChecks::Redis::QueuesCheck
, -
Gitlab::HealthChecks::Redis::SharedStateCheck
, - any of
Gitlab::HealthChecks::GitalyCheck
(as it checks the state of each shard).
It means that it considers the service to be operational as long as db
, all redis instances
or any of gitaly endpoints
do work.
Each of the above defines a single group of checks. It means that within a group at least one of the services needs to be operational to consider the node as ready
.
Considerations
/liveness
Consider that /liveness
run on separate endpoint checks the status of master Unicorn/Puma endpoint.
It means that this checks if master process is operational and is able to scale up/down workers as needed.
This behaviour is different to that of /-/liveness
which checks single worker behaviour.
/readiness
The /readiness
checks the state of all related services, but does not yet check the status of Puma/Unicorn
master
process.
This behaviour is different to that of /-/readiness
which checks single worker behaviour.
/readiness
payload
Fix for
/readiness
Old payload for Currently, the /readiness
payload does not support properly a check of Gitaly:
{
"db_check":{
"status":"ok"
},
"gitaly_check":{
"status":"ok",
"labels":{
"shard":"nfs-file40"
}
}
}
Due to group-name gitaly_check
overlay it makes to show random shard status instead of each shard status.
/readiness
New payload for This MR changes this payload to properly support groups:
{
"status":"ok",
"db_check":[
{
"status":"ok"
}
],
"gitaly_check":[
{
"status":"ok",
"labels":{
"shard":"nfs-file01"
}
},
{
"status":"ok",
"labels":{
"shard":"nfs-file02"
}
}
]
}
Next step
The additional check for /readiness
will be added in this MR: !17962 (merged).
The check will verify that there's at least a single worker able to process the requests. This will make the /readiness
to behave as a check able to tell whether the web server can accept and process web-traffic immediately.
References
Based on: !17953 (merged) and omnibus-gitlab!3650 (merged)
Part of: #30201 (closed)
Related to: #30037 (closed)