Refactor `readiness` and `liveness` probes and add them to `sidekiq|web_exporter`
What does this MR do?
This MR does:
- add
/readinessand/livenessprobes toSidekiqandWebexporters, - changes the logic for
/readinessprobe, - change the structure of
/readinessprobe
Logic of /readiness
Following @jarv !17955 (comment 224971063):
Kubernetes uses liveness probes to know when to restart a container. If a container is unresponsive—perhaps the application is deadlocked due to a multi-threading defect—restarting the container can make the application more available, despite the defect. It certainly beats paging someone in the middle of the night to restart a container.[1]
Kubernetes uses readiness probes to decide when the container is available for accepting traffic. The readiness probe is used to control which pods are used as the backends for a service. A pod is considered ready when all of its containers are ready. If a pod is not ready, it is removed from service load balancers.
It aligns the behavior of /readiness probe. The /liveness already behaves as described.
The /readiness probe logic to consider it successful is all above needs to make endpoint to be successful:
-
Gitlab::HealthChecks::DbCheck, -
Gitlab::HealthChecks::Redis::RedisCheck, -
Gitlab::HealthChecks::Redis::CacheCheck, -
Gitlab::HealthChecks::Redis::QueuesCheck, -
Gitlab::HealthChecks::Redis::SharedStateCheck, - any of
Gitlab::HealthChecks::GitalyCheck(as it checks the state of each shard).
It means that it considers the service to be operational as long as db, all redis instances or any of gitaly endpoints do work.
Each of the above defines a single group of checks. It means that within a group at least one of the services needs to be operational to consider the node as ready.
Considerations
/liveness
Consider that /liveness run on separate endpoint checks the status of master Unicorn/Puma endpoint.
It means that this checks if master process is operational and is able to scale up/down workers as needed.
This behaviour is different to that of /-/liveness which checks single worker behaviour.
/readiness
The /readiness checks the state of all related services, but does not yet check the status of Puma/Unicorn master
process.
This behaviour is different to that of /-/readiness which checks single worker behaviour.
Fix for /readiness payload
Old payload for /readiness
Currently, the /readiness payload does not support properly a check of Gitaly:
{
"db_check":{
"status":"ok"
},
"gitaly_check":{
"status":"ok",
"labels":{
"shard":"nfs-file40"
}
}
}
Due to group-name gitaly_check overlay it makes to show random shard status instead of each shard status.
New payload for /readiness
This MR changes this payload to properly support groups:
{
"status":"ok",
"db_check":[
{
"status":"ok"
}
],
"gitaly_check":[
{
"status":"ok",
"labels":{
"shard":"nfs-file01"
}
},
{
"status":"ok",
"labels":{
"shard":"nfs-file02"
}
}
]
}
Next step
The additional check for /readiness will be added in this MR: !17962 (merged).
The check will verify that there's at least a single worker able to process the requests. This will make the /readiness to behave as a check able to tell whether the web server can accept and process web-traffic immediately.
References
Based on: !17953 (merged) and omnibus-gitlab!3650 (merged)
Part of: #30201 (closed)
Related to: #30037 (closed)