Skip to content

Refactor `readiness` and `liveness` probes and add them to `sidekiq|web_exporter`

Kamil Trzciński requested to merge add-health-checks-exporter into master

What does this MR do?

This MR does:

  1. add /readiness and /liveness probes to Sidekiq and Web exporters,
  2. changes the logic for /readiness probe,
  3. change the structure of /readiness probe

Logic of /readiness

Following @jarv !17955 (comment 224971063):

Kubernetes uses liveness probes to know when to restart a container. If a container is unresponsive—perhaps the application is deadlocked due to a multi-threading defect—restarting the container can make the application more available, despite the defect. It certainly beats paging someone in the middle of the night to restart a container.[1]

Kubernetes uses readiness probes to decide when the container is available for accepting traffic. The readiness probe is used to control which pods are used as the backends for a service. A pod is considered ready when all of its containers are ready. If a pod is not ready, it is removed from service load balancers.

It aligns the behavior of /readiness probe. The /liveness already behaves as described.

The /readiness probe logic to consider it successful is all above needs to make endpoint to be successful:

  1. Gitlab::HealthChecks::DbCheck,
  2. Gitlab::HealthChecks::Redis::RedisCheck,
  3. Gitlab::HealthChecks::Redis::CacheCheck,
  4. Gitlab::HealthChecks::Redis::QueuesCheck,
  5. Gitlab::HealthChecks::Redis::SharedStateCheck,
  6. any of Gitlab::HealthChecks::GitalyCheck (as it checks the state of each shard).

It means that it considers the service to be operational as long as db, all redis instances or any of gitaly endpoints do work.

Each of the above defines a single group of checks. It means that within a group at least one of the services needs to be operational to consider the node as ready.

Considerations

/liveness

Consider that /liveness run on separate endpoint checks the status of master Unicorn/Puma endpoint. It means that this checks if master process is operational and is able to scale up/down workers as needed.

This behaviour is different to that of /-/liveness which checks single worker behaviour.

/readiness

The /readiness checks the state of all related services, but does not yet check the status of Puma/Unicorn master process.

This behaviour is different to that of /-/readiness which checks single worker behaviour.

Fix for /readiness payload

Old payload for /readiness

Currently, the /readiness payload does not support properly a check of Gitaly:

{ 
   "db_check":{ 
      "status":"ok"
   },
   "gitaly_check":{ 
      "status":"ok",
      "labels":{ 
         "shard":"nfs-file40"
      }
   }
}

Due to group-name gitaly_check overlay it makes to show random shard status instead of each shard status.

New payload for /readiness

This MR changes this payload to properly support groups:

{ 
   "status":"ok",
   "db_check":[ 
      { 
         "status":"ok"
      }
   ],
   "gitaly_check":[ 
      { 
         "status":"ok",
         "labels":{ 
            "shard":"nfs-file01"
         }
      },
      { 
         "status":"ok",
         "labels":{ 
            "shard":"nfs-file02"
         }
      }
   ]
}

Next step

The additional check for /readiness will be added in this MR: !17962 (merged).

The check will verify that there's at least a single worker able to process the requests. This will make the /readiness to behave as a check able to tell whether the web server can accept and process web-traffic immediately.

References

Based on: !17953 (merged) and omnibus-gitlab!3650 (merged)

Part of: #30201 (closed)

Related to: #30037 (closed)

Does this MR meet the acceptance criteria?

Conformity

Edited by 🤖 GitLab Bot 🤖

Merge request reports