Provide details of failed container health checks in the logs.
Goal
The logs lack information about problems when starting the container in the case when the health check is not passed and the container has an "unhealthy" state
2022/07/02 17:56:22 restore.go:217: [INFO] Running container: dblab_lr_c7kbqjci58nabq0ml8dg. ID: 32872d4436e20e2c75760cc6217479db338ce8ee9e57211a0a59a1f79eeea113
2022/07/02 17:56:23 restore.go:225: [INFO] Waiting for container readiness
2022/07/02 17:56:23 tools.go:285: [INFO] Check container readiness: 32872d4436e20e2c75760cc6217479db338ce8ee9e57211a0a59a1f79eeea113
2022/07/02 17:56:49 tools.go:343: [INFO] Container logs:
2022/07/02 17:56:49 tools.go:371: [INFO] Removing container ID: 32872d4436e20e2c75760cc6217479db338ce8ee9e57211a0a59a1f79eeea113
2022/07/02 17:57:19 tools.go:377: [INFO] Container "32872d4436e20e2c75760cc6217479db338ce8ee9e57211a0a59a1f79eeea113" has been stopped
2022/07/02 17:57:19 tools.go:388: [INFO] Container "32872d4436e20e2c75760cc6217479db338ce8ee9e57211a0a59a1f79eeea113" has been removed
2022/07/02 17:57:19 telemetry.go:26: [DEBUG] Send telemetry event {c7kbqjci58nabq0ml8dg alert {refresh_failed Failed to run full-refresh}}
2022/07/02 17:57:20 retrieval.go:410: [ERROR] Failed to run full-refresh failed to readiness check: container health check failed
TODO / How to implement
Add details in case the container state is "unhealthy". (https://gitlab.com/postgres-ai/database-lab/-/blob/v3.1.0/engine/internal/retrieval/engine/postgres/tools/tools.go#L304)
Probably, we should collect the state logs: resp.State.Health.Log
Reproducing of failed data refresh
The logicalDump container and the Postgres instance are running. However, the health check failed due to a timeout on the overloaded machine.
"State": {
"Status": "running",
"Running": true,
"Paused": false,
"Restarting": false,
"OOMKilled": false,
"Dead": false,
"Pid": 3482131,
"ExitCode": 0,
"Error": "",
"StartedAt": "2022-07-14T10:37:48.346832942Z",
"FinishedAt": "0001-01-01T00:00:00Z",
"Health": {
"Status": "unhealthy",
"FailingStreak": 17,
"Log": [
{
"Start": "2022-07-14T10:39:39.301160665Z",
"End": "2022-07-14T10:39:41.30155236Z",
"ExitCode": -1,
"Output": "Health check exceeded timeout (2s)"
},
{
"Start": "2022-07-14T10:39:46.318675768Z",
"End": "2022-07-14T10:39:48.319050017Z",
"ExitCode": -1,
"Output": "Health check exceeded timeout (2s)"
},
{
"Start": "2022-07-14T10:39:53.338113403Z",
"End": "2022-07-14T10:39:55.338639142Z",
"ExitCode": -1,
"Output": "Health check exceeded timeout (2s)"
},
{
"Start": "2022-07-14T10:40:00.359663126Z",
"End": "2022-07-14T10:40:02.360002152Z",
"ExitCode": -1,
"Output": "Health check exceeded timeout (2s)"
},
{
"Start": "2022-07-14T10:40:07.368458484Z",
"End": "2022-07-14T10:40:09.36892479Z",
"ExitCode": -1,
"Output": "Health check exceeded timeout (2s)"
}
]
}
},
"Healthcheck": {
"Test": [
"CMD-SHELL",
"pg_isready -U postgres -d test_small"
],
"Interval": 5000000000,
"Timeout": 2000000000,
"StartPeriod": 3000000000,
"Retries": 15
},
Acceptance criteria
The DLE logs provides complete and clear information about a failed health check