Provide details of failed container health checks in the logs.

Goal

The logs lack information about problems when starting the container in the case when the health check is not passed and the container has an "unhealthy" state

2022/07/02 17:56:22 restore.go:217: [INFO]   Running container: dblab_lr_c7kbqjci58nabq0ml8dg. ID: 32872d4436e20e2c75760cc6217479db338ce8ee9e57211a0a59a1f79eeea113
2022/07/02 17:56:23 restore.go:225: [INFO]   Waiting for container readiness
2022/07/02 17:56:23 tools.go:285: [INFO]   Check container readiness:  32872d4436e20e2c75760cc6217479db338ce8ee9e57211a0a59a1f79eeea113
2022/07/02 17:56:49 tools.go:343: [INFO]   Container logs:
2022/07/02 17:56:49 tools.go:371: [INFO]   Removing container ID: 32872d4436e20e2c75760cc6217479db338ce8ee9e57211a0a59a1f79eeea113
2022/07/02 17:57:19 tools.go:377: [INFO]   Container "32872d4436e20e2c75760cc6217479db338ce8ee9e57211a0a59a1f79eeea113" has been stopped
2022/07/02 17:57:19 tools.go:388: [INFO]   Container "32872d4436e20e2c75760cc6217479db338ce8ee9e57211a0a59a1f79eeea113" has been removed
2022/07/02 17:57:19 telemetry.go:26: [DEBUG]  Send telemetry event {c7kbqjci58nabq0ml8dg alert {refresh_failed Failed to run full-refresh}}
2022/07/02 17:57:20 retrieval.go:410: [ERROR]  Failed to run full-refresh failed to readiness check: container health check failed

TODO / How to implement

Add details in case the container state is "unhealthy". (https://gitlab.com/postgres-ai/database-lab/-/blob/v3.1.0/engine/internal/retrieval/engine/postgres/tools/tools.go#L304)

Probably, we should collect the state logs: resp.State.Health.Log

Reproducing of failed data refresh

The logicalDump container and the Postgres instance are running. However, the health check failed due to a timeout on the overloaded machine.

        "State": {
            "Status": "running",
            "Running": true,
            "Paused": false,
            "Restarting": false,
            "OOMKilled": false,
            "Dead": false,
            "Pid": 3482131,
            "ExitCode": 0,
            "Error": "",
            "StartedAt": "2022-07-14T10:37:48.346832942Z",
            "FinishedAt": "0001-01-01T00:00:00Z",
            "Health": {
                "Status": "unhealthy",
                "FailingStreak": 17,
                "Log": [
                    {
                        "Start": "2022-07-14T10:39:39.301160665Z",
                        "End": "2022-07-14T10:39:41.30155236Z",
                        "ExitCode": -1,
                        "Output": "Health check exceeded timeout (2s)"
                    },
                    {
                        "Start": "2022-07-14T10:39:46.318675768Z",
                        "End": "2022-07-14T10:39:48.319050017Z",
                        "ExitCode": -1,
                        "Output": "Health check exceeded timeout (2s)"
                    },
                    {
                        "Start": "2022-07-14T10:39:53.338113403Z",
                        "End": "2022-07-14T10:39:55.338639142Z",
                        "ExitCode": -1,
                        "Output": "Health check exceeded timeout (2s)"
                    },
                    {
                        "Start": "2022-07-14T10:40:00.359663126Z",
                        "End": "2022-07-14T10:40:02.360002152Z",
                        "ExitCode": -1,
                        "Output": "Health check exceeded timeout (2s)"
                    },
                    {
                        "Start": "2022-07-14T10:40:07.368458484Z",
                        "End": "2022-07-14T10:40:09.36892479Z",
                        "ExitCode": -1,
                        "Output": "Health check exceeded timeout (2s)"
                    }
                ]
            }
        },
            "Healthcheck": {
                "Test": [
                    "CMD-SHELL",
                    "pg_isready -U postgres -d test_small"
                ],
                "Interval": 5000000000,
                "Timeout": 2000000000,
                "StartPeriod": 3000000000,
                "Retries": 15
            },

Acceptance criteria

The DLE logs provides complete and clear information about a failed health check

Edited Jul 15, 2022 by Artyom Kartasov

Admin message

Provide details of failed container health checks in the logs.

Goal

TODO / How to implement

Reproducing of failed data refresh

Acceptance criteria