Praefect: Crash after DB disappeared, but state was HEALTHY in kubernetes
hi,
we have Gitlab 14.7.4 with
git@gitlab-praefect-0:/$ praefect -version
Praefect, version 14.7.4
in a K8s Cluster, installed with the official gitlab chart. Today our database had problems, so the following appeared in the logs:
2022-03-17 15:00:00
/root/go/pkg/mod/github.com/prometheus/client_golang@v1.10.0/prometheus/registry.go:538 +0xe4d
2022-03-17 15:00:00
created by github.com/prometheus/client_golang/prometheus.(*Registry).Gather
2022-03-17 15:00:00
/root/go/pkg/mod/github.com/prometheus/client_golang@v1.10.0/prometheus/registry.go:446 +0x12b
2022-03-17 15:00:00
github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func1()
2022-03-17 15:00:00
/tmp/build/internal/praefect/datastore/collector.go:156 +0x608
2022-03-17 15:00:00
gitlab.com/gitlab-org/gitaly/v14/internal/praefect/datastore.(*QueueDepthCollector).Collect(0xc000bfb740, 0xc004a27c20)
2022-03-17 15:00:00
goroutine 452562 [running]:
2022-03-17 15:00:00
2022-03-17 15:00:00
[signal SIGSEGV: segmentation violation code=0x1 addr=0xd8 pc=0xbacba8]
2022-03-17 15:00:00
panic: runtime error: invalid memory address or nil pointer dereference
2022-03-17 14:59:57
time="2022-03-17T13:59:57.093Z" level=error msg="checking health failed" component=HealthManager error="update checks: failed to connect to `host=WW.XX.YY.ZZ user=praefect database=praefect`: dial error (timeout: dial tcp WW.XX.YY.ZZ:54004: i/o timeout)" pid=52
2022-03-17 14:59:51
time="2022-03-17T13:59:51.090Z" level=error msg="checking health failed" component=HealthManager error="update checks: timeout: context deadline exceeded" pid=52
As i wrote, the database had problems so this error is ok, but praefect pod was still healthy in kubernetes and was NOT restartet. so after the database returned a few seconds later, praefect was still dead and not restarted. i had to restart praefect manually and everything was ok.
I think the K8s-state of the POD should switch to fail state so K8s would automatically restart the pod.
Edited by Ulrich Schreiner