Skip to content

testserver: Capture Praefect logs if dialing fails

Will Chandler requested to merge wc-waithealthy-timeout into master

In !4297 (merged) we removed the 15 second timeout in waitHealthy() to avoid spurious failures in slow CI environments. Since this change we have seen occasional instances of multi-minute waits for Praefect, causing the test process to panic on timeout.

To better understand what's going wrong, let's increase Praefect's log verbosity to info and print stderr (where all logging is written) when the context deadline is reached in waitHealthy(). If we're lucky this will have an error, but even which events were logged should give us clues as to where Praefect got stuck.

Here's example test output:

✖  cmd/praefect (22.244s)

=== Failed
=== FAIL: cmd/praefect TestRemoveRepository_Exec (21.74s)
    gitaly.go:138:
        	Error Trace:	/Users/wchandler/devel/gdk/gitaly/cmd/praefect/gitaly.go:138
        	           				/Users/wchandler/devel/gdk/gitaly/cmd/praefect/praefect.go:115
        	           				/Users/wchandler/devel/gdk/gitaly/cmd/praefect/subcmd_remove_repository_test.go:99
        	Error:      	Received unexpected error:
        	           	failed to dial "unix:///private/tmp/gitaly-2411705248/gitaly.socket.1141245043" connection: context deadline exceeded
        	Test:       	TestRemoveRepository_Exec
    praefect.go:109: Praefect failed to start:
        time="2022-07-12T21:21:15.459Z" level=info msg="Starting praefect" pid=18408 version="Praefect, version "
        time="2022-07-12T21:21:15.459Z" level=info msg="establishing database connection to /Users/wchandler/devel/gdk/postgresql:5432 ..." pid=18408
        time="2022-07-12T21:21:15.462Z" level=info msg="database connection established" pid=18408
        time="2022-07-12T21:21:15.462Z" level=info msg="reads distribution caching is enabled by configuration" pid=18408
        time="2022-07-12T21:21:15.463Z" level=info msg="health manager started" component=HealthManager pid=18408
        time="2022-07-12T21:21:16.472Z" level=info msg="background verifier started" config="{168h0m0s false}" pid=18408
        time="2022-07-12T21:21:16.472Z" level=info msg="election strategy: \"per_repository\"" pid=18408
        time="2022-07-12T21:21:16.472Z" level=info msg="background started: gitaly nodes health monitoring" pid=18408
        time="2022-07-12T21:21:16.473Z" level=info msg="listening at unix address" address=/private/tmp/gitaly-2411705248/gitaly.socket.1141245043
        time="2022-07-12T21:21:16.473Z" level=info msg=listening address=/private/tmp/gitaly-2411705248/gitaly.socket.1141245043 pid=18408 schema=unix
        time="2022-07-12T21:21:16.473Z" level=info msg="background started: processing of the replication events" pid=18408
        time="2022-07-12T21:21:16.473Z" level=info msg="background started: processing of the stale replication events" pid=18408
        time="2022-07-12T21:21:16.474Z" level=info msg="automatic reconciler started" component=reconciler pid=18408 storages="map[praefect:[gitaly-1 gitaly-2]]"
        time="2022-07-12T21:21:16.474Z" level=info msg="processing started" component=replication_manager pid=18408 virtual_storage=praefect
        time="2022-07-12T21:21:16.474Z" level=info msg=started component=repocleaner.repository_existence pid=18408
0
DONE 6 tests, 1 failure in 25.428s
make: *** [test-go] Error 1
Edited by Will Chandler

Merge request reports