Fix race condition in server create (!9) · Merge requests · GitLab.org / fleeting / nesting

Tomasz Maczukin requested to merge fix-race-condition-in-server-create into main Mar 16, 2023

We've got a data race in nesting on our SaaS MacOS staging runners.

When testing our new runners, we've executed two jobs that have failed on connecting to nesting on the AWS MacOS instance:

with such failure:

Dialing instance i-04011106dd61f0b2e...
Instance i-04011106dd61f0b2e connected
Enforcing VM Isolation
Creating nesting VM tunnel
Creating nesting VM macos-12-xcode-14
Dialing nesting daemon
ERROR: Preparation failed: creating instance environment: creating vm tunnel: creating nesting vm: rpc error: code = Unavailable desc = error reading from server: EOF

The error means that nesting daemon, that should be running and awaiting connections on that host, is not responding. After quick check we've noticed that both jobs were using the same i-04011106dd61f0b2e instance.

After logging to the instance and looking into nesting logs, we've found a panic caused by data race:

fatal error: concurrent map writes

goroutine 184 [running]:
runtime.throw({0x100a9d29b?, 0x10?})
	/usr/local/go/src/runtime/panic.go:992 +0x50 fp=0x140006b5980 sp=0x140006b5950 pc=0x1007366e0
runtime.mapassign_fast32(0x100ba4a40, 0x1400008e270, 0x0)
	/usr/local/go/src/runtime/map_fast32.go:177 +0x2d0 fp=0x140006b59c0 sp=0x140006b5980 pc=0x100713ef0
gitlab.com/gitlab-org/fleeting/nesting/api.(*server).Create(0x1400008e2a0, {0x100c2acb0, 0x1400021e660}, 0x140000921c0)
	/builds/gitlab-org/fleeting/nesting/api/server.go:88 +0xe0 fp=0x140006b5a50 sp=0x140006b59c0 pc=0x100a86ba0
gitlab.com/gitlab-org/fleeting/nesting/api/internal/proto._Nesting_Create_Handler({0x100bef240?, 0x1400008e2a0}, {0x100c2acb0, 0x1400021e660}, 0x140001d2310, 0x0)
	/builds/gitlab-org/fleeting/nesting/api/internal/proto/nesting_grpc.pb.go:149 +0x174 fp=0x140006b5ab0 sp=0x140006b5a50 pc=0x100a84de4
google.golang.org/grpc.(*Server).processUnaryRPC(0x140000b0000, {0x100c2cab0, 0x14000482000}, 0x14000000120, 0x1400008e330, 0x100edb478, 0x0)
	/go/pkg/mod/google.golang.org/grpc@v1.50.0/server.go:1318 +0x9d8 fp=0x140006b5e20 sp=0x140006b5ab0 pc=0x100a70628
google.golang.org/grpc.(*Server).handleStream(0x140000b0000, {0x100c2cab0, 0x14000482000}, 0x14000000120, 0x0)
	/go/pkg/mod/google.golang.org/grpc@v1.50.0/server.go:1659 +0x840 fp=0x140006b5f50 sp=0x140006b5e20 pc=0x100a748c0
google.golang.org/grpc.(*Server).serveStreams.func1.2()
	/go/pkg/mod/google.golang.org/grpc@v1.50.0/server.go:955 +0x88 fp=0x140006b5fd0 sp=0x140006b5f50 pc=0x100a6e578
runtime.goexit()
	/usr/local/go/src/runtime/asm_arm64.s:1270 +0x4 fp=0x140006b5fd0 sp=0x140006b5fd0 pc=0x100766d04
created by google.golang.org/grpc.(*Server).serveStreams.func1
	/go/pkg/mod/google.golang.org/grpc@v1.50.0/server.go:953 +0x298
(...)

Looking into runner logs we've found, that the two jobs were started at exactly same time and were trying to connect to nesting at exactly same second:

# journalctl -u gitlab-runner.service --since '2023-03-15 19:30' | grep -e 3950468231 -e 3950468229 | grep failed
Mar 16 19:33:07 runners-manager-saas-macos-staging-blue-1 gitlab-runner[511059]: {"job":3950468229,"level":"warning","msg":"Preparation failed: creating instance environment: creating vm tunnel: creating nesting vm: rpc error: code = Unavailable desc = error reading from server: EOF","project":44183866,"runner":"ha4G5WUF","time":"2023-03-16T19:33:07Z"}
Mar 16 19:33:07 runners-manager-saas-macos-staging-blue-1 gitlab-runner[511059]: {"job":3950468231,"level":"warning","msg":"Preparation failed: creating instance environment: creating vm tunnel: creating nesting vm: rpc error: code = Unavailable desc = error reading from server: EOF","project":44183866,"runner":"ha4G5WUF","time":"2023-03-16T19:33:07Z"}

Following the panic stack trace, we've been pointed to this line - https://gitlab.com/gitlab-org/fleeting/nesting/-/blob/f2211a1d15b02f1b0c572a6b9b6284b31e7176f4/api/server.go#L88 - where an entry to the s.slots map is being written:

if slotsInUse {
	s.slots[*req.Slot] = vm.GetId()
}

Call to Create() may also cause reading from s.slots at https://gitlab.com/gitlab-org/fleeting/nesting/-/blob/f2211a1d15b02f1b0c572a6b9b6284b31e7176f4/api/server.go#L147, at https://gitlab.com/gitlab-org/fleeting/nesting/-/blob/f2211a1d15b02f1b0c572a6b9b6284b31e7176f4/api/server.go#L111 and writing at https://gitlab.com/gitlab-org/fleeting/nesting/-/blob/f2211a1d15b02f1b0c572a6b9b6284b31e7176f4/api/server.go#L113.

Two concurrent calls done at exactly same moment have caused concurrent read/write access to the map, which ended with a panic. Unlikely than few other methods in this struct, Create(), Delete() and clearSlot() despite of using a shared map are not using locks.

This MR adds a test that tries to simulate that condition and that - at least on my multiple local executions - was properly detecting the data race when executed with -race flag. Because of that a dedicated test race job is also added in this MR.

Finally, this MR adds locks in needed places to make concurrent calls to Create() and/or Delete() not causing data races anymore.

Edited Mar 16, 2023 by Tomasz Maczukin

Fix race condition in server create

Merge request reports