Resolve "Data Race in function common.(*Trace).Write() - common/trace.go:18"
What does this MR do?
We are using JobTrace
as a shared access to output, we must lock Write
calls.
Are there points in the code the reviewer needs to double check?
This is tricky:
This specific data race was triggered by this function
func (s *executor) waitForServices() {
waitForServicesTimeout := s.Config.Docker.WaitForServicesTimeout
if waitForServicesTimeout == 0 {
waitForServicesTimeout = common.DefaultWaitForServicesTimeout
}
// wait for all services to came up
if waitForServicesTimeout > 0 && len(s.services) > 0 {
s.Println("Waiting for services to be up and running...")
wg := sync.WaitGroup{}
for _, service := range s.services {
wg.Add(1)
go func(service *types.Container) {
s.waitForServiceContainer(service, time.Duration(waitForServicesTimeout)*time.Second)
wg.Done()
}(service)
}
wg.Wait()
}
}
func (s *executor) waitForServiceContainer(service *types.Container, timeout time.Duration) error {
// cut ...
var buffer bytes.Buffer
// cut ...
buffer.WriteString(helpers.ANSI_YELLOW + "*********" + helpers.ANSI_RESET + "\n")
buffer.WriteString("\n")
io.Copy(s.Trace, &buffer)
return err
}
We are writing to to s.Trace
concurrently
Preventing starvation on io.Pipe
In my first attempt to fix this data race I went too far in locking and introduced a starvation problem: https://gitlab.com/gitlab-org/gitlab-runner/-/jobs/44257015
It was caused by this innocent piece of code:
diff --git a/network/trace.go b/network/trace.go
index f2a502437..1d1e29cad 100644
--- a/network/trace.go
+++ b/network/trace.go
@@ -82,6 +82,13 @@ type clientJobTrace struct {
sentState common.JobState
}
+func (c *clientJobTrace) Write(p []byte) (n int, err error) {
+ c.lock.Lock()
+ defer c.lock.Unlock()
+
+ return c.PipeWriter.Write(p)
+}
+
func (c *clientJobTrace) Success() {
c.Fail(nil)
}
At first look it's a normal locking, and c.lock
was already used in clientJobTrace
as sync.RWMutex
for protecting data.
So where was the problem?
We use c.lock
for protecting the other fields of clientJobTrace
and not io.PipeWriter
because as you can see in go io/pipe.go (which is an amazing piece of code) reading and writing are already protected by their own sync.Mutex
and sync.Cond
.
How to detect whats was going on:
Here is an extract of the go test
failure:
goroutine 380 [semacquire, 9 minutes]:
sync.runtime_notifyListWait(0xc42037c7f8, 0xc400000001)
/usr/local/go/src/runtime/sema.go:298 +0x10b
sync.(*Cond).Wait(0xc42037c7e8)
/usr/local/go/src/sync/cond.go:57 +0x89
io.(*pipe).write(0xc42037c780, 0xc42015b000, 0x5, 0x100, 0x0, 0x0, 0x0)
/usr/local/go/src/io/pipe.go:90 +0x189
io.(*PipeWriter).Write(0xc42000c900, 0xc42015b000, 0x5, 0x100, 0xc42044e830, 0xc420275bc0, 0x8c2c80)
/usr/local/go/src/io/pipe.go:157 +0x4c
gitlab.com/gitlab-org/gitlab-runner/network.(*clientJobTrace).Write(0xc42023c780, 0xc42015b000, 0x5, 0x100, 0x0, 0x0, 0x0)
gitlab.com/gitlab-org/gitlab-runner/network/_test/_obj_test/trace.go:100 +0xcd
fmt.Fprint(0xbea380, 0xc42023c780, 0xc420275bb0, 0x1, 0x1, 0x5, 0x0, 0x0)
/usr/local/go/src/fmt/print.go:216 +0x8f
gitlab.com/gitlab-org/gitlab-runner/network.TestJobOutputLimit(0xc4203745b0)
/go/src/gitlab.com/gitlab-org/gitlab-runner/network/trace_test.go:151 +0x28e
testing.tRunner(0xc4203745b0, 0x9afdc0)
/usr/local/go/src/testing/testing.go:657 +0x96
created by testing.(*T).Run
/usr/local/go/src/testing/testing.go:697 +0x2ca
[... cut ...]
goroutine 381 [semacquire, 9 minutes]:
sync.runtime_SemacquireMutex(0xc42023c96c)
/usr/local/go/src/runtime/sema.go:62 +0x34
sync.(*Mutex).Lock(0xc42023c968)
/usr/local/go/src/sync/mutex.go:87 +0x9d
sync.(*RWMutex).Lock(0xc42023c968)
/usr/local/go/src/sync/rwmutex.go:86 +0x2d
gitlab.com/gitlab-org/gitlab-runner/network.(*clientJobTrace).writeRune(0xc42023c780, 0x61, 0x400, 0x0, 0x0, 0x0)
gitlab.com/gitlab-org/gitlab-runner/network/_test/_obj_test/trace.go:168 +0x88
gitlab.com/gitlab-org/gitlab-runner/network.(*clientJobTrace).process(0xc42023c780, 0xc42000c8f8)
gitlab.com/gitlab-org/gitlab-runner/network/_test/_obj_test/trace.go:218 +0x1b4
created by gitlab.com/gitlab-org/gitlab-runner/network.(*clientJobTrace).start
gitlab.com/gitlab-org/gitlab-runner/network/_test/_obj_test/trace.go:146 +0x191
-
goroutine 380
was innetwork.(*clientJobTrace).Write
holdingclientJobTrace.lock
and waiting for a condition variable inio.Pipe
. It was waiting for the reader to consume data in the pipe in order to append his new information. - meanwhile
goroutine 381
was innetwork.(*clientJobTrace).process
and thennetwork.(*clientJobTrace).writeRune
waiting for a lock onclientJobTrace.lock
but held bygoroutine 380
.network.(*clientJobTrace).process
is supposed to read from the sharedio.Pipe
and signal ongoroutine 380
condition variables, but this condition will never happen because of the nested lock.
Does this MR meet the acceptance criteria?
-
Documentation created/updated - Tests
-
Added for this feature/bug -
All builds are passing
-
-
Branch has no merge conflicts with master
(if you do - rebase it please)
What are the relevant issue numbers?
Closes #2911 (closed) Related to #2932