Skip to content

Resolve "Data Race in function common.(*Trace).Write() - common/trace.go:18"

What does this MR do?

We are using JobTrace as a shared access to output, we must lock Write calls.

Are there points in the code the reviewer needs to double check?

This is tricky:

This specific data race was triggered by this function

func (s *executor) waitForServices() {
	waitForServicesTimeout := s.Config.Docker.WaitForServicesTimeout
	if waitForServicesTimeout == 0 {
		waitForServicesTimeout = common.DefaultWaitForServicesTimeout
	}

	// wait for all services to came up
	if waitForServicesTimeout > 0 && len(s.services) > 0 {
		s.Println("Waiting for services to be up and running...")
		wg := sync.WaitGroup{}
		for _, service := range s.services {
			wg.Add(1)
			go func(service *types.Container) {
				s.waitForServiceContainer(service, time.Duration(waitForServicesTimeout)*time.Second)
				wg.Done()
			}(service)
		}
		wg.Wait()
	}
}


func (s *executor) waitForServiceContainer(service *types.Container, timeout time.Duration) error {
	// cut ... 
	var buffer bytes.Buffer
	// cut ... 
	buffer.WriteString(helpers.ANSI_YELLOW + "*********" + helpers.ANSI_RESET + "\n")
	buffer.WriteString("\n")
	io.Copy(s.Trace, &buffer)
	return err
}

We are writing to to s.Trace concurrently

Preventing starvation on io.Pipe

In my first attempt to fix this data race I went too far in locking and introduced a starvation problem: https://gitlab.com/gitlab-org/gitlab-runner/-/jobs/44257015

It was caused by this innocent piece of code:

diff --git a/network/trace.go b/network/trace.go
index f2a502437..1d1e29cad 100644
--- a/network/trace.go
+++ b/network/trace.go
@@ -82,6 +82,13 @@ type clientJobTrace struct {
        sentState common.JobState
 }

+func (c *clientJobTrace) Write(p []byte) (n int, err error) {
+       c.lock.Lock()
+       defer c.lock.Unlock()
+
+       return c.PipeWriter.Write(p)
+}
+
 func (c *clientJobTrace) Success() {
        c.Fail(nil)
 }

At first look it's a normal locking, and c.lock was already used in clientJobTrace as sync.RWMutex for protecting data.

So where was the problem?

We use c.lock for protecting the other fields of clientJobTrace and not io.PipeWriter because as you can see in go io/pipe.go (which is an amazing piece of code) reading and writing are already protected by their own sync.Mutex and sync.Cond.

How to detect whats was going on:

Here is an extract of the go test failure:

goroutine 380 [semacquire, 9 minutes]:
sync.runtime_notifyListWait(0xc42037c7f8, 0xc400000001)
	/usr/local/go/src/runtime/sema.go:298 +0x10b
sync.(*Cond).Wait(0xc42037c7e8)
	/usr/local/go/src/sync/cond.go:57 +0x89
io.(*pipe).write(0xc42037c780, 0xc42015b000, 0x5, 0x100, 0x0, 0x0, 0x0)
	/usr/local/go/src/io/pipe.go:90 +0x189
io.(*PipeWriter).Write(0xc42000c900, 0xc42015b000, 0x5, 0x100, 0xc42044e830, 0xc420275bc0, 0x8c2c80)
	/usr/local/go/src/io/pipe.go:157 +0x4c
gitlab.com/gitlab-org/gitlab-runner/network.(*clientJobTrace).Write(0xc42023c780, 0xc42015b000, 0x5, 0x100, 0x0, 0x0, 0x0)
	gitlab.com/gitlab-org/gitlab-runner/network/_test/_obj_test/trace.go:100 +0xcd
fmt.Fprint(0xbea380, 0xc42023c780, 0xc420275bb0, 0x1, 0x1, 0x5, 0x0, 0x0)
	/usr/local/go/src/fmt/print.go:216 +0x8f
gitlab.com/gitlab-org/gitlab-runner/network.TestJobOutputLimit(0xc4203745b0)
	/go/src/gitlab.com/gitlab-org/gitlab-runner/network/trace_test.go:151 +0x28e
testing.tRunner(0xc4203745b0, 0x9afdc0)
	/usr/local/go/src/testing/testing.go:657 +0x96
created by testing.(*T).Run
	/usr/local/go/src/testing/testing.go:697 +0x2ca

[... cut ...]

goroutine 381 [semacquire, 9 minutes]:
sync.runtime_SemacquireMutex(0xc42023c96c)
	/usr/local/go/src/runtime/sema.go:62 +0x34
sync.(*Mutex).Lock(0xc42023c968)
	/usr/local/go/src/sync/mutex.go:87 +0x9d
sync.(*RWMutex).Lock(0xc42023c968)
	/usr/local/go/src/sync/rwmutex.go:86 +0x2d
gitlab.com/gitlab-org/gitlab-runner/network.(*clientJobTrace).writeRune(0xc42023c780, 0x61, 0x400, 0x0, 0x0, 0x0)
	gitlab.com/gitlab-org/gitlab-runner/network/_test/_obj_test/trace.go:168 +0x88
gitlab.com/gitlab-org/gitlab-runner/network.(*clientJobTrace).process(0xc42023c780, 0xc42000c8f8)
	gitlab.com/gitlab-org/gitlab-runner/network/_test/_obj_test/trace.go:218 +0x1b4
created by gitlab.com/gitlab-org/gitlab-runner/network.(*clientJobTrace).start
	gitlab.com/gitlab-org/gitlab-runner/network/_test/_obj_test/trace.go:146 +0x191
  • goroutine 380 was in network.(*clientJobTrace).Write holding clientJobTrace.lock and waiting for a condition variable in io.Pipe. It was waiting for the reader to consume data in the pipe in order to append his new information.
  • meanwhile goroutine 381 was in network.(*clientJobTrace).process and then network.(*clientJobTrace).writeRune waiting for a lock on clientJobTrace.lock but held by goroutine 380. network.(*clientJobTrace).process is supposed to read from the shared io.Pipe and signal on goroutine 380 condition variables, but this condition will never happen because of the nested lock.

Does this MR meet the acceptance criteria?

  • Documentation created/updated
  • Tests
    • Added for this feature/bug
    • All builds are passing
  • Branch has no merge conflicts with master (if you do - rebase it please)

What are the relevant issue numbers?

Closes #2911 (closed) Related to #2932

Merge request reports