Lazily initiate a Redis publish/subscribe channel

What does this MR do and why?

Previously a Redis PubSub channel would constantly be maintained during startup. If a Redis TIMEOUT value were configured, we could get into this repeated sequence:

  1. Redis tears down the channel.
  2. Workhorse reports an EOF error.
  3. Workhrose reconnects after a backoff time.

We can avoid this by lazily initiating the publish/subscribe channel only when we need it. Note that since we don't block on establishing the channel, the first CI request might not be held in a long poll, but subsequent requests should.

References

Relates to #426006 (closed)

MR acceptance checklist

Please evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Screenshots or screen recordings

Screenshots are required for UI changes, and strongly recommended for all other merge requests.

Before After

Reproducing the problem

  1. In master, run gdk tail gitlab-workhorse.
  2. Stop any runners running on your system.
  3. From your GDK root, set your Redis timeout to 10 seconds: redis-cli -s redis/redis.socket config set timeout 10
  4. You should see these messages repeating every 10 seconds:
2025-01-07_23:02:39.31165 gitlab-workhorse        : redis: 2025/01/07 15:02:39 pubsub.go:168: redis: discarding bad PubSub connection: EOF
2025-01-07_23:02:39.31235 gitlab-workhorse        : {"error":"keywatcher: pubsub receive: EOF","level":"error","msg":"","time":"2025-01-07T15:02:39-08:00"}
2025-01-07_23:02:50.21762 gitlab-workhorse        : redis: 2025/01/07 15:02:50 pubsub.go:168: redis: discarding bad PubSub connection: EOF
2025-01-07_23:02:50.21940 gitlab-workhorse        : {"error":"keywatcher: pubsub receive: EOF","level":"error","msg":"","time":"2025-01-07T15:02:50-08:00"}
2025-01-07_23:03:01.27910 gitlab-workhorse        : redis: 2025/01/07 15:03:01 pubsub.go:168: redis: discarding bad PubSub connection: EOF
2025-01-07_23:03:01.28500 gitlab-workhorse        : {"error":"keywatcher: pubsub receive: EOF","level":"error","msg":"","time":"2025-01-07T15:03:01-08:00"}

How to set up and validate locally

Without long polling enabled (default)

By default, Workhorse is configured with a 50 ns (nanosecond) -apiCiLongPollingDuration value, which effectively disables long polling since the hold duration is close to 0.

  1. Check out this branch.
  2. In the workhorse directory, run make.
  3. gdk restart workhorse
  4. gdk tail gitlab-workhorse
  5. The EOF messages should be gone.
  6. Start a runner. You should see the keywatcher: listening for subscriptions message:
2025-01-07_23:07:05.19833 gitlab-workhorse        : {"level":"info","msg":"keywatcher: listening for subscriptions","time":"2025-01-07T15:07:05-08:00"}
  1. The runner by default should request a job every 3 seconds, which causes repeated SUBSCRIBE and UNSUBSCRIBE messages in Redis. There should be no EOF messages.

With long polling enabled

  1. Stop the runner.
  2. Check out this GDK branch: gitlab-development-kit!4343 (merged).
  3. gdk config set workhorse.ci_long_polling_seconds 30
  4. gdk restart workhorse
  5. gdk tail gitlab-workhorse.
  6. Start a runner. You should see the keywatcher: listening for subscriptions message:
2025-01-07_23:24:36.54390 gitlab-workhorse        : {"level":"info","msg":"keywatcher: listening for subscriptions","time":"2025-01-07T15:24:36-08:00"}
  1. The logs should show the runner attempting to get a job every 30 seconds. No EOF keywatcher errors should show up.
  2. Kick off a CI job to ensure the runner picks up a job quickly in a long poll.
Edited by Stan Hu

Merge request reports

Loading