fix(handlers): fix Broadcaster goroutine leak and DBLoadBalancer close race

What does this MR do?

Fixes goroutine leaks that cause the database-api-load-balancing-discovery-pg_curr_version-1_25-valkey::TestMain CI job to time out after 35 minutes.

There are two root causes:

  1. Leaked Broadcaster goroutines. Every call to handlers.NewApp() runs configureEvents(), which creates a notifications.Broadcaster and starts a background run() goroutine. Tests that call NewApp without eventually calling GracefulShutdown() leak these goroutines. They accumulate (each blocked on a select{} in Broadcaster.run()) until the test binary hits the 35-minute timeout.

  2. DBLoadBalancer close race. The background replica-refresh goroutine in DBLoadBalancer can call ResolveReplicas after Close, reopening connections and leaking connectionOpener goroutines. These hold locks that block TruncateAllTables.

Production code

app.go:

  • NewApp defers GracefulShutdown on the error path (at the top, covering all resource registrations) so callers never receive a nil App with leaked goroutines.
  • GracefulShutdown is idempotent: an atomic flag ensures only the first call performs cleanup, so double-calls from overlapping cleanup paths are safe.

loadbalancing.go:

  • DBLoadBalancer.Close sets a closed atomic flag.
  • ResolveReplicas returns immediately when closed is set, preventing the refresh goroutine from reopening connections after Close.

Test infrastructure

integration_helpers_test.go:

  • newTestEnvWithConfig auto-registers cleanup via t.Cleanup, so callers no longer need an explicit env.Cleanup(t) call.
  • testEnv.Shutdown cancels the app context before closing DB connections, preventing the pool-refresh goroutine from racing with Close.
  • Shutdown is idempotent (returns nil on double-call) to avoid noisy failures from overlapping cleanup paths.

Test fixes

app_test.go:

  • Added GracefulShutdown cleanup (with 5s timeout) to all NewApp call sites: TestNewApp, GitLab API tests, CF-Ray-ID logging tests, lock error tests, lock no-manifests tests, and ApplyStorageMiddlewareTestSuite.
  • Added missing Close() expectation to the mock LoadBalancer.

app_integration_test.go:

  • All lockfile test paths register GracefulShutdown in t.Cleanup.
  • Manual REGISTRY_DATABASE_ENABLED env var checks replaced with skipDatabaseNotEnabled.

api_integration_test.go:

  • Removed now-redundant explicit env.Cleanup(t) calls (auto-registered).

api_conformance_test.go, api_gitlab_integration_test.go, api_db_fault_tolerance_integration_test.go, api_online_gc_integration_test.go:

  • Removed explicit env.Cleanup(t) calls (auto-registered).

repositories_integration_test.go:

  • newTestEnv registers cleanup for both the HTTP server and the App.

Why this differs from the auto-generated implementation plan

The implementation plan on the issue diagnoses the root cause as "Database Connection Pool Exhaustion and Lock Contention" and proposes that TRUNCATE ... CASCADE blocks because active DB connections hold locks. The stack traces tell a different story: the stuck goroutines are all Broadcaster.run() in notifications/sinks.go:121, blocked in select for 34-35 minutes with no events and no doneCh signal. The goroutine count (456+) comes from leaked Broadcaster goroutines accumulating across test cases, not from connection pool saturation. The TruncateAllTables goroutine visible in some traces is a symptom (blocked by leaked connectionOpener goroutines from the DBLoadBalancer close race), not the primary cause.

Related to #1858 (closed)

Author checklist

  • CODEOWNERS Review: This MR requires approval from at least one CODEOWNER per category/file.
    • If codeowners are absent or the change is urgent, any registry maintainer can temporarily disable CODEOWNERS reviews in the project settings. When doing so:
      • Add a comment on the MR justifying the bypass
      • Mention/CC the designated codeowners in the MR for async review
      • Re-enable CODEOWNERS reviews in project settings once the MR is merged
    • If you lack permissions to disable CODEOWNERS reviews, reach out to the Engineering Manager (@jaime) or Senior Engineering Manager (@crystalpoole) for assistance.
  • Assign one of conventional-commit prefixes to the MR.
    • fix: Indicates a bug fix, triggers a patch release.
    • feat
    • perf
    • docs
    • style
    • refactor
    • test
    • chore
    • build
    • ci
    • revert
  • MR contains database changes including schema/background migrations: N/A
  • Change contains a breaking change - apply the breaking change label. N/A
  • Change is considered high risk - apply the label high-risk-change N/A
  • I created or linked to an existing issue for every added or updated TODO, BUG, FIXME or OPTIMIZE prefixed comment: N/A
  • Changes cannot be rolled back: N/A
Edited by Hayley Swimelar

Merge request reports

Loading