fix(handlers): fix Broadcaster goroutine leak and DBLoadBalancer close race
What does this MR do?
Fixes goroutine leaks that cause the
database-api-load-balancing-discovery-pg_curr_version-1_25-valkey::TestMain
CI job to time out after 35 minutes.
There are two root causes:
-
Leaked Broadcaster goroutines. Every call to
handlers.NewApp()runsconfigureEvents(), which creates anotifications.Broadcasterand starts a backgroundrun()goroutine. Tests that callNewAppwithout eventually callingGracefulShutdown()leak these goroutines. They accumulate (each blocked on aselect{}inBroadcaster.run()) until the test binary hits the 35-minute timeout. -
DBLoadBalancer close race. The background replica-refresh goroutine in
DBLoadBalancercan callResolveReplicasafterClose, reopening connections and leakingconnectionOpenergoroutines. These hold locks that blockTruncateAllTables.
Production code
app.go:
-
NewAppdefersGracefulShutdownon the error path (at the top, covering all resource registrations) so callers never receive anilAppwith leaked goroutines. -
GracefulShutdownis idempotent: an atomic flag ensures only the first call performs cleanup, so double-calls from overlapping cleanup paths are safe.
loadbalancing.go:
-
DBLoadBalancer.Closesets aclosedatomic flag. -
ResolveReplicasreturns immediately whenclosedis set, preventing the refresh goroutine from reopening connections afterClose.
Test infrastructure
integration_helpers_test.go:
-
newTestEnvWithConfigauto-registers cleanup viat.Cleanup, so callers no longer need an explicitenv.Cleanup(t)call. -
testEnv.Shutdowncancels the app context before closing DB connections, preventing the pool-refresh goroutine from racing withClose. -
Shutdownis idempotent (returns nil on double-call) to avoid noisy failures from overlapping cleanup paths.
Test fixes
app_test.go:
- Added
GracefulShutdowncleanup (with 5s timeout) to allNewAppcall sites:TestNewApp, GitLab API tests, CF-Ray-ID logging tests, lock error tests, lock no-manifests tests, andApplyStorageMiddlewareTestSuite. - Added missing
Close()expectation to the mockLoadBalancer.
app_integration_test.go:
- All lockfile test paths register
GracefulShutdownint.Cleanup. - Manual
REGISTRY_DATABASE_ENABLEDenv var checks replaced withskipDatabaseNotEnabled.
api_integration_test.go:
- Removed now-redundant explicit
env.Cleanup(t)calls (auto-registered).
api_conformance_test.go, api_gitlab_integration_test.go,
api_db_fault_tolerance_integration_test.go, api_online_gc_integration_test.go:
- Removed explicit
env.Cleanup(t)calls (auto-registered).
repositories_integration_test.go:
-
newTestEnvregisters cleanup for both the HTTP server and theApp.
Why this differs from the auto-generated implementation plan
The implementation plan on the issue
diagnoses the root cause as "Database Connection Pool Exhaustion and Lock
Contention" and proposes that TRUNCATE ... CASCADE blocks because active DB
connections hold locks. The stack traces tell a different story: the stuck
goroutines are all Broadcaster.run() in notifications/sinks.go:121,
blocked in select for 34-35 minutes with no events and no doneCh signal.
The goroutine count (456+) comes from leaked Broadcaster goroutines
accumulating across test cases, not from connection pool saturation. The
TruncateAllTables goroutine visible in some traces is a symptom (blocked by
leaked connectionOpener goroutines from the DBLoadBalancer close race), not
the primary cause.
Related to #1858 (closed)
Author checklist
-
CODEOWNERS Review: This MR requires approval from at least one CODEOWNER per category/file.
- If codeowners are absent or the change is urgent, any registry maintainer can temporarily disable CODEOWNERS reviews in the project settings. When doing so:
- Add a comment on the MR justifying the bypass
- Mention/CC the designated codeowners in the MR for async review
- Re-enable CODEOWNERS reviews in project settings once the MR is merged
- If you lack permissions to disable CODEOWNERS reviews, reach out to the Engineering Manager (
@jaime) or Senior Engineering Manager (@crystalpoole) for assistance.
- If codeowners are absent or the change is urgent, any registry maintainer can temporarily disable CODEOWNERS reviews in the project settings. When doing so:
- Assign one of conventional-commit prefixes to the MR.
-
fix: Indicates a bug fix, triggers a patch release. -
feat -
perf -
docs -
style -
refactor -
test -
chore -
build -
ci -
revert
-
- MR contains database changes including schema/background migrations: N/A
- Change contains a breaking change - apply the breaking change label. N/A
- Change is considered high risk - apply the label high-risk-change N/A
-
I created or linked to an existing issue for every added or updated
TODO,BUG,FIXMEorOPTIMIZEprefixed comment: N/A - Changes cannot be rolled back: N/A