Commits · 1641ec51fbd5d18a367180a729a3bb5d6cd9dfba · GitLab.org / Gitaly

Oct 07, 2021

supervisor: Stop monitor if shutdown was signalled · 1641ec51

The supervisor's health monitor will check health every 15 seconds or
alternatively exit if its shutdown channel was closed. While we do
verify this channel correctly when trying to send the event, we do not
check for shutdown when waiting 15 seconds for the next event.

Fix this by returning early if we got the signal to shut down.

1641ec51

bootstrap: Make bootstrapper stoppable to plug Goroutine leaks · 22351f97

Patrick Steinhardt authored 3 years ago

The bootstrapper is currently leaking multiple Goroutines in the
upgrader because we never make sure to clean it up after tests have
finished. Add its existing `Stop()` function to the upgrader interface
and call is as required to plug these leaks.

22351f97

catfile: Stop cache to plug Goroutine leaks · 6cb7e554

Patrick Steinhardt authored 3 years ago

In many tests where the catfile cache is in use, we create it but don't
ever clean up the cache. As a result, the monitoring Goroutine and any
potenitally cached processes may not get cleaned up as expected.

Fix this by calling `Stop()` as required.

6cb7e554

gitaly-backup: Inject connection pool to unleak its Goroutines · ff726e13

Patrick Steinhardt authored 3 years ago

The gitaly-backup manager currently creates an ad-hoc instance of the
connection pool. This pool never gets cleaned up and thus results in
leakage of its connections.

Inject the connection pool into the manager and clean it up as required.

ff726e13

command: Start using the testhelper package · a980fb11

Patrick Steinhardt authored 3 years ago

Now that the "testhelper" package doesn't depend on "command" anymore,
convert the "command" package to use it.

a980fb11

command: Solve cyclic dependency with testhelper package · 36c43abf

Patrick Steinhardt authored 3 years ago

The "testhelper" package currently depends on the "command" package,
which is why we cannot use the "testhelper"'s setup functions in the
latter package. The only remaining dependency is the wait group which is
used to count in-flight commands though such that we can assert that
there are no leaking processes after tests have concluded.

Move this wait group into a separate "commandcounter" package to remove
this dependency. While this also allows other packages to modify the
counter due to its interface now being public, this really shouldn't be
much of a problem at all.

36c43abf

testhelper: Open-code check for process's error codes · 437f1aaf

Patrick Steinhardt authored 3 years ago

While the "command" package provides a nice helper to check for
process's error codes, use of the "command" package in the testhelper
keeps us from using leak checkers in the "command" package itself.

Open-code this function in the testhelper package to reduce dependencies
to the "command" package.

437f1aaf

testhelper: Refactor checks for leaked processes · ed43ae61

Patrick Steinhardt authored 3 years ago

Refactor checks for leaked processes such that there are less calls to
`panic()`. This prepares for the inlining of the "command" package in
order to break a cyclic dependency.

ed43ae61

testhelper: Move process leak checks into separate package · 58f78e3a

Patrick Steinhardt authored 3 years ago

The function `MustHaveNoChildProcess()` is not a general test helper
that should be used by tests as they see fit, but is instead intended to
be run at the end of tests only. Move it into a separate package and
make it internal to the package to make it less discoverable by callers.

58f78e3a

testhelper: Unify setup of test suites · d8b43e95

Patrick Steinhardt authored 3 years ago

Setup of a test package's main function is quite repetitive: we call a
separate `testMain()` function such that we can use `defer` calls, this
function calls `MustHaveNoChildProcess()` and `Configure()`, and then we
return the result of `m.Run()`. This is almost always exactly the same,
with some exceptions where we need to have additional setup code.

Unify this code via a single `testhelper.Run()` function which does all
of this. It allows us greater flexibility in the setup code such that we
can easily perform additional validation after the tests without having
to modify all callsites.

Note that this change removes calls to the goleak package in some
places. These will resurface in a later commit in this patch series,
where we instead move this call into `testhelper.Run()` such that it is
executed by default.

d8b43e95

testhelper: Return errors when configuring Ruby sidecar fails · 27de9ce6

Patrick Steinhardt authored 3 years ago

When configuration of the Ruby sidecar fails, then we call
`log.Fatalf()` and thus directly execute the complete test suite. Given
that we call this function on-demand in tests nowadays though, we should
instead return the error and validate at the test site that it did set
up the server correctly such that other tests continue to run as
expected.

Fix this issue by returning the error instead of aborting.

27de9ce6

Merge branch 'jv-sidechannel-praefect' into 'master' · a3c694b5
James Fargher authored 3 years ago
```
Praefect: add sidechannel support

See merge request !3862
```
a3c694b5

Oct 06, 2021

Praefect: proxy sidechannels · 9afed4db

Jacob Vosmaer authored 3 years ago

This commit adds backchannel support to the main gRPC listener of
Praefect. And if clients make gRPC calls with sidechannels, Praefect
will now proxy these to the Gitaly backend.

Changelog: added

9afed4db

Merge branch 'pks-drop-fetch-internal-remote' into 'master' · 5027043f
Pavlo Strokov authored 3 years ago
```
remote: Drop FetchInternalRemote RPC

Closes #3667

See merge request !3898
```
5027043f
Merge branch 'pks-datastore-cleanup-timezone' into 'master' · 2b424740
Pavlo Strokov authored 3 years ago
```
datastore: Fix storage cleanup's handling of timezones

See merge request !3930
```
2b424740

remote: Drop FetchInternalRemote RPC · 047b1e6a

Patrick Steinhardt authored 3 years ago

The FetchInternalRemote RPC had been used internally to replicate
repositories across different Gitaly nodes. At some point in time, this
was converted to do a direct fetch though without an additional RPC call
because as it turned out, doing inter-Gitaly mutating RPC calls is
problematic in combination with transactions given that the remote side
would now try to cast votes against another Gitaly. Nowadays, there are
no callers left which use this RPC.

Remove the deprecated `FetchInternalRemote()` call. The backing logic
which is still internally called remains though for use in other parts
of Gitaly. We may eventually want to find a better place for it to live.

Changelog: removed

047b1e6a

remote: Refactor `FetchInternalRemote()` to use remoterepo interface · 9ed7528e
Patrick Steinhardt authored 3 years ago
```
Refactor `FetchInternalRemote()` to use the new function to get the
default branch provided by the remoterepo package.
```
9ed7528e

remoterepo: Implement `GetDefaultBranch()` · 033647da

Patrick Steinhardt authored 3 years ago

We have an ad-hoc implementation of `GetDefaultBranch()` for remote
repositories in the `FetchInternalRemote()` RPC implementation. Given
that this RPC call simply maps to `localrepo.GetDefaultBranch()`, it
makes sense to pull up the implementation into the remoterepo interface
to make it generally available.

Implement `remoterepo.GetDefaultBranch()` and move tests of this
function from localrepo-specific tests into the shared test suite.

033647da

remote: Modernize tests for `FetchInternalRemote` · e2657ce4

Patrick Steinhardt authored 3 years ago

The tests for `FetchInternalRemote` are a bit antique and hard to
follow. Refactor them to reflect a more modern style. This is also a
preparation of the upcoming change to drop the RPC call, where we only
retain its internal non-RPC backing logic.

e2657ce4

repository: Fix test for failing fetch in ReplicateRepository · 5cb50ac0

Patrick Steinhardt authored 3 years ago

While the testcase for ReplicateRepository which exercises failing
fetches fails as expected, the reason for failure is not the failing
fetch but instead it is that we do not have the required set of binaries
installed. Fixing this surfaces another issue: because fetches nowadays
happen internally instead of calling the `FetchInternalRemote()` RPC,
the mock remote server which stubs out the RPC with an error isn't
invoked at all.

Fix the test setup by setting up a "real" gRPC server, where the fetch
failure is provoked by writing garbage into the source repo's HEAD.

5cb50ac0

datastore: Fix storage cleanup's handling of timezones · 4f0368b6

Patrick Steinhardt authored 3 years ago

Timestamps in the `storage_cleanups` table are stored without timezones,
but our timezone handling in `AcquireNextStorage()` is inconsistent: we
typically just use `NOW()` to get the current timestamp, but in one case
we use `NOW() AT TIME ZONE 'UTC'`. In some setups, this leads to time
stamps which cannot be correctly compared with each other.

Fix the bug by consistently storing time stamps without time zones. This
bug is similar to 4a2ac0ed (datastore: Fix acknowledgement of stale
jobs considering timezones, 2021-08-20).

Changelog: fixed

4f0368b6

Merge branch 'ps-replication-parallel-processing' into 'master' · cd94be7f
Pavlo Strokov authored 3 years ago
```
replication: Process replication events for storages in parallel

See merge request !3894
```
cd94be7f

Merge branch 'pks-catfile-fixed-goroutine-leakage' into 'master' · 01501de9

Patrick Steinhardt authored 3 years ago

catfile: Reintroduce refactoring and plug Goroutine leak in the cache

Closes gitlab-com/gl-infra/production#5566

See merge request !3905

01501de9

catfile: Test for Goroutine leaks · 21bbf22b

Patrick Steinhardt authored 3 years ago

A recent refactoring of the catfile cache has caused an incident via a
new Goroutine leak. Assert that we do not leak any Goroutines via
`MustHaveNoGoroutines()` now that the leak has been plugged such that we
cannot have new regressions there.

21bbf22b

catfile: Plug Goroutine leak in the cache · c7458405

Patrick Steinhardt authored 3 years ago

When constructing processes as part of the catfile cache, then we must
create a separate context in case the process is cacheable such that it
does not get killed when the original request context gets cancelled.
Starting with commit 36c166a1 (catfile: Explicitly close processes,
2021-09-09), we have converted the code to always explicitly close
processes instead of doing so via the context. As a result though, this
separate detached context will never get cancelled at all, causing us to
leak Goroutines which listen on context cancellation to perform various
cleanup operations.

While we could simply revert this commit, the original implementation
has been quite indirect given that all process cancellation happened via
this context. The context has thus been passed to the `BatchProcess`,
which knew to cancel it as required. This is the burden to children of
the cache, which is an inversion of responsibilites: it should be the
cache which is managing lifetimes, not the processes.

Fix the Goroutine leak by instead stowing away the separate context's
cancellation function as part of the cache entries. Like this, the cache
can retrieve and cancel them whenever it evicts entries from the cache,
and it can cancel them when a process cannot be returned to the cache in
case it is empty. In the other case of uncacheable processes, we do not
replace the surrounding context's done function at all and can thus
continue to rely on the caller to eventually cancel the context.

Fixes: gitlab-com/gl-infra/production#5566
Changelog: fixed

c7458405

catfile: Explicitly stop the catfile cache in tests · e8a41257

Patrick Steinhardt authored 3 years ago

Explicitly stop the catfile cache whenever we construct it in our test
suite such that we shut down the monitoring Goroutine and evict any
processes hosted by it.

e8a41257

testhelper: Introduce new function to check for leaking Goroutines · f8c17e34

Patrick Steinhardt authored 3 years ago

In some packages, we have started to assert that we do not leak
Goroutines via the goleak package. One exception we need to always have
present is for the opencensus package given that it starts a Goroutine
in its `init()` function which we have no way to stop.

Provide a central function `MustHaveNoGoroutines()` in the testhelper
package such that we do not have to duplicate this exception whenever we
use the goleak package.

f8c17e34

Revert "Merge branch 'sh-revert-3853' into 'master'" · 79268eaf
Patrick Steinhardt authored 3 years ago
```
This reverts commit f99dae3b, reversing
changes made to 15c4a53f.
```
79268eaf
Merge branch 'update_activesupport' into 'master' · 016642ec
John Cai authored 3 years ago
```
Update ruby gem activesupport to v6.1.4.1

Closes #3750

See merge request !3929
```
016642ec

Update ruby gem activesupport to v6.1.4.1 · e3207df5

James Fargher authored 3 years ago

Since gitlab-labkit also depends on activesupport, this needed updating
too.

These gems have been updated to match gitlab-rails.

Changelog: changed

e3207df5

Oct 05, 2021

Merge branch 'jc-praefect-command-add-repo' into 'master' · deb7bdb3
Toon Claes authored 3 years ago
```
Add add-repository praefect subcmd

Closes #3773

See merge request !3918
```
deb7bdb3

Merge branch 'jv-sidechannel-client' into 'master' · 28f93e63

John Cai authored 3 years ago

client: add sidechannel support

Closes gitlab-com/gl-infra/scalability#1303

See merge request !3900

28f93e63

client: add sidechannel support · e64a6c21

Jacob Vosmaer authored 3 years ago

This change adds publicly exported code to the gitaly/client package
that allows Gitaly clients to accept sidechannel connections coming
back from a Gitaly server they have connected to. This is done via a
new dial function DialSidechannel.

Changelog: added

e64a6c21

Merge branch 'check_empty_bundle' into 'master' · c2cf30c9

John Cai authored 3 years ago

Determine when CreateBundleFromRefList would generate an empty bundle

See merge request !3923

c2cf30c9

Add track-repository praefect subcmd · 26006f80

John Cai authored 3 years ago

Adds track-repository subcommand that allows an admin to add a repository that
exists on one of the gitaly nodes to the praefect database. Does some
safety checks before inserting the reords.

Changelog: added

26006f80

replication: Process replication events with configurable number of workers · 064d5be9

Pavlo Strokov authored 3 years ago

Given that we are able to configure amount of workers used to process
replication events we add that capability into the ReplMgr.
Now it starts configured amount of workers per virtual storage to
process replication events. If amount of workers is higher than
the minimum amount of gitaly storages it will be aligned with the
latest and the log message about will be issued. If amount is not
set the default value of 1 will be used (the old behaviour).

Part of: #2915

064d5be9

replication: Configure number of replication workers per virtual storage · 60a4ae5c

Pavlo Strokov authored 3 years ago

The new configuration parameter to set number of the workers used
to process replication jobs. It is a safe incremental move on to the
rolling out parallel storage processing by the replication manager.
The default value 1 behaves the same as no changes done to replication
manager (the old behaviour).

Part of: #2915

60a4ae5c

replication: Process replication events for storages in parallel · 1e5f3b25

Pavlo Strokov authored 3 years ago

Current implementation iterates over gitaly storages and executes
replication events on each sequentially. As some replication events
could take significant amount of time to complete this results into
a long replication lag for other storages as their replication is
blocked. With this change we start to process replication events for
each gitaly storage in parallel, so there is no replication lag between.
If one storage is blocked on the processing of the replication event
all others are allowed to run their replication events in a mean time.

Part of: #2915

Changelog: changed

1e5f3b25

replication: Backoff function provider · 7589ebf9

Pavlo Strokov authored 3 years ago

To support parallel processing of the replication events
we need to refactor usage of the backoff function. Now it
is a factory that creates backoff functions on demand.
With that we could create a backoff function for each
goroutine, so they are independent one from another.

7589ebf9

Oct 04, 2021

Determine when CreateBundleFromRefList would generate an empty bundle · 60fc2c95

James Fargher authored 3 years ago

In order to gracefully handle incremental backups where nothing has
changed, we need to determine when `git bundle create` failed because
the bundle was empty. This isn't an error for incremental backups, it
simply means there were no updates.

The FailedPrecondition GRPC code was chosen here since it best matches
the litmus test:

>  (c) Use FailedPrecondition if the client should not retry until
>      the system state has been explicitly fixed. E.g., if an "rmdir"
>      fails because the directory is non-empty, FailedPrecondition
>      should be returned since the client should not retry unless
>      they have first fixed up the directory by deleting files from it.

Changelog: changed

60fc2c95