Commits · 2ceeecb9e3d6b097746a0c50052916dc0a47705e · GitLab.org / Gitaly

Nov 19, 2021

Merge branch 'pks-praefect-datastore-collector-metrics-endpoint-v14.4' into '14-4-stable' · 2ceeecb9
Patrick Steinhardt authored 3 years ago
```
praefect: Backport separate endpoint for datastore collector (v14.4)

See merge request !4094
```
2ceeecb9

praefect: Do not collect repository store metrics on startup · 3cde9b5e

John Cai authored 3 years ago and

Patrick Steinhardt committed 3 years ago

Our current code path will trigger the RepositoryStoreCollector to query
the database on startup, even if the prometheus listener is not
listening. This is because we call DescribeByCollect in the Describe
method. The Prometheus client will call Describe on Register, which ends
up triggering the Collect method and hence runs the queries. Instead, we
can just provide the decriptions separately from the Collect method.

Changelog: fixed
(cherry picked from commit 90cb7fb7)

3cde9b5e

praefect: Add ability to have separate database metrics endpoint · ebaade4a

John Cai authored 3 years ago and

Patrick Steinhardt committed 3 years ago

By default, when metrics are enabled, then each Praefect will expose
information about how many read-only repositories there are, which
requires Praefect to query the database. First, this will result in the
same metrics being exposed by every Praefect given that the database is
shared between all of them. And second, this will cause one query per
Praefect per scraping run. This cost does add up and generate quite some
load on the database, especially so if there is a lot of repositories in
that database, up to a point where it may overload the database
completely.

Fix this issue by splitting metrics which hit the database into a
separate endpoint "/db_metrics". This allows admins to set up a separate
scraper with a different scraping interval for this metric, and
furthermore it gives the ability to only scrape this metric for one of
the Praefect instances so the work isn't unnecessarily duplicated.

Given that this is a breaking change which will get backported, we must
make this behaviour opt-in for now. We thus include a new configuration
key "prometheus_use_database_endpoint" which enables the new behaviour
such that existing installations' metrics won't break on a simple point
release. The intent is to eventually remove this configuration though
and enable it for all setups on a major release.

Changelog: added
(cherry picked from commit 7e74b733)

ebaade4a

prometheus: Avoid duplicated metrics registration · aac5d5e5

Pavlo Strokov authored 3 years ago and

Patrick Steinhardt committed 3 years ago

Praefect uses prometheus to export metrics from inside.
It relies on the defaults from the prometheus library
to gather set of metrics and register a new metrics.
Because of it the new metrics got registered on the
DefaultRegisterer - a global pre-configured registerer.
Because of that we can't call 'run' function multiple
times (for testing purposes) as it results to the metrics
registration error. To omit that problem the 'run' function
extended with prometheus.Registerer parameter that is used
to register praefect custom metrics. The production code
still uses the same DefaultRegisterer as it was before.
And the test code creates a new instance of the registerer
for each 'run' invocation, so there are no more duplicates.

(cherry picked from commit 81368d46)

aac5d5e5

bootstrap: Abstract bootstrapper for testing · 43031e2d

Pavlo Strokov authored 3 years ago and

Patrick Steinhardt committed 3 years ago

The old implementation of the bootstrapper initialization
does not allow calling the 'run' function to start a service
because the tableflip library doesn't support multiple
instances to be created for one process.
Starting the Praefect service is required in tests to verify
sub-command execution. The bootstrapper initialization
extracted out of 'run' function. It allows using a new
Noop bootstrapper to run service without tableflip
support.

(cherry picked from commit 18ff3676)

43031e2d

Nov 18, 2021

Merge branch 'smh-optimize-dataloss-query-14-4' into '14-4-stable' · c1cf3752
Toon Claes authored 3 years ago
```
Materialize valid_primaries view (14.4)

See merge request !4090
```
c1cf3752

Materialize valid_primaries view in RepositoryStoreCollector · df6b165f

Sami Hiltunen authored 3 years ago

RepositoryStoreCollector gathers metrics on repositories which don't
have a valid primary candidates available. This indicates the repository
is unavailable as the current primary is not valid and ther are no valid
candidates to failover to. The query is currently extremely inefficient
on some versions of Postgres as it ends up computing the full valid_primaries
view for each of the rows it checks. This doesn't seem to occur on all versions
of Postgres, namely 12.6 at least manages to push down the search criteria
inside the view. This commit fixes the situation by materializing the
valid_primaries view prior to querying it. This ensures the full view isn't
computed for all of the rows but rather Postgres just uses the pre-computed
result.

Changelog: performance

df6b165f

Get the latest generation from repositories instead of a view · 57bef779

Sami Hiltunen authored 3 years ago

Dataloss query is currently getting the latest generation of a repository
from a view that takes the max generation from storage_repositories. This
is unnecessary as the repositories table already contains the latest generation
and we can take it from there instead. This commit reads it from the repositories
table instead.

Changelog: performance

57bef779

Nov 17, 2021

Materialize valid_primaries view in dataloss query · 6d569bb6

Sami Hiltunen authored 3 years ago

The dataloss query is extremely slow for bigger datasets. The problem
is that for each row that the data loss query is returning,
Postgres computes the full result of the valid_primaries view only to
filter down to the correct record. This results in an o(n2) complexity
which kills the performance as soon as the dataset size increases. It's
not clear why the join parameters are not pushed down in to the view in
the query.

This commit optimizes the query by materializing the valid_primaries view.
This ensures Postgres computes the full view only once and joins with the
pre-computed result.

Changelog: performance

6d569bb6

Nov 08, 2021
- Update VERSION files · 00071e4a
  GitLab Release Tools Bot authored 3 years ago
  
  [ci skip]
  View commits for tag v14.4.2 v14.4.2
  
  00071e4a
- Update changelog for 14.4.2 · 25892575
  GitLab Release Tools Bot authored 3 years ago
  
  [ci skip]
  25892575
Oct 28, 2021
- Merge remote-tracking branch 'dev/14-4-stable' into 14-4-stable · f5bdb31e
  GitLab Release Tools Bot authored 3 years ago
  
  f5bdb31e
- Update VERSION files · 735a55dc
  GitLab Release Tools Bot authored 3 years ago
  
  [ci skip]
  View commits for tag v14.4.1 v14.4.1
  
  735a55dc
- Update changelog for 14.4.1 · 13f05aed
  GitLab Release Tools Bot authored 3 years ago
  
  [ci skip]
  13f05aed
Oct 21, 2021
- Update VERSION files · 7abdbce5
  GitLab Release Tools Bot authored 3 years ago
  
  [ci skip]
  View commits for tag v14.4.0 v14.4.0
  
  7abdbce5
- Update changelog for 14.4.0 · daf08efb
  GitLab Release Tools Bot authored 3 years ago
  
  [ci skip]
  daf08efb
- Update VERSION files · bccb4c09
  GitLab Release Tools Bot authored 3 years ago
  
  [ci skip]
  View commits for tag v14.4.0-rc45 v14.4.0-rc45
  
  bccb4c09
- Update VERSION files · 6fdc72ba
  GitLab Release Tools Bot authored 3 years ago
  
  [ci skip]
  View commits for tag v14.4.0-rc44 v14.4.0-rc44
  
  6fdc72ba
- Update VERSION files · 42df6dc1
  GitLab Release Tools Bot authored 3 years ago
  
  [ci skip]
  42df6dc1
Oct 20, 2021
- Update VERSION files · d7d08bbf
  GitLab Release Tools Bot authored 3 years ago
  
  [ci skip]
  View commits for tag v14.4.0-rc42 v14.4.0-rc42
  
  d7d08bbf
Oct 07, 2021

Merge branch 'ps-fix-flaky-cache-clean' into 'master' · 4f0a07ba
John Cai authored 3 years ago
```
streamcache: Fix flaky TestCache_diskCleanup

Closes #3822

See merge request !3942
```
4f0a07ba

streamcache: Fix flaky TestCache_diskCleanup · f8434b2d

Pavlo Strokov authored 3 years ago

Because implementation was used single sleep function for
both filestore cleanup and clean function invocation the
close of the channel to trigger "single" invocation of the
clean was broken. This is due to close will make channel
to return default results immediately. But for the test we
need to run it only once. We can't send two elements on
the single channel as we can't be sure both won't be consumed
by one of those functions. That is why we add separate sleep
function for each of those functions. By sending on each channel
we are sure each function will be triggered exactly once.

f8434b2d

Merge branch 'pks-testhelper-goroutine-leakage' into 'master' · 043bb6e6
Pavlo Strokov authored 3 years ago
```
testhelper: Enable Goroutine leak checks by default

Closes #3790

See merge request !3909
```
043bb6e6
Merge branch 'ps-fix-migration' into 'master' · efb14ad0
Toon Claes authored 3 years ago
```
sql-migrate: Update storage_repositories table

Closes #3806

See merge request !3927
```
efb14ad0

testhelper: Enable Goroutine leak checks by default · 9ef46dc0

Patrick Steinhardt authored 3 years ago

In a recent incident, a refactoring of the catfile cache has caused an
incident due to leaking Goroutines. While we did already check for
leaking Goroutines in a subset of our test suite before via the goleak
check, this was opt-in the same way as `mustHaveNoChildProcess()` was.
It also was used a lot less often given that we did have hundreds of
actual Goroutine leaks in our test suite.

Introduce leak checking by default now that almost all of the Goroutine
leaks have been fixed. There is only two packages remaining which still
leak Goroutines: the catfile package, which will get fixed via an MR
which is currently in flight. And the backup package, which is leaking
Goroutines in a package external to us. To the best of my knowledge,
there is no way to plug those.

In any case, having on-by-default leak checking should put us in a much
better position to detect such regressions in the future.

9ef46dc0

global: Close gRPC connections · 5981dc52

Patrick Steinhardt authored 3 years ago

When creating gRPC connections, then we spawn a set of Goroutines which
listen on these connections. As a result, if they are never closed,
those Goroutines are leaked.

Fix this by closing connections.

5981dc52

praefect: Stop gRPC servers on exit · 401a62b5

Patrick Steinhardt authored 3 years ago

Next to a bunch of other things, the Praefect server's `run()` function
also has the task to create the gRPC servers. These servers are never
stopped though, which as a result means that we leak listeners set up
for this server in our tests.

Fix this by calling `Stop()` on these servers.

401a62b5

nodes: Fix leaking connections and Goroutines due to missing cleanup · f501a33d

Patrick Steinhardt authored 3 years ago

The node manager creates connections to all of its known nodes and
starts monitoring routines which check the respective nodes' health
status. We never clean up either of them, which thus leads to lots of
Goroutine leakages in our tests.

Fix this by providing a new `Stop()` function for the manager which both
stops the electors' monitoring Goroutines and closes the node
connections and call this function as required.

f501a33d

glsql: Close database if open fails on ping · 44bab7ad

Patrick Steinhardt authored 3 years ago

When opening the database, we first create the connection and then ping
it to assert that we can correctly establish the connection. If the ping
fails though, then we leak the connection because we don't clean up the
already-opened database struct.

Fix this issue by closing the database if the ping failed.

44bab7ad

streamcache: Make cleanup Goroutines stoppable · 45d81b07

Patrick Steinhardt authored 3 years ago

The streamcache has two Goroutines which make sure to clean up stale
cache entries, either on disk or in-memory ones. These walkers currently
cannot be stopped at all, leading to Goroutine leakages in our tests.

Fix this by making the `Stop()` functions also signal the Goroutines to
stop and by calling this function in the testserver setup code.

45d81b07

cache: Add ability to stop disk walkers · a256b01c

Patrick Steinhardt authored 3 years ago

The disk cache spawns a set of walkers which regularly walk the cache to
check for cache members which are older than a given threshold. These
walkers currently cannot be stopped at all, leading to a Goroutine
leakage in our tests.

Implement a new `StopWalkers()` function which stops them and use it in
our tests.

a256b01c

dontpanic: Refactor code to allow stopping of `GoForever()` · fc7ed4ea

Patrick Steinhardt authored 3 years ago

The dontpanic package provides a `GoForever()` function which allows the
caller to run a function forever even if it panics. While this is a
useful thing to have, the current design also causes leakage of
Goroutines because we do not have a way to stop these functions even in
the context of tests.

Introduce a new `Forever` structure which now hosts the `Go()` function.
Like this, we can also easily introduce a `Cancel()` function which
causes us to abort execution of the function.

Callers are not yet adjusted to make use of the `Cancel()` function.

fc7ed4ea

service: Plug leakages of connection pools · 125f313f

Patrick Steinhardt authored 3 years ago

While some of our service servers get the connection pool injected as
dependency, others construct them ad-hoc. These ad-hoc creations are
never cleaned up though and thus lead to Goroutine leakages.

Fix the leakage by always using the same injected client pool. Luckily,
all callsites do in fact use the same options to construct the pool and
thus we shouldn't see any change in behaviour, except for less
connections being create.

125f313f

supervisor: Stop monitor if shutdown was signalled · 1641ec51

Patrick Steinhardt authored 3 years ago

The supervisor's health monitor will check health every 15 seconds or
alternatively exit if its shutdown channel was closed. While we do
verify this channel correctly when trying to send the event, we do not
check for shutdown when waiting 15 seconds for the next event.

Fix this by returning early if we got the signal to shut down.

1641ec51

bootstrap: Make bootstrapper stoppable to plug Goroutine leaks · 22351f97

Patrick Steinhardt authored 3 years ago

The bootstrapper is currently leaking multiple Goroutines in the
upgrader because we never make sure to clean it up after tests have
finished. Add its existing `Stop()` function to the upgrader interface
and call is as required to plug these leaks.

22351f97

catfile: Stop cache to plug Goroutine leaks · 6cb7e554

Patrick Steinhardt authored 3 years ago

In many tests where the catfile cache is in use, we create it but don't
ever clean up the cache. As a result, the monitoring Goroutine and any
potenitally cached processes may not get cleaned up as expected.

Fix this by calling `Stop()` as required.

6cb7e554

gitaly-backup: Inject connection pool to unleak its Goroutines · ff726e13

Patrick Steinhardt authored 3 years ago

The gitaly-backup manager currently creates an ad-hoc instance of the
connection pool. This pool never gets cleaned up and thus results in
leakage of its connections.

Inject the connection pool into the manager and clean it up as required.

ff726e13

Merge branch 'pks-catfile-cache-split' into 'master' · df7dadcc
Patrick Steinhardt authored 3 years ago
```
Split up the catfile cache

Closes #3763

See merge request !3886
```
df7dadcc

command: Start using the testhelper package · a980fb11

Patrick Steinhardt authored 3 years ago

Now that the "testhelper" package doesn't depend on "command" anymore,
convert the "command" package to use it.

a980fb11

command: Solve cyclic dependency with testhelper package · 36c43abf

Patrick Steinhardt authored 3 years ago

The "testhelper" package currently depends on the "command" package,
which is why we cannot use the "testhelper"'s setup functions in the
latter package. The only remaining dependency is the wait group which is
used to count in-flight commands though such that we can assert that
there are no leaking processes after tests have concluded.

Move this wait group into a separate "commandcounter" package to remove
this dependency. While this also allows other packages to modify the
counter due to its interface now being public, this really shouldn't be
much of a problem at all.

36c43abf