- Nov 19, 2021
-
-
Patrick Steinhardt authored
praefect: Backport separate endpoint for datastore collector (v14.4) See merge request !4094
-
Our current code path will trigger the RepositoryStoreCollector to query the database on startup, even if the prometheus listener is not listening. This is because we call DescribeByCollect in the Describe method. The Prometheus client will call Describe on Register, which ends up triggering the Collect method and hence runs the queries. Instead, we can just provide the decriptions separately from the Collect method. Changelog: fixed (cherry picked from commit 90cb7fb7)
-
By default, when metrics are enabled, then each Praefect will expose information about how many read-only repositories there are, which requires Praefect to query the database. First, this will result in the same metrics being exposed by every Praefect given that the database is shared between all of them. And second, this will cause one query per Praefect per scraping run. This cost does add up and generate quite some load on the database, especially so if there is a lot of repositories in that database, up to a point where it may overload the database completely. Fix this issue by splitting metrics which hit the database into a separate endpoint "/db_metrics". This allows admins to set up a separate scraper with a different scraping interval for this metric, and furthermore it gives the ability to only scrape this metric for one of the Praefect instances so the work isn't unnecessarily duplicated. Given that this is a breaking change which will get backported, we must make this behaviour opt-in for now. We thus include a new configuration key "prometheus_use_database_endpoint" which enables the new behaviour such that existing installations' metrics won't break on a simple point release. The intent is to eventually remove this configuration though and enable it for all setups on a major release. Changelog: added (cherry picked from commit 7e74b733)
-
Praefect uses prometheus to export metrics from inside. It relies on the defaults from the prometheus library to gather set of metrics and register a new metrics. Because of it the new metrics got registered on the DefaultRegisterer - a global pre-configured registerer. Because of that we can't call 'run' function multiple times (for testing purposes) as it results to the metrics registration error. To omit that problem the 'run' function extended with prometheus.Registerer parameter that is used to register praefect custom metrics. The production code still uses the same DefaultRegisterer as it was before. And the test code creates a new instance of the registerer for each 'run' invocation, so there are no more duplicates. (cherry picked from commit 81368d46)
-
The old implementation of the bootstrapper initialization does not allow calling the 'run' function to start a service because the tableflip library doesn't support multiple instances to be created for one process. Starting the Praefect service is required in tests to verify sub-command execution. The bootstrapper initialization extracted out of 'run' function. It allows using a new Noop bootstrapper to run service without tableflip support. (cherry picked from commit 18ff3676)
-
- Nov 18, 2021
-
-
Toon Claes authored
Materialize valid_primaries view (14.4) See merge request !4090
-
Sami Hiltunen authored
RepositoryStoreCollector gathers metrics on repositories which don't have a valid primary candidates available. This indicates the repository is unavailable as the current primary is not valid and ther are no valid candidates to failover to. The query is currently extremely inefficient on some versions of Postgres as it ends up computing the full valid_primaries view for each of the rows it checks. This doesn't seem to occur on all versions of Postgres, namely 12.6 at least manages to push down the search criteria inside the view. This commit fixes the situation by materializing the valid_primaries view prior to querying it. This ensures the full view isn't computed for all of the rows but rather Postgres just uses the pre-computed result. Changelog: performance
-
Sami Hiltunen authored
Dataloss query is currently getting the latest generation of a repository from a view that takes the max generation from storage_repositories. This is unnecessary as the repositories table already contains the latest generation and we can take it from there instead. This commit reads it from the repositories table instead. Changelog: performance
-
- Nov 17, 2021
-
-
Sami Hiltunen authored
The dataloss query is extremely slow for bigger datasets. The problem is that for each row that the data loss query is returning, Postgres computes the full result of the valid_primaries view only to filter down to the correct record. This results in an o(n2) complexity which kills the performance as soon as the dataset size increases. It's not clear why the join parameters are not pushed down in to the view in the query. This commit optimizes the query by materializing the valid_primaries view. This ensures Postgres computes the full view only once and joins with the pre-computed result. Changelog: performance
-
- Nov 08, 2021
-
-
GitLab Release Tools Bot authored
[ci skip]
-
GitLab Release Tools Bot authored
[ci skip]
-
- Oct 28, 2021
-
-
GitLab Release Tools Bot authored
-
GitLab Release Tools Bot authored
[ci skip]
-
GitLab Release Tools Bot authored
[ci skip]
-
- Oct 21, 2021
-
-
GitLab Release Tools Bot authored
[ci skip]
-
GitLab Release Tools Bot authored
[ci skip]
-
GitLab Release Tools Bot authored
[ci skip]
-
GitLab Release Tools Bot authored
[ci skip]
-
GitLab Release Tools Bot authored
[ci skip]
-
- Oct 20, 2021
-
-
GitLab Release Tools Bot authored
[ci skip]
-
- Oct 07, 2021
-
-
Pavlo Strokov authored
Because implementation was used single sleep function for both filestore cleanup and clean function invocation the close of the channel to trigger "single" invocation of the clean was broken. This is due to close will make channel to return default results immediately. But for the test we need to run it only once. We can't send two elements on the single channel as we can't be sure both won't be consumed by one of those functions. That is why we add separate sleep function for each of those functions. By sending on each channel we are sure each function will be triggered exactly once.
-
Pavlo Strokov authored
testhelper: Enable Goroutine leak checks by default Closes #3790 See merge request !3909
-
Toon Claes authored
sql-migrate: Update storage_repositories table Closes #3806 See merge request !3927
-
Patrick Steinhardt authored
In a recent incident, a refactoring of the catfile cache has caused an incident due to leaking Goroutines. While we did already check for leaking Goroutines in a subset of our test suite before via the goleak check, this was opt-in the same way as `mustHaveNoChildProcess()` was. It also was used a lot less often given that we did have hundreds of actual Goroutine leaks in our test suite. Introduce leak checking by default now that almost all of the Goroutine leaks have been fixed. There is only two packages remaining which still leak Goroutines: the catfile package, which will get fixed via an MR which is currently in flight. And the backup package, which is leaking Goroutines in a package external to us. To the best of my knowledge, there is no way to plug those. In any case, having on-by-default leak checking should put us in a much better position to detect such regressions in the future.
-
Patrick Steinhardt authored
When creating gRPC connections, then we spawn a set of Goroutines which listen on these connections. As a result, if they are never closed, those Goroutines are leaked. Fix this by closing connections.
-
Patrick Steinhardt authored
Next to a bunch of other things, the Praefect server's `run()` function also has the task to create the gRPC servers. These servers are never stopped though, which as a result means that we leak listeners set up for this server in our tests. Fix this by calling `Stop()` on these servers.
-
Patrick Steinhardt authored
The node manager creates connections to all of its known nodes and starts monitoring routines which check the respective nodes' health status. We never clean up either of them, which thus leads to lots of Goroutine leakages in our tests. Fix this by providing a new `Stop()` function for the manager which both stops the electors' monitoring Goroutines and closes the node connections and call this function as required.
-
Patrick Steinhardt authored
When opening the database, we first create the connection and then ping it to assert that we can correctly establish the connection. If the ping fails though, then we leak the connection because we don't clean up the already-opened database struct. Fix this issue by closing the database if the ping failed.
-
Patrick Steinhardt authored
The streamcache has two Goroutines which make sure to clean up stale cache entries, either on disk or in-memory ones. These walkers currently cannot be stopped at all, leading to Goroutine leakages in our tests. Fix this by making the `Stop()` functions also signal the Goroutines to stop and by calling this function in the testserver setup code.
-
Patrick Steinhardt authored
The disk cache spawns a set of walkers which regularly walk the cache to check for cache members which are older than a given threshold. These walkers currently cannot be stopped at all, leading to a Goroutine leakage in our tests. Implement a new `StopWalkers()` function which stops them and use it in our tests.
-
Patrick Steinhardt authored
The dontpanic package provides a `GoForever()` function which allows the caller to run a function forever even if it panics. While this is a useful thing to have, the current design also causes leakage of Goroutines because we do not have a way to stop these functions even in the context of tests. Introduce a new `Forever` structure which now hosts the `Go()` function. Like this, we can also easily introduce a `Cancel()` function which causes us to abort execution of the function. Callers are not yet adjusted to make use of the `Cancel()` function.
-
Patrick Steinhardt authored
While some of our service servers get the connection pool injected as dependency, others construct them ad-hoc. These ad-hoc creations are never cleaned up though and thus lead to Goroutine leakages. Fix the leakage by always using the same injected client pool. Luckily, all callsites do in fact use the same options to construct the pool and thus we shouldn't see any change in behaviour, except for less connections being create.
-
Patrick Steinhardt authored
The supervisor's health monitor will check health every 15 seconds or alternatively exit if its shutdown channel was closed. While we do verify this channel correctly when trying to send the event, we do not check for shutdown when waiting 15 seconds for the next event. Fix this by returning early if we got the signal to shut down.
-
Patrick Steinhardt authored
The bootstrapper is currently leaking multiple Goroutines in the upgrader because we never make sure to clean it up after tests have finished. Add its existing `Stop()` function to the upgrader interface and call is as required to plug these leaks.
-
Patrick Steinhardt authored
In many tests where the catfile cache is in use, we create it but don't ever clean up the cache. As a result, the monitoring Goroutine and any potenitally cached processes may not get cleaned up as expected. Fix this by calling `Stop()` as required.
-
Patrick Steinhardt authored
The gitaly-backup manager currently creates an ad-hoc instance of the connection pool. This pool never gets cleaned up and thus results in leakage of its connections. Inject the connection pool into the manager and clean it up as required.
-
Patrick Steinhardt authored
Split up the catfile cache Closes #3763 See merge request !3886
-
Patrick Steinhardt authored
Now that the "testhelper" package doesn't depend on "command" anymore, convert the "command" package to use it.
-
Patrick Steinhardt authored
The "testhelper" package currently depends on the "command" package, which is why we cannot use the "testhelper"'s setup functions in the latter package. The only remaining dependency is the wait group which is used to count in-flight commands though such that we can assert that there are no leaking processes after tests have concluded. Move this wait group into a separate "commandcounter" package to remove this dependency. While this also allows other packages to modify the counter due to its interface now being public, this really shouldn't be much of a problem at all.