Skip to content
Snippets Groups Projects
  1. Nov 19, 2021
    • Patrick Steinhardt's avatar
      Merge branch 'pks-praefect-datastore-collector-metrics-endpoint-v14.4' into '14-4-stable' · 2ceeecb9
      Patrick Steinhardt authored
      praefect: Backport separate endpoint for datastore collector (v14.4)
      
      See merge request !4094
      2ceeecb9
    • John Cai's avatar
      praefect: Do not collect repository store metrics on startup · 3cde9b5e
      John Cai authored and Patrick Steinhardt's avatar Patrick Steinhardt committed
      Our current code path will trigger the RepositoryStoreCollector to query
      the database on startup, even if the prometheus listener is not
      listening. This is because we call DescribeByCollect in the Describe
      method. The Prometheus client will call Describe on Register, which ends
      up triggering the Collect method and hence runs the queries. Instead, we
      can just provide the decriptions separately from the Collect method.
      
      Changelog: fixed
      (cherry picked from commit 90cb7fb7)
      3cde9b5e
    • John Cai's avatar
      praefect: Add ability to have separate database metrics endpoint · ebaade4a
      John Cai authored and Patrick Steinhardt's avatar Patrick Steinhardt committed
      By default, when metrics are enabled, then each Praefect will expose
      information about how many read-only repositories there are, which
      requires Praefect to query the database. First, this will result in the
      same metrics being exposed by every Praefect given that the database is
      shared between all of them. And second, this will cause one query per
      Praefect per scraping run. This cost does add up and generate quite some
      load on the database, especially so if there is a lot of repositories in
      that database, up to a point where it may overload the database
      completely.
      
      Fix this issue by splitting metrics which hit the database into a
      separate endpoint "/db_metrics". This allows admins to set up a separate
      scraper with a different scraping interval for this metric, and
      furthermore it gives the ability to only scrape this metric for one of
      the Praefect instances so the work isn't unnecessarily duplicated.
      
      Given that this is a breaking change which will get backported, we must
      make this behaviour opt-in for now. We thus include a new configuration
      key "prometheus_use_database_endpoint" which enables the new behaviour
      such that existing installations' metrics won't break on a simple point
      release. The intent is to eventually remove this configuration though
      and enable it for all setups on a major release.
      
      Changelog: added
      (cherry picked from commit 7e74b733)
      ebaade4a
    • Pavlo Strokov's avatar
      prometheus: Avoid duplicated metrics registration · aac5d5e5
      Pavlo Strokov authored and Patrick Steinhardt's avatar Patrick Steinhardt committed
      Praefect uses prometheus to export metrics from inside.
      It relies on the defaults from the prometheus library
      to gather set of metrics and register a new metrics.
      Because of it the new metrics got registered on the
      DefaultRegisterer - a global pre-configured registerer.
      Because of that we can't call 'run' function multiple
      times (for testing purposes) as it results to the metrics
      registration error. To omit that problem the 'run' function
      extended with prometheus.Registerer parameter that is used
      to register praefect custom metrics. The production code
      still uses the same DefaultRegisterer as it was before.
      And the test code creates a new instance of the registerer
      for each 'run' invocation, so there are no more duplicates.
      
      (cherry picked from commit 81368d46)
      aac5d5e5
    • Pavlo Strokov's avatar
      bootstrap: Abstract bootstrapper for testing · 43031e2d
      Pavlo Strokov authored and Patrick Steinhardt's avatar Patrick Steinhardt committed
      The old implementation of the bootstrapper initialization
      does not allow calling the 'run' function to start a service
      because the tableflip library doesn't support multiple
      instances to be created for one process.
      Starting the Praefect service is required in tests to verify
      sub-command execution. The bootstrapper initialization
      extracted out of 'run' function. It allows using a new
      Noop bootstrapper to run service without tableflip
      support.
      
      (cherry picked from commit 18ff3676)
      43031e2d
  2. Nov 18, 2021
    • Toon Claes's avatar
      Merge branch 'smh-optimize-dataloss-query-14-4' into '14-4-stable' · c1cf3752
      Toon Claes authored
      Materialize valid_primaries view (14.4)
      
      See merge request !4090
      c1cf3752
    • Sami Hiltunen's avatar
      Materialize valid_primaries view in RepositoryStoreCollector · df6b165f
      Sami Hiltunen authored
      RepositoryStoreCollector gathers metrics on repositories which don't
      have a valid primary candidates available. This indicates the repository
      is unavailable as the current primary is not valid and ther are no valid
      candidates to failover to. The query is currently extremely inefficient
      on some versions of Postgres as it ends up computing the full valid_primaries
      view for each of the rows it checks. This doesn't seem to occur on all versions
      of Postgres, namely 12.6 at least manages to push down the search criteria
      inside the view. This commit fixes the situation by materializing the
      valid_primaries view prior to querying it. This ensures the full view isn't
      computed for all of the rows but rather Postgres just uses the pre-computed
      result.
      
      Changelog: performance
      df6b165f
    • Sami Hiltunen's avatar
      Get the latest generation from repositories instead of a view · 57bef779
      Sami Hiltunen authored
      Dataloss query is currently getting the latest generation of a repository
      from a view that takes the max generation from storage_repositories. This
      is unnecessary as the repositories table already contains the latest generation
      and we can take it from there instead. This commit reads it from the repositories
      table instead.
      
      Changelog: performance
      57bef779
  3. Nov 17, 2021
    • Sami Hiltunen's avatar
      Materialize valid_primaries view in dataloss query · 6d569bb6
      Sami Hiltunen authored
      The dataloss query is extremely slow for bigger datasets. The problem
      is that for each row that the data loss query is returning,
      Postgres computes the full result of the valid_primaries view only to
      filter down to the correct record. This results in an o(n2) complexity
      which kills the performance as soon as the dataset size increases. It's
      not clear why the join parameters are not pushed down in to the view in
      the query.
      
      This commit optimizes the query by materializing the valid_primaries view.
      This ensures Postgres computes the full view only once and joins with the
      pre-computed result.
      
      Changelog: performance
      6d569bb6
  4. Nov 08, 2021
  5. Oct 28, 2021
  6. Oct 21, 2021
  7. Oct 20, 2021
  8. Oct 07, 2021
    • John Cai's avatar
      Merge branch 'ps-fix-flaky-cache-clean' into 'master' · 4f0a07ba
      John Cai authored
      streamcache: Fix flaky TestCache_diskCleanup
      
      Closes #3822
      
      See merge request !3942
      4f0a07ba
    • Pavlo Strokov's avatar
      streamcache: Fix flaky TestCache_diskCleanup · f8434b2d
      Pavlo Strokov authored
      Because implementation was used single sleep function for
      both filestore cleanup and clean function invocation the
      close of the channel to trigger "single" invocation of the
      clean was broken. This is due to close will make channel
      to return default results immediately. But for the test we
      need to run it only once. We can't send two elements on
      the single channel as we can't be sure both won't be consumed
      by one of those functions. That is why we add separate sleep
      function for each of those functions. By sending on each channel
      we are sure each function will be triggered exactly once.
      f8434b2d
    • Pavlo Strokov's avatar
      Merge branch 'pks-testhelper-goroutine-leakage' into 'master' · 043bb6e6
      Pavlo Strokov authored
      testhelper: Enable Goroutine leak checks by default
      
      Closes #3790
      
      See merge request !3909
      043bb6e6
    • Toon Claes's avatar
      Merge branch 'ps-fix-migration' into 'master' · efb14ad0
      Toon Claes authored
      sql-migrate: Update storage_repositories table
      
      Closes #3806
      
      See merge request !3927
      efb14ad0
    • Patrick Steinhardt's avatar
      testhelper: Enable Goroutine leak checks by default · 9ef46dc0
      Patrick Steinhardt authored
      In a recent incident, a refactoring of the catfile cache has caused an
      incident due to leaking Goroutines. While we did already check for
      leaking Goroutines in a subset of our test suite before via the goleak
      check, this was opt-in the same way as `mustHaveNoChildProcess()` was.
      It also was used a lot less often given that we did have hundreds of
      actual Goroutine leaks in our test suite.
      
      Introduce leak checking by default now that almost all of the Goroutine
      leaks have been fixed. There is only two packages remaining which still
      leak Goroutines: the catfile package, which will get fixed via an MR
      which is currently in flight. And the backup package, which is leaking
      Goroutines in a package external to us. To the best of my knowledge,
      there is no way to plug those.
      
      In any case, having on-by-default leak checking should put us in a much
      better position to detect such regressions in the future.
      9ef46dc0
    • Patrick Steinhardt's avatar
      global: Close gRPC connections · 5981dc52
      Patrick Steinhardt authored
      When creating gRPC connections, then we spawn a set of Goroutines which
      listen on these connections. As a result, if they are never closed,
      those Goroutines are leaked.
      
      Fix this by closing connections.
      5981dc52
    • Patrick Steinhardt's avatar
      praefect: Stop gRPC servers on exit · 401a62b5
      Patrick Steinhardt authored
      Next to a bunch of other things, the Praefect server's `run()` function
      also has the task to create the gRPC servers. These servers are never
      stopped though, which as a result means that we leak listeners set up
      for this server in our tests.
      
      Fix this by calling `Stop()` on these servers.
      401a62b5
    • Patrick Steinhardt's avatar
      nodes: Fix leaking connections and Goroutines due to missing cleanup · f501a33d
      Patrick Steinhardt authored
      The node manager creates connections to all of its known nodes and
      starts monitoring routines which check the respective nodes' health
      status. We never clean up either of them, which thus leads to lots of
      Goroutine leakages in our tests.
      
      Fix this by providing a new `Stop()` function for the manager which both
      stops the electors' monitoring Goroutines and closes the node
      connections and call this function as required.
      f501a33d
    • Patrick Steinhardt's avatar
      glsql: Close database if open fails on ping · 44bab7ad
      Patrick Steinhardt authored
      When opening the database, we first create the connection and then ping
      it to assert that we can correctly establish the connection. If the ping
      fails though, then we leak the connection because we don't clean up the
      already-opened database struct.
      
      Fix this issue by closing the database if the ping failed.
      44bab7ad
    • Patrick Steinhardt's avatar
      streamcache: Make cleanup Goroutines stoppable · 45d81b07
      Patrick Steinhardt authored
      The streamcache has two Goroutines which make sure to clean up stale
      cache entries, either on disk or in-memory ones. These walkers currently
      cannot be stopped at all, leading to Goroutine leakages in our tests.
      
      Fix this by making the `Stop()` functions also signal the Goroutines to
      stop and by calling this function in the testserver setup code.
      45d81b07
    • Patrick Steinhardt's avatar
      cache: Add ability to stop disk walkers · a256b01c
      Patrick Steinhardt authored
      The disk cache spawns a set of walkers which regularly walk the cache to
      check for cache members which are older than a given threshold. These
      walkers currently cannot be stopped at all, leading to a Goroutine
      leakage in our tests.
      
      Implement a new `StopWalkers()` function which stops them and use it in
      our tests.
      a256b01c
    • Patrick Steinhardt's avatar
      dontpanic: Refactor code to allow stopping of `GoForever()` · fc7ed4ea
      Patrick Steinhardt authored
      The dontpanic package provides a `GoForever()` function which allows the
      caller to run a function forever even if it panics. While this is a
      useful thing to have, the current design also causes leakage of
      Goroutines because we do not have a way to stop these functions even in
      the context of tests.
      
      Introduce a new `Forever` structure which now hosts the `Go()` function.
      Like this, we can also easily introduce a `Cancel()` function which
      causes us to abort execution of the function.
      
      Callers are not yet adjusted to make use of the `Cancel()` function.
      fc7ed4ea
    • Patrick Steinhardt's avatar
      service: Plug leakages of connection pools · 125f313f
      Patrick Steinhardt authored
      While some of our service servers get the connection pool injected as
      dependency, others construct them ad-hoc. These ad-hoc creations are
      never cleaned up though and thus lead to Goroutine leakages.
      
      Fix the leakage by always using the same injected client pool. Luckily,
      all callsites do in fact use the same options to construct the pool and
      thus we shouldn't see any change in behaviour, except for less
      connections being create.
      125f313f
    • Patrick Steinhardt's avatar
      supervisor: Stop monitor if shutdown was signalled · 1641ec51
      Patrick Steinhardt authored
      The supervisor's health monitor will check health every 15 seconds or
      alternatively exit if its shutdown channel was closed. While we do
      verify this channel correctly when trying to send the event, we do not
      check for shutdown when waiting 15 seconds for the next event.
      
      Fix this by returning early if we got the signal to shut down.
      1641ec51
    • Patrick Steinhardt's avatar
      bootstrap: Make bootstrapper stoppable to plug Goroutine leaks · 22351f97
      Patrick Steinhardt authored
      The bootstrapper is currently leaking multiple Goroutines in the
      upgrader because we never make sure to clean it up after tests have
      finished. Add its existing `Stop()` function to the upgrader interface
      and call is as required to plug these leaks.
      22351f97
    • Patrick Steinhardt's avatar
      catfile: Stop cache to plug Goroutine leaks · 6cb7e554
      Patrick Steinhardt authored
      In many tests where the catfile cache is in use, we create it but don't
      ever clean up the cache. As a result, the monitoring Goroutine and any
      potenitally cached processes may not get cleaned up as expected.
      
      Fix this by calling `Stop()` as required.
      6cb7e554
    • Patrick Steinhardt's avatar
      gitaly-backup: Inject connection pool to unleak its Goroutines · ff726e13
      Patrick Steinhardt authored
      The gitaly-backup manager currently creates an ad-hoc instance of the
      connection pool. This pool never gets cleaned up and thus results in
      leakage of its connections.
      
      Inject the connection pool into the manager and clean it up as required.
      ff726e13
    • Patrick Steinhardt's avatar
      Merge branch 'pks-catfile-cache-split' into 'master' · df7dadcc
      Patrick Steinhardt authored
      Split up the catfile cache
      
      Closes #3763
      
      See merge request !3886
      df7dadcc
    • Patrick Steinhardt's avatar
      command: Start using the testhelper package · a980fb11
      Patrick Steinhardt authored
      Now that the "testhelper" package doesn't depend on "command" anymore,
      convert the "command" package to use it.
      a980fb11
    • Patrick Steinhardt's avatar
      command: Solve cyclic dependency with testhelper package · 36c43abf
      Patrick Steinhardt authored
      The "testhelper" package currently depends on the "command" package,
      which is why we cannot use the "testhelper"'s setup functions in the
      latter package. The only remaining dependency is the wait group which is
      used to count in-flight commands though such that we can assert that
      there are no leaking processes after tests have concluded.
      
      Move this wait group into a separate "commandcounter" package to remove
      this dependency. While this also allows other packages to modify the
      counter due to its interface now being public, this really shouldn't be
      much of a problem at all.
      36c43abf
Loading