Skip to content
Snippets Groups Projects
  1. Jul 16, 2021
    • Patrick Steinhardt's avatar
      coordinator: Only schedule replication for differing error states · 73839029
      Patrick Steinhardt authored
      When finalizing a transaction, we always schedule replication jobs in
      case the primary has returned an error. Given that there are many RPCs
      which are expected to return errors in a controlled way, e.g. if a
      commit is missing, this causes us to create replication in many contexts
      where it's not necessary at all.
      
      Thinking about the issue, what we really care for is not whether an RPC
      failed or not. It's that primary and secondary nodes behaved the same.
      If both primary and secondaries succeeded, we're good. But if both
      failed with the same error, then we're good to as long as all
      transactions have been committed: quorum was reached on all votes and
      nodes failed in the same way, so we can assume that nodes did indeed
      perform the same changes.
      
      This commit thus relaxes the error condition to not schedule replication
      jobs anymore in case the primary failed, but to only schedule
      replication jobs to any node which has a different error than the
      primary. This has both the advantage that we only need to selectively
      schedule jobs for disagreeing nodes instead of targeting all
      secondaries and it avoids scheduling jobs in many cases where we do hit
      errors.
      
      Changelog: performance
      73839029
    • Patrick Steinhardt's avatar
      Merge branch 'ps-relocate-gitlab-test-server' into 'master' · acd3f8e4
      Patrick Steinhardt authored
      Relocate GitLab HTTP test server from testhelper package
      
      See merge request !3665
      acd3f8e4
  2. Jul 15, 2021
  3. Jul 14, 2021
  4. Jul 13, 2021
  5. Jul 12, 2021
    • James Fargher's avatar
      Remove feature gitaly_fetch_internal_remote_errors · ffbd9a31
      James Fargher authored
      Since FetchInternalRemote has been inlined into ReplicateRepository we
      no longer need to make this RPC errors more verbose.
      ffbd9a31
    • Zeger-Jan van de Weg's avatar
      Merge branch 'ps-code-style-fix' into 'master' · a8d42fb6
      Zeger-Jan van de Weg authored
      Fix various static lint issues
      
      See merge request gitlab-org/gitaly!3666
      a8d42fb6
    • Zeger-Jan van de Weg's avatar
      Merge branch 'smh-perform-lazy-failovers' into 'master' · 87104617
      Zeger-Jan van de Weg authored
      Perform failovers lazily
      
      Closes #3207
      
      See merge request gitlab-org/gitaly!3543
      87104617
    • Jacob Vosmaer's avatar
      Add StreamRPC library code · 8a925b40
      Jacob Vosmaer authored
      Changelog: other
      8a925b40
    • Pavlo Strokov's avatar
      Standardise package aliases · f6be3b55
      Pavlo Strokov authored
      It is not common to use snake case names for packages
      and package aliases in Go.
      The change renames aliases to a preferred single word name.
      We also use 'gitaly' prefix for the project-defined packages
      that clashes with standard or 3-rd party package names.
      f6be3b55
    • Pavlo Strokov's avatar
      Remove unused declarations · 74514560
      Pavlo Strokov authored
      Some functions, types, fields and other variables are not
      used. There is no reason to keep them and support.
      Some of them became redundant starting from declaration
      and some after the code changes.
      74514560
    • Pavlo Strokov's avatar
      Redundant condition check in the for loop · a6a4d567
      Pavlo Strokov authored
      The done variable is never assigned any value, so
      the condition always evaluates into true.
      a6a4d567
    • Pavlo Strokov's avatar
      Fix instantiation of the structs with fields assignment · a20f8bc8
      Pavlo Strokov authored
      If struct has a list of fields declared the values for
      those fields could be assigned during struct instance creation
      by providing values in the same order as the fields are declared
      or in any order if field names are used to assign the values.
      The preferred way is to use a field name assigment as it is less
      error prone if you define a new field in the middle of the struct
      and as well more readable as you see the list of the initialized
      fields.
      a20f8bc8
    • Pavlo Strokov's avatar
      Relocate WriteTemporaryGitalyConfigFile · 3a16378c
      Pavlo Strokov authored
      WriteTemporaryGitalyConfigFile is used only in the
      gitaly-hooks, so there is no reason to keep it in the
      shared testhelper package. It is moved closer to the
      place of the usage and made unexportable.
      3a16378c
    • Pavlo Strokov's avatar
      Relocate GitLab HTTP test server from testhelper package · b932dab3
      Pavlo Strokov authored
      The testhelper package contains too many unrelated stuff that
      needs to be moved elsewhere closer to the actual usage or
      real implementation. In some cases it can help to avoid
      circular dependencies between packages in the tests.
      As well it will give us more consistent and structured codebase.
      
      The types and functions moved into gitlab package got a
      rename by removing Gitlab prefix because it causes a lint issue
      on double naming (because of the same package name).
      
      A couple of structs such as postReceiveResponse that are copies
      of the unexported gitlab package structs were removed as they
      are not used anymore.
      b932dab3
    • Sami Hiltunen's avatar
      Export unavailable repositories metric · 12061b1c
      Sami Hiltunen authored
      The current read-only repository count metric describes unavailable
      repositories rather than read-only repositories. We have to keep the
      name for backwards compatibility as some alerting rules and dashboards
      depend on it. To make it possible to migrate to a more accurate metric
      later, this commit adds another metric on the side with more accurate
      name and description.
      12061b1c
    • Sami Hiltunen's avatar
      Update read-only repository count metric to account for lazy failover · d8f63097
      Sami Hiltunen authored
      Read-only repository count metric previously reported the number of
      repositories that were outdated on the primary. As Praefect no longer
      promotes outdated replicas as primaries, this metric is not really useful
      anymore. With lazy failover in place, Praefect will failover to an up to
      date replica as long as there is a healthy one available. The purpose
      of this metric was to alert when a repository's availability was degraded,
      mainly the writes being blocked. With lazy failover, we no longer would
      block the writes as we'd simply promote the up to date node. Praefect hasn't
      served reads from outdated replicas since 7af9c950. Having no fully up to
      date healthy replicas means the repository is fully unavailable. There's
      effectively no more read-only mode. This commit updates the metric to count
      repositories which are unavailable according to the new failover logic.
      The old metric name is kept in place though as some alerting depends on it.
      d8f63097
    • Sami Hiltunen's avatar
      Remove support for virtual storage scoped primaries in read-only metrics · 77c84dd5
      Sami Hiltunen authored
      This commit removes the support for virtual storage scoped primaries in the
      read-only repository count metric to make future changes easier. Virtual
      storage scoped primaries were deprecated in 13.12 and removed in 14.0.
      
      Changelog: removed
      77c84dd5
    • Sami Hiltunen's avatar
      Support lazy failovers in `praefect dataloss` · e900df09
      Sami Hiltunen authored
      With the recent failover changes, the output of `praefect dataloss`
      is no longer accurate. Previously a repository would have been in
      read-only mode if the primary of the repository was outdated. With
      lazy failovers in place, it's no longer sufficient to check only
      whether the current primary is outdated or not. If the current primary
      is outdated, Praefect would immediately switch the repository's primary
      on the next request if there is an up to date replica available. This
      also means that there is no 'read-only mode' anymore, as we'd simply
      failover to an up to date node rather than wait for the current primary
      to be brought up to speed. This commit updates the dataloss sub-command
      to take the new changes into account:
      
      1. If there is an up to date, available replica for the repository, it's
         considered to be available for both reads and writes.
      2. If there are no up to date replicas available, the repository is considered
         unavailable. As it is, Praefect does not distribute writes to outdated
         replicas.
      3. To make it easier to determine why a repository is unavailable, 'unavailable'
         is printed next to the storages which are considered to be unavailable by the
         consensus of the Praefect nodes.
      
      Changelog: changed
      e900df09
    • Sami Hiltunen's avatar
      Replace GetPartiallyReplicatedRepositories with GetPartiallyAvaialableRepositories · 7704c707
      Sami Hiltunen authored
      `praefect dataloss` is using GetPartiallyReplicatedRepositories to get
      repositories which have assigned replicas that are outdated. Inferring from the
      returned generations it was also reporting whether the repository was in read-only
      mode or not. This is not sufficient anymore to determine whether a repository is
      unavailable or not due to recent changes:
      
      1. Since 7af9c950, Praefect has no longer served reads from outdated replicas.
      
      2. Praefect no longer elects outdated replicas as primaries. Electing an outdated
         primary does not improve the availability of a repository as it still couldn't
         accept writes nor reads.
      
      3. With introduction of lazy failovers, there is effectively no read-only mode
         anymore as Praefect would simply failover to the up to date node immediately
         if one exists.
      
      With those in mind, the behavior of `praefect dataloss` is not accurate anymore.
      By default, its attempts to print out repositories which have reduced availability.
      To reflect the current failover logic, we should instead print out repositories
      which do not have any up to date, healthy nodes available. This commit replaces
      the GetPartiallyReplicatedRepositories with GetPartiallyAvailableRepositories.
      A repository is considered available by the current logic if there exists a replica
      that could serve as the primary. A replica can serve as the primary if it is fully
      up to date and healthy. If such a replica exists, the repository is not in read-only
      mode as we'd simply use the replica as the primary. If no such replicas exist, the
      repository is unavailable.
      
      The dataloss sub-command also has the `-partially-replicated` flag that prints out
      repositories which have some assigned replicas that are not fully up to date. That
      flag is going to be replaced by the `partially-available` flag, which returns
      repositories which have assigned replicas that are not able to serve requests at
      the moment. This effectively does the same as the flag did previously but it also
      considers whether the replicas are healthy. This behavior fits better with variable
      replication factor: it could be that we have one up to date copy of the replica on
      an unhealthy node. The previous check would only see that there are no outdated
      replicas and not return the repository. The repository would be unavailable though,
      as the only replicas is on a node that is unhealthy. To better facilitate debugging
      these scenarios, the flag is changed to cover replicas on unavailable nodes as well.
      
      This commit covers only the datastore changes. The user facing changes in dataloss
      are to be done in a follow up commit.
      7704c707
    • Sami Hiltunen's avatar
      Return more information from GetPartiallyReplicatedRepositories · e7cd0922
      Sami Hiltunen authored
      GetPartiallyReplicatedRepositories returns information about repositories
      which have outdated replicas on assigned hosts. The generations returned
      are used in `praefect dataloss` to determine whether a repositroy is in
      read-only mode or not. With lazy failover, there is no read-only mode
      anymore as Praefect can immediately failover to another valid primary.
      Praefect doesn't serve reads from outdated replicas, so the repository
      would effectively be unavailable if there are no up to date and healthy
      replicas. To prepare for updating `praefect dataloss` to account for lazy
      failovers, let's return the health status and whether the replica can act
      as the primary with each of the replicas. We can later use the ValidPrimary
      field to determine if the repository is available and the health status to
      ease with debugging why a repository may be unavailable. Other than returning
      the additional fields, this commit makes no other behavior changes yet.
      e7cd0922
    • Sami Hiltunen's avatar
      Use repository_generations view in GetPartiallyReplicatedRepositories · 9ad66b4c
      Sami Hiltunen authored
      GetPartiallyReplicatedRepositories is currently using a window function
      to get the highest generation from all of the replicas. We've since
      introduced the repository_generations view which also gets the highest
      generation across the replicas. Let's simplify the query by reusing the
      view rather than performing the logic again using the window function.
      9ad66b4c
    • Sami Hiltunen's avatar
      Remove support for virtual storage primaries in `praefect dataloss` · 4d07d9b0
      Sami Hiltunen authored
      Starting from 14.0, Praefect only supports repository-specific primaries.
      This commit removes support for virtual storage scoped primaries in
      `praefect dataloss` to make future changes easier.
      
      Changelog: removed
      4d07d9b0
    • Sami Hiltunen's avatar
      Extract a testhelper for setting healthy nodes in the database · fe6d5257
      Sami Hiltunen authored
      This commit extracts the setHealthyNodes helper from the tests of
      PerRepositoryElector so it can be reused in other packages. The helper
      is used for setting healthy nodes in the database during tests.
      fe6d5257
    • Sami Hiltunen's avatar
      Use request scoped logger in PerRepositoryElector · 1baa997e
      Sami Hiltunen authored
      PerRepositoryElector uses its own logger as a remnant from the time
      it was performing elections in the background. As the elections now
      happen in the request context, let's switch to using the request
      context logger. This allows for correlating the primary changes with
      the request that triggered that failover.
      1baa997e
    • Sami Hiltunen's avatar
      Perform failovers lazily · 3f09e462
      Sami Hiltunen authored
      Praefect's PerRepositoryElector runs elections globally when Praefect
      launches and when a Gitaly node's health status changed. This approach was
      originally taken to match global elections done by the sqlElector as well.
      While the sqlElector runs elections after every health check, by default
      every 3s, the event driven approach was implemented for the PerRepositoryElector
      as it has to perform a lot more work every election run compared to the
      sqlElector. The sqlElector has a single primary for each virtual storage
      where as the PerRepositoryElector has a primary record for every repository.
      While both electors check every repository's generations to pick the best new
      primary, only the PerRepositoryElector has to write potentially a large number
      of records as well. We can do a lot better though:
      
      1. If the primary is unavailable only temporarily, there's a high chance that
         the repository is not even accesed during the outage. If so, there's no need
         to eagerly failover as no one would even see the failure.
      
      2. Most of the operations on the repositories are reads. Reads can be served from
         any up to date replica without needing to have a primary. Only once an RPC that
         requires the primary arrives we care about having a healthy primary.
      
      Given the above, this commit implements a lazy approach to failovers. This removes
      the background election loop entirely and elects a primary if needed when an RPC
      requires a primary. This happens transparently when getting the primary from the
      database. This brings multiple benefits:
      
      1. Perfomance improves as we don't have to perform failovers for repositories which
         are not written to during the primary's outage. This reduces the time to perfrom
         failovers as we are working on records of a single repository as opposed to all
         of the repositories.
      
      2. Failover code is responsive without having to feed it more and more events. This
         becomes more relevant as we implement rebalancing features. When moving a repository
         with a single replica, we may have to demote the primary temporarily and we want it
         to be re-elected as soon as a request needs it and it's possible. Previous approach
         would require us hooking more code into the events where as this lazy approach just
         works.
      
      3. It's easier to reason about synchronous code rather than asynchronous elections.
      
      4. We can log all the individual changes, as opposed to logging the aggregate stats
         of demotions and promotions.
      
      Changelog: performance
      3f09e462
    • Sami Hiltunen's avatar
      Merge branch 'pks-tx-coordinator-replication-error-handling' into 'master' · c8a29dc9
      Sami Hiltunen authored
      coordinator: Only schedule replication for differing error states
      
      See merge request !3642
      c8a29dc9
    • Sami Hiltunen's avatar
      Merge branch 'pks-ff-receiver' into 'master' · fb267fb9
      Sami Hiltunen authored
      featureflag: Implement receiver functions on FeatureFlag struct
      
      See merge request !3662
      fb267fb9
  6. Jul 11, 2021
  7. Jul 09, 2021
Loading