Commits · 73839029f79d4ebdbc8d96475cf9bd0e2a599b2b · GitLab.org / Gitaly

Jul 16, 2021

coordinator: Only schedule replication for differing error states · 73839029

When finalizing a transaction, we always schedule replication jobs in
case the primary has returned an error. Given that there are many RPCs
which are expected to return errors in a controlled way, e.g. if a
commit is missing, this causes us to create replication in many contexts
where it's not necessary at all.

Thinking about the issue, what we really care for is not whether an RPC
failed or not. It's that primary and secondary nodes behaved the same.
If both primary and secondaries succeeded, we're good. But if both
failed with the same error, then we're good to as long as all
transactions have been committed: quorum was reached on all votes and
nodes failed in the same way, so we can assume that nodes did indeed
perform the same changes.

This commit thus relaxes the error condition to not schedule replication
jobs anymore in case the primary failed, but to only schedule
replication jobs to any node which has a different error than the
primary. This has both the advantage that we only need to selectively
schedule jobs for disagreeing nodes instead of targeting all
secondaries and it avoids scheduling jobs in many cases where we do hit
errors.

Changelog: performance

73839029

Merge branch 'ps-relocate-gitlab-test-server' into 'master' · acd3f8e4
Patrick Steinhardt authored 3 years ago
```
Relocate GitLab HTTP test server from testhelper package

See merge request !3665
```
acd3f8e4

Jul 15, 2021

Merge branch 'ps-backup' into 'master' · 7b95c340

Toon Claes authored 3 years ago

backup: Refactoring of the backup/restore task to support remote storages

See merge request !3569

7b95c340

StorageServiceSink to support AWS, Azure, Google storage for the backups · 643d6a08

Pavlo Strokov authored 3 years ago and

Toon Claes committed 3 years ago

StorageServiceSink allows to use desired storage engine based on the
url value provided for its creation. The set of storage engines now
includes: AWS S3, Google Cloud Storage and Azure Blob Storage.
The initialization of a particular client uses a set of env vars that
should be set prior the StorageServiceSink creation. Each storage provider
has its own set of required and optional parameters.
We use a memory storage provider in tests to verify the implementation
works as expected.

Part of: #3633

643d6a08

Refactor Filesystem into Manager and FilesystemSink · de4b0319

Pavlo Strokov authored 3 years ago and

Toon Claes committed 3 years ago

As we are required to support not only a filesystem storage
for the backups current implementation needs to be refactored.
The core logic of backup creation and restoring is untouched
except the operations working with data streams. The new
Sink interface is used to store and retrieve backup data.
Current functionality that allows to store/retrieve repositories
to/from the local filesystem now part of the FilesystemSink
type. And it is used as Sink implementation for the Manager.
The Manager doesn't know anything about underling storage
and just uses methods defined on the interface.

Part of: #3633

de4b0319

Extend Filesystem with abstract read and write operations · 49a73028

Pavlo Strokov authored 3 years ago and

Toon Claes committed 3 years ago

To abstract the actual storage of the backup data the
Write and GetReader methods are added to the Filesystem type.
This is a preparation step towards having other types of
sinks to use, like S3 or Google File Storage services.

Part of: #3633

49a73028

Merge branch 'sh-update-ffi-gem' into 'master' · b56e0680
Toon Claes authored 3 years ago
```
Update ffi gem to 1.15.3

See merge request gitlab-org/gitaly!3664
```
b56e0680
Merge branch 'smh-set-default-grpc-buckets' into 'master' · ebabe439
James Fargher authored 3 years ago
```
Set default Prometheus buckets for Gitalys RPC instrumentation

Closes #3431

See merge request !3669
```
ebabe439

Jul 14, 2021

Set default Prometheus buckets for Gitalys RPC instrumentation · 7c2c4253

Sami Hiltunen authored 3 years ago

Gitaly doesn't set default buckets for RPC latency instrumentation
which leads to the instrumentation being disabled by default. This
commit adds default buckets to the configuration which is used if
the buckets are not explicilty configured.

Changelog: changed

7c2c4253

Merge branch 'jv-add-streamrpc' into 'master' · 9cde7b7d
Sami Hiltunen authored 3 years ago
```
Add StreamRPC library code

See merge request !3601
```
9cde7b7d
Merge branch 'smh-dataloss-lazy-failovers' into 'master' · 47164700
Zeger-Jan van de Weg authored 3 years ago
```
Support lazy failovers in `praefect dataloss`

See merge request !3549
```
47164700

Jul 13, 2021
- Merge branch 'smh-unavailable-repos-metric' into 'master' · 40511f7a
  Zeger-Jan van de Weg authored 3 years ago
  
  Update read-only repository count metric to account for lazy failover See merge request gitlab-org/gitaly!3548
  40511f7a
- Merge branch 'remove_gitaly_fetch_internal_remote_errors' into 'master' · d4ea957f
  James Fargher authored 3 years ago
  
  Remove feature gitaly_fetch_internal_remote_errors Closes #3588 See merge request gitlab-org/gitaly!3647
  d4ea957f
Jul 12, 2021

Remove feature gitaly_fetch_internal_remote_errors · ffbd9a31

James Fargher authored 3 years ago

Since FetchInternalRemote has been inlined into ReplicateRepository we
no longer need to make this RPC errors more verbose.

ffbd9a31

Merge branch 'ps-code-style-fix' into 'master' · a8d42fb6
Zeger-Jan van de Weg authored 3 years ago
```
Fix various static lint issues

See merge request gitlab-org/gitaly!3666
```
a8d42fb6
Merge branch 'smh-perform-lazy-failovers' into 'master' · 87104617
Zeger-Jan van de Weg authored 3 years ago
```
Perform failovers lazily

Closes #3207

See merge request gitlab-org/gitaly!3543
```
87104617
Add StreamRPC library code · 8a925b40
Jacob Vosmaer authored 3 years ago
```
Changelog: other
```
8a925b40

Standardise package aliases · f6be3b55

Pavlo Strokov authored 3 years ago

It is not common to use snake case names for packages
and package aliases in Go.
The change renames aliases to a preferred single word name.
We also use 'gitaly' prefix for the project-defined packages
that clashes with standard or 3-rd party package names.

f6be3b55

Remove unused declarations · 74514560

Pavlo Strokov authored 3 years ago

Some functions, types, fields and other variables are not
used. There is no reason to keep them and support.
Some of them became redundant starting from declaration
and some after the code changes.

74514560

Redundant condition check in the for loop · a6a4d567
Pavlo Strokov authored 3 years ago
```
The done variable is never assigned any value, so
the condition always evaluates into true.
```
a6a4d567

Fix instantiation of the structs with fields assignment · a20f8bc8

Pavlo Strokov authored 3 years ago

If struct has a list of fields declared the values for
those fields could be assigned during struct instance creation
by providing values in the same order as the fields are declared
or in any order if field names are used to assign the values.
The preferred way is to use a field name assigment as it is less
error prone if you define a new field in the middle of the struct
and as well more readable as you see the list of the initialized
fields.

a20f8bc8

Relocate WriteTemporaryGitalyConfigFile · 3a16378c

Pavlo Strokov authored 3 years ago

WriteTemporaryGitalyConfigFile is used only in the
gitaly-hooks, so there is no reason to keep it in the
shared testhelper package. It is moved closer to the
place of the usage and made unexportable.

3a16378c

Relocate GitLab HTTP test server from testhelper package · b932dab3

Pavlo Strokov authored 3 years ago

The testhelper package contains too many unrelated stuff that
needs to be moved elsewhere closer to the actual usage or
real implementation. In some cases it can help to avoid
circular dependencies between packages in the tests.
As well it will give us more consistent and structured codebase.

The types and functions moved into gitlab package got a
rename by removing Gitlab prefix because it causes a lint issue
on double naming (because of the same package name).

A couple of structs such as postReceiveResponse that are copies
of the unexported gitlab package structs were removed as they
are not used anymore.

b932dab3

Export unavailable repositories metric · 12061b1c

Sami Hiltunen authored 3 years ago

The current read-only repository count metric describes unavailable
repositories rather than read-only repositories. We have to keep the
name for backwards compatibility as some alerting rules and dashboards
depend on it. To make it possible to migrate to a more accurate metric
later, this commit adds another metric on the side with more accurate
name and description.

12061b1c

Update read-only repository count metric to account for lazy failover · d8f63097

Sami Hiltunen authored 3 years ago

Read-only repository count metric previously reported the number of
repositories that were outdated on the primary. As Praefect no longer
promotes outdated replicas as primaries, this metric is not really useful
anymore. With lazy failover in place, Praefect will failover to an up to
date replica as long as there is a healthy one available. The purpose
of this metric was to alert when a repository's availability was degraded,
mainly the writes being blocked. With lazy failover, we no longer would
block the writes as we'd simply promote the up to date node. Praefect hasn't
served reads from outdated replicas since 7af9c950. Having no fully up to
date healthy replicas means the repository is fully unavailable. There's
effectively no more read-only mode. This commit updates the metric to count
repositories which are unavailable according to the new failover logic.
The old metric name is kept in place though as some alerting depends on it.

d8f63097

Remove support for virtual storage scoped primaries in read-only metrics · 77c84dd5

Sami Hiltunen authored 3 years ago

This commit removes the support for virtual storage scoped primaries in the
read-only repository count metric to make future changes easier. Virtual
storage scoped primaries were deprecated in 13.12 and removed in 14.0.

Changelog: removed

77c84dd5

Support lazy failovers in `praefect dataloss` · e900df09

Sami Hiltunen authored 3 years ago

With the recent failover changes, the output of `praefect dataloss`
is no longer accurate. Previously a repository would have been in
read-only mode if the primary of the repository was outdated. With
lazy failovers in place, it's no longer sufficient to check only
whether the current primary is outdated or not. If the current primary
is outdated, Praefect would immediately switch the repository's primary
on the next request if there is an up to date replica available. This
also means that there is no 'read-only mode' anymore, as we'd simply
failover to an up to date node rather than wait for the current primary
to be brought up to speed. This commit updates the dataloss sub-command
to take the new changes into account:

1. If there is an up to date, available replica for the repository, it's
considered to be available for both reads and writes.
2. If there are no up to date replicas available, the repository is considered
unavailable. As it is, Praefect does not distribute writes to outdated
replicas.
3. To make it easier to determine why a repository is unavailable, 'unavailable'
is printed next to the storages which are considered to be unavailable by the
consensus of the Praefect nodes.

Changelog: changed

e900df09

Replace GetPartiallyReplicatedRepositories with GetPartiallyAvaialableRepositories · 7704c707

Sami Hiltunen authored 3 years ago

`praefect dataloss` is using GetPartiallyReplicatedRepositories to get
repositories which have assigned replicas that are outdated. Inferring from the
returned generations it was also reporting whether the repository was in read-only
mode or not. This is not sufficient anymore to determine whether a repository is
unavailable or not due to recent changes:

1. Since 7af9c950, Praefect has no longer served reads from outdated replicas.

2. Praefect no longer elects outdated replicas as primaries. Electing an outdated
primary does not improve the availability of a repository as it still couldn't
accept writes nor reads.

3. With introduction of lazy failovers, there is effectively no read-only mode
anymore as Praefect would simply failover to the up to date node immediately
if one exists.

With those in mind, the behavior of `praefect dataloss` is not accurate anymore.
By default, its attempts to print out repositories which have reduced availability.
To reflect the current failover logic, we should instead print out repositories
which do not have any up to date, healthy nodes available. This commit replaces
the GetPartiallyReplicatedRepositories with GetPartiallyAvailableRepositories.
A repository is considered available by the current logic if there exists a replica
that could serve as the primary. A replica can serve as the primary if it is fully
up to date and healthy. If such a replica exists, the repository is not in read-only
mode as we'd simply use the replica as the primary. If no such replicas exist, the
repository is unavailable.

The dataloss sub-command also has the `-partially-replicated` flag that prints out
repositories which have some assigned replicas that are not fully up to date. That
flag is going to be replaced by the `partially-available` flag, which returns
repositories which have assigned replicas that are not able to serve requests at
the moment. This effectively does the same as the flag did previously but it also
considers whether the replicas are healthy. This behavior fits better with variable
replication factor: it could be that we have one up to date copy of the replica on
an unhealthy node. The previous check would only see that there are no outdated
replicas and not return the repository. The repository would be unavailable though,
as the only replicas is on a node that is unhealthy. To better facilitate debugging
these scenarios, the flag is changed to cover replicas on unavailable nodes as well.

This commit covers only the datastore changes. The user facing changes in dataloss
are to be done in a follow up commit.

7704c707

Return more information from GetPartiallyReplicatedRepositories · e7cd0922

Sami Hiltunen authored 3 years ago

GetPartiallyReplicatedRepositories returns information about repositories
which have outdated replicas on assigned hosts. The generations returned
are used in `praefect dataloss` to determine whether a repositroy is in
read-only mode or not. With lazy failover, there is no read-only mode
anymore as Praefect can immediately failover to another valid primary.
Praefect doesn't serve reads from outdated replicas, so the repository
would effectively be unavailable if there are no up to date and healthy
replicas. To prepare for updating `praefect dataloss` to account for lazy
failovers, let's return the health status and whether the replica can act
as the primary with each of the replicas. We can later use the ValidPrimary
field to determine if the repository is available and the health status to
ease with debugging why a repository may be unavailable. Other than returning
the additional fields, this commit makes no other behavior changes yet.

e7cd0922

Use repository_generations view in GetPartiallyReplicatedRepositories · 9ad66b4c

Sami Hiltunen authored 3 years ago

GetPartiallyReplicatedRepositories is currently using a window function
to get the highest generation from all of the replicas. We've since
introduced the repository_generations view which also gets the highest
generation across the replicas. Let's simplify the query by reusing the
view rather than performing the logic again using the window function.

9ad66b4c

Remove support for virtual storage primaries in `praefect dataloss` · 4d07d9b0

Sami Hiltunen authored 3 years ago

Starting from 14.0, Praefect only supports repository-specific primaries.
This commit removes support for virtual storage scoped primaries in
`praefect dataloss` to make future changes easier.

Changelog: removed

4d07d9b0

Extract a testhelper for setting healthy nodes in the database · fe6d5257

Sami Hiltunen authored 3 years ago

This commit extracts the setHealthyNodes helper from the tests of
PerRepositoryElector so it can be reused in other packages. The helper
is used for setting healthy nodes in the database during tests.

fe6d5257

Use request scoped logger in PerRepositoryElector · 1baa997e

Sami Hiltunen authored 3 years ago

PerRepositoryElector uses its own logger as a remnant from the time
it was performing elections in the background. As the elections now
happen in the request context, let's switch to using the request
context logger. This allows for correlating the primary changes with
the request that triggered that failover.

1baa997e

Perform failovers lazily · 3f09e462

Sami Hiltunen authored 3 years ago

Praefect's PerRepositoryElector runs elections globally when Praefect
launches and when a Gitaly node's health status changed. This approach was
originally taken to match global elections done by the sqlElector as well.
While the sqlElector runs elections after every health check, by default
every 3s, the event driven approach was implemented for the PerRepositoryElector
as it has to perform a lot more work every election run compared to the
sqlElector. The sqlElector has a single primary for each virtual storage
where as the PerRepositoryElector has a primary record for every repository.
While both electors check every repository's generations to pick the best new
primary, only the PerRepositoryElector has to write potentially a large number
of records as well. We can do a lot better though:

1. If the primary is unavailable only temporarily, there's a high chance that
the repository is not even accesed during the outage. If so, there's no need
to eagerly failover as no one would even see the failure.

2. Most of the operations on the repositories are reads. Reads can be served from
any up to date replica without needing to have a primary. Only once an RPC that
requires the primary arrives we care about having a healthy primary.

Given the above, this commit implements a lazy approach to failovers. This removes
the background election loop entirely and elects a primary if needed when an RPC
requires a primary. This happens transparently when getting the primary from the
database. This brings multiple benefits:

1. Perfomance improves as we don't have to perform failovers for repositories which
are not written to during the primary's outage. This reduces the time to perfrom
failovers as we are working on records of a single repository as opposed to all
of the repositories.

2. Failover code is responsive without having to feed it more and more events. This
becomes more relevant as we implement rebalancing features. When moving a repository
with a single replica, we may have to demote the primary temporarily and we want it
to be re-elected as soon as a request needs it and it's possible. Previous approach
would require us hooking more code into the events where as this lazy approach just
works.

3. It's easier to reason about synchronous code rather than asynchronous elections.

4. We can log all the individual changes, as opposed to logging the aggregate stats
of demotions and promotions.

Changelog: performance

3f09e462

Merge branch 'pks-tx-coordinator-replication-error-handling' into 'master' · c8a29dc9
Sami Hiltunen authored 3 years ago
```
coordinator: Only schedule replication for differing error states

See merge request !3642
```
c8a29dc9
Merge branch 'pks-ff-receiver' into 'master' · fb267fb9
Sami Hiltunen authored 3 years ago
```
featureflag: Implement receiver functions on FeatureFlag struct

See merge request !3662
```
fb267fb9

Jul 11, 2021

Update ffi gem to 1.15.3 · 0ef432a6

Stan Hu authored 3 years ago

We're shipping three different versions of this gem in Omnibus.
Update to the latest to avoid wasting space.

https://my.diffend.io/gems/ffi/1.13.1/1.15.3

Changelog: changed

Verified

0ef432a6

Jul 09, 2021
- Merge branch 'pks-gitpipe-cancellation' into 'master' · 5fdd1ba6
  Patrick Steinhardt authored 3 years ago
  
  gitpipe: Prioritize context cancellation Closes #3693 and #3697 See merge request !3658
  5fdd1ba6
- Merge branch 'mk-activesupport-6.1' into 'master' · 1df990f9
  Patrick Steinhardt authored 3 years ago
  
  Bump actionpack, actionview, activesupport to 6.1 See merge request !3661
  1df990f9
- Merge branch 'pks-ff-lfs-pointer-pipeline-default-enabled' into 'master' · 01189840
  Toon Claes authored 3 years ago
  
  featureflag: Default-enable LFS pointers pipeline See merge request !3653
  01189840