- Jul 16, 2021
-
-
Patrick Steinhardt authored
When finalizing a transaction, we always schedule replication jobs in case the primary has returned an error. Given that there are many RPCs which are expected to return errors in a controlled way, e.g. if a commit is missing, this causes us to create replication in many contexts where it's not necessary at all. Thinking about the issue, what we really care for is not whether an RPC failed or not. It's that primary and secondary nodes behaved the same. If both primary and secondaries succeeded, we're good. But if both failed with the same error, then we're good to as long as all transactions have been committed: quorum was reached on all votes and nodes failed in the same way, so we can assume that nodes did indeed perform the same changes. This commit thus relaxes the error condition to not schedule replication jobs anymore in case the primary failed, but to only schedule replication jobs to any node which has a different error than the primary. This has both the advantage that we only need to selectively schedule jobs for disagreeing nodes instead of targeting all secondaries and it avoids scheduling jobs in many cases where we do hit errors. Changelog: performance
-
Patrick Steinhardt authored
Relocate GitLab HTTP test server from testhelper package See merge request !3665
-
- Jul 15, 2021
-
-
Toon Claes authored
backup: Refactoring of the backup/restore task to support remote storages See merge request !3569
-
StorageServiceSink allows to use desired storage engine based on the url value provided for its creation. The set of storage engines now includes: AWS S3, Google Cloud Storage and Azure Blob Storage. The initialization of a particular client uses a set of env vars that should be set prior the StorageServiceSink creation. Each storage provider has its own set of required and optional parameters. We use a memory storage provider in tests to verify the implementation works as expected. Part of: #3633
-
As we are required to support not only a filesystem storage for the backups current implementation needs to be refactored. The core logic of backup creation and restoring is untouched except the operations working with data streams. The new Sink interface is used to store and retrieve backup data. Current functionality that allows to store/retrieve repositories to/from the local filesystem now part of the FilesystemSink type. And it is used as Sink implementation for the Manager. The Manager doesn't know anything about underling storage and just uses methods defined on the interface. Part of: #3633
-
To abstract the actual storage of the backup data the Write and GetReader methods are added to the Filesystem type. This is a preparation step towards having other types of sinks to use, like S3 or Google File Storage services. Part of: #3633
-
Toon Claes authored
Update ffi gem to 1.15.3 See merge request gitlab-org/gitaly!3664
-
James Fargher authored
Set default Prometheus buckets for Gitalys RPC instrumentation Closes #3431 See merge request !3669
-
- Jul 14, 2021
-
-
Sami Hiltunen authored
Gitaly doesn't set default buckets for RPC latency instrumentation which leads to the instrumentation being disabled by default. This commit adds default buckets to the configuration which is used if the buckets are not explicilty configured. Changelog: changed
-
Sami Hiltunen authored
Add StreamRPC library code See merge request !3601
-
Zeger-Jan van de Weg authored
Support lazy failovers in `praefect dataloss` See merge request !3549
-
- Jul 13, 2021
-
-
Zeger-Jan van de Weg authored
Update read-only repository count metric to account for lazy failover See merge request gitlab-org/gitaly!3548
-
James Fargher authored
Remove feature gitaly_fetch_internal_remote_errors Closes #3588 See merge request gitlab-org/gitaly!3647
-
- Jul 12, 2021
-
-
James Fargher authored
Since FetchInternalRemote has been inlined into ReplicateRepository we no longer need to make this RPC errors more verbose.
-
Zeger-Jan van de Weg authored
Fix various static lint issues See merge request gitlab-org/gitaly!3666
-
Zeger-Jan van de Weg authored
Perform failovers lazily Closes #3207 See merge request gitlab-org/gitaly!3543
-
Jacob Vosmaer authored
Changelog: other
-
Pavlo Strokov authored
It is not common to use snake case names for packages and package aliases in Go. The change renames aliases to a preferred single word name. We also use 'gitaly' prefix for the project-defined packages that clashes with standard or 3-rd party package names.
-
Pavlo Strokov authored
Some functions, types, fields and other variables are not used. There is no reason to keep them and support. Some of them became redundant starting from declaration and some after the code changes.
-
Pavlo Strokov authored
The done variable is never assigned any value, so the condition always evaluates into true.
-
Pavlo Strokov authored
If struct has a list of fields declared the values for those fields could be assigned during struct instance creation by providing values in the same order as the fields are declared or in any order if field names are used to assign the values. The preferred way is to use a field name assigment as it is less error prone if you define a new field in the middle of the struct and as well more readable as you see the list of the initialized fields.
-
Pavlo Strokov authored
WriteTemporaryGitalyConfigFile is used only in the gitaly-hooks, so there is no reason to keep it in the shared testhelper package. It is moved closer to the place of the usage and made unexportable.
-
Pavlo Strokov authored
The testhelper package contains too many unrelated stuff that needs to be moved elsewhere closer to the actual usage or real implementation. In some cases it can help to avoid circular dependencies between packages in the tests. As well it will give us more consistent and structured codebase. The types and functions moved into gitlab package got a rename by removing Gitlab prefix because it causes a lint issue on double naming (because of the same package name). A couple of structs such as postReceiveResponse that are copies of the unexported gitlab package structs were removed as they are not used anymore.
-
Sami Hiltunen authored
The current read-only repository count metric describes unavailable repositories rather than read-only repositories. We have to keep the name for backwards compatibility as some alerting rules and dashboards depend on it. To make it possible to migrate to a more accurate metric later, this commit adds another metric on the side with more accurate name and description.
-
Sami Hiltunen authored
Read-only repository count metric previously reported the number of repositories that were outdated on the primary. As Praefect no longer promotes outdated replicas as primaries, this metric is not really useful anymore. With lazy failover in place, Praefect will failover to an up to date replica as long as there is a healthy one available. The purpose of this metric was to alert when a repository's availability was degraded, mainly the writes being blocked. With lazy failover, we no longer would block the writes as we'd simply promote the up to date node. Praefect hasn't served reads from outdated replicas since 7af9c950. Having no fully up to date healthy replicas means the repository is fully unavailable. There's effectively no more read-only mode. This commit updates the metric to count repositories which are unavailable according to the new failover logic. The old metric name is kept in place though as some alerting depends on it.
-
Sami Hiltunen authored
This commit removes the support for virtual storage scoped primaries in the read-only repository count metric to make future changes easier. Virtual storage scoped primaries were deprecated in 13.12 and removed in 14.0. Changelog: removed
-
Sami Hiltunen authored
With the recent failover changes, the output of `praefect dataloss` is no longer accurate. Previously a repository would have been in read-only mode if the primary of the repository was outdated. With lazy failovers in place, it's no longer sufficient to check only whether the current primary is outdated or not. If the current primary is outdated, Praefect would immediately switch the repository's primary on the next request if there is an up to date replica available. This also means that there is no 'read-only mode' anymore, as we'd simply failover to an up to date node rather than wait for the current primary to be brought up to speed. This commit updates the dataloss sub-command to take the new changes into account: 1. If there is an up to date, available replica for the repository, it's considered to be available for both reads and writes. 2. If there are no up to date replicas available, the repository is considered unavailable. As it is, Praefect does not distribute writes to outdated replicas. 3. To make it easier to determine why a repository is unavailable, 'unavailable' is printed next to the storages which are considered to be unavailable by the consensus of the Praefect nodes. Changelog: changed
-
Sami Hiltunen authored
`praefect dataloss` is using GetPartiallyReplicatedRepositories to get repositories which have assigned replicas that are outdated. Inferring from the returned generations it was also reporting whether the repository was in read-only mode or not. This is not sufficient anymore to determine whether a repository is unavailable or not due to recent changes: 1. Since 7af9c950, Praefect has no longer served reads from outdated replicas. 2. Praefect no longer elects outdated replicas as primaries. Electing an outdated primary does not improve the availability of a repository as it still couldn't accept writes nor reads. 3. With introduction of lazy failovers, there is effectively no read-only mode anymore as Praefect would simply failover to the up to date node immediately if one exists. With those in mind, the behavior of `praefect dataloss` is not accurate anymore. By default, its attempts to print out repositories which have reduced availability. To reflect the current failover logic, we should instead print out repositories which do not have any up to date, healthy nodes available. This commit replaces the GetPartiallyReplicatedRepositories with GetPartiallyAvailableRepositories. A repository is considered available by the current logic if there exists a replica that could serve as the primary. A replica can serve as the primary if it is fully up to date and healthy. If such a replica exists, the repository is not in read-only mode as we'd simply use the replica as the primary. If no such replicas exist, the repository is unavailable. The dataloss sub-command also has the `-partially-replicated` flag that prints out repositories which have some assigned replicas that are not fully up to date. That flag is going to be replaced by the `partially-available` flag, which returns repositories which have assigned replicas that are not able to serve requests at the moment. This effectively does the same as the flag did previously but it also considers whether the replicas are healthy. This behavior fits better with variable replication factor: it could be that we have one up to date copy of the replica on an unhealthy node. The previous check would only see that there are no outdated replicas and not return the repository. The repository would be unavailable though, as the only replicas is on a node that is unhealthy. To better facilitate debugging these scenarios, the flag is changed to cover replicas on unavailable nodes as well. This commit covers only the datastore changes. The user facing changes in dataloss are to be done in a follow up commit.
-
Sami Hiltunen authored
GetPartiallyReplicatedRepositories returns information about repositories which have outdated replicas on assigned hosts. The generations returned are used in `praefect dataloss` to determine whether a repositroy is in read-only mode or not. With lazy failover, there is no read-only mode anymore as Praefect can immediately failover to another valid primary. Praefect doesn't serve reads from outdated replicas, so the repository would effectively be unavailable if there are no up to date and healthy replicas. To prepare for updating `praefect dataloss` to account for lazy failovers, let's return the health status and whether the replica can act as the primary with each of the replicas. We can later use the ValidPrimary field to determine if the repository is available and the health status to ease with debugging why a repository may be unavailable. Other than returning the additional fields, this commit makes no other behavior changes yet.
-
Sami Hiltunen authored
GetPartiallyReplicatedRepositories is currently using a window function to get the highest generation from all of the replicas. We've since introduced the repository_generations view which also gets the highest generation across the replicas. Let's simplify the query by reusing the view rather than performing the logic again using the window function.
-
Sami Hiltunen authored
Starting from 14.0, Praefect only supports repository-specific primaries. This commit removes support for virtual storage scoped primaries in `praefect dataloss` to make future changes easier. Changelog: removed
-
Sami Hiltunen authored
This commit extracts the setHealthyNodes helper from the tests of PerRepositoryElector so it can be reused in other packages. The helper is used for setting healthy nodes in the database during tests.
-
Sami Hiltunen authored
PerRepositoryElector uses its own logger as a remnant from the time it was performing elections in the background. As the elections now happen in the request context, let's switch to using the request context logger. This allows for correlating the primary changes with the request that triggered that failover.
-
Sami Hiltunen authored
Praefect's PerRepositoryElector runs elections globally when Praefect launches and when a Gitaly node's health status changed. This approach was originally taken to match global elections done by the sqlElector as well. While the sqlElector runs elections after every health check, by default every 3s, the event driven approach was implemented for the PerRepositoryElector as it has to perform a lot more work every election run compared to the sqlElector. The sqlElector has a single primary for each virtual storage where as the PerRepositoryElector has a primary record for every repository. While both electors check every repository's generations to pick the best new primary, only the PerRepositoryElector has to write potentially a large number of records as well. We can do a lot better though: 1. If the primary is unavailable only temporarily, there's a high chance that the repository is not even accesed during the outage. If so, there's no need to eagerly failover as no one would even see the failure. 2. Most of the operations on the repositories are reads. Reads can be served from any up to date replica without needing to have a primary. Only once an RPC that requires the primary arrives we care about having a healthy primary. Given the above, this commit implements a lazy approach to failovers. This removes the background election loop entirely and elects a primary if needed when an RPC requires a primary. This happens transparently when getting the primary from the database. This brings multiple benefits: 1. Perfomance improves as we don't have to perform failovers for repositories which are not written to during the primary's outage. This reduces the time to perfrom failovers as we are working on records of a single repository as opposed to all of the repositories. 2. Failover code is responsive without having to feed it more and more events. This becomes more relevant as we implement rebalancing features. When moving a repository with a single replica, we may have to demote the primary temporarily and we want it to be re-elected as soon as a request needs it and it's possible. Previous approach would require us hooking more code into the events where as this lazy approach just works. 3. It's easier to reason about synchronous code rather than asynchronous elections. 4. We can log all the individual changes, as opposed to logging the aggregate stats of demotions and promotions. Changelog: performance
-
Sami Hiltunen authored
coordinator: Only schedule replication for differing error states See merge request !3642
-
Sami Hiltunen authored
featureflag: Implement receiver functions on FeatureFlag struct See merge request !3662
-
- Jul 11, 2021
-
-
Stan Hu authored
We're shipping three different versions of this gem in Omnibus. Update to the latest to avoid wasting space. https://my.diffend.io/gems/ffi/1.13.1/1.15.3 Changelog: changed
-
- Jul 09, 2021
-
-
Patrick Steinhardt authored
gitpipe: Prioritize context cancellation Closes #3693 and #3697 See merge request !3658
-
Patrick Steinhardt authored
Bump actionpack, actionview, activesupport to 6.1 See merge request !3661
-
Toon Claes authored
featureflag: Default-enable LFS pointers pipeline See merge request !3653
-