Snippet creation not resilient to Postgres replication lag
Problem
Creating a snippet when we there is some replication lag in Postgres fails intermittently.
How to reproduce
- Setup your GDK with a replica like https://gitlab.com/gitlab-org/gitlab-development-kit/-/blob/main/doc/howto/database_load_balancing.md
- Configure 1 minute of replication lag with https://gitlab.com/gitlab-org/gitlab-development-kit/-/blob/main/doc/howto/database_load_balancing.md#simulating-replication-delay
- Update your
config/database.yml
so that you only have the replica database in the replica pools (ie. remove the primary) so yourload_balancing
formain
andci
look like:load_balancing: hosts: - /Users/<myuser>/workspace/gitlab-development-kit/postgresql-replica
gdk restart
- Open gitaly logs with
gdk tail gitaly
- Create a personal snippet at http://127.0.0.1:3000/-/snippets/new
- You'll see:
- In the logs for gitaly you'll see:
"error":"GitLab: The snippet you were looking for could not be found.
Technical Explanation
- Creating a snippet in GitLab calls Gitaly to create the storage which it then saves in the GitLab DB
- GitLab then tries to push to the repository in Gitaly
- Gitaly calls back to GitLab to check if the user has permission to the repository
- GitLab checks the storage size to see if it's allowed
- This needs to lookup the storage location which is not yet replicated to the delayed replica so it calls
Repository.pick_storage_shard
- This then calls out to Gitaly with and Gitaly fails to find the repository
- This needs to lookup the storage location which is not yet replicated to the delayed replica so it calls
Possible Solutions
There are multiple problems here that combine to make this fail:
- Gitaly is asking GitLab if a user has permission to access a repository which is just being created by GitLab. It's a redundant permission check. Presumably this is done for consistency with other Gitaly workflows
- GitLab is checking to see the repository size of a snippet repository that was just created before anything is even pushed to it. It will be empty and this is pointless
- Gitaly -> GitLab calls are using the internal API and therefore they don't appear to come from a specific user. This means our ordinary load balancing primary sticking is not activated and GitLab assumes it can use a replica for the request
- The code in
Snippet#repository_storage
is picking a random storage when it doesn't find one in the DB. This is probably sensible at snippet creation time but this makes no sense when we are trying to check the size of the repository. This should probably 404 or something instead of just picking a random storage and then trying to check the size of something that hasn't been created.
Given the above this could be fixed in any of the following ways:
- Force GitLab to use the primary database when looking up the storage location. This can be done by wrapping specific code in
::Gitlab::Database::LoadBalancing::Session.current.use_primary do
. This should be done sparingly and ideally targetted to this one workflow because extra primary traffic is costly - Somehow update this internal API request from
gitaly -> GitLab
to pick the current session based on the user that originally initiated the request. Maybe this could be done using the correlation_id or we could assume that internal API requests "about" a user should also be scoped to that user's sticky session - Update GitLab to skip this repository size check for this initial workflow of creating a snippet
- Update Gitaly to skip this permission check when the request to push to the repository is coming from GitLab because presumably GitLab has already checked permissions
Additional Technical notes
This was discovered as part of our PG14 upgrade https://ops.gitlab.net/gitlab-com/gl-infra/db-migration/-/issues/60#note_161757 . When there is a small amount of replication lag the QA tests for creating snippets fails regularly.
The logs for these failures revealed the problem
There was more discussion in slack.
The exception from GitLab's perspective was:
5:accessor call: route repository accessor: get repository id: repository "nfs-file22"/"@snippets/4f/c5/4fc5aa8deae05ee2f5167da0b81034ac7a3911b48a0bbfec1f1890f160817980.git" not found. debug_error_string:{UNKNOWN:Error received from peer {created_time:"2023-05-31T15:35:52.657361111+00:00", grpc_status:5, grpc_message:"accessor call: route repository accessor: get repository id: repository \"nfs-file22\"/\"@snippets/4f/c5/4fc5aa8deae05ee2f5167da0b81034ac7a3911b48a0bbfec1f1890f160817980.git\" not found"}}
lib/gitlab/gitaly_client.rb:185:in `execute'
lib/gitlab/gitaly_client/call.rb:18:in `block in call'
lib/gitlab/gitaly_client/call.rb:55:in `recording_request'
lib/gitlab/gitaly_client/call.rb:17:in `call'
lib/gitlab/gitaly_client.rb:175:in `call'
lib/gitlab/gitaly_client/with_feature_flag_actors.rb:31:in `block in gitaly_client_call'
lib/gitlab/gitaly_client.rb:541:in `with_feature_flag_actors'
lib/gitlab/gitaly_client/with_feature_flag_actors.rb:25:in `gitaly_client_call'
lib/gitlab/gitaly_client/repository_service.rb:51:in `repository_size'
lib/gitlab/git/repository.rb:342:in `size'
app/models/repository.rb:578:in `size'
lib/gitlab/repository_cache_adapter.rb:21:in `block (2 levels) in cache_method'
lib/gitlab/repository_cache.rb:25:in `fetch'
lib/gitlab/repository_cache_adapter.rb:163:in `block in cache_method_output'
lib/gitlab/utils/strong_memoize.rb:34:in `strong_memoize'
lib/gitlab/repository_cache_adapter.rb:203:in `block in memoize_method_output'
lib/gitlab/repository_cache_adapter.rb:212:in `no_repository_fallback'
lib/gitlab/repository_cache_adapter.rb:202:in `memoize_method_output'
lib/gitlab/repository_cache_adapter.rb:162:in `cache_method_output'
lib/gitlab/repository_cache_adapter.rb:20:in `block in cache_method'
ee/app/models/ee/snippet.rb:19:in `block (2 levels) in repository_size_checker'
lib/gitlab/repository_size_checker.rb:18:in `current_size'
lib/gitlab/repository_size_checker.rb:28:in `above_size_limit?'
ee/lib/ee/gitlab/repository_size_checker.rb:14:in `above_size_limit?'
lib/gitlab/git_access.rb:473:in `check_size_before_push!'
lib/gitlab/git_access_snippet.rb:109:in `check_change_access!'
lib/gitlab/git_access_snippet.rb:77:in `check_push_access!'
ee/lib/ee/gitlab/git_access_snippet.rb:21:in `check_push_access!'
lib/gitlab/git_access.rb:92:in `check'
ee/lib/ee/gitlab/git_access.rb:18:in `check'
lib/gitlab/git_access_snippet.rb:33:in `check'
lib/api/internal/base.rb:117:in `access_check!'
lib/api/internal/base.rb:63:in `block in check_allowed'
lib/api/internal/base.rb:141:in `with_admin_mode_bypass!'
lib/api/internal/base.rb:62:in `check_allowed'
ee/lib/ee/api/internal/base.rb:22:in `block in check_allowed'
ee/lib/gitlab/ip_address_state.rb:10:in `with'
ee/lib/ee/api/internal/base.rb:21:in `check_allowed'
lib/api/internal/base.rb:165:in `block (2 levels) in <class:Base>'
ee/lib/gitlab/middleware/ip_restrictor.rb:11:in `call'
lib/api/api_guard.rb:222:in `call'
ee/lib/omni_auth/strategies/group_saml.rb:41:in `other_phase'
lib/gitlab/metrics/elasticsearch_rack_middleware.rb:16:in `call'
lib/gitlab/middleware/memory_report.rb:13:in `call'
lib/gitlab/middleware/speedscope.rb:13:in `call'
lib/gitlab/database/load_balancing/rack_middleware.rb:23:in `call'
lib/gitlab/middleware/rails_queue_duration.rb:33:in `call'
lib/gitlab/metrics/rack_middleware.rb:16:in `block in call'
lib/gitlab/metrics/web_transaction.rb:46:in `run'
lib/gitlab/metrics/rack_middleware.rb:16:in `call'
lib/gitlab/jira/middleware.rb:19:in `call'
lib/gitlab/middleware/go.rb:20:in `call'
lib/gitlab/etag_caching/middleware.rb:21:in `call'
lib/gitlab/middleware/query_analyzer.rb:11:in `block in call'
lib/gitlab/database/query_analyzer.rb:37:in `within'
lib/gitlab/middleware/query_analyzer.rb:11:in `call'
lib/gitlab/middleware/multipart.rb:173:in `call'
lib/gitlab/middleware/read_only/controller.rb:50:in `call'
lib/gitlab/middleware/read_only.rb:18:in `call'
lib/gitlab/middleware/same_site_cookies.rb:27:in `call'
lib/gitlab/middleware/basic_health_check.rb:25:in `call'
lib/gitlab/middleware/handle_malformed_strings.rb:21:in `call'
lib/gitlab/middleware/handle_ip_spoof_attack_error.rb:25:in `call'
lib/gitlab/middleware/request_context.rb:15:in `call'
lib/gitlab/middleware/webhook_recursion_detection.rb:15:in `call'
config/initializers/fix_local_cache_middleware.rb:11:in `call'
lib/gitlab/middleware/compressed_json.rb:45:in `call'
lib/gitlab/middleware/rack_multipart_tempfile_factory.rb:19:in `call'
lib/gitlab/middleware/sidekiq_web_static.rb:20:in `call'
lib/gitlab/metrics/requests_rack_middleware.rb:79:in `call'
lib/gitlab/middleware/release_env.rb:13:in `call'
The problem is likely to arise during our production upgrade to PG14 as well. We are probably OK to retry these QA failures in case it does happen but this will also happen to users trying to create snippets during the PG14 upgrade process. It may already be periodically happening today to some users but with less frequency because our replication lag is normally quite low.