Geo: JWT token expiration too short
Problem
System clocks are synced, however a customer is seeing Git repositories fail to sync, with last_sync_failure
:
Error syncing repository: 13:fatal: could not read Username for 'https://gitlab.example.com': terminal prompts disabled
Essentially could not read Username
occurs because Geo's JWT authentication failed, and the primary site returned a 401, which would normally cause a Git client to ask for credentials, but the Git client in this case is running in the secondary site's Gitaly node, non-interactively.
More details
The JWT expiration is 60s, plus 60s of leeway => 120s.
If the JWT is decoded after 120s, then you will see errors like: fatal: could not read Username for 'https://gitlab.example.com': terminal prompts disabled
.
Copied and redacted from confidential issue https://gitlab.com/gitlab-com/geo-customers/-/issues/207#note_1921672521:
Click here to expand
indicates this is an authentication issue. In Gitaly, heres a test that produces that same error. @brodock how does the http auth token get propagated to Gitaly?
Oh... RepoSyncRequest uses an authorization expiration of 10 minutes. Maybe these 3 repos in particular are affected because this expiration is reached?
The default 1 minute expiration was overridden with 10 minutes many years ago to fix the same error: #4881 (closed)
We do an info-refs call and then a receive-pack / upload-pack (I always forget which is which). both use the same token, which will expire if info-refs takes too long.
Though, that
expiration_time
method doesn't appear to be called. We will look into this.
I found in
primary_site_sos_20240424141441/var/log/gitlab/gitlab-rails/geo.log
many of these messages:
{"severity":"ERROR","time":"2024-04-24T07:03:58.904Z","correlation_id":"01HW7EP5N58X5ER4HHZMTFF58Y","class":"Gitlab::Geo::SignedData","gitlab_host":"REDACTED","message":"Signature not within leeway of 60 seconds. Check your system clocks!"}
Which is logged here on both
JWT::ImmatureSignature
andJWT::ExpiredSignature
errors. So those leeway messages could be due to expiration.But I didn't find in the primary site logs:
Geo JWT authentication failed
, which I expected to see along withInvalid signature time
.
It appears
expiration_time
stopped being used accidentally during a refactor in 14.10.I confirmed that the JWT token must be decoded within 120s (60s default validity period + 60s leeway) or else an expiration error will be raised:
[88] pry(main)> signed_data = Gitlab::Geo::SignedData.new(geo_node: GeoNode.secondary_nodes.first, validity_period: 10.seconds).sign_and_encode_data({ 'foo' => 'bar' }); sleep 71; data = Gitlab::Geo::SignedData.new(include_disabled_nodes: @include_disabled).decode_data(signed_data); puts '2'; signed_data = Gitlab::Geo::SignedData.new(geo_node: GeoNode.secondary_nodes.first, validity_period: 10.seconds).sign_and_encode_data({ 'foo' => 'bar' }); sleep 68; data = Gitlab::Geo::SignedData.new(include_disabled_nodes: @include_disabled).decode_data(signed_data) GeoNode Load (0.4ms) SELECT "geo_nodes".* FROM "geo_nodes" WHERE "geo_nodes"."primary" = FALSE ORDER BY "geo_nodes"."id" ASC LIMIT 1 /*application:console,db_config_name:main,console_hostname:MikesGitLabMBP.localdomain,console_username:mkozonogitlab,line:(pry):62:in `__pry__'*/ GeoNode Load (1.1ms) SELECT "geo_nodes".* FROM "geo_nodes" WHERE "geo_nodes"."enabled" = TRUE AND "geo_nodes"."access_key" = 'JQmgyI_TKG9An1wBHIDg' LIMIT 1 /*application:console,db_config_name:main,console_hostname:MikesGitLabMBP.localdomain,console_username:mkozonogitlab,line:/ee/lib/gitlab/geo/signed_data.rb:83:in `hmac_secret'*/ Gitlab::Geo::InvalidSignatureTimeError: Signature not within leeway of 60 seconds. Check your system clocks! from /Users/mkozonogitlab/Developer/gdk2/gitlab/ee/lib/gitlab/geo/signed_data.rb:50:in `rescue in decode_data' Caused by JWT::ExpiredSignature: Signature has expired from /Users/mkozonogitlab/.asdf/installs/ruby/3.2.3/lib/ruby/gems/3.2.0/gems/jwt-2.5.0/lib/jwt/verify.rb:42:in `verify_expiration' [89] pry(main)> signed_data = Gitlab::Geo::SignedData.new(geo_node: GeoNode.secondary_nodes.first, validity_period: 10.seconds).sign_and_encode_data({ 'foo' => 'bar' }); sleep 68; data = Gitlab::Geo::SignedData.new(include_disabled_nodes: @include_disabled).decode_data(signed_data) GeoNode Load (0.4ms) SELECT "geo_nodes".* FROM "geo_nodes" WHERE "geo_nodes"."primary" = FALSE ORDER BY "geo_nodes"."id" ASC LIMIT 1 /*application:console,db_config_name:main,console_hostname:MikesGitLabMBP.localdomain,console_username:mkozonogitlab,line:(pry):63:in `__pry__'*/ GeoNode Load (0.8ms) SELECT "geo_nodes".* FROM "geo_nodes" WHERE "geo_nodes"."enabled" = TRUE AND "geo_nodes"."access_key" = 'JQmgyI_TKG9An1wBHIDg' LIMIT 1 /*application:console,db_config_name:main,console_hostname:MikesGitLabMBP.localdomain,console_username:mkozonogitlab,line:/ee/lib/gitlab/geo/signed_data.rb:83:in `hmac_secret'*/ => {:foo=>"bar"}
We know that Git fetch can take far over 1 or even 10 minutes. And #4881 (closed) implies that the
upload-pack
request only needs to reach the primary Geo site before expiration.upload-pack
comes afterinfo-refs
. So I tried to reproduce the error by making the primary site sleep for 61s during theinfo/refs
request.I ended up seeing a another error which also has been an occasional mystery:
Error syncing repository: 13:Received RST_STREAM with error code 2.
.So I reduced the info refs sleep a little bit, in case a different timeout was being tripped. And saw yet another error that we've seen with
RST_STREAM
before:Error syncing repository: 4:Deadline Exceeded.
.At 30s, as well as at 31s, I got:
Error syncing repository: 13:fetch remote: \"fatal: unable to access 'https://gdk.test:3443/testmkozono/fork-test-2018-08-11/gitaly.git/': Failure when receiving data from the peer\\n\": exit status 128.
At 20s and 28s, the sync succeeded.
Did Gitaly timeouts get increased at some point? Perhaps JWT expiration can only be reached in that way.
Ah ha, I was able to reproduce it by monkeypatching the validity period to a value far below Gitaly timeouts:
module Gitlab; module Geo; class BaseRequest private def geo_auth_token(message) signed_data = Gitlab::Geo::SignedData.new(geo_node: requesting_node, validity_period: -61.seconds).sign_and_encode_data(message) "#{GITLAB_GEO_AUTH_TOKEN_TYPE} #{signed_data}" end end;end;end
With a long info/refs sleep, this yielded
Error syncing repository: 13:fetch remote: \"fatal: could not read Username for 'https://gdk.test:3443': terminal prompts disabled\\n\": exit status 128.
.With info/refs sleep of 5s, and validity period of -56s or -54s:
Error syncing repository: 13:fetch remote: \"fatal: could not read Username for 'https://gdk.test:3443': terminal prompts disabled\\nfatal: expected flush after ref listing\\n\": exit status 128.
With info/refs sleep of 5s, and validity period of -52s, sync succeeded.
Unfortunately I haven't been able to reproduce the exact error
fatal: unable to write request to remote: Broken pipe
, but we can be pretty confident that this would be resolved by increasing the JWT expiration time.Please have the customer validate this assertion by running this in Rails console on the secondary site:
module Gitlab; module Geo; class BaseRequest private def geo_auth_token(message) signed_data = Gitlab::Geo::SignedData.new(geo_node: requesting_node, validity_period: 10.minutes).sign_and_encode_data(message) "#{GITLAB_GEO_AUTH_TOKEN_TYPE} #{signed_data}" end end;end;end
and then in console, resync an affected project:
Project.find_by_full_path('mygroup/mysubgroup/myproject').replicator.resync
and run this to look at the sync state:
Project.find_by_full_path('mygroup/mysubgroup/myproject').replicator.registry
Note that if they are seeing other errors with the repo then this may not immediately work. This is only expected to resolve JWT authorization expiration:
"error": "fetch remote: \"fatal: could not read Username for 'https://gitlab.example.com': terminal prompts disabled\\nfatal: unable to write request to remote: Broken pipe\\n\": exit status 128",
Note that you can also see this same error if your primary site's Puma node's system clock is lagging behind the secondary site's Sidekiq node's clock. See https://docs.gitlab.com/ee/administration/geo/replication/troubleshooting/common.html#message-machine-clock-is-synchronized--exception
Validate this is your issue
Monkey patch the code in Rails console to increase the validity period of the token from 1 minute to 10 minutes. Run this in Rails console on the secondary site:
module Gitlab; module Geo; class BaseRequest
private
def geo_auth_token(message)
signed_data = Gitlab::Geo::SignedData.new(geo_node: requesting_node, validity_period: 10.minutes).sign_and_encode_data(message)
"#{GITLAB_GEO_AUTH_TOKEN_TYPE} #{signed_data}"
end
end;end;end
In the same Rails console, resync an affected project:
Project.find_by_full_path('mygroup/mysubgroup/myproject').replicator.resync
Look at the sync state:
Project.find_by_full_path('mygroup/mysubgroup/myproject').replicator.registry
If this issue is your issue, then last_sync_failure
should no longer include fatal: could not read Username
. The state
should now be 2
, meaning synced.
Possible workarounds
Workaround 1
- Hot patch Sidekiq nodes in the secondary site. Meaning, edit
/opt/gitlab/embedded/service/gitlab-rails/ee/lib/gitlab/geo/signed_data.rb
to add, validity_period: 10.minutes
like in the monkey patch above. Then restart Sidekiq. You would have to do this after every upgrade.
Workaround 2
If there are only a few affected projects, and if the sync is long due to "too many refs", then subsequent syncs are expected to go much faster after the first successful one. In that case:
-
SSH into a Puma/Sidekiq/Geo Log Cursor node in the secondary site
-
Open Rails console:
sudo gitlab-rails console
-
Paste this into Rails console:
module Gitlab; module Geo; class BaseRequest private def geo_auth_token(message) signed_data = Gitlab::Geo::SignedData.new(geo_node: requesting_node, validity_period: 180.minutes).sign_and_encode_data(message) "#{GITLAB_GEO_AUTH_TOKEN_TYPE} #{signed_data}" end end;end;end
-
In the same Rails console, resync an affected project:
Project.find_by_full_path('mygroup/mysubgroup/myproject').replicator.resync
-
Run this to look at the sync state:
Project.find_by_full_path('mygroup/mysubgroup/myproject').replicator.registry
-
If successful, then its
state
should now be2
, meaning synced. -
Repeat for other affected projects.
-
If the sync is long due to "too many refs" (which causes Git in the secondary site's Gitaly to take a long time to figure out which refs to ask for), then the affected projects should now sync much more quickly.
Proposal
- Weight 1: Increase the expiration length from 1 minute to 10 minutes. This was the case from GitLab 10.5 until it was accidentally reverted in 14.10.
Also, we should document the workarounds. In that documentation, we should link to the medium term proposal - re-introduce redownload
. So customers can upvote or comment on it.