Geo: JWT token expiration too short

Problem

System clocks are synced, however a customer is seeing Git repositories fail to sync, with last_sync_failure:

Error syncing repository: 13:fatal: could not read Username for 'https://gitlab.example.com': terminal prompts disabled

Essentially could not read Username occurs because Geo's JWT authentication failed, and the primary site returned a 401, which would normally cause a Git client to ask for credentials, but the Git client in this case is running in the secondary site's Gitaly node, non-interactively.

More details

The JWT expiration is 60s, plus 60s of leeway => 120s.

If the JWT is decoded after 120s, then you will see errors like: fatal: could not read Username for 'https://gitlab.example.com': terminal prompts disabled.

Copied and redacted from confidential issue https://gitlab.com/gitlab-com/geo-customers/-/issues/207#note_1921672521:

Click here to expand

indicates this is an authentication issue. In Gitaly, heres a test that produces that same error. @brodock how does the http auth token get propagated to Gitaly?

Oh... RepoSyncRequest uses an authorization expiration of 10 minutes. Maybe these 3 repos in particular are affected because this expiration is reached?

The default 1 minute expiration was overridden with 10 minutes many years ago to fix the same error: #4881 (closed)

We do an info-refs call and then a receive-pack / upload-pack (I always forget which is which). both use the same token, which will expire if info-refs takes too long.

Though, that expiration_time method doesn't appear to be called. We will look into this.

I found in primary_site_sos_20240424141441/var/log/gitlab/gitlab-rails/geo.log many of these messages:

{"severity":"ERROR","time":"2024-04-24T07:03:58.904Z","correlation_id":"01HW7EP5N58X5ER4HHZMTFF58Y","class":"Gitlab::Geo::SignedData","gitlab_host":"REDACTED","message":"Signature not within leeway of 60 seconds. Check your system clocks!"}

Which is logged here on both JWT::ImmatureSignature and JWT::ExpiredSignature errors. So those leeway messages could be due to expiration.

But I didn't find in the primary site logs: Geo JWT authentication failed, which I expected to see along with Invalid signature time.

It appears expiration_time stopped being used accidentally during a refactor in 14.10.

I confirmed that the JWT token must be decoded within 120s (60s default validity period + 60s leeway) or else an expiration error will be raised:
[88] pry(main)> signed_data = Gitlab::Geo::SignedData.new(geo_node: GeoNode.secondary_nodes.first, validity_period: 10.seconds).sign_and_encode_data({ 'foo' => 'bar' }); sleep 71; data = Gitlab::Geo::SignedData.new(include_disabled_nodes: @include_disabled).decode_data(signed_data); puts '2'; signed_data = Gitlab::Geo::SignedData.new(geo_node: GeoNode.secondary_nodes.first, validity_period: 10.seconds).sign_and_encode_data({ 'foo' => 'bar' }); sleep 68; data = Gitlab::Geo::SignedData.new(include_disabled_nodes: @include_disabled).decode_data(signed_data)
  GeoNode Load (0.4ms)  SELECT "geo_nodes".* FROM "geo_nodes" WHERE "geo_nodes"."primary" = FALSE ORDER BY "geo_nodes"."id" ASC LIMIT 1 /*application:console,db_config_name:main,console_hostname:MikesGitLabMBP.localdomain,console_username:mkozonogitlab,line:(pry):62:in `__pry__'*/
  GeoNode Load (1.1ms)  SELECT "geo_nodes".* FROM "geo_nodes" WHERE "geo_nodes"."enabled" = TRUE AND "geo_nodes"."access_key" = 'JQmgyI_TKG9An1wBHIDg' LIMIT 1 /*application:console,db_config_name:main,console_hostname:MikesGitLabMBP.localdomain,console_username:mkozonogitlab,line:/ee/lib/gitlab/geo/signed_data.rb:83:in `hmac_secret'*/
Gitlab::Geo::InvalidSignatureTimeError: Signature not within leeway of 60 seconds. Check your system clocks!
from /Users/mkozonogitlab/Developer/gdk2/gitlab/ee/lib/gitlab/geo/signed_data.rb:50:in `rescue in decode_data'
Caused by JWT::ExpiredSignature: Signature has expired
from /Users/mkozonogitlab/.asdf/installs/ruby/3.2.3/lib/ruby/gems/3.2.0/gems/jwt-2.5.0/lib/jwt/verify.rb:42:in `verify_expiration'
[89] pry(main)> signed_data = Gitlab::Geo::SignedData.new(geo_node: GeoNode.secondary_nodes.first, validity_period: 10.seconds).sign_and_encode_data({ 'foo' => 'bar' }); sleep 68; data = Gitlab::Geo::SignedData.new(include_disabled_nodes: @include_disabled).decode_data(signed_data)
  GeoNode Load (0.4ms)  SELECT "geo_nodes".* FROM "geo_nodes" WHERE "geo_nodes"."primary" = FALSE ORDER BY "geo_nodes"."id" ASC LIMIT 1 /*application:console,db_config_name:main,console_hostname:MikesGitLabMBP.localdomain,console_username:mkozonogitlab,line:(pry):63:in `__pry__'*/
  GeoNode Load (0.8ms)  SELECT "geo_nodes".* FROM "geo_nodes" WHERE "geo_nodes"."enabled" = TRUE AND "geo_nodes"."access_key" = 'JQmgyI_TKG9An1wBHIDg' LIMIT 1 /*application:console,db_config_name:main,console_hostname:MikesGitLabMBP.localdomain,console_username:mkozonogitlab,line:/ee/lib/gitlab/geo/signed_data.rb:83:in `hmac_secret'*/
=> {:foo=>"bar"}
We know that Git fetch can take far over 1 or even 10 minutes. And #4881 (closed) implies that the upload-pack request only needs to reach the primary Geo site before expiration. upload-pack comes after info-refs. So I tried to reproduce the error by making the primary site sleep for 61s during the info/refs request.

I ended up seeing a another error which also has been an occasional mystery: Error syncing repository: 13:Received RST_STREAM with error code 2..

So I reduced the info refs sleep a little bit, in case a different timeout was being tripped. And saw yet another error that we've seen with RST_STREAM before: Error syncing repository: 4:Deadline Exceeded..

At 30s, as well as at 31s, I got: Error syncing repository: 13:fetch remote: \"fatal: unable to access 'https://gdk.test:3443/testmkozono/fork-test-2018-08-11/gitaly.git/': Failure when receiving data from the peer\\n\": exit status 128.

At 20s and 28s, the sync succeeded.

Did Gitaly timeouts get increased at some point? Perhaps JWT expiration can only be reached in that way.

Ah ha, I was able to reproduce it by monkeypatching the validity period to a value far below Gitaly timeouts:
module Gitlab; module Geo; class BaseRequest
  private
  def geo_auth_token(message)
    signed_data = Gitlab::Geo::SignedData.new(geo_node: requesting_node, validity_period: -61.seconds).sign_and_encode_data(message)

    "#{GITLAB_GEO_AUTH_TOKEN_TYPE} #{signed_data}"
  end
end;end;end
With a long info/refs sleep, this yielded Error syncing repository: 13:fetch remote: \"fatal: could not read Username for 'https://gdk.test:3443': terminal prompts disabled\\n\": exit status 128..

With info/refs sleep of 5s, and validity period of -56s or -54s: Error syncing repository: 13:fetch remote: \"fatal: could not read Username for 'https://gdk.test:3443': terminal prompts disabled\\nfatal: expected flush after ref listing\\n\": exit status 128.

With info/refs sleep of 5s, and validity period of -52s, sync succeeded.

Unfortunately I haven't been able to reproduce the exact error fatal: unable to write request to remote: Broken pipe, but we can be pretty confident that this would be resolved by increasing the JWT expiration time.

Please have the customer validate this assertion by running this in Rails console on the secondary site:
module Gitlab; module Geo; class BaseRequest
  private
  def geo_auth_token(message)
    signed_data = Gitlab::Geo::SignedData.new(geo_node: requesting_node, validity_period: 10.minutes).sign_and_encode_data(message)

    "#{GITLAB_GEO_AUTH_TOKEN_TYPE} #{signed_data}"
  end
end;end;end
and then in console, resync an affected project:
Project.find_by_full_path('mygroup/mysubgroup/myproject').replicator.resync
and run this to look at the sync state:
Project.find_by_full_path('mygroup/mysubgroup/myproject').replicator.registry
Note that if they are seeing other errors with the repo then this may not immediately work. This is only expected to resolve JWT authorization expiration: "error": "fetch remote: \"fatal: could not read Username for 'https://gitlab.example.com': terminal prompts disabled\\nfatal: unable to write request to remote: Broken pipe\\n\": exit status 128",

Note that you can also see this same error if your primary site's Puma node's system clock is lagging behind the secondary site's Sidekiq node's clock. See https://docs.gitlab.com/ee/administration/geo/replication/troubleshooting/common.html#message-machine-clock-is-synchronized--exception

Validate this is your issue

Monkey patch the code in Rails console to increase the validity period of the token from 1 minute to 10 minutes. Run this in Rails console on the secondary site:

module Gitlab; module Geo; class BaseRequest
  private
  def geo_auth_token(message)
    signed_data = Gitlab::Geo::SignedData.new(geo_node: requesting_node, validity_period: 10.minutes).sign_and_encode_data(message)

    "#{GITLAB_GEO_AUTH_TOKEN_TYPE} #{signed_data}"
  end
end;end;end

In the same Rails console, resync an affected project:

Project.find_by_full_path('mygroup/mysubgroup/myproject').replicator.resync

Look at the sync state:

Project.find_by_full_path('mygroup/mysubgroup/myproject').replicator.registry

If this issue is your issue, then last_sync_failure should no longer include fatal: could not read Username. The state should now be 2, meaning synced.

Possible workarounds

Workaround 1

Hot patch Sidekiq nodes in the secondary site. Meaning, edit /opt/gitlab/embedded/service/gitlab-rails/ee/lib/gitlab/geo/signed_data.rb to add , validity_period: 10.minutes like in the monkey patch above. Then restart Sidekiq. You would have to do this after every upgrade.

Workaround 2

If there are only a few affected projects, and if the sync is long due to "too many refs", then subsequent syncs are expected to go much faster after the first successful one. In that case:

SSH into a Puma/Sidekiq/Geo Log Cursor node in the secondary site
Open Rails console: sudo gitlab-rails console

Paste this into Rails console:

module Gitlab; module Geo; class BaseRequest
  private
  def geo_auth_token(message)
    signed_data = Gitlab::Geo::SignedData.new(geo_node: requesting_node, validity_period: 180.minutes).sign_and_encode_data(message)

    "#{GITLAB_GEO_AUTH_TOKEN_TYPE} #{signed_data}"
  end
end;end;end

In the same Rails console, resync an affected project:

Project.find_by_full_path('mygroup/mysubgroup/myproject').replicator.resync

Run this to look at the sync state:

Project.find_by_full_path('mygroup/mysubgroup/myproject').replicator.registry

If successful, then its state should now be 2, meaning synced.
Repeat for other affected projects.
If the sync is long due to "too many refs" (which causes Git in the secondary site's Gitaly to take a long time to figure out which refs to ask for), then the affected projects should now sync much more quickly.

Proposal

Weight 1: Increase the expiration length from 1 minute to 10 minutes. This was the case from GitLab 10.5 until it was accidentally reverted in 14.10.

Also, we should document the workarounds. In that documentation, we should link to the medium term proposal - re-introduce redownload. So customers can upvote or comment on it.

Edited Jun 13, 2024 by Michael Kozono