Add `last_used_at` to the agent token API mitigating Redis N+1 issues

mentioned in merge request !85704 (merged)

added to epic &7204

changed the description

Thanks for opening this issue @tuxtimo .

I've updated the description a bit to capture some discussion details inside this issue, so we keep the conversation ongoing here.

From now on, we should decide between proposals 1, 2, or yet any another possible candidate.

I set the priorities for the team for refinement in the configure board

(@tuxtimo we use this board to plan and have a high-level tracking of the ongoing work. The workflowrefinement lists the issues waiting to be refined in priority order.)

added workflowrefinement label

added groupconfigure [DEPRECATED] label

mentioned in issue #357320

added workflowready for design label and removed workflowrefinement label

added workflowrefinement label and removed workflowready for design label

Setting label(s) ~"devops::configure" sectionops based on ~"group::configure".

added devopsconfigure [DEPRECATED] sectionops labels

Hi configure backend team - @Alexand, @hfyngvason, @tiger

Could you please take a look at the proposals so that we can move forward with this issue? Thanks!

Out of the two proposal, I would prefer to start with:

Trying to use a combination of the BatchLoader and Redis mget, for instance to preload all Redis values to the presenter with just one Redis command.

That being said: How does the GraphQL API used in the frontend work around this problem?

Alternative proposal

A combination of the following:

An upper bound of active (not revoked) tokens at a time. There is no reason to allow more than two. Having more active tokens than that is a security liability, and probably means the user is misunderstanding the purpose of multiple tokens.
When reading last_used_at, only read from redis if the token is not revoked. Otherwise, use the database.
When revoking a token, make sure to persist the value of last_used_at. That way, reading revoked tokens accurately shows the last time the token successfully authenticated an agent.

How does the GraphQL API used in the frontend work around this problem?

It doesn't This will also be an N+1.

An upper bound of active (not revoked) tokens at a time. There is no reason to allow more than two

This is a great point, and we should do this anyway. Even a generous limit of say, 10, would mean that even with an N+1 to Redis the load wouldn't be terrible.

IMO serving the potentially out of date database value isn't a problem, as long as we document the limitation.

@nagyv-gitlab do you have any thoughts?

@hfyngvason @tigerwnz Restricting active tokens to max 2

If combining the two sources is a minor issue, I like the alternative proposal.

IMO serving the potentially out of date database value isn't a problem, as long as we document the limitation.

The main risk I see is that we might leave it that way for a long time and end up responding to a lot of inquiries from confused users (especially internally on Slack). Such interruptions could, over time, end up costing us more than a solution that serves the up-to-date value, but in a hard-to-measure way.

Since we're ok with restricting the number of active tokens, we should be able to go with the alternative proposal in full:

Add a limit of two active tokens
Only use the cache for active tokens
Flush the cached value to the database when a token is revoked

And then, finally:

Add the field to the API response

This will still technically be an N+1 (to Redis), just with a low value for N.

There are also some agents that already have more than two active tokens (the most on GitLab.com currently is 10). This is ok - these tokens will continue to work, they will just need to revoke some/all before registering a new token.

I think we have enough info for this to be picked up now. I'll update the description and move to workflowready for development. @hfyngvason feel free to update or move back to refinement if you have any concerns

In the sync meeting today we've decided to:

Not send revoked tokens in the response.
Use the compliance framework to aid auditing revoked tokens.
Removing the revoked tokens from the response can only be done in %16.0. But once it's removed, we won't need to flush the last_used_at anymore. So this feature becomes just about only filtering the results for active tokens, and limiting them to two.

That said, we have 2 options:

Implement the first version before %16.0 with revoked tokens flushed column.
Or, implement the simpler approach once revoked tokens are no longer in the response.

If there isn't urgent need for the last_used_at column to be present, I think 2 makes more sense.

mentioned in issue gitlab-org/quality/triage-reports#7522 (closed)

added workflowready for development label and removed workflowrefinement label

set weight to 3

changed the description

mentioned in issue #361029 (closed)

mentioned in merge request !87623 (merged)

mentioned in issue #363119 (closed)

added feature featureenhancement labels

added typefeature label

changed milestone to %Backlog

added [deprecated] Accepting merge requests label

changed the description

mentioned in epic &7924 (closed)

changed title from Consider a way for how to return last_used_at when listing agent tokens in REST API without N+1 problems to Add last_used_at to the agent token API mitigating Redis N+1 issues

changed the description

changed milestone to %16.0

removed [deprecated] Accepting merge requests label

added breaking change label

mentioned in issue #387309

added groupenvironments label and removed groupconfigure [DEPRECATED] label

added devopsdeploy label and removed devopsconfigure [DEPRECATED] label

assigned to @partiaga

mentioned in issue gitlab-org/cluster-integration/gitlab-agent#327 (moved)

unassigned @partiaga

assigned to @partiaga

changed milestone to %16.1

marked the checklist item Only fetch active tokens (non-revoked). as completed

marked the checklist item Add last_used_at to the agent tokens API response. as completed

I'm starting work on this issue now.

Noting that the last_used_at field is actually already returned in both the REST and GraphQL API when fetching the tokens of an agent (see comment below). So what we need to do is:

Add a limit of two active tokens per agent during creation, while allowing any existing agents with >2 tokens to continue uninterrupted.

Agent Tokens API calls showing last_used_at is included

REST API

$ curl -k -X GET \
--header "Authorization: Bearer $PERSONAL_ACCESS_TOKEN" \
"https://gdk.test:3443/api/v4/projects/20/cluster_agents/26/tokens" \
| json_pp -json_opt pretty,canonical

[
   {
      "agent_id" : 26,
      "created_at" : "2023-05-16T03:11:22.374Z",
      "created_by_user_id" : 1,
      "description" : null,
      "id" : 28,
      "name" : "agentk-test-1-connected",
      "status" : "active"
   },
   {
      "agent_id" : 26,
      "created_at" : "2023-04-26T04:47:56.904Z",
      "created_by_user_id" : 1,
      "description" : null,
      "id" : 21,
      "name" : "agentk-test-1",
      "status" : "active"
   }
]

GraphQL API

$ curl "https://gdk.test:3443/api/graphql" \
-k -X POST \
--header "Authorization: Bearer $PERSONAL_ACCESS_TOKEN" \
--header "Content-Type: application/json" \
--data "{\"query\": \"query {project(fullPath: \\\"pam-test-group/agentk-setup\\\") {name clusterAgent(name: \\\"agentk-test-1\\\") {name tokens {nodes {id name status lastUsedAt}}}}}\"}" \
| json_pp -json_opt pretty,canonical

{
   "data" : {
      "project" : {
         "clusterAgent" : {
            "name" : "agentk-test-1",
            "tokens" : {
               "nodes" : [
                  {
                     "id" : "gid://gitlab/Clusters::AgentToken/28",
                     "lastUsedAt" : "2023-05-16T03:16:48Z",
                     "name" : "agentk-test-1-connected",
                     "status" : "ACTIVE"
                  },
                  {
                     "id" : "gid://gitlab/Clusters::AgentToken/23",
                     "lastUsedAt" : null,
                     "name" : "agentk-test-1-token-2",
                     "status" : "ACTIVE"
                  }
               ]
            }
         },
         "name" : "AgentK Setup"
      }
   }
}

changed the description

mentioned in merge request !120825 (merged)

changed the description

mentioned in issue #412399 (closed)

marked this issue as related to #412399 (closed)

Async Issue Update

Status

Complete: 70%
Confidence: 90%

Notes

The change for this is added behind a Feature Flag. It needs to be enabled in production through ChatOps, then enabled by default a few days later if users do not complain.

Merge Requests

!120825 (merged)

added workflowin dev label and removed workflowready for development label

Feature Flag rollout thread

This has been enabled in staging and dev instances:

This is going to be enabled in gprd any time after 24 May 2023, 8 AM UTC

This has been enabled in gprd:

mentioned in merge request !122848 (merged)

Async Issue Update

Status

Complete: 100%
Confidence: 100%

Notes

The change for this is added behind a Feature Flag. The change has been enabled by default for all GitLab instances.

Merge Requests

!120825 (merged) !122848 (merged)

Feature Flag Rollout Issue

#412399 (closed)

Closing this issue now as the dropping of the Feature Flag will be handled in #412399 (closed)

closed

added workflowcomplete label and removed workflowin dev label

mentioned in issue gitlab-org/ci-cd/deploy-stage/environments-group/info#64 (closed)

mentioned in merge request !125833 (merged)

Add `last_used_at` to the agent token API mitigating Redis N+1 issues

What's the problem

Proposal

Designs

Child items ...