Skip to content

Decide where to store scoped_user_id

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

Problem

We need to choose where to persist scoped_user_id which today it's immutable data but not a good candidate for p_ci_job_definitions because it would negatively impact deduplication - same job definition can be triggered by many different users.

We need to store scoped_user_id outside.

Proposal

Create dedicated table p_ci_job_identities that would contain scoped_user_id <-> job_id association together with partitioning and sharding keys.

Backwards compatibility

Since scoped_user_id is currently persisted in options, we need to evaluate the current behavior and what should happen when persisting this data into a dedicated table:

  1. Current: scoped_user_id is persisted during pipeline creation.
  2. Current: when user is deleted we maintain scoped_user_id as is. We could probably use LFK between scoped_user_id and users.id and delete the record. Behavior should remain the same
  3. Current: when job is retried we propagate (during cloning) the scoped_user_id to the new job.
Old issue - Decide where to store `scoped_user_id`

Proposals

2 options so far:

  1. introduce a new table p_ci_job_processing that it would mainly persist this column but be a place for other similar data in the future
    1. Adding a new table will come with the overhead of standard columns: job_id, project_id, partition_id, created_at, updated_at. This would be highly inefficient for one integer column scoped_user_id.
    2. If in the future we have similar type of data (processing or immutable but not good candidate for deduplication) we could store it here.
  2. persist it in p_ci_builds. The latter could be preferable if we want to display what human user triggered the job using a service account/agent.
    1. scoped_user_id is used when service accounts (e.g. Duo Workflow or AmazonQ mapped to a user_id) trigger actions on behalf of a human user (in this case tracked via scoped_user_id). Arguably this data could also be considered intrinsic.
    2. scoped_user_id, while today it's used as processing data for authorization, it has the tendency to be intrinsic data. For example: like for user_id we may want to audit or display which human user (scoped_user_id) triggered the action through the service account. This information may at some point be displayed in the UI or exposed via API.
    3. It's an extra column to add to ci_builds - but as of today it doesn't need to be indexed or linked via FK because it's currently stored in options.
Edited by 🤖 GitLab Bot 🤖