Log and monitor error events and secrets usage

Why are we doing this work

We want to track error events and secrets usage for the Closed Beta so we can determine how often or severe they are, and what actions we can take to make improvements or mitigate errors for GA.

What we want to track (see #571236 (comment 2811425725)):

  1. Are people using this? How much are they using this? Ideally we see at least one secret per project being created and used or something to that effect to show minimum adoption, but curious what the actual usage is for a secret in it's lifecycle.
  2. Is that adoption and usage concentrated? Are there specific pools of users/customers who are heavily using secrets and some who have light or no usage?
  3. In terms of error messages - are we able to categorize and understand what types of errors users are receiving? My interest here is we've been discussing a lot of different "what if" type of failures and before we got and fix everything, it would be good to know what "ifs" our users are actually being impacted.

Relevant links

See related discussions in:

Implementation plan

frontend work will be implemented in #577449 (closed).

The implementation requires adding event tracking and logging across all secret operations (create, read, update, delete) using Snowplow/Internal Events in the backend services. Each operation should track both successful completions and failures with categorized error labels such as 'openbao_connection', 'validation_error', 'stale_secret', and 'permission_denied'. Application logging should capture structured error details including operation type, error messages, project/user IDs, and secret status for debugging purposes. Usage metrics need to be collected to measure adoption (secrets per project, operations over time) and concentration patterns (distribution across users and projects). Frontend tracking in #577449 (closed) should capture UI interactions like page views, filter usage, and stale secret remediation actions.

Backend Tracking Services

  1. Usage & Adoption Tracking Service

    Tracks all user and CI interactions (create, update, read, delete, usage in pipelines) through Internal Events (Snowplow) as the single source of truth, with aggregated metrics derived downstream in Product Analytics dashboards (Snowflake/Sisense/Tableau).

  2. Reliability & Error Monitoring Service

    Captures operational failures and system health via structured Rails logs (Kibana), runtime exception tracking (Sentry with feature:secrets_management tags), and infrastructure-level signals (Prometheus/k8s) to ensure full visibility into errors, failures, and background job stability.

Frontend Tracking Services

  1. UI Interaction Tracking Service – Log page views and user actions injected into Internal Events (Snowplow).
  2. Frontend Error Handling Service – Capture surfaced GraphQL/UI errors with categorized messages injected into Sentry frontend.

Verification steps

  1. Internal Events (Snowplow)

    What it tracks: User actions and product usage through GitLab's internal event system using the gitlab_standard schema.

    Where to validate:

    • Local: To be checked
    • Production: follow - Internal Event Tracking Snowflake database or Sisense dashboards under Product Analytics
  2. Application Logs

    What it tracks: Structured JSON logs for operations, errors, and warnings with full context for debugging.

    Where to validate:

  3. Error Tracking (Sentry)

    What it tracks: Exceptions and errors with stack traces, user context, and error categorization.

    Where to validate:

    • Local: Console output when exceptions occur
    • Production: Sentry dashboard at https://sentry.gitlab.net with tag feature:secrets_management
Edited by Dmytro Biryukov