Split audit_events table
Background 🏙
audit_events table is used to store all audit information of user-driven actions on GitLab application. "audit_events is the 4th largest table on GitLab.com" source. To provide more comprehensive audit events for compliance-minded organisations, we need to progressive extend our audit coverage while still maintaining reasonably performant and capable search functionalities on large audit events data.
Demands 📈
There is a growing needs to provide complex views around audit event level
- Expose sub-groups and projects events under Group level view (#39139 (closed))
As well as more filtering/searching capabilities on the number of audit events
- Filtering Audit Events (#32388 (closed))
State of the world 🌏
They are visible under 3 levels: Project, Group and Instance as depicted below
We currently provide very limited filtering capabilities on these vast number of audit events, namely:
-
created_atdate range -
entity_typeandentity_id
The table contains a polymorphic type - Entity (entity_id and entity_type) which can be either User, Project or Group.
The audit_events table schema currently look like the following:
| Column | Type | Nullable | Default | Storage |
|---|---|---|---|---|
| id | integer | not null | nextval('audit_events_id_seq'::regclass) | plain |
| author_id | integer | not null | plain | |
| type | character varying | not null | extended | |
| entity_id | integer | not null | plain | |
| entity_type | character varying | not null | extended | |
| details | text | extended | ||
| created_at | timestamp without time zone | plain | ||
| updated_at | timestamp without time zone | plain |
| Index |
|---|
| "audit_events_pkey" PRIMARY KEY, btree (id) |
| "analytics_index_audit_events_on_created_at_and_author_id" btree (created_at, author_id) |
| "index_audit_events_on_entity_id_and_entity_type_and_id_desc" btree (entity_id, entity_type, id DESC) |
Challenges 🚂
The current table schema poses a number of challenges:
- Slow query time
- Difficult to query information in
detailsserialized hash - Inefficient utilization of indexes (i.e.
analytics_index_audit_events_on_created_at_and_author_idis taking up spaces and not used by any queries)
"My main issue with our current schema is that we use a lot of serialized properties in the details section. This makes it very hard to query for certain types of events and index on them. My suggestion to break the tables out is so we can add specific fields for certain operations (e.g. user changed this setting from X to Y)." source
Proposal
Extending on the original @stanhu's thoughts (#7865 (comment 110902945)), we would like to distinguish two types of audit events:
- Settings changes (very much like Change-Data Capture record with author)
- Activities (that not always result in changes of our persisted stores)
instance_setting_audit_events
| Column | Type | Nullable | Default | Storage |
|---|---|---|---|---|
| id | integer | not null | nextval('instance_setting_audit_events_id_seq'::regclass) | plain |
| author_id | integer | not null | plain | |
| attribute_name | character varying(255) | not null | extended | |
| old_value | character varying(255) | not null | extended | |
| new_value | character varying(255) | not null | extended | |
| ip_address | character varying(255) | not null | extended | |
| created_at | timestamp without time zone | not null | plain |
group_setting_audit_events
| Column | Type | Nullable | Default | Storage |
|---|---|---|---|---|
| id | integer | not null | nextval('group_setting_audit_events_id_seq'::regclass) | plain |
| author_id | integer | not null | plain | |
| group_id | integer | not null | plain | |
| attribute_name | character varying(255) | not null | extended | |
| old_value | character varying(255) | not null | extended | |
| new_value | character varying(255) | not null | extended | |
| created_at | timestamp without time zone | not null | plain |
group_activity_audit_events
| Column | Type | Nullable | Default | Storage |
|---|---|---|---|---|
| id | integer | not null | nextval('group_activity_audit_events_id_seq'::regclass) | plain |
| author_id | integer | not null | plain | |
| group_id | integer | not null | plain | |
| created_at | timestamp without time zone | not null | plain |
project_setting_audit_events
| Column | Type | Nullable | Default | Storage |
|---|---|---|---|---|
| id | integer | not null | nextval('project_setting_audit_events_id_seq'::regclass) | plain |
| author_id | integer | not null | plain | |
| project_id | integer | not null | plain | |
| attribute_name | character varying(255) | not null | extended | |
| old_value | character varying(255) | not null | extended | |
| new_value | character varying(255) | not null | extended | |
| created_at | timestamp without time zone | not null | plain |
project_activity_audit_events
| Column | Type | Nullable | Default | Storage |
|---|---|---|---|---|
| id | integer | not null | nextval('project_activity_audit_events_id_seq'::regclass) | plain |
| author_id | integer | not null | plain | |
| project_id | integer | not null | plain | |
| created_at | timestamp without time zone | not null | plain |
user_activity_audit_events
| Column | Type | Nullable | Default | Storage |
|---|---|---|---|---|
| id | integer | not null | nextval('user_activity_audit_events_id_seq'::regclass) | plain |
| author_id | integer | not null | plain | |
| user_id | integer | not null | plain | |
| created_at | timestamp without time zone | not null | plain |
Migration plan-of-attack ⚔
-
How to migrate from existing audit_eventstable to new schema
Tasks to Evaluate
-
Determine feasibility of the feature -
Create issue for implementation or update existing implementation issue description with implementation proposal
Risks and Implementation Considerations
-
ip_addressis currently available onadmin_audit_logfeature only. Is there any privacy concern to always capture this piece of information? - Generate aggregated project level and group level view across new set of tables might require further optimization (i.e. maybe leveraging
ElasticSearch?) - Transactional grouping of audit events, esp. around settings changes. For example, an update on a project settings page could change both the name and description and end up with 2 separate audit events. Should the schema on the table reflect that they are part of a same action?
- Might require frontend changes to show a view for each table
Team
-
Add workflowplanning breakdown feature and the corresponding ~devops::<stage>and~group::<group>labels. -
Ping the PM and EM.
Related issues
- Partition the audit events table #7865 (closed)
List of audit events
References
| Scope | Name |
|---|---|
| GroupDestroyed | Group |
| GroupMarkedForDeletion | Group |
| GroupSettingsUpdated | Group |
| GroupRestored | Groups |
| ExportFileDownloadStarted | Project |
| Impersonation | Project |
| KeysCreated | Project |
| MemberAccessApproved | Project |
| MemberCreated | Project |
| MemberDestroyed | Project |
| MemberUpdated | Project |
| ProjectArchived | Project |
| ProjectCreated | Project |
| ProjectDestroyed | Project |
| ProjectFeaturesUpdated | Project |
| ProjectGroupLinksCreated | Project |
| ProjectGroupLinksUpdated | Project |
| ProjectSettingsUpdated | Project |
| ProjectUnarchived | Project |
| ProtectedBranch | Project |
| ReleaseArtifactsDownloaded | Project |
| ReleaseAssociateMilestoneUpdated | Project |
| ReleaseCreated | Project |
| RepositoryDownloadStarted | Project |
| RepositoryPush | Project |
| DeployKeyCreated | User |
| DeployKeyDisabled | User |
| DeployKeyEnabled | User |
| EmailUpdated | User |
| LoginFailed | User |
| LoginSuccessful | User |
| OauthAccessInitiated | User |
| SmartcardLoginSuccessful | User |
| UserBlocked | User |
| UserCreated | User |
| UserDestroyed | User |
| UserUpdated | User |