Split audit_events table
🏙
Background audit_events
table is used to store all audit information of user-driven actions on GitLab application. "audit_events
is the 4th largest table on GitLab.com" source. To provide more comprehensive audit events for compliance-minded organisations, we need to progressive extend our audit coverage while still maintaining reasonably performant and capable search functionalities on large audit events data.
📈
Demands There is a growing needs to provide complex views around audit event level
- Expose sub-groups and projects events under Group level view (#39139)
As well as more filtering/searching capabilities on the number of audit events
- Filtering Audit Events (#32388 (closed))
🌏
State of the world They are visible under 3 levels: Project
, Group
and Instance
as depicted below
We currently provide very limited filtering capabilities on these vast number of audit events, namely:
-
created_at
date range -
entity_type
andentity_id
The table contains a polymorphic type - Entity
(entity_id
and entity_type
) which can be either User
, Project
or Group
.
The audit_events
table schema currently look like the following:
Column | Type | Nullable | Default | Storage |
---|---|---|---|---|
id | integer | not null | nextval('audit_events_id_seq'::regclass) | plain |
author_id | integer | not null | plain | |
type | character varying | not null | extended | |
entity_id | integer | not null | plain | |
entity_type | character varying | not null | extended | |
details | text | extended | ||
created_at | timestamp without time zone | plain | ||
updated_at | timestamp without time zone | plain |
Index |
---|
"audit_events_pkey" PRIMARY KEY, btree (id) |
"analytics_index_audit_events_on_created_at_and_author_id" btree (created_at, author_id) |
"index_audit_events_on_entity_id_and_entity_type_and_id_desc" btree (entity_id, entity_type, id DESC) |
🚂
Challenges The current table schema poses a number of challenges:
- Slow query time
- Difficult to query information in
details
serialized hash - Inefficient utilization of indexes (i.e.
analytics_index_audit_events_on_created_at_and_author_id
is taking up spaces and not used by any queries)
"My main issue with our current schema is that we use a lot of serialized properties in the details
section. This makes it very hard to query for certain types of events and index on them. My suggestion to break the tables out is so we can add specific fields for certain operations (e.g. user changed this setting from X to Y)." source
Proposal
Extending on the original @stanhu's thoughts (#7865 (comment 110902945)), we would like to distinguish two types of audit events:
- Settings changes (very much like Change-Data Capture record with author)
- Activities (that not always result in changes of our persisted stores)
instance_setting_audit_events
Column | Type | Nullable | Default | Storage |
---|---|---|---|---|
id | integer | not null | nextval('instance_setting_audit_events_id_seq'::regclass) | plain |
author_id | integer | not null | plain | |
attribute_name | character varying(255) | not null | extended | |
old_value | character varying(255) | not null | extended | |
new_value | character varying(255) | not null | extended | |
ip_address | character varying(255) | not null | extended | |
created_at | timestamp without time zone | not null | plain |
group_setting_audit_events
Column | Type | Nullable | Default | Storage |
---|---|---|---|---|
id | integer | not null | nextval('group_setting_audit_events_id_seq'::regclass) | plain |
author_id | integer | not null | plain | |
group_id | integer | not null | plain | |
attribute_name | character varying(255) | not null | extended | |
old_value | character varying(255) | not null | extended | |
new_value | character varying(255) | not null | extended | |
created_at | timestamp without time zone | not null | plain |
group_activity_audit_events
Column | Type | Nullable | Default | Storage |
---|---|---|---|---|
id | integer | not null | nextval('group_activity_audit_events_id_seq'::regclass) | plain |
author_id | integer | not null | plain | |
group_id | integer | not null | plain | |
created_at | timestamp without time zone | not null | plain |
project_setting_audit_events
Column | Type | Nullable | Default | Storage |
---|---|---|---|---|
id | integer | not null | nextval('project_setting_audit_events_id_seq'::regclass) | plain |
author_id | integer | not null | plain | |
project_id | integer | not null | plain | |
attribute_name | character varying(255) | not null | extended | |
old_value | character varying(255) | not null | extended | |
new_value | character varying(255) | not null | extended | |
created_at | timestamp without time zone | not null | plain |
project_activity_audit_events
Column | Type | Nullable | Default | Storage |
---|---|---|---|---|
id | integer | not null | nextval('project_activity_audit_events_id_seq'::regclass) | plain |
author_id | integer | not null | plain | |
project_id | integer | not null | plain | |
created_at | timestamp without time zone | not null | plain |
user_activity_audit_events
Column | Type | Nullable | Default | Storage |
---|---|---|---|---|
id | integer | not null | nextval('user_activity_audit_events_id_seq'::regclass) | plain |
author_id | integer | not null | plain | |
user_id | integer | not null | plain | |
created_at | timestamp without time zone | not null | plain |
⚔
Migration plan-of-attack -
How to migrate from existing audit_events
table to new schema
Tasks to Evaluate
-
Determine feasibility of the feature -
Create issue for implementation or update existing implementation issue description with implementation proposal
Risks and Implementation Considerations
-
ip_address
is currently available onadmin_audit_log
feature only. Is there any privacy concern to always capture this piece of information? - Generate aggregated project level and group level view across new set of tables might require further optimization (i.e. maybe leveraging
ElasticSearch
?) - Transactional grouping of audit events, esp. around settings changes. For example, an update on a project settings page could change both the name and description and end up with 2 separate audit events. Should the schema on the table reflect that they are part of a same action?
- Might require frontend changes to show a view for each table
Team
-
Add workflowplanning breakdown feature and the corresponding ~devops::<stage>
and~group::<group>
labels. -
Ping the PM and EM.
Related issues
- Partition the audit events table #7865 (closed)
List of audit events
References
Scope | Name |
---|---|
GroupDestroyed | Group |
GroupMarkedForDeletion | Group |
GroupSettingsUpdated | Group |
GroupRestored | Groups |
ExportFileDownloadStarted | Project |
Impersonation | Project |
KeysCreated | Project |
MemberAccessApproved | Project |
MemberCreated | Project |
MemberDestroyed | Project |
MemberUpdated | Project |
ProjectArchived | Project |
ProjectCreated | Project |
ProjectDestroyed | Project |
ProjectFeaturesUpdated | Project |
ProjectGroupLinksCreated | Project |
ProjectGroupLinksUpdated | Project |
ProjectSettingsUpdated | Project |
ProjectUnarchived | Project |
ProtectedBranch | Project |
ReleaseArtifactsDownloaded | Project |
ReleaseAssociateMilestoneUpdated | Project |
ReleaseCreated | Project |
RepositoryDownloadStarted | Project |
RepositoryPush | Project |
DeployKeyCreated | User |
DeployKeyDisabled | User |
DeployKeyEnabled | User |
EmailUpdated | User |
LoginFailed | User |
LoginSuccessful | User |
OauthAccessInitiated | User |
SmartcardLoginSuccessful | User |
UserBlocked | User |
UserCreated | User |
UserDestroyed | User |
UserUpdated | User |