Use a separate table for storing push events
This MR adds a table called push_event_payloads
and migrates push events data to this new table.
Events Checklist
-
Index (project_id, created_at)
, then remove the standalone index onpush_events.project_id
-
Add a background migration to migrate a single row to the new format -
Add a post-deployment migration that schedules migrations for all existing push events -
Adjust the code so that creating push events uses PushEventPayload
-
Adjust Atom feeds to show push events using the new format, and make Event
compatible with this so we can show old data at the same time -
Adjust the UI to show push events using the new format -
Adjust the event queries to use a LATERAL JOIN on PostgreSQL to more efficiently get the data -
RemoveEvent#commits
/PushEvent#commits
once it's no longer in use (after taking care of the above)-
Event#commits
has to stay until all existing events have been processed
-
-
Check if we can simply drop events.data
in the next release since it seems to only be used by push events. This would remove the need for updating events, thus reducing table bloat (and subsequent DB load)- Confirmed as of July 25th, 2017:
events.data
is only used by push events
- Confirmed as of July 25th, 2017:
-
Verify the API output for push events
Database Checklist
When adding migrations:
-
Updated db/schema.rb
-
Added a down
method so the migration can be reverted -
Added the output of the migration(s) to the MR body -
Added the execution time of the migration(s) to the MR body -
Added tests for the migration in spec/migrations
if necessary (e.g. when migrating data) -
Made sure the migration won't interfere with a running GitLab cluster, for example by disabling transactions for long running migrations
When adding or modifying queries:
-
Included the raw SQL queries of the relevant queries -
Included the output of EXPLAIN ANALYZE
and execution timings of the relevant queries -
Added tests for the relevant changes
When adding tables:
-
Ordered columns based on their type sizes in descending order -
Added foreign keys if necessary -
Added indexes if necessary
General Checklist
-
Changelog entry added, if necessary -
Documentation created/updated -
Tests added for this feature/bug -
Conform by the merge request performance guides -
Conform by the style guides -
Squashed related commits together
Migration Output
From my personal staging environment, which uses the same setup as production:
== 20170608152747 PrepareEventsTableForPushEventsMigration: migrating =========
-- create_table(:events_for_migration)
-> 0.2035s
-- transaction_open?()
-> 0.0000s
-- execute("SET statement_timeout TO 0")
-> 0.0007s
-- execute("ALTER TABLE events_for_migration\nADD CONSTRAINT fk_edfd187b6f\nFOREIGN KEY (author_id)\nREFERENCES users (id)\nON DELETE cascade\nNOT VALID;\n")
-> 0.0089s
-- execute("ALTER TABLE events_for_migration VALIDATE CONSTRAINT fk_edfd187b6f;")
-> 0.0037s
== 20170608152747 PrepareEventsTableForPushEventsMigration: migrated (0.2174s
== 20170608152748 CreatePushEventPayloadsTables: migrating ====================
-- create_table(:push_event_payloads, {:id=>false})
-> 0.0227s
-- transaction_open?()
-> 0.0000s
-- execute("SET statement_timeout TO 0")
-> 0.0005s
-- execute("ALTER TABLE push_event_payloads\nADD CONSTRAINT fk_36c74129da\nFOREIGN KEY (event_id)\nREFERENCES events_for_migration (id)\nON DELETE cascade\nNOT VALID;\n")
-> 0.0031s
-- execute("ALTER TABLE push_event_payloads VALIDATE CONSTRAINT fk_36c74129da;")
-> 0.0032s
== 20170608152748 CreatePushEventPayloadsTables: migrated (0.0298s) ===========
== 20170627101016 ScheduleEventMigrations: migrating ==========================
== 20170627101016 ScheduleEventMigrations: migrated (278.5996s) ===============
== 20170727123534 AddIndexOnEventsProjectIdId: migrating ======================
-- index_exists?(:events, [:project_id, :id])
-> 0.0068s
-- transaction_open?()
-> 0.0000s
-- execute("SET statement_timeout TO 0")
-> 0.0005s
-- add_index(:events, [:project_id, :id], {:algorithm=>:concurrently})
-> 589.0104s
-- index_exists?(:events, :project_id)
-> 0.0077s
-- transaction_open?()
-> 0.0000s
-- select_one("SELECT current_setting('server_version_num') AS v")
-> 0.0010s
-- execute("SET statement_timeout TO 0")
-> 0.0005s
-- remove_index(:events, {:algorithm=>:concurrently, :column=>:project_id})
-> 0.2620s
-- index_exists?(:events_for_migration, [:project_id, :id])
-> 0.0047s
-- transaction_open?()
-> 0.0000s
-- execute("SET statement_timeout TO 0")
-> 0.0005s
-- add_index(:events_for_migration, [:project_id, :id], {:algorithm=>:concurrently})
-> 0.0164s
-- index_exists?(:events_for_migration, :project_id)
-> 0.0054s
-- transaction_open?()
-> 0.0000s
-- select_one("SELECT current_setting('server_version_num') AS v")
-> 0.0005s
-- execute("SET statement_timeout TO 0")
-> 0.0005s
-- remove_index(:events_for_migration, {:algorithm=>:concurrently, :column=>:project_id})
-> 0.1220s
== 20170727123534 AddIndexOnEventsProjectIdId: migrated (589.4401s) ===========
Migration Timings
Migration | Duration |
---|---|
PrepareEventsTableForPushEventsMigration | 0.2 sec |
CreatePushEventPayloadsTables | 0.2 sec |
ScheduleEventMigrations | 278.5 seconds (4.6 minutes) |
AddIndexOnEventsProjectIdId | 589.4 sec (9.8 minutes) |
Query Plans
This query is used to get events using a JOIN LATERAL
when supported (PostgreSQL 9.3 or newer):
SELECT "events".*
FROM (
SELECT "projects"."id"
FROM "projects"
INNER JOIN "project_authorizations" ON "projects"."id" = "project_authorizations"."project_id"
WHERE "project_authorizations"."user_id" = 1
) projects_for_lateral
JOIN LATERAL (
SELECT "events".*
FROM "events"
WHERE (events.project_id = projects_for_lateral.id)
ORDER BY "events"."id" DESC
LIMIT 20
) AS events ON true
ORDER BY "events"."id" DESC
LIMIT 20
OFFSET 0;
The plan for this query is:
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=2825.33..2825.38 rows=20 width=1875) (actual time=12.069..12.079 rows=20 loops=1)
-> Sort (cost=2825.33..2828.93 rows=1440 width=1875) (actual time=12.068..12.074 rows=20 loops=1)
Sort Key: events.id DESC
Sort Method: top-N heapsort Memory: 36kB
-> Nested Loop (cost=1.43..2787.02 rows=1440 width=1875) (actual time=0.041..10.669 rows=3631 loops=1)
-> Nested Loop (cost=0.86..186.63 rows=72 width=4) (actual time=0.028..1.677 rows=299 loops=1)
-> Index Only Scan using index_project_authorizations_on_user_id_project_id_access_level on project_authorizations (cost=0.43..5.69 rows=72 width=4) (actual time=0.020..0.113 rows=299 loops=1)
Index Cond: (user_id = 1)
Heap Fetches: 15
-> Index Only Scan using projects_pkey on projects (cost=0.43..2.50 rows=1 width=4) (actual time=0.004..0.005 rows=1 loops=299)
Index Cond: (id = project_authorizations.project_id)
Heap Fetches: 3
-> Limit (cost=0.57..35.72 rows=20 width=1875) (actual time=0.008..0.025 rows=12 loops=299)
-> Index Scan Backward using index_events_on_project_id_and_id on events (cost=0.57..2165.74 rows=1232 width=1875) (actual time=0.007..0.022 rows=12 loops=299)
Index Cond: (project_id = projects.id)
Planning time: 0.436 ms
Execution time: 12.122 ms
This is much better compared to the fallback query used for MySQL / PostgreSQL 9.2 (and is basically what we currently use):
SELECT "events".*
FROM "events"
WHERE (
EXISTS (
SELECT 1
FROM "projects"
INNER JOIN "project_authorizations" ON "projects"."id" = "project_authorizations"."project_id"
WHERE "project_authorizations"."user_id" = 1
AND (projects.id = events.project_id)
)
)
ORDER BY "events"."id" DESC
LIMIT 20
OFFSET 0
This produces plan:
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=5455.33..5455.38 rows=20 width=1875) (actual time=1587.420..1587.433 rows=20 loops=1)
-> Sort (cost=5455.33..5677.15 rows=88726 width=1875) (actual time=1587.419..1587.426 rows=20 loops=1)
Sort Key: events.id DESC
Sort Method: top-N heapsort Memory: 36kB
-> Nested Loop (cost=187.38..3094.37 rows=88726 width=1875) (actual time=1.541..1403.954 rows=583589 loops=1)
-> HashAggregate (cost=186.81..187.53 rows=72 width=8) (actual time=1.518..1.678 rows=299 loops=1)
Group Key: projects.id
-> Nested Loop (cost=0.86..186.63 rows=72 width=8) (actual time=0.024..1.433 rows=299 loops=1)
-> Index Only Scan using index_project_authorizations_on_user_id_project_id_access_level on project_authorizations (cost=0.43..5.69 rows=72 width=4) (actual time=0.017..0.101 rows=299 loops=1)
Index Cond: (user_id = 1)
Heap Fetches: 15
-> Index Only Scan using projects_pkey on projects (cost=0.43..2.50 rows=1 width=4) (actual time=0.004..0.004 rows=1 loops=299)
Index Cond: (id = project_authorizations.project_id)
Heap Fetches: 3
-> Index Scan using index_events_on_project_id_and_id on events (cost=0.57..28.05 rows=1232 width=1875) (actual time=0.011..3.901 rows=1952 loops=299)
Index Cond: (project_id = projects.id)
Planning time: 0.664 ms
Execution time: 1587.500 ms
So by using JOIN LATERAL we can query the data 66x faster.
Merge request reports
Activity
- Resolved by Yorick Peterse
mentioned in issue #31806 (closed)
mentioned in merge request !12355 (merged)
added 1 commit
- 76133960 - Use a separate table for storing push events
added 1 commit
- 20a60e85 - Use a separate table for storing push events
added 1 commit
- 2bda45eb - Use a separate table for storing push events
mentioned in merge request !12527 (merged)
mentioned in merge request !12555 (merged)
added 810 commits
-
2bda45eb...674f05d2 - 809 commits from branch
master
- 8560f19b - Use a separate table for storing push events
-
2bda45eb...674f05d2 - 809 commits from branch
changed milestone to %9.5
- Resolved by Yorick Peterse
- Resolved by Yorick Peterse
- Resolved by Yorick Peterse
- Resolved by Yorick Peterse
- Resolved by Yorick Peterse
- Resolved by Yorick Peterse
I'm contemplating taking a slightly different direction. While using two entirely different tables is a good idea at its core it's a bit tricky to deal with when it comes to pagination. Using a UNION proved too problematic because of the vastly different schemas of the tables involved, and because I couldn't get Rails to use the right classes for the returned rows. Paginating data manually is a pain because:
- Either we take the limit N, divide it by the table count, then use that for the actual limit. This however means we may display less data (e.g. if there are no push events).
- We use the limit N for all tables, then paginate the merged set again in Ruby. This however means we're paginating data both on database and Ruby level, and getting all this to work is really painful.
So I'm thinking of doing the following: we keep
events
as the main events table and turnpush_events
intoevent_push_payloads
(or something along those lines). We then addevent_push_payload_id
toevents
and migrate the push events data fromevents.data
into this separate table. In this setup pagination is as trivial as it normally is, and we still move most of the data from events (thedata
column) to a separate table. However, there are also some downsides:- The number of rows in
events
will stay the same, but then again they could still grow to a very large number even when using separate tables. - As we do the same for multiple event types we may end up with multiple
event_bla_payload_id
columns, each needing an index. This is essentially STI but using separate columns - Writing to
events
will still be expensive because of all the indexes - We'd need validations to ensure only one of these payload columns is set
However, there are also benefits to this approach:
- Pagination is trivial.
- We need to change less of the codebase, meaning other code such as the contributions calendar keeps working.
- We can still reduce the table size by a lot, though perhaps slightly less than 100GB (depending on how much space is necessary per row in this particular setup).
- We only have a single events table, making it easier to query for features such as the contributions calendar instead of having to aggregate over multiple tables and merge the results somehow.
- We can adjust Rails' STI system to use a different class based on the value of
events.action
. Basically forEvent::PUSHED
it would usePushEvent
, for everything else it would useEvent
. This removes the need for stuffing all logic into theEvent
model, and without the need for atype
column.
@yorickpeterse we have a
system_note_metadata
table, which sounds like a similar scheme. I think it's pragmatic in cases like this, where we want to reduce the size of an existing table (or prevent it growing much more) without completely changing that part of the application.Come to think of it: technically we don't actually need an additional column on the
events
table. As long as the payloads table has anevent_id
column (which it needs anyway for foreign keys) we can just use that. This means theevents
table would not get any wider. To then get the data you would just do something likePushEventPayload.where(event_id: self.id).take
Edited by Yorick Peterse@smcgivern Yeah, I'm leaning towards this approach being the best. I was already scratching my head over the amount of code that would have to be changed to support two separate tables.
added 413 commits
-
8560f19b...b0cb64a1 - 412 commits from branch
master
- 393039e4 - Migrate push event payloads to a separate table
-
8560f19b...b0cb64a1 - 412 commits from branch
added 413 commits
-
8560f19b...b0cb64a1 - 412 commits from branch
master
- 393039e4 - Migrate push event payloads to a separate table
-
8560f19b...b0cb64a1 - 412 commits from branch
added 1 commit
- 6fcf2431 - Store push event payloads in a separate table
@smcgivern I ended up going with the separate payloads table approach, and this is a much better solution thus far. In particular it makes the migration process much easier using some STI hackery. Having said that, we do need to change the output of our Atom feeds since we no longer expose all commits (Atom feeds display the 15 most recent commits per event), I'll see if I can come up with a sane new format.
Code wise the MR currently supports push events in both the old and new format (hence the setup of PushEvent). The data returned is compatible, but not guaranteed to be equal. For example,
PushEvent
only ever returns a single commit and its author details are set to the details of the person who pushed the commits, and not the creator of the commits. The next step is to adjust the UI so it doesn't rely directly on the raw commit data any more. Fortunately the changes here should be relatively minor.- Resolved by Yorick Peterse
added 1 commit
- f9adcadf - Store push event payloads in a separate table
added 1 commit
- 858ee420 - Store push event payloads in a separate table
With this MR slowly nearing completion I want to discuss the production impact/procedure. We currently have around 51 959 306 push events in our
events
table, all of which need to be migrated. This is done using a background migration that processes a range of N events. These jobs are scheduled indb/post_migrate/20170627101016_schedule_event_migrations.rb
which essentially does the following:- Get a slice of 10 000 events
- Break this slice up into ranges of 100
- For every range prepare a job, then push the 100 jobs (10 000 / 100) into Sidekiq at once
- Do this for all push events
The background migration then essentially does this:
- Get the events in the range using
SELECT id, data FROM events WHERE action = 5 AND data IS NOT NULL AND id >= START AND id <= END
- For every row, create a row in
push_event_payloads
. Upon a foreign key error (meaning the associated event was just removed) we'll just skip the event - Once the rows for this slice are created we'll run
UPDATE events SET data = NULL WHERE action = 5 AND id >= START AND id <= END
The migration is built so that it automatically skips already migrated data. This means that in the event of an error we can just re-run the scheduling process without having to make any adjustments.
The database impact of this is:
- 51 959 306 rows will have their
data
field set to NULL - 519 593 UPDATE statements
- Depending our vacuuming settings we can end up with millions of dead tuples
The number of UPDATE statements can be reduced by increasing the range of events we migrate per job, but this also means we can process less in parallel (which could lead to work taking longer). As such I'd rather not increase this beyond 100 events per job.
Because of the large amount of data that needs to be migrated I don't think it's a good idea to schedule batches with some time in between (e.g. 2 minutes between every batch). Doing so would lead to the last batch not starting until roughly 2 months after the first batch was scheduled. My goal is to have the work completed in less than a week, which I can only see reasonably happening if we just schedule all the jobs at once.
For this to work however we'll need to do a few things:
- We will need a few (e.g. 10 with e.g. 8 cores) dedicated Sidekiq workers for the
background_migration
queue - We will need to increase the number of connections pgbouncer allows on the primary to support this additional fleet of hosts
- We should adjust the vacuuming settings of
events
(before deploying this migration) so the cost limit is much greater. I'm thinking of 10 000 instead of 3 000 (so a 3.3x increase in throughput). This should greatly reduce the time necessary to vacuumevents
, at the cost of it maybe interfering with other queries - Once the migration is done we'll need to run pg_repack on the
events
table so we can reclaim space (this will most likely be at least 100 GB of space). We can't useVACUUM FULL
because this process can easily take hours while locking the entire table.
cc @pcarranza
added 1 commit
- 1cf19a01 - Store push event payloads in a separate table
added 1 commit
- 8c62b118 - Store push event payloads in a separate table
added 1 commit
- 0fbf8273 - Store push event payloads in a separate table
added 1 commit
- 3cdce97e - Store push event payloads in a separate table
added 1 commit
- 13b70b26 - Store push event payloads in a separate table
added 1 commit
- 14f3fc7c - Store push event payloads in a separate table
added 1 commit
- 449b3d24 - Store push event payloads in a separate table
added 304 commits
-
449b3d24...c64fd519 - 303 commits from branch
master
- 3303f8e1 - Store push event payloads in a separate table
-
449b3d24...c64fd519 - 303 commits from branch
added 304 commits
-
449b3d24...c64fd519 - 303 commits from branch
master
- 3303f8e1 - Store push event payloads in a separate table
-
449b3d24...c64fd519 - 303 commits from branch
added 59 commits
-
3303f8e1...11f9ac0a - 58 commits from branch
master
- 390b1068 - Store push event payloads in a separate table
-
3303f8e1...11f9ac0a - 58 commits from branch
added 1 commit
- ded0c696 - Store push event payloads in a separate table
added 1 commit
- 2760ce75 - Store push event payloads in a separate table
added 1 commit
- 7dfbd229 - Store push event payloads in a separate table
added 253 commits
-
7dfbd229...b7d372d9 - 252 commits from branch
master
- 15ecf591 - Store push event payloads in a separate table
-
7dfbd229...b7d372d9 - 252 commits from branch
added 1 commit
- a6a15bfb - Store push event payloads in a separate table
added 1 commit
- 8da116c1 - Store push event payloads in a separate table
To get rid of the table bloat I'm thinking of the following solution:
Release 1:
- Create a table called "events_for_migration" using the same structure as "events" except:
- It won't have any data
- The "data" and "title" columns are not included
- The primary key starts at the same value as "events.id"
- It will have a foreign key on
push_events_for_migration.project_id
pointing toprojects.id
with a cascading delete - It will have all indexes in place
- Set up a trigger to replicate INSERT statements on "events" to "events_for_migration", using the exact same values (including the "id" value)
- Migrate push events, copying over migrated "events" rows into the "events_for_migration" column (including the "id" value, excluding the "data" and "title" columns). In this setup we'll leave
events.data
as-is to reduce table bloat and pressure.
Release 2:
- Steal any remaining background migrations for push events. At this point "events_for_migration" contains newly inserted events, and all migrated push events (about 80% of the current size of "events")
- Copy over remaining rows from "events" where "events.id" is smaller than the smallest value of "events_for_migration.id". This ensures we only copy over data that hasn't already been moved to this table. An example query would be
INSERT INTO events_for_migration (column1, column2, ...) SELECT column1, column2, ... FROM events WHERE id < MIN_ID
withMIN_ID
being queried separately and passed as an argument (since MySQL doesn't do CTEs this is the easiest approach). This is currently about 20-21% of the "events" table - Lock both "events" and "events_for_migration" in access exclusive mode, then:
- Rename the tables
- Reset the primary key sequence of the new events table so it starts at the right value
- Unlock tables
- Adjust
push_event_payloads
so the foreign key onevent_id
points to the new events table (instead of the old one), then validate it - Drop the old events table
This setup should have minimal impact as the time the tables are locked is very brief. Further, it removes the need for
VACUUM FULL
or pg_repack to reclaim space. Both of these are not ideal sinceVACUUM FULL
requires downtime (since it locks the entire table for the duration) and pg_repack is an extension that we don't ship (plus can't really control from a Rails migration). Since we're scanning over 80% of the "events" rows we might as well embed this work into the migration.There are some details here that are very important:
- The foreign key on
events_for_migration.project_id
is crucial as it prevents orphaned data from being produced. This in turn means we don't need to remove any orphans before adjusting the foreign key in step 4, reducing the amount of time necessary to lock things (without this we'd have to remove orphans and add the foreign key inside the lock). - Both tables must absolutely use the same primary key sequence as this makes it easier to adjust the foreign key on
push_event_payloads
- Until this entire process is complete we'll end up using a bit more disk space. The exact amount is hard to estimate, but it shouldn't be more than 10-20 GB (towards the end of the migration just before we drop the old table)
- The background migration in this MR will have to not only insert data into
push_event_payloads
, but also copy over rows fromevents
toevents_to_migrate
. We can skip this but then copying over existing data in step 2 becomes more time consuming
The end result would be a "events" table with basically no bloat by 9.6, and this would work for all GitLab users (instead of only for GitLab.com).
Edited by Yorick PeterseThe distribution of event types is currently as following (to explain the 20/80% values mentioned above):
gitlabhq_production=# select action, count(*) as amount from events group by action order by amount desc; action | amount --------+---------- 5 | 53979887 1 | 6177726 6 | 3718787 3 | 1824720 7 | 1807557 8 | 738592 9 | 582628 4 | 108801 11 | 22892 10 | 9708 (10 rows) gitlabhq_production=# select case when action = 5 then 'pushed' else 'other' end AS label, count(*) as amount from events group by label order by amount desc; label | amount --------+---------- pushed | 53980383 other | 14991599
Here we can see that all non pushed events take up about 20% of the total row count (= pushed + other).
Coding wise pg_repack would be easier. In this setup we'd basically do the following:
- Migrate push events as done in the current MR state (= inserting rows into
push_event_payloads
then nullifyingevents.data
for the migrated rows) - Wait for all of this to complete
- Drop
events.data
- Run pg_repack on
events
The downside of this approach is that it requires us to:
- Ship pg_repack so everybody can benefit from it (or fall back to
VACUUM FULL
if pg_repack is not available, but this is really dangerous) - Be able to start pg_repack from a database migration, preferably in a non blocking manner so the deploy isn't blocked by it
I do see us doing work like this in the future, so perhaps pg_repack may not be a bad tool to ship. However, if pg_repack requires special permissions or commands to run we can't use it since the DB migration user may not have these permissions.
- Migrate push events as done in the current MR state (= inserting rows into
Hm, as far as I can find pg_repack is executed as a shell command. This means we can't really call it from a migration because pg_repack may not be available on the host that is running the migrations. This means the only way to get rid of bloat using pg_repack is to tell users to manually run it, which I really don't like.
@yorickpeterse pg_repack is not even installed in our infrastructure at this point. I wouldn't rely on it at all.
Also, @yorickpeterse IIRC that required us building
pg_repack
manually against the right PG version... a lot of mess.Do you think that this is the the only (reasonable) way to do this?
@pcarranza The options that I can think of are:
- The migration procedure listed above. This requires some additional coding but allows us to reclaim bloat the fastest.
- pg_repack, which has various distribution difficulties
-
VACUUM FULL
, this requires downtime and could easily take hours - logical replication to replicate to a new cluster (bloat won't be replicated), then fail over to this new cluster. This however is something that would take months to setup, plus it would be GitLab.com specific; it would however also reclaim bloat in other tables
added 309 commits
-
8da116c1...4d05e853 - 307 commits from branch
master
- 7191a0ce - Store push event payloads in a separate table
- cc173367 - Use a specialized class for querying events
-
8da116c1...4d05e853 - 307 commits from branch
added 218 commits
-
cc173367...8626cdc3 - 216 commits from branch
master
- db57e9f6 - Store push event payloads in a separate table
- 069b3db5 - Use a specialized class for querying events
-
cc173367...8626cdc3 - 216 commits from branch
I ended up going with the approach of migrating data from within GitLab. This is done in such a way that in 10.0 all we need to do is:
- Steal any remaining background jobs
- Swap the tables
- Start a transaction
- Lock both tables in exclusive mode
- Drop "events"
- Rename "events_for_migration" to "events", including any indexes
- Commit
- Remove the migration related code from the
Event
model, and remove theEventForMigration
model
The procedure in question takes care of copying over all existing data as well as newly inserted data. The latter is done by the application and not a trigger since triggers can't be dumped to db/schema.rb. This means if one were to set up a new DB they would not have said trigger, thus losing newly inserted events.
One thing to take into account is that the background migration does 1 INSERT for every event. Batching this won't work very well as a single row with an invalid foreign key value would lead to the entire batch being cancelled, whereas 1 INSERT per event means we can simply skip the event.
The procedure of swapping tables will not take more than maybe a second or so.
added 2 commits
@smcgivern @DouweM Since this is going to be a pretty big set of changes I'd like to have some preliminary feedback (when convenient) while I probably spend a few days getting all the tests to pass.
- Resolved by Yorick Peterse
- Resolved by Yorick Peterse
- Resolved by Yorick Peterse
- Resolved by Douwe Maan
- Resolved by Douwe Maan
- Resolved by Yorick Peterse
- Resolved by Yorick Peterse
- Resolved by Yorick Peterse
- Resolved by Yorick Peterse
- Resolved by Yorick Peterse
- Resolved by Douwe Maan
- Resolved by Yorick Peterse
- Resolved by Yorick Peterse
assigned to @smcgivern
- Resolved by Yorick Peterse
This is done in such a way that in 10.0 all we need to do is:
- Steal any remaining background jobs
This is going to really suck for anyone who skips 9.5. Could we wait a little longer before stealing the remaining background jobs?
- Resolved by Yorick Peterse
- Resolved by Sean McGivern
- Resolved by Yorick Peterse
- Resolved by Yorick Peterse
- Resolved by Yorick Peterse
- Resolved by Sean McGivern
- Resolved by Yorick Peterse
- Resolved by Yorick Peterse
- pg_repack, which has various distribution difficulties
This is a digression, but given that the fourth option (logical replication to a new cluster) would have been specific to GitLab.com, couldn't we do this in a GitLab.com-specific way initially?
@yorickpeterse thanks! I mostly have minor comments; Douwe's are more substantial, but I think we both agree this approach makes sense. I'm concerned about the timeline, but that's easy to change / discuss.
assigned to @yorickpeterse
added 183 commits
-
0c8938cb...90cb2aab - 181 commits from branch
master
- 4d188df7 - Migrate events into a new format
- 5d742aa5 - Use a specialized class for querying events
-
0c8938cb...90cb2aab - 181 commits from branch
added 2 commits
added 2 commits
mentioned in merge request !12584 (merged)
added 52 commits
-
788d07ef...1b117e7f - 50 commits from branch
master
- a0f396dd - Migrate events into a new format
- 5278a7de - Use a specialized class for querying events
-
788d07ef...1b117e7f - 50 commits from branch
added 1 commit
- 54d46aa2 - Update the API to support push event payloads
added 1 commit
- 54d46aa2 - Update the API to support push event payloads
added 2 commits
added 2 commits
added 2 commits
added 2 commits
mentioned in issue #35939 (closed)
mentioned in issue #27375 (closed)
mentioned in issue #35990 (closed)
marked the checklist item Documentation created/updated as completed
marked the checklist item Changelog entry added, if necessary as completed
marked the checklist item Squashed related commits together as completed
marked the checklist item Conform by the style guides as completed
marked the checklist item Conform by the merge request performance guides as completed
added 813 commits
-
dfd7e2b4...fafd6a0a - 811 commits from branch
master
- d025cc30 - Migrate events into a new format
- 109c71f1 - Use a specialized class for querying events
-
dfd7e2b4...fafd6a0a - 811 commits from branch
added 2 commits
added 2 commits
The regular migrations have been performed in a personal staging environment and their details are added to the MR. The background migrations are running, though I had to slightly adjust how many jobs we buffer and how many events we process in a single job. The initial amount of 100 events per job (and buffering 100 jobs) would take far too long to schedule, whereas with 1000/1000 it takes about 5 minutes.
added 2 commits
Some preliminary statistics:
- 8 threads on a 4 core CPU (4 threads only lead to 30% utilisation of all cores) based on the best-effort Sidekiq setup
- about 0.2% of all events could be processed in roughly 15 minutes, though I did end up restarting Sidekiq 2-3 times to adjust its concurrency levels
Row counts thus far:
gitlabhq_production=# select count(*) from push_event_payloads; count -------- 115669 (1 row) gitlabhq_production=# select count(*) from events_for_migration; count -------- 151731 (1 row)
Table sizes thus far:
gitlabhq_production=# select pg_size_pretty(pg_total_relation_size('push_event_payloads')); pg_size_pretty ---------------- 19 MB (1 row) gitlabhq_production=# select pg_size_pretty(pg_total_relation_size('events_for_migration')); pg_size_pretty ---------------- 35 MB (1 row)
Sample output to show
push_event_payloads
is doing what it should be doing:commit_count | event_id | action | ref_type | commit_from | commit_to | ref | commit_title --------------+----------+--------+----------+--------------------------------------------+--------------------------------------------+--------------------------+------------------------------------------------------------ 1 | 22630704 | 2 | 0 | \xc318d8b9cb51e65856df389cda1c9895ce17f9c2 | \xe01451a1a99508e75c10bf6e32f209cac2a43b08 | master | ribbon 1 | 22630701 | 2 | 0 | \xeef261b320282054213cf09f5da25bc63cc7b48c | \xaf9d1f5362006ccef59e679be38dfb7f1effcf8f | master | added high forest block 2 | 22630700 | 2 | 0 | \x9e58f85a6c809b7ba69c6d7f87f530afcd519a24 | \x3f3070fe536e4ead75dc2c2bd043a5f18932b768 | master | Merge branch 'address' into 'master'
added 49 commits
-
56dffd4c...142403ac - 47 commits from branch
master
- 84d84b79 - Migrate events into a new format
- bee8ae17 - Use a specialized class for querying events
-
56dffd4c...142403ac - 47 commits from branch
Some further statistics thus far:
I started the migration procedure roughly around 2017-08-08 16:30:00 UTC. On 2017-08-09 11:00:00 UTC the progress was 13.92%. This produces a rate of 0.75% per hour, which in turn would lead to the entire progress taking roughly 5.5 days for a single 4 core host running 8 threads (assuming progress per hour stays the same).
The table sizes are currently as follows (including indexes):
- events: 176 GB
- events_for_migration: 2.21 GB (9 935 595 rows)
- push_event_payloads: 1.2 GB (7 698 681 rows)
Assuming linear growth this would produce and end size of the following per table (including indexes):
- events_for_migration: 15.8 GB
- push_event_payloads: 8.62 GB
If we round this up a bit to take into account variance in sizes we'd get, say:
- events_for_migration: 20 GB
- push_event_payloads: 10 GB
This means in the new setup our total data size is around 30 GB (give or take a few GB). This in turn means we end up saving around 140 GB of space, giving us an 80% reduction.
This of course all assumes the data grows linear, which it probably won't.
mentioned in issue #36219 (moved)
assigned to @smcgivern
On 2017-08-10 11:20:42 progress was at 24.9%, 40.8 hours after starting. This gives us a rate of 0.6% per hour, slightly lower than before. At this rate it would take 6.9 days to complete the migration process.
Table statistics are currently as follows:
- events_for_migration: 17 689 866 rows, 3.94 GB
- push_event_payloads: 13 941 202 rows, 2.2 GB
With linear growth this would make us end up with roughly the same statistics as calculated when progress was at 13.92%.
@ilyaf @victorcete The restore hosts we used for this process (the DB, Redis, and the Sidekiq node) can now be terminated since I no longer need them.
@yorickpeterse thanks
I'll take care of deprovisioning the hosts- Resolved by Sean McGivern
- Resolved by Yorick Peterse
- Resolved by Yorick Peterse
- Resolved by Yorick Peterse
- Resolved by Yorick Peterse
- Resolved by Yorick Peterse
- Resolved by Sean McGivern
@yorickpeterse thanks! I don't have any major comments, really. This is another one we need to get approvals for, because it's past the freeze date - but we also really need this.
Edited by Jose Ivan Vargasassigned to @yorickpeterse
added 61 commits
-
bee8ae17...afd6e517 - 59 commits from branch
master
- 70e781b8 - Migrate events into a new format
- 2e776f2c - Use a specialized class for querying events
-
bee8ae17...afd6e517 - 59 commits from branch
added 2 commits
mentioned in issue #36303 (closed)
mentioned in commit e80a893f
mentioned in commit 75dffc45
I think https://gitlab.com/gitlab-org/gitlab-ee/-/jobs/29079656 is just a test setup issue, but to be safe we just reverted this from RC2 and can target this for RC3.
@yorickpeterse There isn't a MR to fix this yet. I think it is indeed an outdated EE spec.
mentioned in merge request !13466 (merged)
Picked into https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/13604, will merge into
9-5-stable
ready for9.5 RC5
mentioned in commit f67256ba
mentioned in commit b562e64f
mentioned in issue #36722 (closed)
mentioned in issue #36685 (closed)
mentioned in issue #36876 (closed)
mentioned in merge request !13932 (merged)
mentioned in issue #36880 (moved)
mentioned in issue #38004 (moved)
mentioned in issue #37814 (moved)
mentioned in issue #38509 (closed)
mentioned in issue #38507 (closed)
14 MAX_INDEX = 69 15 PUSHED = 5 16 17 def push_event? 18 action == PUSHED && data.present? 19 end 20 21 def commit_title 22 commit = commits.last 23 24 return nil unless commit && commit[:message] 25 26 index = commit[:message].index("\n") 27 message = index ? commit[:message][0..index] : commit[:message] 28 29 message.strip.truncate(70) @yorickpeterse @smcgivern Are you sure that enforces the
varying(70)
when utf8 characters are involved? I had to add.mb_chars.limit(70).to_s
in order to migrate successfully an instance with commit message in french. cf https://stackoverflow.com/questions/12536080/ruby-limiting-a-utf-8-string-by-byte-length.Edited by François Bobot@francois Both the database and Ruby limits operate on characters, not bytes. As such all of this should work fine, perhaps unless you're using some weird database encoding.
@yorickpeterse Indeed you are right by database seems to use
SQL_ASCII
encoding. Perhapsgitlab:check
should check also the encoding. But it seems a pain to change the encoding of a database. Thank you.
mentioned in issue #43394 (closed)
mentioned in issue gitlab#353266 (closed)