Skip to content
Snippets Groups Projects

Use a separate table for storing push events

Merged Yorick Peterse requested to merge split-events-into-push-events into master
All threads resolved!

This MR adds a table called push_event_payloads and migrates push events data to this new table.

Events Checklist

  • Index (project_id, created_at), then remove the standalone index on push_events.project_id
  • Add a background migration to migrate a single row to the new format
  • Add a post-deployment migration that schedules migrations for all existing push events
  • Adjust the code so that creating push events uses PushEventPayload
  • Adjust Atom feeds to show push events using the new format, and make Event compatible with this so we can show old data at the same time
  • Adjust the UI to show push events using the new format
  • Adjust the event queries to use a LATERAL JOIN on PostgreSQL to more efficiently get the data
  • Remove Event#commits / PushEvent#commits once it's no longer in use (after taking care of the above)
    • Event#commits has to stay until all existing events have been processed
  • Check if we can simply drop events.data in the next release since it seems to only be used by push events. This would remove the need for updating events, thus reducing table bloat (and subsequent DB load)
    • Confirmed as of July 25th, 2017: events.data is only used by push events
  • Verify the API output for push events

Database Checklist

When adding migrations:

  • Updated db/schema.rb
  • Added a down method so the migration can be reverted
  • Added the output of the migration(s) to the MR body
  • Added the execution time of the migration(s) to the MR body
  • Added tests for the migration in spec/migrations if necessary (e.g. when migrating data)
  • Made sure the migration won't interfere with a running GitLab cluster, for example by disabling transactions for long running migrations

When adding or modifying queries:

  • Included the raw SQL queries of the relevant queries
  • Included the output of EXPLAIN ANALYZE and execution timings of the relevant queries
  • Added tests for the relevant changes

When adding tables:

  • Ordered columns based on their type sizes in descending order
  • Added foreign keys if necessary
  • Added indexes if necessary

General Checklist

Migration Output

From my personal staging environment, which uses the same setup as production:

== 20170608152747 PrepareEventsTableForPushEventsMigration: migrating =========                                                                               
-- create_table(:events_for_migration)                                                                                                                        
   -> 0.2035s                                                                  
-- transaction_open?()                                                         
   -> 0.0000s                                                                  
-- execute("SET statement_timeout TO 0")                                       
   -> 0.0007s                                                                  
-- execute("ALTER TABLE events_for_migration\nADD CONSTRAINT fk_edfd187b6f\nFOREIGN KEY (author_id)\nREFERENCES users (id)\nON DELETE cascade\nNOT VALID;\n") 
   -> 0.0089s                                                                  
-- execute("ALTER TABLE events_for_migration VALIDATE CONSTRAINT fk_edfd187b6f;")                                                                             
   -> 0.0037s                                                                  
== 20170608152747 PrepareEventsTableForPushEventsMigration: migrated (0.2174s

== 20170608152748 CreatePushEventPayloadsTables: migrating ====================
-- create_table(:push_event_payloads, {:id=>false})
   -> 0.0227s
-- transaction_open?()
   -> 0.0000s
-- execute("SET statement_timeout TO 0")
   -> 0.0005s
-- execute("ALTER TABLE push_event_payloads\nADD CONSTRAINT fk_36c74129da\nFOREIGN KEY (event_id)\nREFERENCES events_for_migration (id)\nON DELETE cascade\nNOT VALID;\n")
   -> 0.0031s
-- execute("ALTER TABLE push_event_payloads VALIDATE CONSTRAINT fk_36c74129da;")
   -> 0.0032s
== 20170608152748 CreatePushEventPayloadsTables: migrated (0.0298s) ===========

== 20170627101016 ScheduleEventMigrations: migrating ==========================
== 20170627101016 ScheduleEventMigrations: migrated (278.5996s) ===============

== 20170727123534 AddIndexOnEventsProjectIdId: migrating ======================                                                                               
-- index_exists?(:events, [:project_id, :id])                                  
   -> 0.0068s                                                                  
-- transaction_open?()                                                         
   -> 0.0000s                                                                  
-- execute("SET statement_timeout TO 0")                                       
   -> 0.0005s                                                                  
-- add_index(:events, [:project_id, :id], {:algorithm=>:concurrently})         
   -> 589.0104s                                                                
-- index_exists?(:events, :project_id)                                         
   -> 0.0077s                                                                  
-- transaction_open?()                                                         
   -> 0.0000s                                                                  
-- select_one("SELECT current_setting('server_version_num') AS v")             
   -> 0.0010s                                                                  
-- execute("SET statement_timeout TO 0")                                       
   -> 0.0005s                                                                  
-- remove_index(:events, {:algorithm=>:concurrently, :column=>:project_id})    
   -> 0.2620s                                                                  
-- index_exists?(:events_for_migration, [:project_id, :id])                    
   -> 0.0047s                                                                  
-- transaction_open?()                                                         
   -> 0.0000s                                                                  
-- execute("SET statement_timeout TO 0")                                       
   -> 0.0005s                                                                  
-- add_index(:events_for_migration, [:project_id, :id], {:algorithm=>:concurrently})                                                                          
   -> 0.0164s                                                                  
-- index_exists?(:events_for_migration, :project_id)                           
   -> 0.0054s                                                                  
-- transaction_open?()                                                         
   -> 0.0000s                                                                  
-- select_one("SELECT current_setting('server_version_num') AS v")             
   -> 0.0005s                                                                  
-- execute("SET statement_timeout TO 0")                                       
   -> 0.0005s                                                                  
-- remove_index(:events_for_migration, {:algorithm=>:concurrently, :column=>:project_id})                                                                     
   -> 0.1220s                                                                  
== 20170727123534 AddIndexOnEventsProjectIdId: migrated (589.4401s) ===========                            

Migration Timings

Migration Duration
PrepareEventsTableForPushEventsMigration 0.2 sec
CreatePushEventPayloadsTables 0.2 sec
ScheduleEventMigrations 278.5 seconds (4.6 minutes)
AddIndexOnEventsProjectIdId 589.4 sec (9.8 minutes)

Query Plans

This query is used to get events using a JOIN LATERAL when supported (PostgreSQL 9.3 or newer):

SELECT  "events".* 
FROM (
  SELECT "projects"."id" 
  FROM "projects" 
  INNER JOIN "project_authorizations" ON "projects"."id" = "project_authorizations"."project_id" 
  WHERE "project_authorizations"."user_id" = 1
) projects_for_lateral 
JOIN LATERAL (
  SELECT  "events".* 
  FROM "events" 
  WHERE (events.project_id = projects_for_lateral.id)  
  ORDER BY "events"."id" DESC 
  LIMIT 20
) AS events ON true 
ORDER BY "events"."id" DESC 
LIMIT 20 
OFFSET 0;

The plan for this query is:

                                                                                                       QUERY PLAN                                                                                                       
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=2825.33..2825.38 rows=20 width=1875) (actual time=12.069..12.079 rows=20 loops=1)
   ->  Sort  (cost=2825.33..2828.93 rows=1440 width=1875) (actual time=12.068..12.074 rows=20 loops=1)
         Sort Key: events.id DESC
         Sort Method: top-N heapsort  Memory: 36kB
         ->  Nested Loop  (cost=1.43..2787.02 rows=1440 width=1875) (actual time=0.041..10.669 rows=3631 loops=1)
               ->  Nested Loop  (cost=0.86..186.63 rows=72 width=4) (actual time=0.028..1.677 rows=299 loops=1)
                     ->  Index Only Scan using index_project_authorizations_on_user_id_project_id_access_level on project_authorizations  (cost=0.43..5.69 rows=72 width=4) (actual time=0.020..0.113 rows=299 loops=1)
                           Index Cond: (user_id = 1)
                           Heap Fetches: 15
                     ->  Index Only Scan using projects_pkey on projects  (cost=0.43..2.50 rows=1 width=4) (actual time=0.004..0.005 rows=1 loops=299)
                           Index Cond: (id = project_authorizations.project_id)
                           Heap Fetches: 3
               ->  Limit  (cost=0.57..35.72 rows=20 width=1875) (actual time=0.008..0.025 rows=12 loops=299)
                     ->  Index Scan Backward using index_events_on_project_id_and_id on events  (cost=0.57..2165.74 rows=1232 width=1875) (actual time=0.007..0.022 rows=12 loops=299)
                           Index Cond: (project_id = projects.id)
 Planning time: 0.436 ms
 Execution time: 12.122 ms

This is much better compared to the fallback query used for MySQL / PostgreSQL 9.2 (and is basically what we currently use):

SELECT  "events".* 
FROM "events" 
WHERE (
  EXISTS (
    SELECT 1 
    FROM "projects" 
    INNER JOIN "project_authorizations" ON "projects"."id" = "project_authorizations"."project_id" 
    WHERE "project_authorizations"."user_id" = 1 
    AND (projects.id = events.project_id)
  )
) 
ORDER BY "events"."id" DESC 
LIMIT 20 
OFFSET 0

This produces plan:

                                                                                                          QUERY PLAN                                                                                                          
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=5455.33..5455.38 rows=20 width=1875) (actual time=1587.420..1587.433 rows=20 loops=1)
   ->  Sort  (cost=5455.33..5677.15 rows=88726 width=1875) (actual time=1587.419..1587.426 rows=20 loops=1)
         Sort Key: events.id DESC
         Sort Method: top-N heapsort  Memory: 36kB
         ->  Nested Loop  (cost=187.38..3094.37 rows=88726 width=1875) (actual time=1.541..1403.954 rows=583589 loops=1)
               ->  HashAggregate  (cost=186.81..187.53 rows=72 width=8) (actual time=1.518..1.678 rows=299 loops=1)
                     Group Key: projects.id
                     ->  Nested Loop  (cost=0.86..186.63 rows=72 width=8) (actual time=0.024..1.433 rows=299 loops=1)
                           ->  Index Only Scan using index_project_authorizations_on_user_id_project_id_access_level on project_authorizations  (cost=0.43..5.69 rows=72 width=4) (actual time=0.017..0.101 rows=299 loops=1)
                                 Index Cond: (user_id = 1)
                                 Heap Fetches: 15
                           ->  Index Only Scan using projects_pkey on projects  (cost=0.43..2.50 rows=1 width=4) (actual time=0.004..0.004 rows=1 loops=299)
                                 Index Cond: (id = project_authorizations.project_id)
                                 Heap Fetches: 3
               ->  Index Scan using index_events_on_project_id_and_id on events  (cost=0.57..28.05 rows=1232 width=1875) (actual time=0.011..3.901 rows=1952 loops=299)
                     Index Cond: (project_id = projects.id)
 Planning time: 0.664 ms
 Execution time: 1587.500 ms

So by using JOIN LATERAL we can query the data 66x faster.

Edited by Yorick Peterse

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • mentioned in issue #31806 (closed)

  • Yorick Peterse mentioned in merge request !12355 (merged)

    mentioned in merge request !12355 (merged)

  • Yorick Peterse changed the description

    changed the description

  • Yorick Peterse marked the checklist item Index (project_id, created_at), then remove the standalone index on push_events.project_id as completed

    marked the checklist item Index (project_id, created_at), then remove the standalone index on push_events.project_id as completed

  • Yorick Peterse marked the checklist item Add a background migration to migrate a single row to the new format as completed

    marked the checklist item Add a background migration to migrate a single row to the new format as completed

  • Yorick Peterse changed the description

    changed the description

  • Yorick Peterse marked the checklist item Add a post-deployment migration that schedules migrations for all existing push events as completed

    marked the checklist item Add a post-deployment migration that schedules migrations for all existing push events as completed

  • Yorick Peterse added 1 commit

    added 1 commit

    • 76133960 - Use a separate table for storing push events

    Compare with previous version

  • Yorick Peterse added 1 commit

    added 1 commit

    • 20a60e85 - Use a separate table for storing push events

    Compare with previous version

  • Yorick Peterse resolved all discussions

    resolved all discussions

  • Yorick Peterse marked the checklist item Updated db/schema.rb as completed

    marked the checklist item Updated db/schema.rb as completed

  • Yorick Peterse marked the checklist item Added a down method so the migration can be reverted as completed

    marked the checklist item Added a down method so the migration can be reverted as completed

  • Yorick Peterse marked the checklist item Added the output of the migration(s) to the MR body as completed

    marked the checklist item Added the output of the migration(s) to the MR body as completed

  • Yorick Peterse marked the checklist item Added the execution time of the migration(s) to the MR body as completed

    marked the checklist item Added the execution time of the migration(s) to the MR body as completed

  • Yorick Peterse added 1 commit

    added 1 commit

    • 2bda45eb - Use a separate table for storing push events

    Compare with previous version

  • Yorick Peterse marked the checklist item Ordered columns based on their type sizes in descending order as completed

    marked the checklist item Ordered columns based on their type sizes in descending order as completed

  • Yorick Peterse marked the checklist item Added foreign keys if necessary as completed

    marked the checklist item Added foreign keys if necessary as completed

  • Yorick Peterse marked the checklist item Added indexes if necessary as completed

    marked the checklist item Added indexes if necessary as completed

  • Yorick Peterse changed the description

    changed the description

  • Yorick Peterse mentioned in merge request !12527 (merged)

    mentioned in merge request !12527 (merged)

  • Yorick Peterse mentioned in merge request !12555 (merged)

    mentioned in merge request !12555 (merged)

  • Yorick Peterse added 810 commits

    added 810 commits

    Compare with previous version

  • Yorick Peterse changed the description

    changed the description

  • Yorick Peterse changed milestone to %9.5

    changed milestone to %9.5

  • Yorick Peterse changed the description

    changed the description

  • I'm contemplating taking a slightly different direction. While using two entirely different tables is a good idea at its core it's a bit tricky to deal with when it comes to pagination. Using a UNION proved too problematic because of the vastly different schemas of the tables involved, and because I couldn't get Rails to use the right classes for the returned rows. Paginating data manually is a pain because:

    1. Either we take the limit N, divide it by the table count, then use that for the actual limit. This however means we may display less data (e.g. if there are no push events).
    2. We use the limit N for all tables, then paginate the merged set again in Ruby. This however means we're paginating data both on database and Ruby level, and getting all this to work is really painful.

    So I'm thinking of doing the following: we keep events as the main events table and turn push_events into event_push_payloads (or something along those lines). We then add event_push_payload_id to events and migrate the push events data from events.data into this separate table. In this setup pagination is as trivial as it normally is, and we still move most of the data from events (the data column) to a separate table. However, there are also some downsides:

    1. The number of rows in events will stay the same, but then again they could still grow to a very large number even when using separate tables.
    2. As we do the same for multiple event types we may end up with multiple event_bla_payload_id columns, each needing an index. This is essentially STI but using separate columns
    3. Writing to events will still be expensive because of all the indexes
    4. We'd need validations to ensure only one of these payload columns is set

    However, there are also benefits to this approach:

    1. Pagination is trivial.
    2. We need to change less of the codebase, meaning other code such as the contributions calendar keeps working.
    3. We can still reduce the table size by a lot, though perhaps slightly less than 100GB (depending on how much space is necessary per row in this particular setup).
    4. We only have a single events table, making it easier to query for features such as the contributions calendar instead of having to aggregate over multiple tables and merge the results somehow.
    5. We can adjust Rails' STI system to use a different class based on the value of events.action. Basically for Event::PUSHED it would use PushEvent, for everything else it would use Event. This removes the need for stuffing all logic into the Event model, and without the need for a type column.
  • @yorickpeterse we have a system_note_metadata table, which sounds like a similar scheme. I think it's pragmatic in cases like this, where we want to reduce the size of an existing table (or prevent it growing much more) without completely changing that part of the application.

  • Come to think of it: technically we don't actually need an additional column on the events table. As long as the payloads table has an event_id column (which it needs anyway for foreign keys) we can just use that. This means the events table would not get any wider. To then get the data you would just do something like PushEventPayload.where(event_id: self.id).take

    Edited by Yorick Peterse
  • @smcgivern Yeah, I'm leaning towards this approach being the best. I was already scratching my head over the amount of code that would have to be changed to support two separate tables.

  • Yorick Peterse added 413 commits

    added 413 commits

    Compare with previous version

  • Yorick Peterse added 413 commits

    added 413 commits

    Compare with previous version

  • Yorick Peterse added 1 commit

    added 1 commit

    • 6fcf2431 - Store push event payloads in a separate table

    Compare with previous version

  • Yorick Peterse resolved all discussions

    resolved all discussions

  • @smcgivern I ended up going with the separate payloads table approach, and this is a much better solution thus far. In particular it makes the migration process much easier using some STI hackery. Having said that, we do need to change the output of our Atom feeds since we no longer expose all commits (Atom feeds display the 15 most recent commits per event), I'll see if I can come up with a sane new format.

    Code wise the MR currently supports push events in both the old and new format (hence the setup of PushEvent). The data returned is compatible, but not guaranteed to be equal. For example, PushEvent only ever returns a single commit and its author details are set to the details of the person who pushed the commits, and not the creator of the commits. The next step is to adjust the UI so it doesn't rely directly on the raw commit data any more. Fortunately the changes here should be relatively minor.

  • Yorick Peterse changed the description

    changed the description

  • Code wise the MR currently supports push events in both the old and new format (hence the setup of PushEvent). The data returned is compatible, but not guaranteed to be equal

    :thumbsup:

  • Yorick Peterse added 1 commit

    added 1 commit

    • f9adcadf - Store push event payloads in a separate table

    Compare with previous version

  • Yorick Peterse marked the checklist item Adjust Atom feeds to show push events using the new format, and make Event compatible with this so we can show old data at the same time as completed

    marked the checklist item Adjust Atom feeds to show push events using the new format, and make Event compatible with this so we can show old data at the same time as completed

  • Yorick Peterse marked the checklist item Adjust the UI to show push events using the new format as completed

    marked the checklist item Adjust the UI to show push events using the new format as completed

  • Yorick Peterse changed the description

    changed the description

  • Yorick Peterse added 1 commit

    added 1 commit

    • 858ee420 - Store push event payloads in a separate table

    Compare with previous version

  • With this MR slowly nearing completion I want to discuss the production impact/procedure. We currently have around 51 959 306 push events in our events table, all of which need to be migrated. This is done using a background migration that processes a range of N events. These jobs are scheduled in db/post_migrate/20170627101016_schedule_event_migrations.rb which essentially does the following:

    1. Get a slice of 10 000 events
    2. Break this slice up into ranges of 100
    3. For every range prepare a job, then push the 100 jobs (10 000 / 100) into Sidekiq at once
    4. Do this for all push events

    The background migration then essentially does this:

    1. Get the events in the range using SELECT id, data FROM events WHERE action = 5 AND data IS NOT NULL AND id >= START AND id <= END
    2. For every row, create a row in push_event_payloads. Upon a foreign key error (meaning the associated event was just removed) we'll just skip the event
    3. Once the rows for this slice are created we'll run UPDATE events SET data = NULL WHERE action = 5 AND id >= START AND id <= END

    The migration is built so that it automatically skips already migrated data. This means that in the event of an error we can just re-run the scheduling process without having to make any adjustments.

    The database impact of this is:

    • 51 959 306 rows will have their data field set to NULL
    • 519 593 UPDATE statements
    • Depending our vacuuming settings we can end up with millions of dead tuples

    The number of UPDATE statements can be reduced by increasing the range of events we migrate per job, but this also means we can process less in parallel (which could lead to work taking longer). As such I'd rather not increase this beyond 100 events per job.

    Because of the large amount of data that needs to be migrated I don't think it's a good idea to schedule batches with some time in between (e.g. 2 minutes between every batch). Doing so would lead to the last batch not starting until roughly 2 months after the first batch was scheduled. My goal is to have the work completed in less than a week, which I can only see reasonably happening if we just schedule all the jobs at once.

    For this to work however we'll need to do a few things:

    • We will need a few (e.g. 10 with e.g. 8 cores) dedicated Sidekiq workers for the background_migration queue
    • We will need to increase the number of connections pgbouncer allows on the primary to support this additional fleet of hosts
    • We should adjust the vacuuming settings of events (before deploying this migration) so the cost limit is much greater. I'm thinking of 10 000 instead of 3 000 (so a 3.3x increase in throughput). This should greatly reduce the time necessary to vacuum events, at the cost of it maybe interfering with other queries
    • Once the migration is done we'll need to run pg_repack on the events table so we can reclaim space (this will most likely be at least 100 GB of space). We can't use VACUUM FULL because this process can easily take hours while locking the entire table.

    cc @pcarranza

  • Yorick Peterse added 1 commit

    added 1 commit

    • 1cf19a01 - Store push event payloads in a separate table

    Compare with previous version

  • Worth mentioning: we can't just drop the data column and then vacuum since the column is still in use by other event types.

  • Yorick Peterse changed the description

    changed the description

  • Yorick Peterse added 1 commit

    added 1 commit

    • 8c62b118 - Store push event payloads in a separate table

    Compare with previous version

  • Yorick Peterse marked the checklist item Adjust the code so that creating push events uses PushEventPayload as completed

    marked the checklist item Adjust the code so that creating push events uses PushEventPayload as completed

  • Yorick Peterse added 1 commit

    added 1 commit

    • 0fbf8273 - Store push event payloads in a separate table

    Compare with previous version

  • Yorick Peterse added 1 commit

    added 1 commit

    • 3cdce97e - Store push event payloads in a separate table

    Compare with previous version

  • Yorick Peterse added 1 commit

    added 1 commit

    • 13b70b26 - Store push event payloads in a separate table

    Compare with previous version

  • Yorick Peterse added 1 commit

    added 1 commit

    • 14f3fc7c - Store push event payloads in a separate table

    Compare with previous version

  • Yorick Peterse added 1 commit

    added 1 commit

    • 449b3d24 - Store push event payloads in a separate table

    Compare with previous version

  • Yorick Peterse added 304 commits

    added 304 commits

    Compare with previous version

  • Yorick Peterse added 304 commits

    added 304 commits

    Compare with previous version

  • Yorick Peterse added 59 commits

    added 59 commits

    Compare with previous version

  • Yorick Peterse marked the checklist item Added tests for the migration in spec/migrations if necessary (e.g. when as completed

    marked the checklist item Added tests for the migration in spec/migrations if necessary (e.g. when as completed

  • Yorick Peterse marked the checklist item Made sure the migration won't interfere with a running GitLab cluster, as completed

    marked the checklist item Made sure the migration won't interfere with a running GitLab cluster, as completed

  • Yorick Peterse added 1 commit

    added 1 commit

    • ded0c696 - Store push event payloads in a separate table

    Compare with previous version

  • Yorick Peterse added 1 commit

    added 1 commit

    • 2760ce75 - Store push event payloads in a separate table

    Compare with previous version

  • Yorick Peterse added 1 commit

    added 1 commit

    • 7dfbd229 - Store push event payloads in a separate table

    Compare with previous version

  • Yorick Peterse added 253 commits

    added 253 commits

    Compare with previous version

  • Yorick Peterse added 1 commit

    added 1 commit

    • a6a15bfb - Store push event payloads in a separate table

    Compare with previous version

  • Yorick Peterse added 1 commit

    added 1 commit

    • 8da116c1 - Store push event payloads in a separate table

    Compare with previous version

  • Yorick Peterse changed the description

    changed the description

  • To get rid of the table bloat I'm thinking of the following solution:

    Release 1:

    1. Create a table called "events_for_migration" using the same structure as "events" except:
    2. It won't have any data
    3. The "data" and "title" columns are not included
    4. The primary key starts at the same value as "events.id"
    5. It will have a foreign key on push_events_for_migration.project_id pointing to projects.id with a cascading delete
    6. It will have all indexes in place
    7. Set up a trigger to replicate INSERT statements on "events" to "events_for_migration", using the exact same values (including the "id" value)
    8. Migrate push events, copying over migrated "events" rows into the "events_for_migration" column (including the "id" value, excluding the "data" and "title" columns). In this setup we'll leave events.data as-is to reduce table bloat and pressure.

    Release 2:

    1. Steal any remaining background migrations for push events. At this point "events_for_migration" contains newly inserted events, and all migrated push events (about 80% of the current size of "events")
    2. Copy over remaining rows from "events" where "events.id" is smaller than the smallest value of "events_for_migration.id". This ensures we only copy over data that hasn't already been moved to this table. An example query would be INSERT INTO events_for_migration (column1, column2, ...) SELECT column1, column2, ... FROM events WHERE id < MIN_ID with MIN_ID being queried separately and passed as an argument (since MySQL doesn't do CTEs this is the easiest approach). This is currently about 20-21% of the "events" table
    3. Lock both "events" and "events_for_migration" in access exclusive mode, then:
    4. Rename the tables
    5. Reset the primary key sequence of the new events table so it starts at the right value
    6. Unlock tables
    7. Adjust push_event_payloads so the foreign key on event_id points to the new events table (instead of the old one), then validate it
    8. Drop the old events table

    This setup should have minimal impact as the time the tables are locked is very brief. Further, it removes the need for VACUUM FULL or pg_repack to reclaim space. Both of these are not ideal since VACUUM FULL requires downtime (since it locks the entire table for the duration) and pg_repack is an extension that we don't ship (plus can't really control from a Rails migration). Since we're scanning over 80% of the "events" rows we might as well embed this work into the migration.

    There are some details here that are very important:

    • The foreign key on events_for_migration.project_id is crucial as it prevents orphaned data from being produced. This in turn means we don't need to remove any orphans before adjusting the foreign key in step 4, reducing the amount of time necessary to lock things (without this we'd have to remove orphans and add the foreign key inside the lock).
    • Both tables must absolutely use the same primary key sequence as this makes it easier to adjust the foreign key on push_event_payloads
    • Until this entire process is complete we'll end up using a bit more disk space. The exact amount is hard to estimate, but it shouldn't be more than 10-20 GB (towards the end of the migration just before we drop the old table)
    • The background migration in this MR will have to not only insert data into push_event_payloads, but also copy over rows from events to events_to_migrate. We can skip this but then copying over existing data in step 2 becomes more time consuming

    The end result would be a "events" table with basically no bloat by 9.6, and this would work for all GitLab users (instead of only for GitLab.com).

    Edited by Yorick Peterse
  • The distribution of event types is currently as following (to explain the 20/80% values mentioned above):

    gitlabhq_production=# select action, count(*) as amount from events group by action order by amount desc;
     action |  amount                                                              
    --------+----------                                                            
          5 | 53979887                                                             
          1 |  6177726                                                                    
          6 |  3718787                                                             
          3 |  1824720                                                             
          7 |  1807557                                                             
          8 |   738592                                                             
          9 |   582628                                                             
          4 |   108801                                                             
         11 |    22892                                                             
         10 |     9708                                                             
    (10 rows)                                                                      
                                                                                   
    gitlabhq_production=# select case when action = 5 then 'pushed' else 'other' end AS label, count(*) as amount from events group by label order by amount desc;
     label  |  amount                        
    --------+----------                                                            
     pushed | 53980383                                                             
     other  | 14991599             

    Here we can see that all non pushed events take up about 20% of the total row count (= pushed + other).

  • Coding wise pg_repack would be easier. In this setup we'd basically do the following:

    1. Migrate push events as done in the current MR state (= inserting rows into push_event_payloads then nullifying events.data for the migrated rows)
    2. Wait for all of this to complete
    3. Drop events.data
    4. Run pg_repack on events

    The downside of this approach is that it requires us to:

    1. Ship pg_repack so everybody can benefit from it (or fall back to VACUUM FULL if pg_repack is not available, but this is really dangerous)
    2. Be able to start pg_repack from a database migration, preferably in a non blocking manner so the deploy isn't blocked by it

    I do see us doing work like this in the future, so perhaps pg_repack may not be a bad tool to ship. However, if pg_repack requires special permissions or commands to run we can't use it since the DB migration user may not have these permissions.

  • Hm, as far as I can find pg_repack is executed as a shell command. This means we can't really call it from a migration because pg_repack may not be available on the host that is running the migrations. This means the only way to get rid of bloat using pg_repack is to tell users to manually run it, which I really don't like.

  • @yorickpeterse pg_repack is not even installed in our infrastructure at this point. I wouldn't rely on it at all.

  • Also, @yorickpeterse IIRC that required us building pg_repack manually against the right PG version... a lot of mess.

    Do you think that this is the the only (reasonable) way to do this?

  • @pcarranza The options that I can think of are:

    1. The migration procedure listed above. This requires some additional coding but allows us to reclaim bloat the fastest.
    2. pg_repack, which has various distribution difficulties
    3. VACUUM FULL, this requires downtime and could easily take hours
    4. logical replication to replicate to a new cluster (bloat won't be replicated), then fail over to this new cluster. This however is something that would take months to setup, plus it would be GitLab.com specific; it would however also reclaim bloat in other tables
  • Yorick Peterse marked the checklist item Check if we can simply drop events.data in the next release since it seems to only be used by push events. This would remove the need for updating events, thus reducing table bloat (and subsequent DB load) as completed

    marked the checklist item Check if we can simply drop events.data in the next release since it seems to only be used by push events. This would remove the need for updating events, thus reducing table bloat (and subsequent DB load) as completed

  • Yorick Peterse added 309 commits

    added 309 commits

    Compare with previous version

  • Yorick Peterse added 218 commits

    added 218 commits

    Compare with previous version

  • Yorick Peterse added 1 commit

    added 1 commit

    • da73ce20 - Use a specialized class for querying events

    Compare with previous version

  • Yorick Peterse marked the checklist item Adjust the event queries to use a LATERAL JOIN on PostgreSQL to more efficiently get the data as completed

    marked the checklist item Adjust the event queries to use a LATERAL JOIN on PostgreSQL to more efficiently get the data as completed

  • Yorick Peterse marked the checklist item Added the output of the migration(s) to the MR body as incomplete

    marked the checklist item Added the output of the migration(s) to the MR body as incomplete

  • Yorick Peterse marked the checklist item Added the execution time of the migration(s) to the MR body as incomplete

    marked the checklist item Added the execution time of the migration(s) to the MR body as incomplete

  • Yorick Peterse changed the description

    changed the description

  • I ended up going with the approach of migrating data from within GitLab. This is done in such a way that in 10.0 all we need to do is:

    1. Steal any remaining background jobs
    2. Swap the tables
    3. Start a transaction
    4. Lock both tables in exclusive mode
    5. Drop "events"
    6. Rename "events_for_migration" to "events", including any indexes
    7. Commit
    8. Remove the migration related code from the Event model, and remove the EventForMigration model

    The procedure in question takes care of copying over all existing data as well as newly inserted data. The latter is done by the application and not a trigger since triggers can't be dumped to db/schema.rb. This means if one were to set up a new DB they would not have said trigger, thus losing newly inserted events.

    One thing to take into account is that the background migration does 1 INSERT for every event. Batching this won't work very well as a single row with an invalid foreign key value would lead to the entire batch being cancelled, whereas 1 INSERT per event means we can simply skip the event.

    The procedure of swapping tables will not take more than maybe a second or so.

  • Yorick Peterse added 2 commits

    added 2 commits

    • a15e8969 - Migrate events into a new format
    • 0c8938cb - Use a specialized class for querying events

    Compare with previous version

  • @smcgivern @DouweM Since this is going to be a pretty big set of changes I'd like to have some preliminary feedback (when convenient) while I probably spend a few days getting all the tests to pass.

  • Yorick Peterse changed the description

    changed the description

  • Yorick Peterse
  • Yorick Peterse
  • Douwe Maan
  • Douwe Maan
  • Douwe Maan
  • Douwe Maan
  • Douwe Maan
  • Douwe Maan
  • Douwe Maan
  • Douwe Maan
  • Douwe Maan
  • Douwe Maan
  • assigned to @smcgivern

    • Resolved by Yorick Peterse

      This is done in such a way that in 10.0 all we need to do is:

      1. Steal any remaining background jobs

      This is going to really suck for anyone who skips 9.5. Could we wait a little longer before stealing the remaining background jobs?

  • Sean McGivern
  • Sean McGivern
    • Resolved by Yorick Peterse
      • pg_repack, which has various distribution difficulties

      This is a digression, but given that the fourth option (logical replication to a new cluster) would have been specific to GitLab.com, couldn't we do this in a GitLab.com-specific way initially?

  • @yorickpeterse thanks! I mostly have minor comments; Douwe's are more substantial, but I think we both agree this approach makes sense. I'm concerned about the timeline, but that's easy to change / discuss.

  • Yorick Peterse added 183 commits

    added 183 commits

    Compare with previous version

  • Yorick Peterse added 2 commits

    added 2 commits

    • dbe02f7e - Migrate events into a new format
    • 89b6f72d - Use a specialized class for querying events

    Compare with previous version

  • Yorick Peterse added 2 commits

    added 2 commits

    • a4bf52cc - Migrate events into a new format
    • 788d07ef - Use a specialized class for querying events

    Compare with previous version

  • Yorick Peterse mentioned in merge request !12584 (merged)

    mentioned in merge request !12584 (merged)

  • Yorick Peterse added 52 commits

    added 52 commits

    Compare with previous version

  • Yorick Peterse marked the checklist item Verify the API output for push events as completed

    marked the checklist item Verify the API output for push events as completed

  • Yorick Peterse added 1 commit

    added 1 commit

    • 54d46aa2 - Update the API to support push event payloads

    Compare with previous version

  • Yorick Peterse added 1 commit

    added 1 commit

    • 54d46aa2 - Update the API to support push event payloads

    Compare with previous version

  • Yorick Peterse added 2 commits

    added 2 commits

    • 25ea069e - Migrate events into a new format
    • a067992c - Use a specialized class for querying events

    Compare with previous version

  • Yorick Peterse added 2 commits

    added 2 commits

    • 7f603dc4 - Migrate events into a new format
    • 8a4e7102 - Use a specialized class for querying events

    Compare with previous version

  • Yorick Peterse added 1 commit

    added 1 commit

    • 7dded9d4 - Use a specialized class for querying events

    Compare with previous version

  • Yorick Peterse added 2 commits

    added 2 commits

    • 9c4f7004 - Migrate events into a new format
    • e05338e1 - Use a specialized class for querying events

    Compare with previous version

  • Yorick Peterse added 2 commits

    added 2 commits

    • 636d1762 - Migrate events into a new format
    • dfd7e2b4 - Use a specialized class for querying events

    Compare with previous version

  • mentioned in issue #35939 (closed)

  • Yorick Peterse marked the checklist item Added tests for the relevant changes as completed

    marked the checklist item Added tests for the relevant changes as completed

  • mentioned in issue #27375 (closed)

  • mentioned in issue #35990 (closed)

  • Yorick Peterse resolved all discussions

    resolved all discussions

  • Yorick Peterse marked the checklist item Documentation created/updated as completed

    marked the checklist item Documentation created/updated as completed

  • Yorick Peterse marked the checklist item Tests added for this feature/bug as completed

    marked the checklist item Tests added for this feature/bug as completed

  • Yorick Peterse marked the checklist item Changelog entry added, if necessary as completed

    marked the checklist item Changelog entry added, if necessary as completed

  • Yorick Peterse marked the checklist item Squashed related commits together as completed

    marked the checklist item Squashed related commits together as completed

  • Yorick Peterse marked the checklist item Conform by the style guides as completed

    marked the checklist item Conform by the style guides as completed

  • Yorick Peterse marked the checklist item Conform by the merge request performance guides as completed

    marked the checklist item Conform by the merge request performance guides as completed

  • Yorick Peterse added 813 commits

    added 813 commits

    Compare with previous version

  • Yorick Peterse added 2 commits

    added 2 commits

    • 7804a641 - Migrate events into a new format
    • 95a23b10 - Use a specialized class for querying events

    Compare with previous version

  • Yorick Peterse added 2 commits

    added 2 commits

    • 78688edd - Migrate events into a new format
    • e1162aa6 - Use a specialized class for querying events

    Compare with previous version

  • Yorick Peterse changed the description

    changed the description

  • Yorick Peterse marked the checklist item Added the output of the migration(s) to the MR body as completed

    marked the checklist item Added the output of the migration(s) to the MR body as completed

  • Yorick Peterse marked the checklist item Added the execution time of the migration(s) to the MR body as completed

    marked the checklist item Added the execution time of the migration(s) to the MR body as completed

  • The regular migrations have been performed in a personal staging environment and their details are added to the MR. The background migrations are running, though I had to slightly adjust how many jobs we buffer and how many events we process in a single job. The initial amount of 100 events per job (and buffering 100 jobs) would take far too long to schedule, whereas with 1000/1000 it takes about 5 minutes.

  • Yorick Peterse added 2 commits

    added 2 commits

    • b0ae8ba5 - Migrate events into a new format
    • 56dffd4c - Use a specialized class for querying events

    Compare with previous version

  • Some preliminary statistics:

    • 8 threads on a 4 core CPU (4 threads only lead to 30% utilisation of all cores) based on the best-effort Sidekiq setup
    • about 0.2% of all events could be processed in roughly 15 minutes, though I did end up restarting Sidekiq 2-3 times to adjust its concurrency levels

    Row counts thus far:

    gitlabhq_production=# select count(*) from push_event_payloads;
     count  
    --------
     115669
    (1 row)
    
    gitlabhq_production=# select count(*) from events_for_migration;
     count  
    --------
     151731
    (1 row)

    Table sizes thus far:

    gitlabhq_production=# select pg_size_pretty(pg_total_relation_size('push_event_payloads'));
     pg_size_pretty 
    ----------------
     19 MB
    (1 row)
    
    gitlabhq_production=# select pg_size_pretty(pg_total_relation_size('events_for_migration'));
     pg_size_pretty 
    ----------------
     35 MB
    (1 row)
  • Sample output to show push_event_payloads is doing what it should be doing:

     commit_count | event_id | action | ref_type |                commit_from                 |                 commit_to                  |           ref            |                        commit_title                        
    --------------+----------+--------+----------+--------------------------------------------+--------------------------------------------+--------------------------+------------------------------------------------------------
                1 | 22630704 |      2 |        0 | \xc318d8b9cb51e65856df389cda1c9895ce17f9c2 | \xe01451a1a99508e75c10bf6e32f209cac2a43b08 | master                   | ribbon
                1 | 22630701 |      2 |        0 | \xeef261b320282054213cf09f5da25bc63cc7b48c | \xaf9d1f5362006ccef59e679be38dfb7f1effcf8f | master                   | added high forest block
                2 | 22630700 |      2 |        0 | \x9e58f85a6c809b7ba69c6d7f87f530afcd519a24 | \x3f3070fe536e4ead75dc2c2bd043a5f18932b768 | master                   | Merge branch 'address' into 'master'
  • Yorick Peterse added 49 commits

    added 49 commits

    Compare with previous version

  • Some further statistics thus far:

    I started the migration procedure roughly around 2017-08-08 16:30:00 UTC. On 2017-08-09 11:00:00 UTC the progress was 13.92%. This produces a rate of 0.75% per hour, which in turn would lead to the entire progress taking roughly 5.5 days for a single 4 core host running 8 threads (assuming progress per hour stays the same).

    The table sizes are currently as follows (including indexes):

    • events: 176 GB
    • events_for_migration: 2.21 GB (9 935 595 rows)
    • push_event_payloads: 1.2 GB (7 698 681 rows)

    Assuming linear growth this would produce and end size of the following per table (including indexes):

    • events_for_migration: 15.8 GB
    • push_event_payloads: 8.62 GB

    If we round this up a bit to take into account variance in sizes we'd get, say:

    • events_for_migration: 20 GB
    • push_event_payloads: 10 GB

    This means in the new setup our total data size is around 30 GB (give or take a few GB). This in turn means we end up saving around 140 GB of space, giving us an 80% reduction.

    This of course all assumes the data grows linear, which it probably won't.

  • Yorick Peterse marked the checklist item Included the raw SQL queries of the relevant queries as completed

    marked the checklist item Included the raw SQL queries of the relevant queries as completed

  • Yorick Peterse unmarked as a Work In Progress

    unmarked as a Work In Progress

  • mentioned in issue #36219 (moved)

  • On 2017-08-10 11:20:42 progress was at 24.9%, 40.8 hours after starting. This gives us a rate of 0.6% per hour, slightly lower than before. At this rate it would take 6.9 days to complete the migration process.

    Table statistics are currently as follows:

    • events_for_migration: 17 689 866 rows, 3.94 GB
    • push_event_payloads: 13 941 202 rows, 2.2 GB

    With linear growth this would make us end up with roughly the same statistics as calculated when progress was at 13.92%.

    @ilyaf @victorcete The restore hosts we used for this process (the DB, Redis, and the Sidekiq node) can now be terminated since I no longer need them.

  • @yorickpeterse thanks :thumbsup_tone1: I'll take care of deprovisioning the hosts

  • Sean McGivern
  • Sean McGivern
  • Sean McGivern
  • Sean McGivern
  • Sean McGivern
  • @yorickpeterse thanks! I don't have any major comments, really. This is another one we need to get approvals for, because it's past the freeze date - but we also really need this.

    Edited by Jose Ivan Vargas
  • Yorick Peterse added 61 commits

    added 61 commits

    Compare with previous version

  • Yorick Peterse added 2 commits

    added 2 commits

    • 0395c471 - Migrate events into a new format
    • aac1de46 - Use a specialized class for querying events

    Compare with previous version

  • mentioned in issue #36303 (closed)

  • Yorick Peterse resolved all discussions

    resolved all discussions

  • Sean McGivern approved this merge request

    approved this merge request

  • merged

  • Sean McGivern mentioned in commit e80a893f

    mentioned in commit e80a893f

  • This is quite the impressive merge request. Nice work. Looks good to go into the next RC.

  • Picked into 9-5-stable, will go into 9.5.0 RC2.

  • Sean McGivern mentioned in commit 75dffc45

    mentioned in commit 75dffc45

  • Reverting from 9.5.0 RC2, it's causing the CE to EE merge to fail

  • I think https://gitlab.com/gitlab-org/gitlab-ee/-/jobs/29079656 is just a test setup issue, but to be safe we just reverted this from RC2 and can target this for RC3.

  • @jivanvl @stanhu Is that the only problem we ran into with RC2, and is there an MR for this? I'm happy to look into it, and it probably is just an outdated EE spec that still uses the :event factory instead of :push_event

  • @yorickpeterse There isn't a MR to fix this yet. I think it is indeed an outdated EE spec.

  • Rémy Coutable mentioned in merge request !13466 (merged)

    mentioned in merge request !13466 (merged)

  • Picked into https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/13604, will merge into 9-5-stable ready for 9.5 RC5

  • Sean McGivern mentioned in commit f67256ba

    mentioned in commit f67256ba

  • Sean McGivern mentioned in commit b562e64f

    mentioned in commit b562e64f

  • mentioned in issue #36722 (closed)

  • mentioned in issue #36685 (closed)

  • mentioned in issue #36876 (closed)

  • Yorick Peterse mentioned in merge request !13932 (merged)

    mentioned in merge request !13932 (merged)

  • mentioned in issue #36880 (moved)

  • mentioned in issue #38004 (moved)

  • mentioned in issue #37814 (moved)

  • mentioned in issue #38509 (closed)

  • mentioned in issue #38507 (closed)

  • François Bobot
    François Bobot @francois started a thread on commit 0395c471
  • 14 MAX_INDEX = 69
    15 PUSHED = 5
    16
    17 def push_event?
    18 action == PUSHED && data.present?
    19 end
    20
    21 def commit_title
    22 commit = commits.last
    23
    24 return nil unless commit && commit[:message]
    25
    26 index = commit[:message].index("\n")
    27 message = index ? commit[:message][0..index] : commit[:message]
    28
    29 message.strip.truncate(70)
  • mentioned in issue #43394 (closed)

  • Please register or sign in to reply
    Loading