The migration 20160603180330_remove_duplicated_notification_settings.rb is supposed to remove duplicate notification settings. When using a database with a lot of rows in notification_settings (467849 in my case) this can take a very long time to run. I had to abort the migration after 45 minutes (of it taking up an entire CPU core).
On GitLab.com (or any other installation with more rows) this migration may end up taking several hours. We should try to improve this migration so it can finish in lets say up to 30 minutes.
The query for getting the grouped notification rows took around a second to run locally so it seems the actual DELETE is the problem. Perhaps deleting rows in batches may speed things up.
Designs
Child items
...
Show closed items
Linked items
0
Link issues together to show that they're related or that one is blocking others.
Learn more.
The commit log for 8450fe30 also says, "Add index to notification settings". It would have helped either to have split this migration in its own commit (so that we can revert it), or add a comment that this commit also removed duplicate settings.
What happens if SELECT min_id from (SELECT MIN(id) as min_id FROM notification_settings GROUP BY user_id, source_type, source_id runs, somebody adds a notification setting and then the DELETE is triggered? Now that I'm taking another look at the migration it seems the migration would end up deleting the newly created notification_settings row.
To make the migration faster we could probably start a number of threads and have each send a query that updates a chunk of the rows. While MRI has a GIL preventing multiple Ruby threads from running in parallel we just need to send multiple queries to PostgreSQL as it can still run queries in parallel, then we'd just have to wait for each thread to receive a response from PostgreSQL.
To prevent this migration from newly created data we probably have to acquire an INSERT/UPDATE lock (EXCLUSIVE on PostgreSQL if I'm not mistaken) on the entire table for the duration of the migration. This would allow concurrent SELECTs while blocking concurrent INSERTs/UPDATEs until the migration is complete.
Alternatively (and ideally) we'd write the migration in a way that we don't need any explicit locking if that's possible.
Other solution (and I'm just thinking out loud here): create a Sidekiq worker that just takes a SQL query and runs it, then have this migration generate SQL queries and schedule them for said worker. This does mean the migration will finish before the actual data has been processed. On the other side it allows us to use 20-something servers to update the data.
@stanhu I think there is no problem reverting the entire commit. We remove the duplicated records to create the index.
But yes, sorry i will write a better explanation next time.
@yorickpeterse These are all good ideas, especially your last one, it would be great to have something that we can re-use in
all big queries, especially because maybe i will have another big query on the way. Maybe if try to get the IDs from the duplicates instead of the not duplicated. I think the NOT IN statement that is taking too long.
Now we know what rows are duplicate, the next step is to figure out which rows to remove. To do so we could do the following:
Iterate over the rows returned by the above query
For every row run the query DELETE FROM notification_settings WHERE user_id = X AND source_type = Y AND source_id = Z AND id != ( SELECT MIN(id) FROM notification_settings WHERE user_id = X AND source_type = Y AND source_id = Z)