The default batching strategy for batched background migrations assumes data is evenly distributed, because it was originally designed to perform whole-table migrations. In a few recent MRs (!78393 (merged) for one), we've intended to use primary key batching to do partial updates based on the value of a type field. These partial updates could confuse the batch optimizer, because a batch may be mostly or entirely empty of rows to update. In the previous MR, we found a gap of over 1.5M records in the table where no matching records were found, which could cause the optimizer to aggressively scale up the batch size to an unsafe level.
A custom batching strategy can be added as a workaround, but we should try to solve this generically in the framework. Rather than have the batch optimizer key off previous batch size, it should key off the number of rows affected in the previous batch. This way, the optimizer can scale more accurately based on the amount of work done by the migration.
As an additional optimization, if a batch is entirely empty, we could immediately begin the next batch, to prevent waiting for the job delay when no work is being done.
Designs
Child items
...
Show closed items
Linked items
0
Link issues together to show that they're related or that one is blocking others.
Learn more.
@alexives I discussed this with @iroussos based on a few MRs where we're running into this problem. I'll leave it up to you both to decide if it makes sense to keep in this epic or as a future iteration.
Thank you @pbair, this looks to me as a really valuable addition to the batched background migrations framework: there are numerous occasions of data migrations that have to go through all IDs (can not use a filter to set the batches due to performance constraints or even the batching query timing out), where batches may be empty.
As this is a feature required by a group actively working with batched background migrations on a pretty large data migration, I think that we should prioritize addressing it. There are various alternatives to unblock ~group::workspace on their work, but I am in favor of addressing any issues that arise from real world usage of the framework by other groups; those cases are a great signal on what we will really need to be able to fully replace the current way of doing data migrations and reach GA.
Setting to databaseactive and %14.8 and we can update if there are any objections
@krasio Sure. In one of the MRs that did this, we batched over routes where source_type = 'Namespace'. Since that data is not evenly distributed, we had large pockets in the table where the update query would update 0 rows.
Those batches and sub-batches were very quick since no writes occurred. Since the batch optimizer uses the previous batch size and duration to determine next batch sizes, it would continually increase the batch size while batching over these pockets. Once it hit a segment where there was rows to actually update, the batch size could be much higher than it should be to complete in the interval.
Instead it would be better if the batch size worked based on the number of rows actually changed during previous queries, to handle these situations. For example, if the previous batch updates 0 rows, the batch optimizer shouldn't attempt to increase the batch size.
While we could use a custom batch strategy to fix this problem, it also requires additional indexes and creates a fair bit of overheard for migration authors.
@pbair Thanks! I think I get the problem, was curious about any details on how to solve it.
For example, if the previous batch updates 0 rows, the batch optimizer shouldn't attempt to increase the batch size
It should not be just the immediate last batch, but any batch in the last number_of_jobs that affected 0 records, right. Also we should consider not only no-op batches, but also batches that affected significantly less records than the batch size. So the optimization should be somehow adjusted to the number of records updated.
Meanwhile do you think it's worth working towards making it possible to disable automatic batch optimization per migration. This way we can opt-out of it for migrations that we know will have this problem.
It should not be just the immediate last batch, but any batch in the last number_of_jobs that affected 0 records, right. Also we should consider not only no-op batches, but also batches that affected significantly less records than the batch size. So the optimization should be somehow adjusted to the number of records updated.
Yep, that sounds right. We need to capture and store the rows affected so we can use that instead.
Meanwhile do you think it's worth working towards making it possible to disable automatic batch optimization per migration. This way we can opt-out of it for migrations that we know will have this problem.
Yeah, that's a good idea, and we have an issue to do that already: #349196 (closed). I think really this is basically ready now, because !79848 (merged) implemented a max_batch_size. If the max_batch_size is equal to batch_size the optimizer will never go any higher (it can still go lower, but that doesn't seem to be a problem).
I may be wrong but I see this as a vertical consideration to time efficiency. I would approach this by not including any calculation on time_efficiency but instead calculating:
batch_efficiency = work_done / batched_migration.batch_size.to_f (I would welcome a better name but I'll use that for this comment :-) )
and use it as a threshold check to not update the batch size:
if batch_efficiency is bellow a very low threshold (e.g. 10-20%), then we should exclude that batch from any calculations as it is not representative.
if batch_efficiency is bellow 100% and above the very low threshold, I would just divide the duration by it to adjust it to what we would expect it to be if 100% of the batch was to be executed.
Which brought me to the proposed formula above in the trivial case, so I guess I agree with @krasio
But maybe we could keep the idea of excluding batches with batch_efficiency < low percentage if you find it useful
optional thought: @krasio could we do an additional optimization when finalizing the job?
Add another column (batch_efficiency)
Update it when finalizing the job by using the batch_metrics.cmd_tuples value
Additionally filter by batch_efficiency > X when fetching the past 10 jobs in smoothed_time_efficiency
That would cover small bumps with low coverage batches but could cause problems if all (or most) batches are bellow X, so I am not sure if this is a good idea
Nice idea (finally feel like I'm nearly caught up here :D)!
batch_efficiency < X would be an indicator that our core assumption has been violated (uniform-distribution and that max - min = number of records to update for a batch).
We should start documenting the overall thinking for the tuning somewhere concisely.
I think that we should move this and other issues related to our auto tuning layer under a new epic under &6751, so that we can also discuss and summarize our high level decisions in there. And also bring &7594 there? What do you think?
@iroussos Sure, I've created a new epic and rearranged as you proposed. I think it will be useful to keep all conversations related to auto-tuning under one parent epic.
As for the continued work on batch_efficiency, I think we're at a bit of a crossroads. I think at least some points from the conversation in #353395 (comment 884002463) are difficult to implement within the current limitation of using sidekiq-cron, but at the same time I'm not sure we have another option.
Optimizing for batch_size still likely makes sense when using sidekiq, since it's something we can easily control within that framework.
I think that your proposal makes sense, let's continue with the optimization @krasio mentions as the other options are yet to be proven and we may never manage to find a proper solution.
Additionally filter by batch_efficiency > X when fetching the past 10 jobs in smoothed_time_efficiency
Isn't this keep the current situation though? We'll be adjusting the batch size based on assumption that work was done while this will not be the case (because jobs with no work done or not enough work done will be filtered out).
I made some tests. Start with a table that has 2M rows to iterate over:
Use the following migration to simulate the case mentioned in the issue description where we had a number of batches not updating any records:
# frozen_string_literal: true# lib/gitlab/background_migration/dummy_bbm.rbmoduleGitlabmoduleBackgroundMigrationclassDummyBbm<BaseJobincludeGitlab::Database::DynamicModelHelpersdefperform(start_id,end_id,batch_table,batch_column,sub_batch_size,pause_ms,rest)model=define_batchable_model(batch_table,connection: connection)batch_metrics.instrument_operation(:update_all)docasestart_idwhen200_000..700_000sleep(0.01)0elsesleep(2.5)# This is half of the intervalend_id-start_idendendenddefbatch_metrics@batch_metrics||=Gitlab::Database::BackgroundMigration::BatchMetrics.newendendendend
It's not ideal but I think it's realistic enough.
Enqueue the migration with the following parameters:
For the current code we get the expected results - batch size is increased because "no work" batches are considered not efficient:
Next tried the approach described above - store batch_efficiency and additionally filter by batch_efficiency > X when fetching the past 10 jobs in smoothed_time_efficiency:
# lib/gitlab/database/background_migration/batched_job.rb+ before_transition :running => :succeeded do |job, transition|+ affected_rows = job.metrics['affected_rows'] && job.metrics['affected_rows'].values.flatten.reduce(0) {|a, r| a += r}+ job.batch_efficiency = affected_rows.to_f / job.batch_size if affected_rows && job.batch_size+ end # lib/gitlab/database/background_migration/batched_migration.rb def smoothed_time_efficiency(number_of_jobs: 10, alpha: 0.2) - jobs = batched_jobs.successful_in_execution_order.reverse_order.limit(number_of_jobs).with_preloads+ jobs = batched_jobs.successful_in_execution_order.where('batch_efficiency > 0.5').reverse_order.limit(number_of_jobs).with_preloads
Wet get the same results as what we have now:
And last, try with adjusting the job efficiency based on the actual work done, and cap it to 0.95. Capping is so we do not decrease the batch size because of with adjusted efficiency > 1.0.
Great way to showcase that with the second approach (batch_efficiency > X), the batch will keep on increasing at the same rate it was increasing before the cutoff point.
Isn't the last approach a smoother equivalent to just checking for batch_efficiency < X and not updating the batch in that case? (just return 1 or set it to 1, did not check the related code)
we found a gap of over 1.5M records in the table where no matching records were found, which could cause the optimizer to aggressively scale up the batch size to an unsafe level.
With "unsafe level", we mean a batch size that is way too large (due to the described gaps) and we have a job that actually hits a lot of records suddenly. This job is going to take a lot longer than the anticipated interval between jobs. This is no immediate "threat" to overloading the system, but we can only adjust/pause job parameters after this long job has finished (which may take a long time).
I wonder if we can do this separately from batch optimization and rather add a safety on the execution side: We know when the job started, so we can easily stop after $interval has passed (and update job's batch size to reflect that retroactively?).
I realize this is an entirely different approach, so just leaving this here as a thought and request for comments. Also see #353395 (comment 884002463) with related notes.
This job is going to take a lot longer than the anticipated interval between jobs. This is no immediate "threat" to overloading the system, but we can only adjust/pause job parameters after this long job has finished (which may take a long time).
Right. The only danger with a job taking much longer than estimated is that it can outlast the timeout of the exclusive lease, an we begin running overlapping jobs.
I wonder if we can do this separately from batch optimization and rather add a safety on the execution side: We know when the job started, so we can easily stop after $interval has passed (and update job's batch size to reflect that retroactively?)
That makes sense, we already have #336672 for this as well. We can also leverage the work @dfrazao-gitlab has done to automatically split_and_retry! jobs for the failed jobs with too-large batch_size.
Also adding to stopping the job after $interval, I don't think we have yet identified an easy way to do that.
We can move the "outside" batching into the wrapper, and then the job is called for each sub-batch. After each iteration we can check the elapsed time and exit if we've passed the interval.
But we still have to handle long-running queries, which means either controlling those timeouts or forcibly killing jobs after too much time has elapsed.
Also adding to stopping the job after $interval, I don't think we have yet identified an easy way to do that.
Yeah, this is the tricky part. Currently the background migration class (job_class_name) is a black box for the runner and the wrapper. There is even no guarantee it will be splitting work in sub-batches.
Provocative question: What if we can re-establish our former assumption of a uniform data distribution and therefore we don't have to worry about this at all?
@dfrazao-gitlab and I were just discussing about this - can we support batching a "table with a filter" or even a "relation" (= any SQL query)?
The uniform data distribution assumption is really a nice thing to work with and reason about - wonder if we can keep this.
@abrandl I would love it if we could achieve that goal!
But aren't we forced to go away from that assumption once we have filtered batch queries that are too expensive to execute on a large production system?
We are even observing queries that try to find the min and max boundaries of such uniformly distributed batches timing out in GitLab.com. So, how would we solve this differently in this case? (it would be so great to find a way to do so)
@iroussos In previous cases I know of, we made sure we have the appropriate index for this situation, i.e. when filtering on abc = ?, we'd want an index on (id) WHERE abc = ? or (abc, id) or similar. This scopes the view down into parts of the relation you care about and supports batching.
Can you think of an example where this wouldn't work?
Of course, if the filter involves a join, this may not work.
@abrandl This is a good idea too, but the biggest blocker is probably that many of these types of migrations require an additional index for batching to perform well.
In cases where we have overall small percentage of rows to migrate, adding an index probably makes the most sense to make the migration efficient.
But in cases where we have a large percentage of rows (but less than the entire table), we sometimes already recommend "slow batching" to avoid adding an index, because of the extra overhead of that approach. If we want to disallow that pattern, that's fine too, just stating my reasoning for not making that a strict requirement when I created the issue.
@abrandl unfortunately there are all those times that we need to perform a slow iteration; joins or a multi-column ordering that can not have a proper temporary index defined are some of the cases. But then there are those cases where even iterating through the index is too slow or adding an index is too expensive. From my perspective, those last ones were a few of the reasons why we had to move to a new way of performing background migrations.
Not to mention how much more complex, involved and also error prone some times are the migrations that require us to find the perfect index for the job.
I know that those may be the minority of migrations we have to ship, but those are the migrations that cause us the most problems and where we need a way to address the problem more holistically. I am afraid that basing our work on a uniform data distribution assumption will lead us to have to do the same tricks we are doing today.
But if we were able to figure out a solution for this while keeping the batched background migrations, that would be a huge win.
I also agree with the comment by @pbair, I am just not sure if we can always find a way around slow batching.
Didn't we even have some recent attempts to migrations that could not work in any other way, where authors tried to do the infinite loop iteration approach (keep on skimming from the top) or the job-rescheduling approach until the dataset is completed because they could not find a way to make their migrations work even with an index added? I can't find an example right now, so I would love to be corrected if my memory fails me.
@iroussos@pbair@abrandl I've submitted !85152 (closed) with a proposal how to to resolve this. Can't think of anything else/better we can do at the moment. I also like this approach, because it's just a code change and easier to revert if we find it's not working as expected. Asked @dfrazao-gitlab for an initial review, he's pass to one of you once ready.
BTW I am OOO in the next couple of weeks, feel free to make any changes in order to get it merged. Or just close the MR if we decide this is not the approach we want to go with.
@iroussos I am a bit stuck on this one. There is a MR with proposed solution, but it will work only if a specific set of assumptions is correct, see !85152 (comment 933255931).
Not sure what to do from here. One option will be to just nothing for now, at least after !79848 (merged) we now have max_batch_size limit that should not be exceeded.
Batching on one relation while updating another (I'll call it child sub-batching further down this comment) can get us into trouble in general with the way we optimize batched background migrations right now, especially in the context of 1:N relationships (or N:M with an extra hop) as the sub-batch can be unpredictable and not directly correlated to the size of the batch.
time_efficiency semi works right now in this case. But if we follow the update in !85152 (closed) we risk breaking our optimization for those cases.
I can not personally think of a quick way to mitigate this. If we could label somehow those child sub-batching migrations, we could use that information inside time_efficiency to skip the adjustment. It's trivial to do so manually at creation time, but could we do that automatically as well?
If there are no other ideas on how to deal with this, my opinion is that we should rethink how we treat child sub-batching in general and then revisit this implementation.
For example, what happens if we batch on one relation and update another?
@krasio@iroussos What if we added a batching strategy that joins multiple tables? I've only ever seen this case in a migration that wants to paginate parent records to update child records, so we could have the batching strategy do something like the following:
SELECT parent.id, child.idFROM parentINNER JOIN child ON child.parent_id = parent.idWHERE (parent.id, child.id) > ?AND <possible other where conditions to narrow down rows we want to update>ORDER BY parent.id, child.idLIMIT 1 OFFSET 1000
And in that case we'd be batching by the number of rows that we're actually updating.
Recommending this approach (and adding tooling to improve custom batching strategies) could also improve our batched background migration runtimes. Right now we often scan entire tables when we don't need to.
I couldn't convince postgres to use appropriate indexes in a 3-table scenario when testing this idea, but I think the 2-table scenario is significantly more common and that the 3-table scenario is possible with enough playing with the generated query.
When we started this project, the main focus was PK migration. We built a few features around that user case, but now the batched background migrations framework is available for all Gilab developers, and they want to make filters, create weird relations, execute raw SQL, and move data between tables without any relation.
I agree with @stomlinson. A good first step should be to create a new batching strategy that supports relations. I will also suggest that instead of accepting a table as an argument, we switch to relation for queue_batched_background_migration. Then we can rethink our batch optimizer. If we see that we will have more scenarios, we can have multiple batch optimizers and connect them to the respective batching strategy.
One more thing that I feel that it is important too:
We should also rethink how we collect data used for the batch optimizer. To collect the number of affected rows, we have an assumption that the method used returns the number of updated rows. Sometimes this is not true.
Starting with a generic batching strategy that supports relations (so we can cover both filters and joins) could be a nice first step if it's not complicating things too much. We can then build on top of that and properly implement the approach by @krasio (!85152 (closed)) while covering all cases. We can also revisit our batch optimizer in general.
One quick idea - we can bypass needing to store complex relation data in the database by just requiring a specific function or constant in the batched migration class. Something like
I agree it makes sense to add one or more new batching migrations that support common use cases. Then we can update batch size optimization to take the strategy into account, and eventually come back to the problem in this issue.
I will also suggest that instead of accepting a table as an argument, we switch to relation for queue_batched_background_migration
But how do we store this - as SQL fragment, something else? Simon's suggestion seems like I way to avoid this, we can ask the migration class to provide this relation, and if there is none, default to what we do now.
One more thing that I feel that it is important too: We should also rethink how we collect data used for the batch optimizer. To collect the number of affected rows, we have an assumption that the method used returns the number of updated rows. Sometimes this is not true.
Yeah, this is not ideal. Not sure how we can improve it though.
I'll close !85152 (closed) for now, we can revisit it later. @iroussos what should we do with this issue? Close and create new issues for the suggestions above?
Simon's suggestion seems like I way to avoid this, we can ask the migration class to provide this relation, and if there is none, default to what we do now.
Yep I agree with that, sounds like a nice way to address this in a simple way.
[..] what should we do with this issue? Close and create new issues for the suggestions above?
@krasio From my point of view it seems like a valid issue to revisit after we are done with everything else, so I would keep it in the backlog until we are ready to come back to it. But you know better, so your call!