Setup omnibus managed cronjob to trigger reindexing

added to epic &3989

mentioned in epic &3989

Hi @abrandl,

Please add labels to your issue, this aids categorization and locating issues in the future.

Thanks for your help!

You are welcome to help improve this comment.

added auto updated label

added [deprecated] Accepting merge requests label

added database groupdatabase labels

@mayra-cabrera Do we currently have windows of time throughout the week where we don't deploy anything?

This is about recreating indexes in the background, which has conflict potential with concurrent deployments (because of their database migrations). Now there is mitigation for this in #246497 (closed), but I wonder about its priority.

If we have fixed times throughout the week where we don't deploy, we might just start with scheduling the reindexing for that.

@abrandl We don't have fixed times during the week, we only prevent deployments during the weekend and in some "standard" holidays see https://about.gitlab.com/handbook/engineering/releases/#deployment-blockers-

I wonder if this index recreation is something we can schedule in a post-migration so it's executed at the end of the deployment . How do we plan to start this? With some specific bloated indixes?

Thanks @mayra-cabrera ! We might just start with scheduling for the weekends and see if that's already enough to maintain a good state.

I wonder if this index recreation is something we can schedule in a post-migration so it's executed at the end of the deployment

That crossed my mind, too - I had discarded it because it would delay deployments, perhaps by a long time (some indexes are huge, take a long time to rebuild). Ultimately this is unrelated to deploys, so I would like to see if we can keep it separate. Also for self-hosted installations, this should run more regularly than an upgrade process.

How do we plan to start this? With some specific bloated indixes?

Yeah, there is #246498 (closed) which would implement a heuristic to select a good candidate (based on a bloat estimate).

We might have a couple of steps here:

Rake task requires index name as a parameter (this is where we are at right now)
Rake task automatically selects an index at random (preferring larger tables perhaps)
Rake tasks becomes smarter to select based on bloat estimate or history of reindexing

(3) - it might be easiest to just cycle through all indexes, one at a time and keep track of the history and perhaps even the impact a reindexing had and make smarter choices over time... Perhaps this is also slightly overengineered.

Rake tasks becomes smarter to select based on bloat estimate or history of reindexing

This seems like a thorough solution . Now I'm curious :), going to follow the discussion on #246498 (comment 414222107)

Setting label(s) devopsenablement Category:Database sectionenablement based on groupdatabase.

added Category:Database devopssystems sectioncore platform labels

@ibaum I would like to add a cronjob that executes a Rake task on a set schedule. Would you mind pointing me in a direction how to do this best with omnibus, please?

A cron sidekiq job won't work: We can't go through pgbouncer here, so the job would have to execute on a host that has a direct database connection. For GitLab.com, that's deploy-01, for example.

Is crond_job what I'm looking for in omnibus? Would it work to create a gitlab::database_tasks cookbook or similar and have that included for the runlist for the deploy host?

Ideally, there would only be a single host where this is being executed from - but we can add a guard against concurrent execution to the Rake task, too.

The alternative for now is to just configure the cronjob on the deploy host on GitLab.com, for now - and integrate this later into omnibus. I'd prefer to use omnibus from the start, if we can.

@abrandl crond_job is definitely the way to go. It uses the go-crond daemon we bundle with omnibus. You can see the crond_job defition for how it works.

I think a separate recipe would be great. omnibus does have a postgresql cookbook so perhaps that is the appropriate place? Something like postgresql::tasks or postgresql::maintanence_tasks?

Where to include it does get tricky. Does the rake task connect directly to the db? Or through rails?

If direct, we could include it in the postgresql::enable recipe. That would require the task to behave appropriately when run on a read only db instance.

If rails, it gets trickier. I would need to double check the reference architecture and see where it would work.

Thanks for this @ibaum!

So the rake task would use the standard Rails connection configuration, so whatever is configured on the instance. However, it cannot go through pgbouncer (this is where it gets tricky).

We're only starting to implement the reindexing and right now, I'd just like to set this up for .com for starters. Which also doesn't use the postgresql cookbooks to my knowledge.

If rails, it gets trickier. I would need to double check the reference architecture and see where it would work.

I could see this work for .com if we had another cookbook gitlab::database_tasks and include it in the runlist of the desired hosts (through our own chef setup), but nowhere else for starters.

As a second step, we can see how this fits the reference architecture. I don't have enough insight here, so I would probably create an issue for groupdistribution at some point, if that's ok.

Would that work?

@abrandl That sounds good. I think in general we need to solve the deploy host issue in omnibus. So this will be some good motivation.

assigned to @abrandl

mentioned in merge request !42705 (merged)

removed [deprecated] Accepting merge requests label

mentioned in issue gitlab-com/gl-infra/production#2507 (closed)

added Deliverable label

added workflowin dev label

mentioned in commit omnibus-gitlab@58974af6

mentioned in merge request omnibus-gitlab!4602 (merged)

mentioned in commit omnibus-gitlab@2576841d

@abrandl One thought for the order of indexes to be rebuild: If we have a list of good candidates for reindexing we maybe should go from the smallest to the biggest index to not risk running out of disk space, because we will first need more disk space to create the new index, before we free up space by deleting the old bloated one.

@hphilipps Thanks, interesting point. Currently, the biggest index is 180 GB (in its perhaps bloated form):

# select pg_size_pretty(ondisk_size_bytes), identifier from postgres_indexes order by ondisk_size_bytes desc limit 10;
 pg_size_pretty |                              identifier                              
----------------+----------------------------------------------------------------------
 183 GB         | public.index_merge_request_diff_commits_on_sha
 106 GB         | public.index_notes_on_note_trigram
 101 GB         | public.index_merge_request_diff_commits_on_mr_diff_id_and_order
 100 GB         | public.index_ci_build_trace_sections_on_build_id_and_section_name_id
 96 GB          | public.index_ci_build_trace_sections_on_section_name_id
 94 GB          | public.index_ci_build_trace_sections_on_project_id
 64 GB          | public.index_ci_builds_on_commit_id_and_status_and_type
 56 GB          | public.index_ci_builds_on_commit_id_and_type_and_name_and_ref
 43 GB          | public.index_ci_builds_on_token_encrypted
 39 GB          | public.index_ci_builds_on_commit_id_and_type_and_ref
(10 rows)

We have > 3TB free disk space on .com. Do you think we should consider this "bottom-up" (smallest first) from the start or can this wait?

I created #258576 (closed) to make choosing the right indexes for rebuild smarter. Currently we're just picking the indexes at random.

@abrandl I think for gprd and gstg we are fine now disk space wise and this can wait for a later iteration. We alert at 90% disk space usage, so we might get paged one day because of index rebuilds, but we are not really at risk of running out of space. But maybe that is different for other environments or customers, so thanks for looking up the stats and opening that issue!

mentioned in issue #258576 (closed)

mentioned in commit omnibus-gitlab@0e1092e6

mentioned in commit omnibus-gitlab@b6d5c450

added workflowin review label and removed workflowin dev label

mentioned in commit omnibus-gitlab@1e9728a9

The omnibus side of this has been merged and deployed already. Next step is the chef configuration to enable the cronjob: https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/4306#note_93589

Status: In principle, this should be ready. However, just applying the chef change in gstg didn't have the desired effect. I suspect we might need to reconfigure the host, currently syncing up with @nnelson here. See https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/4306#note_93864.

Closing this given that the omnibus change has been merged and the rest is about rolling this change to gstg.

closed

mentioned in issue gitlab-com/gl-infra/production#2849 (closed)

mentioned in issue gitlab-com/gl-infra/production#2885 (closed)

set weight to 1

added Engineering Allocation label

added devopsdata stores label and removed devopssystems label

added groupdatabase frameworks label and removed groupdatabase [DEPRECATED] label

Setup omnibus managed cronjob to trigger reindexing

Designs

Child items ...

Activity