Some time around the week of April 26 2020, our import/export performance pipeline caught what appears to be a degradation in project export performance by as much as 25 minutes (28%) on average. Here is a break down of the timing from before and after April 26:
The data is currently aggregated and reported through Slack.
Our history doesn't extend much further but we do get fresher numbers daily. Unfortunately we don't have any more insight into these runs outside of raw timing data, so we do not know what happened. However, especially since we were looking to improve performance with the ndjson work, this is cause for concern.
Thanks @mkaeppler - do we think this should be something gitlab-org/memory-team/team-tasks10690783 looks into, perhaps due to our knowledge here, or should we assign over to gitlab-org/memory-team/team-tasks11353826?
Maybe we can use this issue to do some knowledge sharing with gitlab-org/memory-team/team-tasks11353826. We could pick it up, but (assuming we find the cause) document every step we took to get there so that in the future, gitlab-org/memory-team/team-tasks11353826 can do it independently.
The main challenge here will be that we don't have specific performance metrics from past runs to compare to. This work is still in progress, so we have nothing to go on but hunches currently. Perhaps the first step should actually be to find out what happened around that time (did any larger changes get merged? I already checked that with gitlab-org/memory-team/team-tasks~11353826 and it sounded like no; did any infrastructure changes occur that would have an impact e.g. on db performance?)
I would love to get to a point where in the long run, it's not just gitlab-org/memory-team/team-tasks~10690783 that looks into performance regressions. Just like maintaining test coverage is not the TestEng team's responsibility but instead everyone's, maintaining performance should be everyone's responsibility and not just ours. Perhaps we should even hand over the import-export-performance project. I think it shouldn't be our group's focus to end up monitoring every feature under the sun for performance regressions, but we can help show the way and set up the tooling & infra to make it easier to do so. This is probably a larger directional discussion to have for our group and not specific to this issue here, but that's how I feel about it anyway.
I would love to get to a point where in the long run, it's not just gitlab-org/memory-team/team-tasks~10690783 that looks into performance regressions.
💯
Maybe we can use this issue to do some knowledge sharing with gitlab-org/memory-team/team-tasks11353826 . We could pick it up, but (assuming we find the cause) document every step we took to get there so that in the future, gitlab-org/memory-team/team-tasks11353826 can do it independently.
Maybe gitlab-org/memory-team/team-tasks~11353826 can pick it up but we can offer to pair or shadow with them?
Yep, for sure -- I'm doing just a little bit of prep already, for instance, I was curious if it was reproducible locally so I'm looking into setting up that large project again but running into some unexpected problems. Once I've set this up I can pair up with someone from import 👍
I'll close this issue since the problem has not persisted. It's highly unsatisfying that we never got to the bottom of this, but if we continue to work towards better monitoring tools (such as the ability to grab measurements from import/export service executions, which is underway), then it should be easier to assess should it happen again.
Pull two export profiles, one from before that date, one from after that date (by checking out the respective commits from those days) and compare. We will only be able to do this locally, and using the rake task we will get more metrics than just execution time.
If the previous step shows the expected difference on a dev machine, we know it's code and not infra that caused the problem. Try to git bisect until the offending commit is found. (This may take a long time, since exports of this size are slow...)
If we cannot reproduce this locally, i.e. regardless of HEAD (before and after the given date) the execution times are more or less equal, then it might be related to infrastructure on .com. In that case we might just have to reach out and see what has changed in that area.
@nmilojevic1 while looking at this, I noticed that something isn't right with the way we now pull metrics via the measurement_enabled flag. I had to reimport gitlabhq first to then measure exports. As soon as I run the import task, it dumps the metrics but the job has barely started:
I, [2020-05-04T14:46:06.722474 #107] INFO -- : Importing GitLab export: /gck/i@work/gitlabhq_export.tar.gz into GitLab gitlab-org/gitlabhq as AdministratorI, [2020-05-04T14:46:08.891928 #107] INFO -- : ---:class: Projects::GitlabProjectsImportService:current_user: Administrator:project_full_path: gitlab-org/gitlabhq:import_type: gitlab_project:file_path: "/gck/i@work/gitlabhq_export.tar.gz":gc_stats: :count: :before: 61 :after: 62 :diff: 1...I, [2020-05-04T14:57:35.814375 #107] INFO -- : Done!
i.e. it printed gc_stats at 14:46 when I started the import, not at 14:57 when it finished. Any ideas?
So far not able to reproduce locally. The same project exports way too quickly on a dev machine (< 1m), and I checked out commits as far back as April 20th with no notable difference in performance.
It is explained by a postgres upgrade we were running over the weekend, where sidekiq workers were being disabled and later started back up:
After the postgres upgrade test in staging, sidekiq wasn’t restarted and stayed down over the weekend. This explains the low CPU from 4/24 to 4/27.
(via Slack)
CPU and memory utilization seem to have fallen back to where they before so it's probably not related.
I pulled the commit range from what was deployed at the time but it's massive and probably impossible to comprehend: 57d990db...3a9a322a
We might have to wait until we can pull more exhaustive performance metrics from this pipeline to start investigating this. Another option could be to repeat the local exports with a different, much larger project, since maybe the difference in timing only becomes apparent at a certain scale, but it might be comparing apples and oranges so it's a long shot.
I don't think this issue should be a blocker for the postgres rollout though. We have no evidence yet that this is the cause, and I would imagine that if there was a performance regression caused by the upgrade, it would have surfaced elsewhere on the site, too?
I was looking at Thanos as well, and honestly I'm not sure this query even makes sense, but I was trying to plot how much time was spent, on average, processing exports in staging-sidekiq (and also time spent in DB) during that period, and it does not look like it started to plateau after the 26th:
I also repeated these queries for the cpu and gitaly seconds buckets with similar results.
@andrewn Could you confirm if that is an OK way of looking at it? I wasn't sure if averaging over the Inf bucket (which should contain all samples?) makes sense.
Side note: probably not even related, but I did notice that the priority for these jobs changed from export to just missing some time after April 29.