Professional Services is working to migrate a customer under a time constraint. The import for this customer has been so far unsuccessful. The timeline is as follows:
May 14th - migration canceled
May 21st target migration Due Date
If May 21st is missed, next opportunity is late August
There are seeing inconsistent behavior when migrating merge request data:
General discussion comments are hit or miss, currently not importing at all from our test repo for merge requests, and barely importing from our test repo for issues
Git diffs on outdated diffs
Diff comment threads are individual comments instead
Issues and Merge Requests are successfully migrated to GitLab from GitHub, but not all comments within them.
Here is that same MR on the GitLab side after the latest finished import test (2021-05-14):
What we are seeing is the DiffNotes are mostly making it over, but some of the regular comments are not showing up at all.
For example,
On GitHub:
From one of the attempts with comments:
The DiffNotes are at least showing up somewhat. The git diff itself is not consistent, and threads do not appear to be retained, but there is still some record of it which is a great improvement.
What is the current bug behavior?
Not all general discussion comments are being migrated, and comments that happen on diffs show up as individual comments rather than inside threads.
What is the expected correct behavior?
All general discussion comments should be migrated over and comments on diffs should be nested as a thread in order to understand context.
Relevant logs and/or screenshots
Customer migration with massive data size:
17GB repo [Base size, increases with 3k+ branches;
Final size with git pack data files is over 40GB for one repo/project]
70,000+ MRs
Previous/Possibly unrelated fixes
Fix #1) Adding cache to resume migration after time-out
Fix #2) Review importer when the author doesn't exist anymore
All of these fixes have been applied and tested in GitLab "sandbox" which replicates AVI dev (10-15%). None of these fixes address the issue of MRs (and MR content) being left behind.
@mross1 Thanks for creating this issue to track all the changes. I have looked at your list above to get the status for each line item and this is what I see at this time:
Adding cache to resume migration after time-out. MR: !60668 (merged)
Review importer when the author doesn't exist anymore. MR: !61257 (merged)
Github importer failing with undefined method 'id' for nil:NilClass. Issue: #330294 (closed)
The first thing I noted was that the MR link from # 1 is associated with the issue in # 4 and the MR in # 2 is related to the issue in # 5. Is that accurate, as this would mean that we are really only tracking 3 issues and 3 associated MRs?
The other thing I noted was that 4 out of 5 are already closed and the last one (# 3) is in review.
Does this mean that the Engineering has completed all the identified issues (pending the review for # 3)?
@hdelalic Thank you so much for following up. I've asked @jmarquez2 & @leopardm to weigh in here as they have been doing the testing that product/engineering has not been able to in terms of a small scale "simulation" of the customer's unique, large data situation.
@jmarquez2 are we trying to fix all the errors in the error log you posted before? Is that what is driving the description of this issue? My understanding is that the real problem that is left is:
None of these fixes address the issue of MRs (and MR content) being left behind.
Could you speak to that, and also elaborate on what you know about MRs/MR content being left behind? If it is only some content being left, what content specifically? General discussion comments, comments on diffs (or specific diffs), labels, etc? Are you thinking that fix #3 will resolve this?
If it is only some content being left, what content specifically? General discussion comments, comments on diffs (or specific diffs), labels, etc?
General discussion comments are hit or miss, currently not importing at all from our test repo for merge requests, and barely importing from our test repo for issues
Git diffs on outdated diffs
Diff comment threads are individual comments instead
We are seeing issues and merge requests make it over to GitLab, but without all the comments. Kassio's rate limit fix improved the quality of our GitHub import immensely, but we are still not seeing all comments make it over to GitLab.
Here is that same MR on the GitLab side from multiple import attempts:
After the rate limit fix
After another patch
Latest finished import test with all of the fixes as of a day ago
What we are seeing is the DiffNotes are mostly making it over, but some of the regular comments are not showing up at all.
For example,
On GitHub:
From one of the attempts with comments:
The DiffNotes are at least showing up somewhat. The git diff itself is not consistent, and threads do not appear to be retained, but there is still some record of it which is a great improvement.
@leopardm this is incredibly helpful and exactly what I was looking for. If this is the problem that is left, and there are no others, I would suggest that the focus of this issue is on exactly what you have posted here, as opposed to the current issue description. I will update to reflect that once you confirm for me. I believe there may also be existing issues for other groups that relate to this.
It is also curious that the author comments have not migrated. I wonder if that's coincidence, or if there is something happening specific to the author. I am not the best person to speak to that though.
@lmcandrew + @m_gill Do we have someone in Manage to take a look at this as soon as possible to fix a blocking issue with the customer? Otherwise I would also take a look to look outside Manage
I noticed that the title of this issue doesn't match the description. While the title is "Importing from GitHub does not migrate all MR comments", the description says "The git diff itself is not consistent, and threads do not appear to be retained, but there is still some record", which in my understanding is:
bug: "The git diff itself is not consistent"
feature request: "threads do not appear to be retained, but there is still some record"
Given how the focus of this issue already changed multiple times and the number of threads discussing multiple different subjects, I suggest we close it and open one specific issue for each bug or feature request where we can keep more focused discussions.
Following the bias for action I already created a feature request issue (#336596 (closed)) to import diff notes in a threaded manner.
Thank you for the new issue @kassio. The most important aspect of this topic is that there should be 285 of the GitHub comments migrated and there are not. Do you feel like #332630 (closed) is the specific issue for that bug? That is where we need the most focus.
My understanding is that we have been creating specific issues for each bug/feature which should be added to the Solution section, and that this issue should remain open to group those together.
The most important aspect of this topic is that there should be 285 of the GitHub comments migrated and there are not. Do you feel like #332630 (closed) is the specific issue for that bug?
Yes, that's my understanding, the #332630 (closed) is specific for the inconsistent number of comments migrated. This issue I'm assuming is fixed.
Michelle Gillchanged title from GitHub Import tool limitations - Customer blocked to Importing from GitHub does not migrate all MR comments (GitHub Import tool limitations - Customer blocked)
changed title from GitHub Import tool limitations - Customer blocked to Importing from GitHub does not migrate all MR comments (GitHub Import tool limitations - Customer blocked)
Michelle Gillchanged the descriptionCompare with previous version
@haris@lmcandrew I have updated the issue to reflect what ~"group::import" might be more used to. Is this format easier to digest, prioritize, and schedule? If not, feel free to make changes.
FYI: Given the due date and the amount of back and forth of this one, I scheduled a technical sync up, Monday, with @leopardm, @jmarquez2 and @tpresa (Not sure if everyone is needed) with the intent to speed up the fixing process required here. I'm also assigning myself as dev DRI of this issue to facilitate the communication.
@hdelalic & @m_gill I'm hoping that we can keep Kassio's attention through next week on this. I'd like to have a sync call on the 27th before we head on holiday. cc: @michael_lutz@lmcandrew
2021-05-21: Customer stand-up meeting notes
Test 1 - issue with GitHub API lib. + more verbose logs. Performed better but not 1 to 1
Test 2 - Same import HA, large sideqik cluster - performed worse. Analyzing logs now
Still not looking good
Kassio is working on a pagination issue
Michael is making sense of logs from 2 tests - creating scripts to parse the GDK log output
Oktokit test is wrapping up today
We have determined infrastructure for testing can be a simpler set-up
Rate-limit fix brought in ~50% more data but we still have work remaining
Michael will continue on this path into next week as well Jo
TODO: Melani to send an update on Monday May 24th in the slack channel
TODO: Jo to cancel M-W-F meetings next week. Keep Tues & Thurs meetings May 25 & 27
@mross1 I have updated the list of issues we are tracking to get the current status. It seems that items 1.-5. are completed and item 6. is in the initial review. Please let me know if anything is missing here.
1. Adding cache to resume migration after time-out. MR: !60668 (merged)
2. Review importer when the author doesn't exist anymore. MR: !61257 (merged)
If I'm not mistaken (@kassio can correct me then), !62036 (merged) is the only one that isn't included in 13.12. AFAIK the order doesn't matter: if they have conflicting changes, you'll need to resolve the conflict manually.
2021-05-26 still don't have a full fix for the customer.
Still missing comments.
Diffs are also missing.
@hdelalic What are the next steps? Seems we exhausted the options in terms of the fixes merged by @kassio And thank you Kassio for your dedication here.
@hdelalic The 2 issues listed above are directly related to the original issue. Nothing "new" has surfaced only variations of the same issue. The customer is stuck. Can not migrate. This high-value deal is very much at risk as we're still dead in the water here and unable to move forward. The 6 fixes above do not solve the problem.
The Problem = Massive data size: 17GB repo [Base size, increases with 3k+ branches; Final size with git pack data files is over 40GB for one repo/project], 70,000+ MRs
Impact: Import/migration fails due to size limit. See details on the Issue tab of the Customer Project Definition document.. FYI @m_gill@lmcandrew@ogolowinski - @michael_lutz has set up a meeting for us all to discuss next steps in terms of how PS is to handle customer migrations going forward
@mross1 Thanks for the update. The document you referenced states None of these fixes address the issue of MRs (and MR content) being left behind. Based on this, I have some questions:
Are the MRs that aren't being imported always the same? i.e. is it a consistent, repeatable problem?
Do you have a rough idea of % of missing MRs from the most recent import?
@kassio have you been able to reproduce the problem locally?
@kassio in addition to the above, do you have any expectation that !62036 (merged) will make the process more reliable, thus solving the problem? My instinct says no.
Are the MRs that aren't being imported always the same? i.e. is it a consistent, repeatable problem?
The MRs usually get imported, but they lack a lot of metadata, specifically the comments and diffs. The comments and diffs are inconsistent. The example above is still occurring after the patches are applied.
Do you have a rough idea of % of missing MRs from the most recent import?
The last test I ran showed 155/285 comments for the large pull request I am using to validate the changes
Next steps: (Might be missing some as I dropped off at time)
Professional Services needs "guard rails" from Engineering to understand the limitations of the system and make decisions. DRI for this is Engineering, noting that the "bar" where we are confident saying the functionality will 100% work might be low
Kassio, Jo Marquez, and Michael Leopard will keep this issue up to date with test results and current status so that we are aligned on what is outstanding and who we are waiting on
We will need to enlist QA to help test on this environment, or one like it, in the near term. DRI for this is Product in terms of prioritizing the work for QA. Currently we are testing only on Jo Marquez/Michael Leopards testing environments and need to wait on their results. Michael Leopard already has this data.
Melani will be meeting with customer tomorrow with Haris attending in order to communicate that we do not have a solution
Haris will create a new issue for collaboration with Professional Services with test logs and documentation of the current state so that George can get up to speed quickly (Note: I was not on the call - why is this needed? I thought that was the purpose of this issue? If we are being transparent in this issue, does this need disappear?)
Haris will create a new issue for collaboration with Professional Services with test logs and documentation of the current state so that George can get up to speed quickly (Note: I was not on the call - why is this needed? I thought that was the purpose of this issue?
We had broken out each undesirable beavior into a separate issue, listed in the Solution section above. This was done to keep track of each issue that we needed to fix, while this issue is used to track the overall progress and to track those individual issues in one place.
Given that we have identified the next issue we want to fix (inconsistent comments and diffs in latest tests), a targetted issue around that incident would allow us to focus the efforts in the next several days and would allow us to continue to iterate with smaller issues toward the overall solution.
Another engineer has been getting up to speed and has documented great details and nuances #332630 (comment 597856784)
We have validated that these problems are intermittent, possibly due to Infrastructure/networking/sidekiq errors, by successfully importing one of the problematic MRs locally and by investigating an updated set of logs provided by Professional Services
Infrastructure is involved to increase and dedicate resources to Import jobs in order to improve reliability #332616 (closed)
A solution has gone out to handle rate limits from GitHub more efficiently. This could make a considerable difference, but will need to be validated. !62036 (merged)
I have re-opened #332630 (closed) with a specific focus on understanding the intermittent problems we are seeing. The reason for doing so is described #332630 (comment 602765058). As far as we are aware we still haven't been able to successfully import all data, although there was some confusion around this #332630 (comment 602756582).
Haris Delalićmarked the checklist item 8. Mark relation as imported after the importer runs, instead of when import is scheduled. Issue: #333246 (closed) as completed
marked the checklist item 8. Mark relation as imported after the importer runs, instead of when import is scheduled. Issue: #333246 (closed) as completed
Haris Delalićchanged the descriptionCompare with previous version
Just to align on the work from last week related to this problem:
The team continued to investigate the missing comment issue in #332630 (closed) and found that one problem might be related to networking or infrastructure, and not import code itself. More logging is being added to the system (actually went out today) in order to understand how many objects are getting imported compared to how many objects should have been imported.
!64256 (merged) was merged, which improves logging for each imported object as well as showing a count of the total number of objects imported.
The improved logging helped us identify a couple of bugs, which there are MRs to resolve: !64558 (merged), !64645 (merged).
As per #332630 (comment 607121170), we are still investigating some intermittent problems caused by network or database timeouts - we are trying to identify the root cause of these issues, and we are also considering whether it is feasible to introduce robust retry logic #334270.
@lmcandrew & @m_gill Thank you for your continued support here. I've been following the other issues where the details are surfacing. Seems we still don't have a path forward. I'd like to loop back with you both regarding EOW (end of week) messaging to the customer. Thank you.
@mross1 For some reason I did not see this notification! Yes, I will be looping back with you. I am just waiting to see results from the latest import.
@m_gill In a call with Sakamoto today, and he suggested we all get on a call and come up with a plan forward that would then be brought to him for exec approval/escalation.
Please confirm as I've certainly brought this to others beyond this group, for instance to Marin. I'm unclear who the "working group" should be in terms of preparing a statement/plan for Sakamoto as to how we move forward.
@mross1 Per this comment, it seems that we have come across an undocumented limitation of the GitHub API which doesn't allow us to fetch issue comments and MR comments past a certain number.
We will look into how we could overcome this limitation, but I was wondering if we could also ask the customer if they can open a case with GitHub to fix on their end by removing this limitation?
It may still be possible for GitHub to make changes to their API to accommodate their customer's needs, or maybe make a temporary increase in those limitations for their customers.
We don't seem to have any quick solutions to work around those limitations.
Per this comment, it seems that we have come across an undocumented limitation of the GitHub API which doesn't allow us to fetch issue comments and MR comments past a certain number when fetching a page at a time.
I have asked @gitlab-org/manage/import/backend Engineers to proceed with implementing a solution to fetch one comment at a time, as described here.
This solution will be much slower than our current importer but should end up fetching all the comments. We will put this solution behind a feature flag, so that customers can enable it, as needed.
@david Depending on the number of each data elements in the data set, potentially multiple times, i.e. from 8h to 24h (illustration, not a real measurement).
This works because we spend most of the time during import waiting for the GitHub API to allow us to proceed, as we get quickly rate-limited when importing large projects
This solution would be accessible through the API, so ProServ would be able to take advantage of it
In the meantime, the customer was provided with a patch that they can use to test this solution. They are expected to apply the patch and kick off a test import tomorrow (8/4).
Accelerating GitHub import by rotating multiple access tokens would be a fairly large effort and we are not prioritizing it at this time. The customer seems ok with waiting longer for the import if it means that all of their comments will be migrated.
With the delivery of #332630 (closed), Engineering has completed all the identified issues related to this project. I am closing this issue at this time. Please reopen (or log a new issue), if we come across additional problems that need engineering help.