Recently a customer identified an issue when importing diffs from GitHub. The issue is that comments on older diffs (commits) and of highlight (although not confirmed causation) when they contain suggestions - the diffs aren’t visible:
Clicking try again in this context gives an error in sentry. The error is an AbstractController::DoubleRenderError, but that is not likely the actual problem.
Additional customer-specific information can be found in this (internal) document
Spike Issue Expectations
For this issue, we would like to understand the scope of the problem. Once we understand that, we can create new issues to implement the work. Some of the questions we are looking to answer are:
Is there a viable path forward on the import of the data to have diffs visible and associated with the correct commit?
What is the level of effort required to achieve this? What is the timeline? Which teams are involve? Code Review? Import?
What are the unknowns/concerns of this effort?
Would there be impacts to the length of time imports take?
Is it possible to have a script apply the fix/update the data after the import?
Can this be fixed by changing the diff presentation logic (frontend change)?
@mnohr@patrickbajao Just to make sure we're clear on expectation of timing - the next scheduled sync to discuss this is 2021-09-09 at 7am PT. If there's any updates prior to that, it'd be good to have some initial findings documented in this issue.
I created my own GH project for testing (https://github.com/patrickbajao/gl-import) as I don't have access to Michelle's project. If someone needs access to this repo, feel free to ping me with you GH handle and I'll invite you.
Imported that project to my GDK so I can investigate the bug a little bit further (e.g. access the DB records).
First issue: Expanding outdated thread can't load the diff
After importing, I was able to replicate the same issue as can be seen on m_gill/private-test!2: the discussion on old thread cannot load diff when expanded.
Further investigation shows that this is because whenever we expand a discussion, we try to find the note record so we can display the diff by discussion ID. For some reason, the LegacyDiffNote generated after import has no discussion_id in the DB (it's nil). Since it's nil, it can't be found so we see that error state. We need to find the note so we can display the diff stored in that record.
Looking at the Note model (parent of LegacyDiffNote), we require presence of discussion_id which makes me wonder how it was bypassed. Digging further, found out that the DiffNoteImporter code does not set it and we use bulk_insert to create the records. That explains why nildiscussion_id was allowed as bulk_insert bypasses validation and callbacks.
@georgekoltsov would you be able to answer if it's intentional that we aren't setting the discussion_id? Asking because I'm not familiar with the import logic. If it's not intentional, can we build a discussion_id for every LegacyDiffNote the importer creates?
Here's a PoC MR I have that does it: !69842 (closed). But I wonder if there are repercussions of doing this from import perspective.
Second issue: Commit shown on outdated thread isn't accurate
Another finding is that the commit shown on the "started a thread on an outdated change in commit" text is not the correct commit. The commit shown is the latest commit which made the diff outdated.
Looking at the same DiffNoteImporter code, we just pass the note.commit_id when creating it. The note in this context is not a LegacyDiffNote yet but a Gitlab::GithubImport::Representation::DiffNote.
Tried it out also in the same PoC MR (!69842 (closed)) and it's now showing the appropriate commit on import.
Another question @georgekoltsov, would you know why we're using commit_id instead of original_commit_id?
Based on my testing, the PoC MR fixes the issues described in this issue. It'll only fix the problem for succeeding imports. Old imports need to be fixed differently by backfilling data which may be more complex. But I do want to confirm with @georgekoltsov first because they may be already looked at before and may cause some issues with the importer.
@georgekoltsov would you be able to answer if it's intentional that we aren't setting the discussion_id? Asking because I'm not familiar with the import logic. If it's not intentional, can we build a discussion_id for every LegacyDiffNote the importer creates?
Great find. I don't think this is intentional. Right now imported diff notes are not threaded, not grouped by discussion id and this is something that we want to add to the github importer #336596 (closed) But we can still add a new discussion id per note for now just to get them displayed.
Another question @georgekoltsov, would you know why we're using commit_id instead of original_commit_id?
I don't know Looking at git blame I see that this code is at least 5 years old introduced by Yorick. Perhaps github API did not have original_commit_id at that time? That said, I think it makes sense to specify original commit id instead assuming it does not introduce any regressions.
Great find. I don't think this is intentional. Right now imported diff notes are not threaded, not grouped by discussion id and this is something that we want to add to the github importer #336596 (closed) But we can still add a new discussion id per note for now just to get them displayed.
Cool! Yup, I think it's a good first iteration to have them work first before we can start to group them in threads.
I don't know Looking at git blame I see that this code is at least 5 years old introduced by Yorick. Perhaps github API did not have original_commit_id at that time? That said, I think it makes sense to specify original commit id instead assuming it does not introduce any regressions.
That's a long time.
Yup, based on this discussion, we'll want the original_commit_id for this.
Would this scenario be covered by the changes you proposed?
Based on my testing, yup it does get fixed too. This is the PR that I use to test: https://github.com/patrickbajao/gl-import/pull/1. That includes a comment on a file added/deleted on the same PR.
It'll only fix the problem for succeeding imports. Old imports need to be fixed differently by backfilling data which may be more complex.
@patrickbajao Do you have any thoughts about if it's possible to create a rake task that could be targeted in some way to backfill this data and correct an existing import?
@patrickbajao in looking at the PoC MR - it looks like the timestamps are relatively recent. Is that because you made the Github project and then immediately imported it - or are timestamps not being respected with the changes?
Do you have any thoughts about if it's possible to create a rake task that could be targeted in some way to backfill this data and correct an existing import?
It is possible to do that too. While it's going to be less complex and time consuming than backfilling all projects affected, it'll still have the same complexity in logic. We still need to ensure it's performant as it'll be run on production.
I mentioned this in #340330 (comment 672152582) as well, but one of my worries is how we can backfill the commit_id, it may not be possible or time consuming because we have to get the original_commit_id from GH.
in looking at the PoC MR - it looks like the timestamps are relatively recent. Is that because you made the Github project and then immediately imported it - or are timestamps not being respected with the changes?
It's because at that time, I just created the GH project. This is how they look like now when they're imported again:
The importer gets the note creation time from GH and respects that.
@mnohr@phikai I also noticed that suggestions made on GH that are then imported on GitLab isn't formatted correctly. I assume this is a separate bug we want to look at but not on this one?
Thanks for raising this question. Would you be able to provide an example/screenshot of the incorrect formatting? I am trying to understand if it is just a small cosmetic thing or does it affect usability in a significant way.
The suggestion is just a box that contains the previous suggestion, but it's not anything that can be applied or used in GitLab. Nor does it provide any indication if it was applied on the Github side.
It's 100% a separate bug to what we're investigating here - but it is potentially another gap in the importer.
What is the level of effort required to achieve this? What is the timeline? Which teams are involve? Code Review? Import?
We'll need to create issues for these 2 issues and I believe they can be fixed in the same/separate milestones as they're not dependent on each other. Based on the PoC MR, the changes are done on the importer code. I feel that it's better for Import group to handle this (with Code Review assistance) given they will be the one maintaining it as well.
Right now, my estimate for them are weights of 2 each, total of 4. This doesn't include backfilling old imports.
What are the unknowns/concerns of this effort?
Currently, it's not tested on big imports. Since additional logic was added to DiffNoteImporter that adds some time to importing process.
Quick benchmarking of additional discussion ID generation logic shows that for 10000 notes, discussion ID generation will take around ~100ms.
Would there be impacts to the length of time imports take?
As mentioned above, since additional logic was added to DiffNoteImporter there's additional time for it of course but it should be minimal.
Is it possible to have a script apply the fix/update the data after the import?
As mentioned in #340330 (comment 671955596), the fix suggested will only fix the problems for new imports after the fix is applied. Old imports must be backfilled (e.g. discussion_id should be generated for each LegacyDiffNote without discussion_id). But one of my worries are how we can backfill the commit_id, it may not be possible or time consuming because we have to get the original_commit_id.
Can this be fixed by changing the diff presentation logic (frontend change)?
Given the issue is in the backend, the fix should be backend as well.
@patrickbajao excellent work and thank you for investigating this, especially so quickly. Would you do an example of a code suggestion as well? That is really the most similar comparison to what the customer is experiencing.
I updated the GH pull request to have a suggestion on an outdated diff and on a non-outdated diff. Here's how it looks like when imported locally (outdated diff discussion is expanded)
The suggestion doesn't look like a GitLab suggestion (which I also mentioned in #340330 (comment 671957280)) but the comments get displayed and thread expansion works.
@hdelalic at this point, we should only present solutions we are 100% sure they will accept. The customer is extremely frustrated with our 5 month-long attempt to find a solution for them. Let's discuss further during the Sep 9, 7amPST call.
@phikai Since we now have #340446 (closed) to track the actual work of implementing a fix, should we close this issue and continue to collaborate on the other issue?
I want to be careful to have a single source of truth so we do not lose information across multiple issues.