Skip to content

  • Projects
  • Groups
  • Snippets
  • Help
  • This project
    • Loading...
  • Sign in / Register
GitLab Community Edition
GitLab Community Edition
  • Overview
    • Overview
    • Details
    • Activity
    • Cycle Analytics
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
    • Charts
    • Locked Files
  • Issues 10,252
    • Issues 10,252
    • List
    • Boards
    • Labels
    • Service Desk
    • Milestones
  • Merge Requests 543
    • Merge Requests 543
  • CI / CD
    • CI / CD
    • Pipelines
    • Jobs
    • Schedules
    • Charts
  • Snippets
    • Snippets
  • Members
    • Members
  • Collapse sidebar
  • Activity
  • Graph
  • Charts
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
  • GitLab.org
  • GitLab Community EditionGitLab Community Edition
  • Merge Requests
  • !14731

Merged
Opened Oct 06, 2017 by Yorick Peterse@yorickpeterse 31 of 34 tasks completed31/34 tasks
  • Report abuse
Report abuse

Rewrite the GitHub importer to perform work in parallel and greatly improve performance

What does this MR do?

This MR rewrites the GitHub importer from scratch so it's much faster and performs work in parallel. This MR is still a WIP, I'll update the body properly once we get closer to a final state.

Fixes #33135 (closed), fixes #38621 (closed), fixes #39361 (closed)

TODO

  • Use a separate class for importing issue comments
  • Import issue comments using a worker (1 job per comment)
  • Issue and comment workers should reschedule themselves in the future if we hit a rate limit, the reschedule time will simply be the reset time of the rate limit (so if our rate limit resets in 10 seconds that means we schedule jobs for 10 seconds in the future)
  • Make the parallel importing schedule a job to check for progress, instead of blocking the thread in a sleep call
  • Make sure any stuck importer jobs don't mess with a running GitHub import
  • Import releases
  • Re-fetch if a PR was updated after the initial repository fetch
  • See if we can re-use GitHub's diffs from the API for diff notes, instead of using Git(aly)
  • Set a release's description to the tag name if none is provided (since a description is required)
  • Add tests for User.by_any_email
  • Add tests for the importer code
    • Base classes such as Client and UserFinder
    • Representation classes
    • Importer classes
    • Sidekiq workers (GithubImporter namespace)
    • Sidekiq worker concerns
    • Changes to Projects::ImportService to support async importers
    • Changes to RepositoryImportWorker
  • Some GitHub pull requests use the same source/target branch, but GitLab does not seem to support this. Example: https://github.com/rubinius/rubinius/pull/704. This may happen when the source repository/project (or user) no longer exists. See https://sentry.gitlap.com/gitlab/staginggitlabcom/issues/107832/ for more info.
  • Sometimes inserts into merge_requests produce duplicate key errors in the index index_merge_requests_on_target_project_id_and_iid. This is probably due to GitHub sometimes returning duplicate data. We can solve this by not scheduling PRs for which we have already scheduled an importing run (using a Set of iid values). See https://sentry.gitlap.com/gitlab/staginggitlabcom/issues/107834/ for more info.
    • It's also possible this is caused by a job failing, then being retried. Wrapping all work in a transaction could solve this.
  • Handle rescheduled jobs better when cloning (e.g. don't add the remotes if they already exist). Right now a job getting terminated in the middle (e.g. by the memory killer) can prevent us from starting over/resuming
  • The logic used for refreshing repositories before importing PRs doesn't scale. Frequently updated PRs will lead to many refetches of the repository, even if the repository wasn't actually updated. Since we schedule the refs in a job we really only need to refetch if the refs can't be found. This will require a lease however.
  • It seems that under certain conditions the page numbers of e.g. the pull requests data isn't tracked properly. It seems to work fine locally, so perhaps something else botched this up. This needs some additional testing.
    • Possible theory: page 1 is never written to the cache because of current_page defaulting to 1 and if page > current_page in set. This means that if we were to request page 2 (after having processed page 1) and get a rate limit error we'd process page 1 again. This could then lead to duplicate key errors if jobs from the same page run in parallel. This however doesn't explain why I didn't see the page number being cached at all.

Projects To Test

The following projects should be tested using this new importer on GitLab.com (one way or another):

  • https://github.com/yorickpeterse/oga: 103 seconds on staging, 47 seconds on production
  • https://github.com/kubernetes/kubernetes: 6.5 hours on production
  • https://github.com/rubinius/rubinius: 22.3 minutes on staging, 20 minutes on production

Does this MR meet the acceptance criteria?

  • Changelog entry added, if necessary
  • Documentation created/updated
  • API support added
  • Tests added for this feature/bug
  • Review
    • Has been reviewed by Backend
    • Has been reviewed by Database
  • Conform by the merge request performance guides
  • Conform by the style guides
  • Squashed related commits together
Edited Nov 07, 2017 by Yorick Peterse
  • Discussion 118
  • Commits 5
  • Pipelines 82
  • Changes 186
{{ resolvedDiscussionCount }}/{{ discussionCount }} {{ resolvedCountText }} resolved
  • Yorick Peterse @yorickpeterse

    mentioned in issue #38621 (closed)

    Oct 06, 2017

    mentioned in issue #38621 (closed)

    mentioned in issue #38621
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    restored source branch github-importer-refactor

    Oct 06, 2017

    restored source branch github-importer-refactor

    restored source branch `github-importer-refactor`
    Toggle commit list
  • Yorick Peterse @yorickpeterse commented Oct 06, 2017
    Master

    For this MR my goal is to get https://github.com/kubernetes/kubernetes to import in roughly a minute or so, at least when using the parallel importer (the Rake task will still be sequential). This however requires a near total rewrite of the importer as I couldn't make this work in the old code.

    For this MR my goal is to get https://github.com/kubernetes/kubernetes to import in roughly a minute or so, at least when using the parallel importer (the Rake task will still be sequential). This however requires a near total rewrite of the importer as I couldn't make this work in the old code.
  • 🏝 Rémy Coutable 🏝 (back March 27th) @rymai commented Oct 06, 2017
    Master

    @yorickpeterse As far as I know Gitlab::GithubImport::Importer is only used for Gitea import, GitHub import now uses Github::Import.

    @yorickpeterse As far as I know `Gitlab::GithubImport::Importer` is only used for Gitea import, GitHub import now uses `Github::Import`.
  • Yorick Peterse @yorickpeterse

    added 1 commit

    • f6d6e37e - Rewrite the GitHub importer from scratch

    Compare with previous version

    Oct 06, 2017

    added 1 commit

    • f6d6e37e - Rewrite the GitHub importer from scratch

    Compare with previous version

    added 1 commit * f6d6e37e - Rewrite the GitHub importer from scratch [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7336248&start_sha=048f8a144cff25008c7ab298fb40f90465ba4d2b)
    Toggle commit list
  • Yorick Peterse @yorickpeterse commented Oct 06, 2017
    Master

    @rymai Yeah I renamed that one to LegacyGithubImport. I didn't rename GiteaImport it since it used Octokit and some other GitHub specific bits, but perhaps that's a better name.

    I also just pushed a new version which adds the ability to import issues and what not properly. Pull requests and releases are not yet handled. Issues are also imported in parallel, which results in https://github.com/yorickpeterse/oga being imported in 15-20 seconds instead of 120 seconds (excluding PRs and releases).

    @rymai Yeah I renamed that one to `LegacyGithubImport`. I didn't rename GiteaImport it since it used Octokit and some other GitHub specific bits, but perhaps that's a better name. I also just pushed a new version which adds the ability to import issues and what not properly. Pull requests and releases are not yet handled. Issues are also imported in parallel, which results in https://github.com/yorickpeterse/oga being imported in 15-20 seconds instead of 120 seconds (excluding PRs and releases).
  • Yorick Peterse @yorickpeterse

    added 1 commit

    • 5c01cbc4 - Rewrite the GitHub importer from scratch

    Compare with previous version

    Oct 06, 2017

    added 1 commit

    • 5c01cbc4 - Rewrite the GitHub importer from scratch

    Compare with previous version

    added 1 commit * 5c01cbc4 - Rewrite the GitHub importer from scratch [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7337481&start_sha=f6d6e37eaf36e05b39327564b089a022fd8cb7d3)
    Toggle commit list
  • Yorick Peterse @yorickpeterse commented Oct 06, 2017
    Master

    I spent a few more hours hacking on this and things are starting to come together. Labels and milestones are assigned properly, data for non existing users is handled, etc.

    Along the way I did get hit by the GitHub rate limit when trying to import https://github.com/kubernetes/kubernetes. This isn't entirely surprising because this project has 4626 issues and we need at least an equal amount of requests to get the comments for these issues. In the current setup this means we may (in the worst case) have to wait up to an hour to get the data.

    My plan for dealing with this is to break importing comments up into a separate worker as this will consume most requests. These jobs will then re-schedule themselves in the future when hitting the rate limit, based on when the rate limit will reset. This means importing Kubernetes can still take an hour or two because of this, but at least we won't have threads blocked in a sleep call.

    I spent a few more hours hacking on this and things are starting to come together. Labels and milestones are assigned properly, data for non existing users is handled, etc. Along the way I did get hit by the GitHub rate limit when trying to import https://github.com/kubernetes/kubernetes. This isn't entirely surprising because this project has 4626 issues and we need at least an equal amount of requests to get the comments for these issues. In the current setup this means we may (in the worst case) have to wait up to an hour to get the data. My plan for dealing with this is to break importing comments up into a separate worker as this will consume most requests. These jobs will then re-schedule themselves in the future when hitting the rate limit, based on when the rate limit will reset. This means importing Kubernetes can still take an hour or two because of this, but at least we won't have threads blocked in a `sleep` call.
  • Yorick Peterse @yorickpeterse

    changed the description

    Oct 06, 2017

    changed the description

    changed the description
    Toggle commit list
  • Yorick Peterse @yorickpeterse commented Oct 06, 2017
    Master

    Potential option: there's an API endpoint for getting comments across a repository (e.g. https://api.github.com/repos/YorickPeterse/oga/issues/comments). While the returned comment payloads don't contain a direct association to their "parent" (issue or PR), we could parse this from the issue_url field. Assuming this were to work it means we'd only need 4626 / 100 (the maximum number of objects per page) = 47 requests.

    Potential option: there's an API endpoint for getting comments across a repository (e.g. https://api.github.com/repos/YorickPeterse/oga/issues/comments). While the returned comment payloads don't contain a direct association to their "parent" (issue or PR), we could parse this from the `issue_url` field. Assuming this were to work it means we'd only need 4626 / 100 (the maximum number of objects per page) = 47 requests.
  • 🏝 Rémy Coutable 🏝 (back March 27th) @rymai commented Oct 09, 2017
    Master

    Yeah I renamed that one to LegacyGithubImport. I didn't rename GiteaImport it since it used Octokit and some other GitHub specific bits, but perhaps that's a better name.

    Oh, ok I see: you've rewritten Gitlab::GithubImport::Importer (the old importer) and kept the new Github::Import (which won't be used anymore after this refactor). That was what confused me! :P

    My plan for dealing with this is to break importing comments up into a separate worker as this will consume most requests. These jobs will then re-schedule themselves in the future when hitting the rate limit, based on when the rate limit will reset. This means importing Kubernetes can still take an hour or two because of this, but at least we won't have threads blocked in a sleep call.

    Sounds good!

    Potential option: there's an API endpoint for getting comments across a repository (e.g. https://api.github.com/repos/YorickPeterse/oga/issues/comments). While the returned comment payloads don't contain a direct association to their "parent" (issue or PR), we could parse this from the issue_url field. Assuming this were to work it means we'd only need 4626 / 100 (the maximum number of objects per page) = 47 requests.

    @yorickpeterse Yes, that's what I discovered and proposed too: #38200 :)

    > Yeah I renamed that one to `LegacyGithubImport`. I didn't rename GiteaImport it since it used Octokit and some other GitHub specific bits, but perhaps that's a better name. Oh, ok I see: you've rewritten `Gitlab::GithubImport::Importer` (the old importer) and kept the new `Github::Import` (which won't be used anymore after this refactor). That was what confused me! :P > My plan for dealing with this is to break importing comments up into a separate worker as this will consume most requests. These jobs will then re-schedule themselves in the future when hitting the rate limit, based on when the rate limit will reset. This means importing Kubernetes can still take an hour or two because of this, but at least we won't have threads blocked in a `sleep` call. Sounds good! > Potential option: there's an API endpoint for getting comments across a repository (e.g. https://api.github.com/repos/YorickPeterse/oga/issues/comments). While the returned comment payloads don't contain a direct association to their "parent" (issue or PR), we could parse this from the `issue_url` field. Assuming this were to work it means we'd only need 4626 / 100 (the maximum number of objects per page) = 47 requests. @yorickpeterse Yes, that's what I discovered and proposed too: https://gitlab.com/gitlab-org/gitlab-ce/issues/38200 :)
  • 🏝 Rémy Coutable 🏝 (back March 27th)
    @rymai started a discussion on an old version of the diff Oct 09, 2017
    Resolved by Yorick Peterse Oct 13, 2017
    lib/gitlab/github_import/milestones_importer.rb 0 → 100644
    22 client
    23 .milestones(project.import_source, state: 'all')
    24 .map { |milestone| build(milestone) }
    25 end
    26
    27 def build(milestone)
    28 time = Time.zone.now
    29
    30 {
    31 iid: milestone.number,
    32 title: milestone.title,
    33 description: milestone.description,
    34 project_id: project.id,
    35 state: state_for(milestone),
    36 created_at: time,
    37 updated_at: time
    • 🏝 Rémy Coutable 🏝 (back March 27th) @rymai commented Oct 09, 2017
      Master

      Why not keep milestone.created_at/milestone.updated_at?

      Why not keep `milestone.created_at`/`milestone.updated_at`?
    • Yorick Peterse @yorickpeterse

      changed this line in version 8 of the diff

      Oct 10, 2017

      changed this line in version 8 of the diff

      changed this line in [version 8 of the diff](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7405258&start_sha=269085aa26664138f87d3be23d001e3e52056c30#90912a84abc2b4123c683121ba25a4344a9350cf_37_0)
      Toggle commit list
    Please register or sign in to reply
  • 🏝 Rémy Coutable 🏝 (back March 27th) @rymai

    mentioned in issue #38873 (closed)

    Oct 09, 2017

    mentioned in issue #38873 (closed)

    mentioned in issue #38873
    Toggle commit list
  • 🏝 Rémy Coutable 🏝 (back March 27th) @rymai commented Oct 09, 2017
    Master

    @yorickpeterse Man that's fast, local benchmarks when importing guard/guard (531 issues):

    • master: 1228 seconds (~20 minutes) 🐌
    • this branch: 74 seconds (1 minutes, 14 seconds) 🚀
    @yorickpeterse Man that's fast, local benchmarks when importing `guard/guard` (531 issues): - `master`: 1228 seconds (~20 minutes) :snail: - `this branch`: 74 seconds (1 minutes, 14 seconds) :rocket:
  • Yorick Peterse @yorickpeterse

    added 1 commit

    • adc9ed0a - Rewrite the GitHub importer from scratch

    Compare with previous version

    Oct 09, 2017

    added 1 commit

    • adc9ed0a - Rewrite the GitHub importer from scratch

    Compare with previous version

    added 1 commit * adc9ed0a - Rewrite the GitHub importer from scratch [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7371077&start_sha=5c01cbc47c95267e6df554d0d7a9b63b26812a72)
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    marked the task Import issue comments using a worker (1 job per comment) as completed

    Oct 09, 2017

    marked the task Import issue comments using a worker (1 job per comment) as completed

    marked the task **Import issue comments using a worker (1 job per comment)** as completed
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    marked the task Use a separate class for importing issue comments as completed

    Oct 09, 2017

    marked the task Use a separate class for importing issue comments as completed

    marked the task **Use a separate class for importing issue comments** as completed
    Toggle commit list
  • Yorick Peterse @yorickpeterse commented Oct 09, 2017
    Master

    Importing comments is now done using the "comments per repo" endpoint (https://developer.github.com/v3/issues/comments/#list-comments-in-a-repository), saving a lot of API calls.

    Pull requests will be a bit tricky. In the old approach we would restore some branches and (as far as I can tell) would generate diffs using our Git clone. This means we could hit NFS pretty hard if we were to run e.g. 100 PR importer jobs in parallel. Unfortunately the GitHub API does not appear to provide a single API endpoint to get the diffs in a merge request. I'll see what we can do about this.

    Importing comments is now done using the "comments per repo" endpoint (https://developer.github.com/v3/issues/comments/#list-comments-in-a-repository), saving a lot of API calls. Pull requests will be a bit tricky. In the old approach we would restore some branches and (as far as I can tell) would generate diffs using our Git clone. This means we could hit NFS pretty hard if we were to run e.g. 100 PR importer jobs in parallel. Unfortunately the GitHub API does not appear to provide a single API endpoint to get the diffs in a merge request. I'll see what we can do about this.
  • 🏝 Rémy Coutable 🏝 (back March 27th) @rymai commented Oct 09, 2017
    Master

    Pull requests will be a bit tricky. In the old approach we would restore some branches and (as far as I can tell) would generate diffs using our Git clone. This means we could hit NFS pretty hard if we were to run e.g. 100 PR importer jobs in parallel. Unfortunately the GitHub API does not appear to provide a single API endpoint to get the diffs in a merge request. I'll see what we can do about this.

    @yorickpeterse If you take a look at the new importer Github::Import, we're not restoring branches anymore. :)

    > Pull requests will be a bit tricky. In the old approach we would restore some branches and (as far as I can tell) would generate diffs using our Git clone. This means we could hit NFS pretty hard if we were to run e.g. 100 PR importer jobs in parallel. Unfortunately the GitHub API does not appear to provide a single API endpoint to get the diffs in a merge request. I'll see what we can do about this. @yorickpeterse If you take a look at the new importer `Github::Import`, we're not restoring branches anymore. :)
  • Yorick Peterse @yorickpeterse commented Oct 09, 2017
    Master

    @rymai Aha, that explains why I couldn't find any calls to restore_branch! and the likes any more; I could have sworn they were there, but them being refactored away explains why I couldn't find them. What is the commit (or commits) that introduced this change? That might make it easier for me to understand how to best port that logic over to this parallel importer.

    Edited Oct 09, 2017 by Yorick Peterse
    @rymai Aha, that explains why I couldn't find any calls to `restore_branch!` and the likes any more; I could have sworn they were there, but them being refactored away explains why I couldn't find them. What is the commit (or commits) that introduced this change? That might make it easier for me to understand how to best port that logic over to this parallel importer.
  • Yorick Peterse @yorickpeterse commented Oct 09, 2017
    Master

    Ah I found it, those calls were removed in 43b75b38. I'll take a look at that and the following commits.

    Ah I found it, those calls were removed in 43b75b38086c88894fca4b6c1c9e68c111c4e546. I'll take a look at that and the following commits.
  • 🏝 Rémy Coutable 🏝 (back March 27th) @rymai commented Oct 10, 2017
    Master

    Aha, that explains why I couldn't find any calls to restore_branch! and the likes any more; I could have sworn they were there, but them being refactored away explains why I couldn't find them. What is the commit (or commits) that introduced this change? That might make it easier for me to understand how to best port that logic over to this parallel importer.

    @yorickpeterse Yeah, the MR was !14445 (merged), the description of the MR and its commits can probably help you understand what was done and why! Also, feel free to checkout https://gitlab.com/gitlab-org/gitlab-ce/issues/36284 for more background and work on improving importing and performance in general.

    > Aha, that explains why I couldn't find any calls to `restore_branch!` and the likes any more; I could have sworn they were there, but them being refactored away explains why I couldn't find them. What is the commit (or commits) that introduced this change? That might make it easier for me to understand how to best port that logic over to this parallel importer. @yorickpeterse Yeah, the MR was !14445, the description of the MR and its commits can probably help you understand what was done and why! Also, feel free to checkout https://gitlab.com/gitlab-org/gitlab-ce/issues/36284 for more background and work on improving importing and performance in general.
  • Yorick Peterse @yorickpeterse commented Oct 10, 2017
    Master

    I poked around with the GitHub GraphQL API and this looks quite promising. For example, it would allow us to get issue/PR authors/assignees including their Emails, removing the need for API calls to get that data. This is quite important as looking up e.g. 1000 users could in the worst case lead to 1000 GitHub API calls.

    To give an example, here's a GraphQL query to get all issues of a repository along with the associated data we'd need:

    query($owner: String!, $repo: String!, $after: String) {
      repository(owner: $owner, name: $repo) {
        issues(states: [OPEN, CLOSED], first: 100, after: $after, orderBy: { field: CREATED_AT, direction: ASC }) {
          nodes {
            number
            title
            body
            author {
              ... on User {
                id: databaseId
                login
                email
              }
            }
            assignees(first: 10) {
              nodes {
                ... on User {
                  id: databaseId
                  login
                  email
                }
              }
            }
            milestone_iid: milestone {
              number
            }
            label_names: labels(first: 10) {
              nodes {
                name
              }
            }
          }
          pageInfo {
            has_next_page: hasNextPage
            end_cursor: endCursor
          }
        }
      }
      rate_limit: rateLimit {
        remaining
        reset_at: resetAt
      }
    }

    This query includes some info necessary for pagination as well as rate limiting details. The output will be something along the lines of:

    {
      "data": {
        "repository": {
          "issues": {
            "nodes": [
              {
                "number": 1,
                "title": "Lexing doctypes",
                "body": "Both the lexer and the HTML parser should support the processing of doctypes. Currently both already have basic support for doctypes but this should be improved so that it's easier to extract information such as the doctype URL.\n",
                "author": {
                  "id": 86065,
                  "login": "YorickPeterse",
                  "email": "yorickpeterse@gmail.com"
                },
                "assignees": {
                  "nodes": []
                },
                "milestone_iid": null,
                "label_names": {
                  "nodes": [
                    {
                      "name": "HTML"
                    },
                    {
                      "name": "Lexer"
                    }
                  ]
                }
              },
              ...
            ],
            "pageInfo": {
              "has_next_page": true,
              "end_cursor": "Y3Vyc29yOnYyOpK5MjAxNC0wMi0yOFQxMTozODo1NCswMTowMM4Bsr8v"
            }
          }
        },
        "rate_limit": {
          "remaining": 4966,
          "reset_at": "2017-10-10T15:09:30.000Z"
        }
      }
    }

    There's a Ruby client written by GitHub here: https://github.com/github/graphql-client. It seems to be able to send calls over HTTP out of the box, though the code seems to be more of a proof of concept: https://github.com/github/graphql-client/blob/master/lib/graphql/client/http.rb

    I poked around with the GitHub GraphQL API and this looks quite promising. For example, it would allow us to get issue/PR authors/assignees _including_ their Emails, removing the need for API calls to get that data. This is quite important as looking up e.g. 1000 users could in the worst case lead to 1000 GitHub API calls. To give an example, here's a GraphQL query to get all issues of a repository along with the associated data we'd need: ``` query($owner: String!, $repo: String!, $after: String) { repository(owner: $owner, name: $repo) { issues(states: [OPEN, CLOSED], first: 100, after: $after, orderBy: { field: CREATED_AT, direction: ASC }) { nodes { number title body author { ... on User { id: databaseId login email } } assignees(first: 10) { nodes { ... on User { id: databaseId login email } } } milestone_iid: milestone { number } label_names: labels(first: 10) { nodes { name } } } pageInfo { has_next_page: hasNextPage end_cursor: endCursor } } } rate_limit: rateLimit { remaining reset_at: resetAt } } ``` This query includes some info necessary for pagination as well as rate limiting details. The output will be something along the lines of: ```json { "data": { "repository": { "issues": { "nodes": [ { "number": 1, "title": "Lexing doctypes", "body": "Both the lexer and the HTML parser should support the processing of doctypes. Currently both already have basic support for doctypes but this should be improved so that it's easier to extract information such as the doctype URL.\n", "author": { "id": 86065, "login": "YorickPeterse", "email": "yorickpeterse@gmail.com" }, "assignees": { "nodes": [] }, "milestone_iid": null, "label_names": { "nodes": [ { "name": "HTML" }, { "name": "Lexer" } ] } }, ... ], "pageInfo": { "has_next_page": true, "end_cursor": "Y3Vyc29yOnYyOpK5MjAxNC0wMi0yOFQxMTozODo1NCswMTowMM4Bsr8v" } } }, "rate_limit": { "remaining": 4966, "reset_at": "2017-10-10T15:09:30.000Z" } } } ``` There's a Ruby client written by GitHub here: https://github.com/github/graphql-client. It seems to be able to send calls over HTTP out of the box, though the code seems to be more of a proof of concept: https://github.com/github/graphql-client/blob/master/lib/graphql/client/http.rb
  • Yorick Peterse @yorickpeterse commented Oct 10, 2017
    Master

    And here's a similar query for getting pull requests, including labels and what not:

    query($owner: String!, $repo: String!, $after: String) {
      repository(owner: $owner, name: $repo) {
        pullRequests(states: [OPEN, CLOSED, MERGED], first: 100, after: $after, orderBy: { field: CREATED_AT, direction: ASC }) {
          nodes {
            number
            title
            body
            author {
              ... on User {
                id: databaseId
                login
                email
              }
            }
            assignees(first: 10) {
              nodes {
                ... on User {
                  id: databaseId
                  login
                  email
                }
              }
            }
            milestone_iid: milestone {
              number
            }
            label_names: labels(first: 10) {
              nodes {
                name
              }
            }
          }
          pageInfo {
            has_next_page: hasNextPage
            end_cursor: endCursor
          }
        }
      }
      rate_limit: rateLimit {
        remaining
        reset_at: resetAt
      }
    }

    One minor nuisance in the GraphQL API is that paginating nested objects (e.g. labels) will be annoying. However I think we could just say we'll only import up to 10 assignees and labels (or something along those lines) per issue/pull request. There shouldn't be that many cases (if any) where an issue/PR has more than 10 labels.

    And here's a similar query for getting pull requests, including labels and what not: ``` query($owner: String!, $repo: String!, $after: String) { repository(owner: $owner, name: $repo) { pullRequests(states: [OPEN, CLOSED, MERGED], first: 100, after: $after, orderBy: { field: CREATED_AT, direction: ASC }) { nodes { number title body author { ... on User { id: databaseId login email } } assignees(first: 10) { nodes { ... on User { id: databaseId login email } } } milestone_iid: milestone { number } label_names: labels(first: 10) { nodes { name } } } pageInfo { has_next_page: hasNextPage end_cursor: endCursor } } } rate_limit: rateLimit { remaining reset_at: resetAt } } ``` One minor nuisance in the GraphQL API is that paginating nested objects (e.g. labels) will be annoying. However I think we could just say we'll only import up to 10 assignees and labels (or something along those lines) per issue/pull request. There shouldn't be that many cases (if any) where an issue/PR has more than 10 labels.
  • Yorick Peterse @yorickpeterse commented Oct 10, 2017
    Master

    Annoyingly the GraphQL API does not support a way of getting comments across an entire repository, instead you can only get them per issue/PR/etc. This means we'd need to either embed those in the above queries and then paginate them, which I can see being quite annoying. This is annoying because we'd have to run many API calls per issue/PR to get the data, so it might be more efficient to just use the REST API.

    Edited Oct 10, 2017 by Yorick Peterse
    Annoyingly the GraphQL API _does not_ support a way of getting comments across an entire repository, instead you can only get them per issue/PR/etc. This means we'd need to either embed those in the above queries and then paginate them, which I can see being quite annoying. This is annoying because we'd have to run many API calls per issue/PR to get the data, so it might be more efficient to just use the REST API.
  • Yorick Peterse @yorickpeterse commented Oct 10, 2017
    Master

    So pros/cons of GraphQL:

    Pros:

    • Fewer calls necessary to get data such as labels of issues and PRS
    • No additional calls necessary to get the Email address of users
    • Query language is decent

    Cons:

    • Need to implement pagination yourself using GraphQL cursors
    • Need API calls to traverse comment pages for every issue/PR
    • Very new
    • I'd have to rewrite most of the code I wrote thus far

    I'll stick with REST for now, unless I really start hating it.

    So pros/cons of GraphQL: Pros: * Fewer calls necessary to get data such as labels of issues and PRS * No additional calls necessary to get the Email address of users * Query language is decent Cons: * Need to implement pagination yourself using GraphQL cursors * Need API calls to traverse comment pages for every issue/PR * Very new * I'd have to rewrite most of the code I wrote thus far I'll stick with REST for now, unless I really start hating it.
  • Yorick Peterse @yorickpeterse

    added 2 commits

    • 17f487cf - Refactor User.find_by_any_email
    • ada8dff2 - Rewrite the GitHub importer from scratch

    Compare with previous version

    Oct 10, 2017

    added 2 commits

    • 17f487cf - Refactor User.find_by_any_email
    • ada8dff2 - Rewrite the GitHub importer from scratch

    Compare with previous version

    added 2 commits * 17f487cf - Refactor User.find_by_any_email * ada8dff2 - Rewrite the GitHub importer from scratch [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7404170&start_sha=adc9ed0aa489b3a8daa8600cc06b07351db5a3c2)
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    added 1 commit

    • 269085aa - Rewrite the GitHub importer from scratch

    Compare with previous version

    Oct 10, 2017

    added 1 commit

    • 269085aa - Rewrite the GitHub importer from scratch

    Compare with previous version

    added 1 commit * 269085aa - Rewrite the GitHub importer from scratch [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7404754&start_sha=ada8dff22fdb4f98785f7a4c5a2d289fde64c65f)
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    added 1 commit

    • e409bfcd - Rewrite the GitHub importer from scratch

    Compare with previous version

    Oct 10, 2017

    added 1 commit

    • e409bfcd - Rewrite the GitHub importer from scratch

    Compare with previous version

    added 1 commit * e409bfcd - Rewrite the GitHub importer from scratch [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7405258&start_sha=269085aa26664138f87d3be23d001e3e52056c30)
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    added 1 commit

    • 55a5e4f5 - Rewrite the GitHub importer from scratch

    Compare with previous version

    Oct 10, 2017

    added 1 commit

    • 55a5e4f5 - Rewrite the GitHub importer from scratch

    Compare with previous version

    added 1 commit * 55a5e4f5 - Rewrite the GitHub importer from scratch [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7405620&start_sha=e409bfcd93b8824f1d092f250c18cd5c867a30ae)
    Toggle commit list
  • Yorick Peterse @yorickpeterse commented Oct 10, 2017
    Master

    @rymai @dbalexandre I noticed that creating notes will result in a call to Repository#keep_around to keep the commits. I'm wondering if this is necessary when importing? Getting rid of this would save us a lot of Git/NFS operations when importing large projects.

    @rymai @dbalexandre I noticed that creating notes will result in a call to `Repository#keep_around` to keep the commits. I'm wondering if this is necessary when importing? Getting rid of this would save us _a lot_ of Git/NFS operations when importing large projects.
  • Yorick Peterse @yorickpeterse commented Oct 10, 2017
    Master

    To add to that: the number of Gitaly calls that I see flying past in my console when importing PRs and diff notes is quite worrying. I have doubts about our setup being able to handle large imports if we keep hammering Git for every MR and note.

    To add to that: the number of Gitaly calls that I see flying past in my console when importing PRs and diff notes is quite worrying. I have doubts about our setup being able to handle large imports if we keep hammering Git for every MR and note.
  • Yorick Peterse @yorickpeterse

    added 1 commit

    • 05437693 - Rewrite the GitHub importer from scratch

    Compare with previous version

    Oct 10, 2017

    added 1 commit

    • 05437693 - Rewrite the GitHub importer from scratch

    Compare with previous version

    added 1 commit * 05437693 - Rewrite the GitHub importer from scratch [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7409318&start_sha=55a5e4f528d2a7df9780129cbbc87a5d075bbaed)
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    mentioned in issue gitlab-com/infrastructure#2922 (closed)

    Oct 11, 2017

    mentioned in issue gitlab-com/infrastructure#2922 (closed)

    mentioned in issue gitlab-com/infrastructure#2922
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    mentioned in issue gitlab-com/infrastructure#2983 (closed)

    Oct 11, 2017

    mentioned in issue gitlab-com/infrastructure#2983 (closed)

    mentioned in issue gitlab-com/infrastructure#2983
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    added 1 commit

    • 15b3f15d - Rewrite the GitHub importer from scratch

    Compare with previous version

    Oct 11, 2017

    added 1 commit

    • 15b3f15d - Rewrite the GitHub importer from scratch

    Compare with previous version

    added 1 commit * 15b3f15d - Rewrite the GitHub importer from scratch [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7427134&start_sha=05437693a2d1ff13f291d26abd88ded1da5ef7f2)
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    marked the task Import releases as completed

    Oct 11, 2017

    marked the task Import releases as completed

    marked the task **Import releases** as completed
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    added 1 commit

    • 9194783e - Rewrite the GitHub importer from scratch

    Compare with previous version

    Oct 11, 2017

    added 1 commit

    • 9194783e - Rewrite the GitHub importer from scratch

    Compare with previous version

    added 1 commit * 9194783e - Rewrite the GitHub importer from scratch [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7432783&start_sha=15b3f15de9c023f9b126eac2b3ccaff596e59588)
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    added 1 commit

    • 7d2368c0 - Rewrite the GitHub importer from scratch

    Compare with previous version

    Oct 11, 2017

    added 1 commit

    • 7d2368c0 - Rewrite the GitHub importer from scratch

    Compare with previous version

    added 1 commit * 7d2368c0 - Rewrite the GitHub importer from scratch [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7434384&start_sha=9194783e677a90a3146e19375b9bb9b1ee3ab6cd)
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    added 1 commit

    • d228e5e0 - Rewrite the GitHub importer from scratch

    Compare with previous version

    Oct 12, 2017

    added 1 commit

    • d228e5e0 - Rewrite the GitHub importer from scratch

    Compare with previous version

    added 1 commit * d228e5e0 - Rewrite the GitHub importer from scratch [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7462917&start_sha=7d2368c08164211dc116bb2eff69a246ee81d1a2)
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    marked the task Make the parallel importing schedule a job to check for progress, instead of blocking the thread in a sleep call as completed

    Oct 12, 2017

    marked the task Make the parallel importing schedule a job to check for progress, instead of blocking the thread in a sleep call as completed

    marked the task **Make the parallel importing schedule a job to check for progress, instead of blocking the thread in a `sleep` call** as completed
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    marked the task Issue and comment workers should reschedule themselves in the future if we hit a rate limit, the reschedule time will simply be the reset time of the rate limit (so if our rate limit resets in 10 seconds that means we schedule jobs for 10 seconds in the future) as completed

    Oct 12, 2017

    marked the task Issue and comment workers should reschedule themselves in the future if we hit a rate limit, the reschedule time will simply be the reset time of the rate limit (so if our rate limit resets in 10 seconds that means we schedule jobs for 10 seconds in the future) as completed

    marked the task **Issue and comment workers should reschedule themselves in the future if we hit a rate limit, the reschedule time will simply be the reset time of the rate limit (so if our rate limit resets in 10 seconds that means we schedule jobs for 10 seconds in the future)** as completed
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    marked the task Make sure any stuck importer jobs don't mess with a running GitHub import as completed

    Oct 12, 2017

    marked the task Make sure any stuck importer jobs don't mess with a running GitHub import as completed

    marked the task **Make sure any stuck importer jobs don't mess with a running GitHub import** as completed
    Toggle commit list
  • Yorick Peterse @yorickpeterse commented Oct 12, 2017
    Master

    I managed to break up the importer into a pipeline consisting out of the following stages:

    1. ImportRepository: this will import base data such as the repository, wiki, labels, milestones, and releases (this data can be imported quickly enough in a single thread)
    2. ImportPullRequests: this will import all pull requests from GitHub
    3. ImportIssuesAndDiffNotes: this will create regular issues, enrich merge requests with labels, and create merge request diff notes
    4. ImportNotes: this creates regular notes for both issues and merge requests, thus we have to schedule this after those have been imported
    5. FinishImport: this will simply mark the job as finished and clear any caches

    Between every stage we schedule a job for the AdvanceStageWorker. This worker will check if the work from the previous stage is done and if so it will advance the pipeline. If this isn't the case it will reschedule itself and check again later. Using this process we only very briefly block threads in sleep calls (when checking of jobs are completed in AdvanceStageWorker we wait for a little bit to reduce the number of reschedules), instead of doing so for hours on end.

    A downside of this approach is that there's a minimum import time of around 1 minute or so, though this depends largely on how fast jobs can be completed before the AdvanceStageWorker reschedules itself. This however is a trade-off that's worth it as the alternative is blocking an entire thread (or multiple depending on the number of import jobs running) for a long period of time.

    Using this setup I can import github.com/yorickpeterse/oga in about 100 seconds on my laptop, whereas it would take the old importer around 4.5 minutes.

    API call wise importing Oga requires 91 API calls without user information being cached. With this data being cached we only need 11 API calls.

    The code also takes care of refreshing the import JID so StuckImportJobsWorker doesn't mark the import process as failed after 15 minutes, though I have to still properly test this.

    My next plan is to write automated tests for all this code (and probably fix a bunch of bugs in the process), then run some tests on larger projects such as Kubernetes.

    Edited Oct 12, 2017 by Yorick Peterse
    I managed to break up the importer into a pipeline consisting out of the following stages: 1. `ImportRepository`: this will import base data such as the repository, wiki, labels, milestones, and releases (this data can be imported quickly enough in a single thread) 2. `ImportPullRequests`: this will import all pull requests from GitHub 3. `ImportIssuesAndDiffNotes`: this will create regular issues, enrich merge requests with labels, and create merge request diff notes 4. `ImportNotes`: this creates regular notes for both issues and merge requests, thus we have to schedule this after those have been imported 5. `FinishImport`: this will simply mark the job as finished and clear any caches Between every stage we schedule a job for the AdvanceStageWorker. This worker will check if the work from the previous stage is done and if so it will advance the pipeline. If this isn't the case it will reschedule itself and check again later. Using this process we only very briefly block threads in `sleep` calls (when checking of jobs are completed in AdvanceStageWorker we wait for a little bit to reduce the number of reschedules), instead of doing so for hours on end. A downside of this approach is that there's a minimum import time of around 1 minute or so, though this depends largely on how fast jobs can be completed before the AdvanceStageWorker reschedules itself. This however is a trade-off that's worth it as the alternative is blocking an entire thread (or multiple depending on the number of import jobs running) for a long period of time. Using this setup I can import github.com/yorickpeterse/oga in about 100 seconds on my laptop, whereas it would take the old importer around 4.5 minutes. API call wise importing Oga requires 91 API calls without user information being cached. With this data being cached we only need 11 API calls. The code also takes care of refreshing the import JID so StuckImportJobsWorker doesn't mark the import process as failed after 15 minutes, though I have to still properly test this. My next plan is to write automated tests for all this code (and probably fix a bunch of bugs in the process), then run some tests on larger projects such as Kubernetes.
  • Yorick Peterse @yorickpeterse

    added 1 commit

    • b8adb73a - Rewrite the GitHub importer from scratch

    Compare with previous version

    Oct 12, 2017

    added 1 commit

    • b8adb73a - Rewrite the GitHub importer from scratch

    Compare with previous version

    added 1 commit * b8adb73a - Rewrite the GitHub importer from scratch [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7468168&start_sha=d228e5e06baa223882f46434f2a290649a94425a)
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    added 1 commit

    • 1dacd871 - Rewrite the GitHub importer from scratch

    Compare with previous version

    Oct 12, 2017

    added 1 commit

    • 1dacd871 - Rewrite the GitHub importer from scratch

    Compare with previous version

    added 1 commit * 1dacd871 - Rewrite the GitHub importer from scratch [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7469332&start_sha=b8adb73ae5844c12670fa42ada1d89cadc5505f1)
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    changed the description

    Oct 12, 2017

    changed the description

    changed the description
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    added 1 commit

    • 04e689fb - Rewrite the GitHub importer from scratch

    Compare with previous version

    Oct 12, 2017

    added 1 commit

    • 04e689fb - Rewrite the GitHub importer from scratch

    Compare with previous version

    added 1 commit * 04e689fb - Rewrite the GitHub importer from scratch [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7471765&start_sha=1dacd87100421d752c46b7ab880f3d103c3fde82)
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    added 493 commits

    • 04e689fb...34fd07f6 - 491 commits from branch master
    • 2f10fa3e - Refactor User.find_by_any_email
    • 214b70a6 - Rewrite the GitHub importer from scratch

    Compare with previous version

    Oct 12, 2017

    added 493 commits

    • 04e689fb...34fd07f6 - 491 commits from branch master
    • 2f10fa3e - Refactor User.find_by_any_email
    • 214b70a6 - Rewrite the GitHub importer from scratch

    Compare with previous version

    added 493 commits * 04e689fb...34fd07f6 - 491 commits from branch `master` * 2f10fa3e - Refactor User.find_by_any_email * 214b70a6 - Rewrite the GitHub importer from scratch [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7471993&start_sha=04e689fb8998c65adb4b44746422b52e66340b4f)
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    changed the description

    Oct 13, 2017

    changed the description

    changed the description
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    added 1 commit

    • 41a94cae - Rewrite the GitHub importer from scratch

    Compare with previous version

    Oct 13, 2017

    added 1 commit

    • 41a94cae - Rewrite the GitHub importer from scratch

    Compare with previous version

    added 1 commit * 41a94cae - Rewrite the GitHub importer from scratch [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7494893&start_sha=214b70a6d211435b733fd1c9232467d34fceb62e)
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    marked the task Add tests for User.by_any_email as completed

    Oct 13, 2017

    marked the task Add tests for User.by_any_email as completed

    marked the task **Add tests for `User.by_any_email`** as completed
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    resolved all discussions

    Oct 13, 2017

    resolved all discussions

    resolved all discussions
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    added 1 commit

    • 91c79675 - Rewrite the GitHub importer from scratch

    Compare with previous version

    Oct 13, 2017

    added 1 commit

    • 91c79675 - Rewrite the GitHub importer from scratch

    Compare with previous version

    added 1 commit * 91c79675 - Rewrite the GitHub importer from scratch [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7494948&start_sha=41a94cae42c0b0e1e0e0f33e173795f56412febb)
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    added 2 commits

    • e7cf1c7d - Refactor User.find_by_any_email
    • 06462743 - Rewrite the GitHub importer from scratch

    Compare with previous version

    Oct 13, 2017

    added 2 commits

    • e7cf1c7d - Refactor User.find_by_any_email
    • 06462743 - Rewrite the GitHub importer from scratch

    Compare with previous version

    added 2 commits * e7cf1c7d - Refactor User.find_by_any_email * 06462743 - Rewrite the GitHub importer from scratch [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7495068&start_sha=91c79675c3396bf471864f1d20c58ba69328d58c)
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    added 1 commit

    • 8b4fe973 - Rewrite the GitHub importer from scratch

    Compare with previous version

    Oct 13, 2017

    added 1 commit

    • 8b4fe973 - Rewrite the GitHub importer from scratch

    Compare with previous version

    added 1 commit * 8b4fe973 - Rewrite the GitHub importer from scratch [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7495565&start_sha=064627436899fc2dd40ef2b28a2add2a9a5208a6)
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    changed the description

    Oct 13, 2017

    changed the description

    changed the description
    Toggle commit list
  • Martin Rüegg @martin.rueegg commented Oct 15, 2017

    Hi Yorick,

    Thanks for this work. I've successfully migrated a GitHub repo, where the legacy importer failed to import the issues. Congrats for your work.

    Btw, I've just "cherry-picked" the changes of the two commits (8b4fe973 and e7cf1c7d) and patched them on my omnibus installation 10.0.3 8895150 and that worked fine!

    I've seen you've discussed the (re-)naming of the importer. I personally would suggest to leave the legacy importer un-renamed (and files not moved) and instead file the new code under e.g. github_import_ng (for next generation) or _v2.

    Also, would it not be great to have all importers named the same, i.e. Gitlab::GithubImportNg::Importer rather than Github::Import?

    I'm very new to this project and there might be good reasons for how you have chosen to name things.

    Cheers, Martin.

    Hi Yorick, Thanks for this work. I've successfully migrated a GitHub repo, where the legacy importer failed to import the issues. Congrats for your work. Btw, I've just "cherry-picked" the changes of the two commits (8b4fe973d153c700fe5c60002a165a984e7f5b4f and e7cf1c7d6a75bf3a526d8262e7d38e7177bb61ba) and patched them on my omnibus installation 10.0.3 [8895150](https://gitlab.com/gitlab-org/gitlab-ce/commits/8895150) and that worked fine! I've seen you've discussed the (re-)naming of the importer. I personally would suggest to leave the legacy importer un-renamed (and files not moved) and instead file the new code under e.g. `github_import_ng` (for next generation) or `_v2`. Also, would it not be great to have all importers named the same, i.e. `Gitlab::GithubImportNg::Importer` rather than `Github::Import`? I'm very new to this project and there might be good reasons for how you have chosen to name things. Cheers, Martin.
  • Martin Rüegg @martin.rueegg

    mentioned in issue #4055

    Oct 15, 2017

    mentioned in issue #4055

    mentioned in issue #4055
    Toggle commit list
  • Martin Rüegg @martin.rueegg commented Oct 15, 2017

    One observation I'd like to add:

    After importing a GitHub repo with the patched version of the CE, I exported the same repo again in order to import it on gitlab.com. However, that import failed because some releases' description field were empty (""). Once I've corrected the project.json in the export by replacing $.releases[*].description with $.releases[*].tag when description === '', the import went through smoothly.

    Now obviously, this is an issue of the ImportExport module, as it allows to export empty descriptions but does not import them again. However, it might be a follow-up error by the fact that this patch here does import empty descriptions where otherwise this would not be allowed?

    One observation I'd like to add: After importing a GitHub repo with the patched version of the CE, I exported the same repo again in order to import it on gitlab.com. However, that import failed because some releases' description field were empty (`""`). Once I've corrected the `project.json` in the export by replacing `$.releases[*].description` with `$.releases[*].tag` when `description === ''`, the import went through smoothly. Now obviously, this is an issue of the `ImportExport` module, as it allows to export empty descriptions but does not import them again. However, it might be a follow-up error by the fact that this patch here does import empty descriptions where otherwise this would not be allowed?
  • Martin Rüegg @martin.rueegg commented Oct 15, 2017

    See also #18235 Refactor GitHub importer

    See also https://gitlab.com/gitlab-org/gitlab-ce/issues/18235 **Refactor GitHub importer**
  • Martin Rüegg @martin.rueegg

    mentioned in issue #39133

    Oct 15, 2017

    mentioned in issue #39133

    mentioned in issue #39133
    Toggle commit list
  • Martin Rüegg @martin.rueegg commented Oct 16, 2017

    Also see !13788 (closed)

    Also see !13788
  • Martin Rüegg @martin.rueegg commented Oct 16, 2017

    One thing I have realised is, that labels are correctly generated. However, not assigned to MRs or issues.

    One thing I have realised is, that labels are correctly generated. However, not assigned to MRs or issues.
  • Yorick Peterse @yorickpeterse

    changed milestone to 10.2

    Oct 16, 2017

    changed milestone to 10.2

    changed milestone to %"10.2"
    Toggle commit list
  • Yorick Peterse @yorickpeterse commented Oct 16, 2017
    Master

    @martin.rueegg

    I've seen you've discussed the (re-)naming of the importer. I personally would suggest to leave the legacy importer un-renamed (and files not moved) and instead file the new code under e.g. github_import_ng (for next generation) or _v2.

    Including versions/generations in namespaces in my opinion is a bad idea, I'd much rather just have a "legacy" importer vs "the current" importer. It's worth mentioning that I may just rename LegacyGithubImporter to GiteaImporter since it's only used for Gitea.

    Also, would it not be great to have all importers named the same, i.e. Gitlab::GithubImportNg::Importer rather than Github::Import?

    This separation is the result of Sidekiq workers residing in app/workers and not lib/gitlab/X. Having said that I think creating directories such as app/workers/gitlab/github_import/ and placing the workers there should also work. I'll take a look at this.

    Now obviously, this is an issue of the ImportExport module, as it allows to export empty descriptions but does not import them again. However, it might be a follow-up error by the fact that this patch here does import empty descriptions where otherwise this would not be allowed?

    It seems the old importer would validate releases, skipping those that are invalid. This validation process requires a description to be present. Validations are currently bypassed since we insert releases in bulk, though we can quite easily validate data without having to initialize a Release object for every release.

    One thing I have realised is, that labels are correctly generated. However, not assigned to MRs or issues.

    They should be, at least this was working last time I tried the code in this MR locally. Perhaps I broke it while refactoring the code in recent pushes. I'm currently in the process of writing tests for all the code, hopefully that will fix any remaining bugs once I'm done.

    Edited Oct 16, 2017 by Yorick Peterse
    @martin.rueegg > I've seen you've discussed the (re-)naming of the importer. I personally would > suggest to leave the legacy importer un-renamed (and files not moved) and > instead file the new code under e.g. github_import_ng (for next generation) or > \_v2. Including versions/generations in namespaces in my opinion is a bad idea, I'd much rather just have a "legacy" importer vs "the current" importer. It's worth mentioning that I may just rename LegacyGithubImporter to GiteaImporter since it's only used for Gitea. > Also, would it not be great to have all importers named the same, i.e. > Gitlab::GithubImportNg::Importer rather than Github::Import? This separation is the result of Sidekiq workers residing in app/workers and not lib/gitlab/X. Having said that I think creating directories such as app/workers/gitlab/github_import/ and placing the workers there should also work. I'll take a look at this. > Now obviously, this is an issue of the ImportExport module, as it allows to > export empty descriptions but does not import them again. However, it might be > a follow-up error by the fact that this patch here does import empty > descriptions where otherwise this would not be allowed? It seems the old importer would validate releases, skipping those that are invalid. This validation process requires a description to be present. Validations are currently bypassed since we insert releases in bulk, though we can quite easily validate data without having to initialize a Release object for every release. > One thing I have realised is, that labels are correctly generated. However, > not assigned to MRs or issues. They should be, at least this was working last time I tried the code in this MR locally. Perhaps I broke it while refactoring the code in recent pushes. I'm currently in the process of writing tests for all the code, hopefully that will fix any remaining bugs once I'm done.
  • Yorick Peterse @yorickpeterse

    marked the task Representation classes as completed

    Oct 16, 2017

    marked the task Representation classes as completed

    marked the task **Representation classes** as completed
    Toggle commit list
  • Yorick Peterse @yorickpeterse commented Oct 16, 2017
    Master

    Regarding the milestone: I moved this to 10.2, at least for now. I initially hoped to be able to squeeze this into 10.1 behind a feature flag, but since the 22nd is this Sunday and the summit starts before that I don't think I'll make it.

    Maybe I can make it, but the more likely alternative will be hot patching GitLab.com so we can test this, then ship it in 10.2.0.

    Regarding the milestone: I moved this to 10.2, at least for now. I initially hoped to be able to squeeze this into 10.1 behind a feature flag, but since the 22nd is this Sunday _and_ the summit starts before that I don't think I'll make it. _Maybe_ I can make it, but the more likely alternative will be hot patching GitLab.com so we can test this, then ship it in 10.2.0.
  • Martin Rüegg @martin.rueegg commented Oct 16, 2017

    @yorickpeterse,

    Thanks for your response and all the explanations. Much appreciated.

    Re release description: I would suggest to not skip the import of a given release when no description is given. Rather using the tag name as a description.

    Re labels: I checked the code and I have not been able to identify the issue. However, they are not assigned (or not updated)

    @yorickpeterse, Thanks for your response and all the explanations. Much appreciated. **Re release description:** I would suggest to not skip the import of a given release when no description is given. Rather using the tag name as a description. **Re labels:** I checked the code and I have not been able to identify the issue. However, they are not assigned (or not [updated](https://gitlab.com/gitlab-org/gitlab-ce/blob/github-importer-refactor/lib/github/import.rb#L239))
  • Yorick Peterse @yorickpeterse commented Oct 16, 2017
    Master

    @martin.rueegg That link points to the old importer, not the one being added in this MR. The new code is in the LabelLinksImporter class in this MR. The old importer indeed had issues with not assigning labels properly.

    @martin.rueegg That link points to the old importer, not the one being added in this MR. The new code is in the `LabelLinksImporter` class in this MR. The old importer indeed had issues with not assigning labels properly.
  • Yorick Peterse @yorickpeterse

    added 1 commit

    • 3d78cd50 - Rewrite the GitHub importer from scratch

    Compare with previous version

    Oct 16, 2017

    added 1 commit

    • 3d78cd50 - Rewrite the GitHub importer from scratch

    Compare with previous version

    added 1 commit * 3d78cd50 - Rewrite the GitHub importer from scratch [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7542456&start_sha=8b4fe973d153c700fe5c60002a165a984e7f5b4f)
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    changed the description

    Oct 16, 2017

    changed the description

    changed the description
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    changed the description

    Oct 16, 2017

    changed the description

    changed the description
    Toggle commit list
  • Martin Rüegg @martin.rueegg commented Oct 16, 2017

    @yorickpeterse, oh! :-D ok.

    Well, then I have not run your importer! And I don't know if I can patch my 10.0.3 omnibus installation to test it ...

    @yorickpeterse, oh! :-D ok. Well, then I have not run your importer! And I don't know if I can patch my 10.0.3 omnibus installation to test it ...
  • Yorick Peterse @yorickpeterse

    added 1 commit

    • 15c268a8 - Rewrite the GitHub importer from scratch

    Compare with previous version

    Oct 17, 2017

    added 1 commit

    • 15c268a8 - Rewrite the GitHub importer from scratch

    Compare with previous version

    added 1 commit * 15c268a8 - Rewrite the GitHub importer from scratch [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7578704&start_sha=3d78cd5055eb0977cd036160a3c904d9dde335a4)
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    marked the task Importer classes as completed

    Oct 17, 2017

    marked the task Importer classes as completed

    marked the task **Importer classes** as completed
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    marked the task Set a release's description to the tag name if none is provided (since a description is required) as completed

    Oct 17, 2017

    marked the task Set a release's description to the tag name if none is provided (since a description is required) as completed

    marked the task **Set a release's description to the tag name if none is provided (since a description is required)** as completed
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    added 1 commit

    • e26f3eba - Rewrite the GitHub importer from scratch

    Compare with previous version

    Oct 17, 2017

    added 1 commit

    • e26f3eba - Rewrite the GitHub importer from scratch

    Compare with previous version

    added 1 commit * e26f3eba - Rewrite the GitHub importer from scratch [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7581257&start_sha=15c268a82671aee48c9c20b68a9c443fe42c0ef5)
    Toggle commit list
  • Yorick Peterse @yorickpeterse commented Oct 17, 2017
    Master

    All code now resides in the Gitlab::GithubImport namespace, instead of the Sidekiq code residing in GithubImporter.

    All code now resides in the `Gitlab::GithubImport` namespace, instead of the Sidekiq code residing in `GithubImporter`.
  • Yorick Peterse @yorickpeterse commented Oct 17, 2017
    Master

    @martin.rueegg

    Well, then I have not run your importer! And I don't know if I can patch my 10.0.3 omnibus installation to test it ...

    If you want to test things you'd have to apply the following patch using patch -p1 -N < file.patch: importer.patch

    I haven't tested this with EE so it might not work there yet. You'll then have to muck around with the Rake task to get it to use the new importer, e.g. by applying this patch: rake.patch. You can then run the new importer using this:

    env IMPORTER=parallel bundle exec rake import:github[token,gitlab-user,gitlab-user/gitlab-repo,github-user/github-repo]

    For example:

    env IMPORTER=parallel bundle exec rake import:github[hunter2,yorickpeterse,yorickpeterse/kubernetes-community,kubernetes/community]

    For this to work you'll need to restart Sidekiq so it has workers available for the new queues. You can also use IMPORTER=sequential to use the new code in sequential mode, but this will be slower for large projects compared to IMPORTER=parallel.

    Edited Oct 17, 2017 by Yorick Peterse
    @martin.rueegg > Well, then I have not run your importer! And I don't know if I can patch my 10.0.3 omnibus installation to test it ... If you want to test things you'd have to apply the following patch using `patch -p1 -N < file.patch`: [importer.patch](/uploads/ae7046f0ed3c5ab2e8498fd63dd07b2d/importer.patch) I haven't tested this with EE so it might not work there yet. You'll then have to muck around with the Rake task to get it to use the new importer, e.g. by applying this patch: [rake.patch](/uploads/5c5c8b2b9ccebb302babd96e57143aeb/rake.patch). You can then run the new importer using this: ``` env IMPORTER=parallel bundle exec rake import:github[token,gitlab-user,gitlab-user/gitlab-repo,github-user/github-repo] ``` For example: ``` env IMPORTER=parallel bundle exec rake import:github[hunter2,yorickpeterse,yorickpeterse/kubernetes-community,kubernetes/community] ``` For this to work you'll need to restart Sidekiq so it has workers available for the new queues. You can also use `IMPORTER=sequential` to use the new code in sequential mode, but this will be slower for large projects compared to `IMPORTER=parallel`.
  • Yorick Peterse @yorickpeterse

    marked the task Changes to Projects::ImportService to support async importers as completed

    Oct 18, 2017

    marked the task Changes to Projects::ImportService to support async importers as completed

    marked the task **Changes to `Projects::ImportService` to support async importers** as completed
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    marked the task Changes to RepositoryImportWorker as completed

    Oct 18, 2017

    marked the task Changes to RepositoryImportWorker as completed

    marked the task **Changes to `RepositoryImportWorker`** as completed
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    marked the task Sidekiq workers (GithubImporter namespace) as completed

    Oct 18, 2017

    marked the task Sidekiq workers (GithubImporter namespace) as completed

    marked the task **Sidekiq workers (`GithubImporter` namespace)** as completed
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    marked the task Sidekiq worker concerns as completed

    Oct 18, 2017

    marked the task Sidekiq worker concerns as completed

    marked the task **Sidekiq worker concerns** as completed
    Toggle commit list
  • Yorick Peterse @yorickpeterse commented Oct 18, 2017
    Master

    @smcgivern @DouweM This MR is finally reaching a point where all code is tested automatically, meaning it's time to document the changes, write a decent commit message, and most importantly of all: test it on GitLab.com somehow. In particular I want to test the impact of importing projects such as the one the Edge team are working on, how long it takes, if we need more monitoring, etc.

    There are two approaches to doing this:

    1. Include this code in 10.1 but hide it behind a feature flag (that's disabled by default) and don't announce it yet (or announce it as being in a beta state of sorts)
    2. Don't include this in 10.1 and aim for 10.2, then hotpatch GitLab.com until we're done with testing

    The first approach has the benefit of being the easiest to test as we don't need any patching. It also means we can start importing larger projects (if desired) without having to wait another month. There are also two downsides of this approach:

    1. We're way past the merge window and the changes here are quite big (though a feature flag should prevent anything from breaking).
    2. It will require reviewing before this Sunday, which can be difficult with the summit.

    Option 2 doesn't suffer from these problems but instead suffers from the following:

    1. Testing this code will be a pain because we'd have to hot-patch all Sidekiq workers, instantly affecting everybody that imports a GitHub project (unless we also hide the code behind a feature flag).
    2. It will take at least a month before we can ship this to GitLab.com in a state that doesn't require us to run patches after every deploy.

    As such I'd like to know what you think would be best here.

    Edited Oct 19, 2017 by Sean McGivern
    @smcgivern @DouweM This MR is finally reaching a point where all code is tested automatically, meaning it's time to document the changes, write a decent commit message, and most importantly of all: test it on GitLab.com somehow. In particular I want to test the impact of importing projects such as the one the Edge team are working on, how long it takes, if we need more monitoring, etc. There are two approaches to doing this: 1. Include this code in 10.1 but hide it behind a feature flag (that's disabled by default) and don't announce it yet (or announce it as being in a beta state of sorts) 2. Don't include this in 10.1 and aim for 10.2, then hotpatch GitLab.com until we're done with testing The first approach has the benefit of being the easiest to test as we don't need any patching. It also means we can start importing larger projects (if desired) without having to wait another month. There are also two downsides of this approach: 1. We're way past the merge window and the changes here are quite big (though a feature flag should prevent anything from breaking). 2. It will require reviewing before this Sunday, which can be difficult with the summit. Option 2 doesn't suffer from these problems but instead suffers from the following: 1. Testing this code will be a pain because we'd have to hot-patch _all_ Sidekiq workers, instantly affecting everybody that imports a GitHub project (unless we also hide the code behind a feature flag). 2. It will take at least a month before we can ship this to GitLab.com in a state that doesn't require us to run patches after every deploy. As such I'd like to know what you think would be best here.
  • Yorick Peterse @yorickpeterse

    added 1 commit

    • 590622b8 - Rewrite the GitHub importer from scratch

    Compare with previous version

    Oct 18, 2017

    added 1 commit

    • 590622b8 - Rewrite the GitHub importer from scratch

    Compare with previous version

    added 1 commit * 590622b8 - Rewrite the GitHub importer from scratch [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7616496&start_sha=e26f3eba4ef324d3fd249f254b13af902315e136)
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    added 2 commits

    • 3158030f - Rewrite the GitHub importer from scratch
    • 73632809 - Enable the new GitHub importer with a feature flag

    Compare with previous version

    Oct 18, 2017

    added 2 commits

    • 3158030f - Rewrite the GitHub importer from scratch
    • 73632809 - Enable the new GitHub importer with a feature flag

    Compare with previous version

    added 2 commits * 3158030f - Rewrite the GitHub importer from scratch * 73632809 - Enable the new GitHub importer with a feature flag [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7617515&start_sha=590622b8b9c2b22d96a111acce7f5aeb40c774d2)
    Toggle commit list
  • Yorick Peterse @yorickpeterse commented Oct 18, 2017
    Master

    73632809 enables the new importer using a feature flag. It's not super pretty but it gives an idea of how we'd enable/test this until 10.2, if we were to include this in 10.1 that is.

    736328090e3a4276bccfc6de9a52da2a075baf8f enables the new importer using a feature flag. It's not super pretty but it gives an idea of how we'd enable/test this until 10.2, if we were to include this in 10.1 that is.
  • Yorick Peterse @yorickpeterse

    unmarked as a Work In Progress

    Oct 18, 2017

    unmarked as a Work In Progress

    unmarked as a **Work In Progress**
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    changed the description

    Oct 18, 2017

    changed the description

    changed the description
    Toggle commit list
  • Sean McGivern @smcgivern commented Oct 19, 2017
    Master

    @yorickpeterse thanks for laying out those options. I don't like the idea of merging and picking by Sunday. However, if it's behind a feature flag anyway, could we not just include it in a patch to 10.1? It is stretching the definition of 'patch' quite a lot, but if it has no impact without the feature flag, that seems like a reasonable compromise?

    @yorickpeterse thanks for laying out those options. I don't like the idea of merging and picking by Sunday. However, if it's behind a feature flag anyway, could we not just include it in a patch to 10.1? It is stretching the definition of 'patch' quite a lot, but if it has no impact without the feature flag, that seems like a reasonable compromise?
  • Douwe Maan @DouweM commented Oct 19, 2017
    Master

    @yorickpeterse @smcgivern I like the compromise option!

    @yorickpeterse @smcgivern I like the compromise option!
  • Yorick Peterse @yorickpeterse

    added 321 commits

    • 73632809...bd39b441 - 318 commits from branch master
    • f3e238c2 - Refactor User.find_by_any_email
    • 35538cd5 - Rewrite the GitHub importer from scratch
    • eeb18a5b - Enable the new GitHub importer with a feature flag

    Compare with previous version

    Oct 19, 2017

    added 321 commits

    • 73632809...bd39b441 - 318 commits from branch master
    • f3e238c2 - Refactor User.find_by_any_email
    • 35538cd5 - Rewrite the GitHub importer from scratch
    • eeb18a5b - Enable the new GitHub importer with a feature flag

    Compare with previous version

    added 321 commits * 73632809...bd39b441 - 318 commits from branch `master` * f3e238c2 - Refactor User.find_by_any_email * 35538cd5 - Rewrite the GitHub importer from scratch * eeb18a5b - Enable the new GitHub importer with a feature flag [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7644325&start_sha=736328090e3a4276bccfc6de9a52da2a075baf8f)
    Toggle commit list
  • Yorick Peterse @yorickpeterse commented Oct 20, 2017
    Master

    @smcgivern That seems like a better option, that way we don't have to rush things out now.

    Unrelated to this, it seems the build is failing due to static analysis tools failing on unrelated files. Not sure what's going on here.

    @smcgivern That seems like a better option, that way we don't have to rush things out now. Unrelated to this, it seems the build is failing due to static analysis tools failing on unrelated files. Not sure what's going on here.
  • Sean McGivern @smcgivern commented Oct 20, 2017
    Master

    @yorickpeterse hmm, that's weird. One of the errors seems slightly related, but the others are completely unrelated. Does bundle exec rake flay also fail locally?

    @yorickpeterse hmm, that's weird. One of the errors seems slightly related, but the others are completely unrelated. Does `bundle exec rake flay` also fail locally?
  • Yorick Peterse @yorickpeterse

    added 2 commits

    • f7a2fa17 - Rewrite the GitHub importer from scratch
    • 0904a8ae - Enable the new GitHub importer with a feature flag

    Compare with previous version

    Oct 20, 2017

    added 2 commits

    • f7a2fa17 - Rewrite the GitHub importer from scratch
    • 0904a8ae - Enable the new GitHub importer with a feature flag

    Compare with previous version

    added 2 commits * f7a2fa17 - Rewrite the GitHub importer from scratch * 0904a8ae - Enable the new GitHub importer with a feature flag [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7666867&start_sha=eeb18a5be73f38943ba8dd1f538058b543d4ebdb)
    Toggle commit list
  • Yorick Peterse @yorickpeterse commented Oct 20, 2017
    Master

    @smcgivern Locally it fails as well, but it seems to spit out similar output when I switch back to the master branch:

    app/models/user.rb:41 :: parse error on value ":" (tCOLON)
    skipping app/models/user.rb
    app/models/concerns/awardable.rb:5 :: parse error on value nil (error)
    skipping app/models/concerns/awardable.rb
    lib/gitlab/git/storage/health.rb:29 :: parse error on value ["do", 29] (kDO_BLOCK)
    skipping lib/gitlab/git/storage/health.rb
    lib/gitlab/git/storage/forked_storage_check.rb:46 :: parse error on value "<<" (tLSHFT)
    skipping lib/gitlab/git/storage/forked_storage_check.rb
    lib/gitlab/background_migration/delete_conflicting_redirect_routes_range.rb:28 :: parse error on value "<<" (tLSHFT)
    skipping lib/gitlab/background_migration/delete_conflicting_redirect_routes_range.rb
    lib/gitlab/background_migration/create_fork_network_memberships_range.rb:28 :: parse error on value "FROM" (tCONSTANT)
    skipping lib/gitlab/background_migration/create_fork_network_memberships_range.rb
    lib/gitlab/background_migration/populate_fork_networks_range.rb:15 :: parse error on value "FROM" (tCONSTANT)
    skipping lib/gitlab/background_migration/populate_fork_networks_range.rb
    lib/gitlab/sql/glob.rb:9 :: parse error on value "<<" (tLSHFT)
    skipping lib/gitlab/sql/glob.rb
    lib/gitlab/exclusive_lease.rb:13 :: parse error on value "<<" (tLSHFT)
    skipping lib/gitlab/exclusive_lease.rb
    @smcgivern Locally it fails as well, but it seems to spit out similar output when I switch back to the `master` branch: ``` app/models/user.rb:41 :: parse error on value ":" (tCOLON) skipping app/models/user.rb app/models/concerns/awardable.rb:5 :: parse error on value nil (error) skipping app/models/concerns/awardable.rb lib/gitlab/git/storage/health.rb:29 :: parse error on value ["do", 29] (kDO_BLOCK) skipping lib/gitlab/git/storage/health.rb lib/gitlab/git/storage/forked_storage_check.rb:46 :: parse error on value "<<" (tLSHFT) skipping lib/gitlab/git/storage/forked_storage_check.rb lib/gitlab/background_migration/delete_conflicting_redirect_routes_range.rb:28 :: parse error on value "<<" (tLSHFT) skipping lib/gitlab/background_migration/delete_conflicting_redirect_routes_range.rb lib/gitlab/background_migration/create_fork_network_memberships_range.rb:28 :: parse error on value "FROM" (tCONSTANT) skipping lib/gitlab/background_migration/create_fork_network_memberships_range.rb lib/gitlab/background_migration/populate_fork_networks_range.rb:15 :: parse error on value "FROM" (tCONSTANT) skipping lib/gitlab/background_migration/populate_fork_networks_range.rb lib/gitlab/sql/glob.rb:9 :: parse error on value "<<" (tLSHFT) skipping lib/gitlab/sql/glob.rb lib/gitlab/exclusive_lease.rb:13 :: parse error on value "<<" (tLSHFT) skipping lib/gitlab/exclusive_lease.rb ```
  • Sean McGivern @smcgivern commented Oct 20, 2017
    Master

    @yorickpeterse ah, it looks like that exits with zero, though. Our rake task is this:

    desc 'Code duplication analyze via flay'
    task :flay do
      output = `bundle exec flay --mass 35 app/ lib/gitlab/`
    
      if output.include? "Similar code found"
        puts output
        exit 1
      end
    end

    So I think that if we don't care about the similar code in the importers (haven't reviewed this), we could ignore that, or update the threshold. I have no idea why the untouched files also generate that message.

    @yorickpeterse ah, it looks like that exits with zero, though. Our rake task is this: ```ruby desc 'Code duplication analyze via flay' task :flay do output = `bundle exec flay --mass 35 app/ lib/gitlab/` if output.include? "Similar code found" puts output exit 1 end end ``` So I think that if we don't care about the similar code in the importers (haven't reviewed this), we could ignore that, or update the threshold. I have no idea why the untouched files also generate that message.
  • Yorick Peterse @yorickpeterse commented Oct 20, 2017
    Master

    @smcgivern Ah ok. The similar code issue is something I'm tackling now so that should solve the failure.

    @smcgivern Ah ok. The similar code issue is something I'm tackling now so that should solve the failure.
  • Yorick Peterse @yorickpeterse

    added 2 commits

    • 2a18b277 - Rewrite the GitHub importer from scratch
    • 84c1fa49 - Enable the new GitHub importer with a feature flag

    Compare with previous version

    Oct 20, 2017

    added 2 commits

    • 2a18b277 - Rewrite the GitHub importer from scratch
    • 84c1fa49 - Enable the new GitHub importer with a feature flag

    Compare with previous version

    added 2 commits * 2a18b277 - Rewrite the GitHub importer from scratch * 84c1fa49 - Enable the new GitHub importer with a feature flag [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7668003&start_sha=0904a8ae14462a3838c56251f6387366f7827341)
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    changed the description

    Oct 20, 2017

    changed the description

    changed the description
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    added 2 commits

    • da87fe65 - Rewrite the GitHub importer from scratch
    • a64edfb4 - Enable the new GitHub importer with a feature flag

    Compare with previous version

    Oct 20, 2017

    added 2 commits

    • da87fe65 - Rewrite the GitHub importer from scratch
    • a64edfb4 - Enable the new GitHub importer with a feature flag

    Compare with previous version

    added 2 commits * da87fe65 - Rewrite the GitHub importer from scratch * a64edfb4 - Enable the new GitHub importer with a feature flag [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7672874&start_sha=84c1fa49f76084b0abf047de62eb55d994a9bacc)
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    added 2 commits

    • 75428ea0 - Rewrite the GitHub importer from scratch
    • bcb7790a - Enable the new GitHub importer with a feature flag

    Compare with previous version

    Oct 20, 2017

    added 2 commits

    • 75428ea0 - Rewrite the GitHub importer from scratch
    • bcb7790a - Enable the new GitHub importer with a feature flag

    Compare with previous version

    added 2 commits * 75428ea0 - Rewrite the GitHub importer from scratch * bcb7790a - Enable the new GitHub importer with a feature flag [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7673821&start_sha=a64edfb4c58be5317c30d6bd3aec0e0e4cd51cf1)
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    added 2 commits

    • a3c1baa2 - Rewrite the GitHub importer from scratch
    • b534270d - Enable the new GitHub importer with a feature flag

    Compare with previous version

    Oct 21, 2017

    added 2 commits

    • a3c1baa2 - Rewrite the GitHub importer from scratch
    • b534270d - Enable the new GitHub importer with a feature flag

    Compare with previous version

    added 2 commits * a3c1baa2 - Rewrite the GitHub importer from scratch * b534270d - Enable the new GitHub importer with a feature flag [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7678611&start_sha=bcb7790ad8e4f289e940863ec84ae83c2a080b60)
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    marked the task See if we can re-use GitHub's diffs from the API for diff notes, instead of using Git(aly) as completed

    Oct 21, 2017

    marked the task See if we can re-use GitHub's diffs from the API for diff notes, instead of using Git(aly) as completed

    marked the task **See if we can re-use GitHub's diffs from the API for diff notes, instead of using Git(aly)** as completed
    Toggle commit list
  • Yorick Peterse @yorickpeterse commented Oct 21, 2017
    Master

    @rymai In the old importer we'd re-fetch the refs of a repository if a pull request was updated after the repository was cloned. In the parallel importer this is currently not done because:

    1. Pull requests are imported in separate threads/processes/hosts meaning we can't store a timestamp in a local variable and re-use it.
    2. Pull request jobs are not given any additional timestamps for this, and I prefer to keep it that way. Right now the Sidekiq interface is generic, making it easy to add more workers without having to duplicate a lot of code. Adding this fetch feature would require specialisation just for importing PRs.
    3. This would require some form of synchronisation (e.g. an ExclusiveLease) as otherwise many jobs could attempt to run the same git fetch. Adding this is something I prefer to avoid

    This leads me to this question: is this feature important enough that we want to include it?

    Edited Oct 21, 2017 by Yorick Peterse
    @rymai In the old importer we'd re-fetch the refs of a repository if a pull request was updated after the repository was cloned. In the parallel importer this is currently not done because: 1. Pull requests are imported in separate threads/processes/hosts meaning we can't store a timestamp in a local variable and re-use it. 2. Pull request jobs are not given any additional timestamps for this, and I prefer to keep it that way. Right now the Sidekiq interface is generic, making it easy to add more workers without having to duplicate a lot of code. Adding this fetch feature would require specialisation just for importing PRs. 3. This would require some form of synchronisation (e.g. an ExclusiveLease) as otherwise many jobs could attempt to run the same `git fetch`. Adding this is something I prefer to avoid This leads me to this question: is this feature important enough that we want to include it?
  • Yorick Peterse @yorickpeterse commented Oct 21, 2017
    Master

    To reduce the number of queries and Gitaly RPC calls I made some changes to the Note, DiffNote, and Issue importers. These changes result in the importers bypassing the callbacks and validations of models by using Gitlab::Database.bulk_insert instead of the models. Bypassing the models (and thus any callbacks and validations) saves us a lot of SQL queries and a handful of Gitaly RPC calls.

    To test this I imported https://github.com/yorickpeterse/oga into my local development environment. The before/after can be seen below.

    Before bypassing callbacks/validation

    Gitaly calls:

    RPC call Amount
    repository_exists 36
    has_local_branches 1
    fetch_remote 1
    find_commit 877
    commits_between 70
    commit_diff 70
    ref_exists 490
    • Total RPC calls: 1545
    • Total SQL queries: 10806
    • Total time: 165 seconds

    After bypassing callbacks/validation

    Gitaly calls:

    RPC call Amount
    repository_exists 36
    has_local_branches 1
    fetch_remote 1
    find_commit 770
    commits_between 70
    commit_diff 70
    ref_exists 420
    • Total RPC calls: 1368
    • Total SQL queries: 4621
    • Total time: 87 seconds

    Conclusion

    So by making these changes we save ourselves 177 Gitaly RPC calls, 6185 SQL queries, and about 78 seconds of execution time. Do keep in mind that this test was run over the summit's hotel wifi, thus the timings are likely to be influenced by how responsive the wifi was during the test.

    To reduce the number of queries and Gitaly RPC calls I made some changes to the Note, DiffNote, and Issue importers. These changes result in the importers bypassing the callbacks and validations of models by using `Gitlab::Database.bulk_insert` instead of the models. Bypassing the models (and thus any callbacks and validations) saves us a lot of SQL queries and a handful of Gitaly RPC calls. To test this I imported https://github.com/yorickpeterse/oga into my local development environment. The before/after can be seen below. ## Before bypassing callbacks/validation Gitaly calls: | RPC call | Amount | |--------------------|--------| | repository_exists | 36 | | has_local_branches | 1 | | fetch_remote | 1 | | find_commit | 877 | | commits_between | 70 | | commit_diff | 70 | | ref_exists | 490 | * Total RPC calls: 1545 * Total SQL queries: 10806 * Total time: 165 seconds ## After bypassing callbacks/validation Gitaly calls: | RPC call | Amount | |--------------------|--------| | repository_exists | 36 | | has_local_branches | 1 | | fetch_remote | 1 | | find_commit | 770 | | commits_between | 70 | | commit_diff | 70 | | ref_exists | 420 | * Total RPC calls: 1368 * Total SQL queries: 4621 * Total time: 87 seconds ## Conclusion So by making these changes we save ourselves 177 Gitaly RPC calls, 6185 SQL queries, and about 78 seconds of execution time. Do keep in mind that this test was run over the summit's hotel wifi, thus the timings are likely to be influenced by how responsive the wifi was during the test.
  • Yorick Peterse @yorickpeterse

    added 2 commits

    • 98786f21 - Rewrite the GitHub importer from scratch
    • b768f6a0 - Enable the new GitHub importer with a feature flag

    Compare with previous version

    Oct 21, 2017

    added 2 commits

    • 98786f21 - Rewrite the GitHub importer from scratch
    • b768f6a0 - Enable the new GitHub importer with a feature flag

    Compare with previous version

    added 2 commits * 98786f21 - Rewrite the GitHub importer from scratch * b768f6a0 - Enable the new GitHub importer with a feature flag [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7682613&start_sha=b534270d4ad6c3449358c391cb6e8aaa1c7b1605)
    Toggle commit list
  • Yorick Peterse @yorickpeterse commented Oct 21, 2017
    Master

    With some further caching (of labels and milestones) we can reduce the number of SQL queries to 4470 for Oga. I'll see if I can further reduce the queries, since 4470 is still quite a lot for such a small project.

    With some further caching (of labels and milestones) we can reduce the number of SQL queries to 4470 for Oga. I'll see if I can further reduce the queries, since 4470 is still quite a lot for such a small project.
  • Yorick Peterse @yorickpeterse

    added 2 commits

    • 2254fe42 - Rewrite the GitHub importer from scratch
    • b7269bda - Enable the new GitHub importer with a feature flag

    Compare with previous version

    Oct 21, 2017

    added 2 commits

    • 2254fe42 - Rewrite the GitHub importer from scratch
    • b7269bda - Enable the new GitHub importer with a feature flag

    Compare with previous version

    added 2 commits * 2254fe42 - Rewrite the GitHub importer from scratch * b7269bda - Enable the new GitHub importer with a feature flag [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7683228&start_sha=b768f6a09a7c7e7cee400e4f80f6905f617b50cb)
    Toggle commit list
  • Yorick Peterse @yorickpeterse commented Oct 21, 2017
    Master

    As it turns out the caching of user IDs/Emails wasn't entirely working as expected. Fixing this and some other issues I ran into reduces the number of SQL queries to 4409. This is still a lot, but we're slowly getting there.

    As it turns out the caching of user IDs/Emails wasn't entirely working as expected. Fixing this and some other issues I ran into reduces the number of SQL queries to 4409. This is still a lot, but we're slowly getting there.
  • Yorick Peterse @yorickpeterse

    added 3 commits

    • 68a73f3b - Add returning IDs to Gitlab::Database.bulk_insert
    • 0fa4e34a - Rewrite the GitHub importer from scratch
    • d85c051f - Enable the new GitHub importer with a feature flag

    Compare with previous version

    Oct 21, 2017

    added 3 commits

    • 68a73f3b - Add returning IDs to Gitlab::Database.bulk_insert
    • 0fa4e34a - Rewrite the GitHub importer from scratch
    • d85c051f - Enable the new GitHub importer with a feature flag

    Compare with previous version

    added 3 commits * 68a73f3b - Add returning IDs to Gitlab::Database.bulk_insert * 0fa4e34a - Rewrite the GitHub importer from scratch * d85c051f - Enable the new GitHub importer with a feature flag [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7684200&start_sha=b7269bdab04ceab32e48842cd3e2ad7b50e9596d)
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    added 2 commits

    • 701ad4c2 - Rewrite the GitHub importer from scratch
    • 7ca59490 - Enable the new GitHub importer with a feature flag

    Compare with previous version

    Oct 22, 2017

    added 2 commits

    • 701ad4c2 - Rewrite the GitHub importer from scratch
    • 7ca59490 - Enable the new GitHub importer with a feature flag

    Compare with previous version

    added 2 commits * 701ad4c2 - Rewrite the GitHub importer from scratch * 7ca59490 - Enable the new GitHub importer with a feature flag [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7689537&start_sha=d85c051f149bd63f578ed69d1e9ccbbeb9703707)
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    mentioned in issue #39361 (closed)

    Oct 22, 2017

    mentioned in issue #39361 (closed)

    mentioned in issue #39361
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    added 2 commits

    • f92dc038 - Rewrite the GitHub importer from scratch
    • ef3c2f23 - Enable the new GitHub importer with a feature flag

    Compare with previous version

    Oct 22, 2017

    added 2 commits

    • f92dc038 - Rewrite the GitHub importer from scratch
    • ef3c2f23 - Enable the new GitHub importer with a feature flag

    Compare with previous version

    added 2 commits * f92dc038 - Rewrite the GitHub importer from scratch * ef3c2f23 - Enable the new GitHub importer with a feature flag [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7690707&start_sha=7ca5949098d39af67bde49fdd16d3965301c5ff3)
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    changed the description

    Oct 22, 2017

    changed the description

    changed the description
    Toggle commit list
  • Yorick Peterse @yorickpeterse commented Oct 22, 2017
    Master

    With some further caching being added for issue/MR lookups (when importing notes, diff notes, and label links) the number of SQL queries can be reduced to 3590. About 1450 queries are for getting the feature flag names (see #39361 (closed) for more information), which isn't something I can really work around in this MR.

    The remaining queries are mostly SELECT queries to build the cache, and a lot of INSERTs for the data we're importing.

    With some further caching being added for issue/MR lookups (when importing notes, diff notes, and label links) the number of SQL queries can be reduced to 3590. About 1450 queries are for getting the feature flag names (see https://gitlab.com/gitlab-org/gitlab-ce/issues/39361 for more information), which isn't something I can really work around in this MR. The remaining queries are mostly SELECT queries to build the cache, and a lot of INSERTs for the data we're importing.
  • Yorick Peterse @yorickpeterse

    added 2 commits

    • 7444f768 - Rewrite the GitHub importer from scratch
    • 064d642d - Enable the new GitHub importer with a feature flag

    Compare with previous version

    Oct 22, 2017

    added 2 commits

    • 7444f768 - Rewrite the GitHub importer from scratch
    • 064d642d - Enable the new GitHub importer with a feature flag

    Compare with previous version

    added 2 commits * 7444f768 - Rewrite the GitHub importer from scratch * 064d642d - Enable the new GitHub importer with a feature flag [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7691088&start_sha=ef3c2f2312badd054c381c9dbfb25638cc303083)
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    added 2 commits

    • 972aa37e - Rewrite the GitHub importer from scratch
    • 9a48ad41 - Enable the new GitHub importer with a feature flag

    Compare with previous version

    Oct 22, 2017

    added 2 commits

    • 972aa37e - Rewrite the GitHub importer from scratch
    • 9a48ad41 - Enable the new GitHub importer with a feature flag

    Compare with previous version

    added 2 commits * 972aa37e - Rewrite the GitHub importer from scratch * 9a48ad41 - Enable the new GitHub importer with a feature flag [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7691095&start_sha=064d642da1d2c9bb029bfd50b979cf37f05c52cc)
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    added 2 commits

    • 6ec1ad62 - Rewrite the GitHub importer from scratch
    • f6ee0b5d - Enable the new GitHub importer with a feature flag

    Compare with previous version

    Oct 23, 2017

    added 2 commits

    • 6ec1ad62 - Rewrite the GitHub importer from scratch
    • f6ee0b5d - Enable the new GitHub importer with a feature flag

    Compare with previous version

    added 2 commits * 6ec1ad62 - Rewrite the GitHub importer from scratch * f6ee0b5d - Enable the new GitHub importer with a feature flag [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7701530&start_sha=9a48ad412ca9e33abc33542754b3f7497a1670ec)
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    added 2 commits

    • c345b5af - Rewrite the GitHub importer from scratch
    • 61392125 - Enable the new GitHub importer with a feature flag

    Compare with previous version

    Oct 23, 2017

    added 2 commits

    • c345b5af - Rewrite the GitHub importer from scratch
    • 61392125 - Enable the new GitHub importer with a feature flag

    Compare with previous version

    added 2 commits * c345b5af - Rewrite the GitHub importer from scratch * 61392125 - Enable the new GitHub importer with a feature flag [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7704044&start_sha=f6ee0b5db27b19c6c5bd1f346e6ad090b35c0b9f)
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    added 2 commits

    • bf44cf60 - Rewrite the GitHub importer from scratch
    • 8be71784 - Enable the new GitHub importer with a feature flag

    Compare with previous version

    Oct 23, 2017

    added 2 commits

    • bf44cf60 - Rewrite the GitHub importer from scratch
    • 8be71784 - Enable the new GitHub importer with a feature flag

    Compare with previous version

    added 2 commits * bf44cf60 - Rewrite the GitHub importer from scratch * 8be71784 - Enable the new GitHub importer with a feature flag [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7706431&start_sha=61392125d14f0a170881f7c5dd22aa3e21017616)
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    added 2 commits

    • 1a1eb38f - Rewrite the GitHub importer from scratch
    • 729cfb7f - Enable the new GitHub importer with a feature flag

    Compare with previous version

    Oct 23, 2017

    added 2 commits

    • 1a1eb38f - Rewrite the GitHub importer from scratch
    • 729cfb7f - Enable the new GitHub importer with a feature flag

    Compare with previous version

    added 2 commits * 1a1eb38f - Rewrite the GitHub importer from scratch * 729cfb7f - Enable the new GitHub importer with a feature flag [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7713662&start_sha=8be717845b6584dbab722e02a05607a41e0ac85a)
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    added 2 commits

    • d7715ec1 - Rewrite the GitHub importer from scratch
    • 8248fc21 - Enable the new GitHub importer with a feature flag

    Compare with previous version

    Oct 23, 2017

    added 2 commits

    • d7715ec1 - Rewrite the GitHub importer from scratch
    • 8248fc21 - Enable the new GitHub importer with a feature flag

    Compare with previous version

    added 2 commits * d7715ec1 - Rewrite the GitHub importer from scratch * 8248fc21 - Enable the new GitHub importer with a feature flag [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7713984&start_sha=729cfb7f10218ca422fe8767d9be8ed1bba5779b)
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    added 2 commits

    • 7ad447dd - Rewrite the GitHub importer from scratch
    • 573a9eb4 - Enable the new GitHub importer with a feature flag

    Compare with previous version

    Oct 23, 2017

    added 2 commits

    • 7ad447dd - Rewrite the GitHub importer from scratch
    • 573a9eb4 - Enable the new GitHub importer with a feature flag

    Compare with previous version

    added 2 commits * 7ad447dd - Rewrite the GitHub importer from scratch * 573a9eb4 - Enable the new GitHub importer with a feature flag [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7715256&start_sha=8248fc21964586771750532e73f959cf5aa705d1)
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    changed the description

    Oct 23, 2017

    changed the description

    changed the description
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    added 2 commits

    • 4af5b7d9 - Rewrite the GitHub importer from scratch
    • e3cfb168 - Enable the new GitHub importer with a feature flag

    Compare with previous version

    Oct 23, 2017

    added 2 commits

    • 4af5b7d9 - Rewrite the GitHub importer from scratch
    • e3cfb168 - Enable the new GitHub importer with a feature flag

    Compare with previous version

    added 2 commits * 4af5b7d9 - Rewrite the GitHub importer from scratch * e3cfb168 - Enable the new GitHub importer with a feature flag [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7721591&start_sha=573a9eb44c13a0c40dc2065365ec5f2edfe9b04d)
    Toggle commit list
  • 🏝 Rémy Coutable 🏝 (back March 27th) @rymai

    mentioned in issue #33135 (closed)

    Oct 24, 2017

    mentioned in issue #33135 (closed)

    mentioned in issue #33135
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    added 133 commits

    • e3cfb168...e16add22 - 129 commits from branch master
    • cb58a673 - Refactor User.find_by_any_email
    • 87722696 - Add returning IDs to Gitlab::Database.bulk_insert
    • 371d16e4 - Rewrite the GitHub importer from scratch
    • 661489c0 - Enable the new GitHub importer with a feature flag

    Compare with previous version

    Oct 24, 2017

    added 133 commits

    • e3cfb168...e16add22 - 129 commits from branch master
    • cb58a673 - Refactor User.find_by_any_email
    • 87722696 - Add returning IDs to Gitlab::Database.bulk_insert
    • 371d16e4 - Rewrite the GitHub importer from scratch
    • 661489c0 - Enable the new GitHub importer with a feature flag

    Compare with previous version

    added 133 commits * e3cfb168...e16add22 - 129 commits from branch `master` * cb58a673 - Refactor User.find_by_any_email * 87722696 - Add returning IDs to Gitlab::Database.bulk_insert * 371d16e4 - Rewrite the GitHub importer from scratch * 661489c0 - Enable the new GitHub importer with a feature flag [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7728200&start_sha=e3cfb168e81608e969ec9f73eb36614bda64fc29)
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    added 2 commits

    • 4836cf3a - Rewrite the GitHub importer from scratch
    • 3bd8aa3d - Enable the new GitHub importer with a feature flag

    Compare with previous version

    Oct 24, 2017

    added 2 commits

    • 4836cf3a - Rewrite the GitHub importer from scratch
    • 3bd8aa3d - Enable the new GitHub importer with a feature flag

    Compare with previous version

    added 2 commits * 4836cf3a - Rewrite the GitHub importer from scratch * 3bd8aa3d - Enable the new GitHub importer with a feature flag [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7728552&start_sha=661489c0aee1ee478efabdc845462bda90b76b18)
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    added 2 commits

    • 57ccaa84 - Rewrite the GitHub importer from scratch
    • f7e1239c - Enable the new GitHub importer with a feature flag

    Compare with previous version

    Oct 24, 2017

    added 2 commits

    • 57ccaa84 - Rewrite the GitHub importer from scratch
    • f7e1239c - Enable the new GitHub importer with a feature flag

    Compare with previous version

    added 2 commits * 57ccaa84 - Rewrite the GitHub importer from scratch * f7e1239c - Enable the new GitHub importer with a feature flag [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7729151&start_sha=3bd8aa3dc4347b1529f385b034501f7832c4e771)
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    added 2 commits

    • 85749fb8 - Rewrite the GitHub importer from scratch
    • 62399e02 - Enable the new GitHub importer with a feature flag

    Compare with previous version

    Oct 24, 2017

    added 2 commits

    • 85749fb8 - Rewrite the GitHub importer from scratch
    • 62399e02 - Enable the new GitHub importer with a feature flag

    Compare with previous version

    added 2 commits * 85749fb8 - Rewrite the GitHub importer from scratch * 62399e02 - Enable the new GitHub importer with a feature flag [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7729524&start_sha=f7e1239cd0a7b3e8307630fc5d43343611215759)
    Toggle commit list
  • 🏝 Rémy Coutable 🏝 (back March 27th) @rymai commented Oct 24, 2017
    Master

    is this feature important enough that we want to include it?

    We could probably fail a PR import if the ref isn't fetched and restart a new import, which would do a fetch first so that PR that weren't imported correctly would be imported fine since their refs would have been fetched this time... I still need to look at this massive diff to see if that'd be possible, but you probably see the idea.

    > is this feature important enough that we want to include it? We could probably fail a PR import if the ref isn't fetched and restart a new import, which would do a `fetch` first so that PR that weren't imported correctly would be imported fine since their refs would have been fetched this time... I still need to look at this *massive* diff to see if that'd be possible, but you probably see the idea.
  • 🏝 Rémy Coutable 🏝 (back March 27th) @rymai

    mentioned in issue #39427 (closed)

    Oct 24, 2017

    mentioned in issue #39427 (closed)

    mentioned in issue #39427
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    added 2 commits

    • 9b14070d - Rewrite the GitHub importer from scratch
    • 240d77c7 - Enable the new GitHub importer with a feature flag

    Compare with previous version

    Oct 24, 2017

    added 2 commits

    • 9b14070d - Rewrite the GitHub importer from scratch
    • 240d77c7 - Enable the new GitHub importer with a feature flag

    Compare with previous version

    added 2 commits * 9b14070d - Rewrite the GitHub importer from scratch * 240d77c7 - Enable the new GitHub importer with a feature flag [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7729985&start_sha=62399e0254a1bf69524e2ee3a20bee189cf33703)
    Toggle commit list
  • Douglas Barbosa Alexandre
    @dbalexandre started a discussion on the diff Oct 24, 2017
    Resolved by Yorick Peterse Oct 25, 2017
    app/workers/gitlab/github_import/stage/import_repository_worker.rb 0 → 100644
    1 # frozen_string_literal: true
    2
    3 module Gitlab
    4 module GithubImport
    5 module Stage
    6 class ImportRepositoryWorker
    7 include Sidekiq::Worker
    8 include GithubImport::Queue
    9 include StageMethods
    10
    11 def import(client, project)
    12 importer = Importer::RepositoryImporter.new(project, client)
    • Douglas Barbosa Alexandre @dbalexandre commented Oct 24, 2017
      Master

      @yorickpeterse I tried to import a large repository today but fetching the repository data over the summit's hotel wifi take a while and the import failed with the following error message: Import timed out. Import took longer than 54000 seconds. Is there a way that we can refresh the JID expiration while fetching the repository data?

      @yorickpeterse I tried to import a large repository today but fetching the repository data over the summit's hotel wifi take a while and the import failed with the following error message: `Import timed out. Import took longer than 54000 seconds`. Is there a way that we can refresh the JID expiration while fetching the repository data?
    • Yorick Peterse @yorickpeterse commented Oct 24, 2017
      Master

      @dbalexandre I don't think in practise any repository would ever take more than 15 hours of cloning, at least not in production environments.

      We have a worker (AdvanceStageWorker) that takes care of refreshing the JID (this also happens in a few other places), but this is only done when the stage it waits for has been finished. This is necessary as always refreshing the JID can result in the jobs never getting finished. Basically what could happen in that case is:

      1. A job in the stage fails to update our Gitlab::JobWaiter status
      2. AdvanceStageWorker refreshes the JID, sees work remains, reschedules itself
      3. This keeps happening forever. StuckImportJobsWorker won't be able to mark the import as failed since the JID keeps getting refreshed.

      So what we'd have to do is schedule a separate job that runs e.g. every minute and refreshes the JID, until the cloning is done. I however wonder if in production environments we'd ever need this.

      @dbalexandre I don't think in practise any repository would ever take more than 15 hours of cloning, at least not in production environments. We have a worker (AdvanceStageWorker) that takes care of refreshing the JID (this also happens in a few other places), but this is only done when the stage it waits for has been finished. This is necessary as always refreshing the JID can result in the jobs never getting finished. Basically what could happen in that case is: 1. A job in the stage fails to update our `Gitlab::JobWaiter` status 2. AdvanceStageWorker refreshes the JID, sees work remains, reschedules itself 3. This keeps happening forever. StuckImportJobsWorker won't be able to mark the import as failed since the JID keeps getting refreshed. So what we'd have to do is schedule a separate job that runs e.g. every minute and refreshes the JID, until the cloning is done. I however wonder if in production environments we'd ever need this.
    • Yorick Peterse @yorickpeterse commented Oct 24, 2017
      Master

      Basically the problem is this:

      If we schedule a job every X minutes that refreshes the status it's possible for the import to get stuck forever, and StuckImportJobsWorker won't be able to mark the import as failed. If we don't do this we need to complete the cloning work in 15 hours.

      If 15 hours isn't enough we could set the TTL to 24 hours, but I honestly think 15 hours should be more than enough to clone a repository.

      Edited Oct 24, 2017 by Yorick Peterse
      Basically the problem is this: If we schedule a job every X minutes that refreshes the status it's possible for the import to get stuck forever, and `StuckImportJobsWorker` won't be able to mark the import as failed. If we don't do this we need to complete the cloning work in 15 hours. If 15 hours isn't enough we could set the TTL to 24 hours, but I honestly think 15 hours should be more than enough to clone a repository.
    • Yorick Peterse @yorickpeterse commented Oct 24, 2017
      Master

      What might work is this:

      Before cloning the repository we note down the JID, then we schedule job A. Job A will run every few minutes and check if the clone job is done. When job A runs it will also refresh the importer JID (a different JID than we use for checking if the clone job runs). Once the clone job has died or completed we stop this process.

      If I'm right this should prevent a job from getting stuck while also preventing it from getting marked as "failed" if a clone takes longer than 15 hours.

      What might work is this: Before cloning the repository we note down the JID, then we schedule job A. Job A will run every few minutes and check if the clone job is done. When job A runs it will also refresh the importer JID (a different JID than we use for checking if the clone job runs). Once the clone job has died or completed we stop this process. If I'm right this should prevent a job from getting stuck while also preventing it from getting marked as "failed" if a clone takes longer than 15 hours.
    • Yorick Peterse @yorickpeterse commented Oct 25, 2017
      Master

      This has been implemented.

      This has been implemented.
    Please register or sign in to reply
  • Yorick Peterse @yorickpeterse

    added 2 commits

    • 21b194d9 - Rewrite the GitHub importer from scratch
    • 448d1e4c - Enable the new GitHub importer with a feature flag

    Compare with previous version

    Oct 24, 2017

    added 2 commits

    • 21b194d9 - Rewrite the GitHub importer from scratch
    • 448d1e4c - Enable the new GitHub importer with a feature flag

    Compare with previous version

    added 2 commits * 21b194d9 - Rewrite the GitHub importer from scratch * 448d1e4c - Enable the new GitHub importer with a feature flag [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7737620&start_sha=240d77c79feaf14c1bb5c75f6a04bfa22b147b8c)
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    added 2 commits

    • c8728573 - Rewrite the GitHub importer from scratch
    • 2f2dbd72 - Enable the new GitHub importer with a feature flag

    Compare with previous version

    Oct 24, 2017

    added 2 commits

    • c8728573 - Rewrite the GitHub importer from scratch
    • 2f2dbd72 - Enable the new GitHub importer with a feature flag

    Compare with previous version

    added 2 commits * c8728573 - Rewrite the GitHub importer from scratch * 2f2dbd72 - Enable the new GitHub importer with a feature flag [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7738250&start_sha=448d1e4ce6ad2c4874cc0f257577e6b053d0e425)
    Toggle commit list
  • Yorick Peterse @yorickpeterse commented Oct 24, 2017
    Master

    @rymai

    We could probably fail a PR import if the ref isn't fetched and restart a new import, which would do a fetch first so that PR that weren't imported correctly would be imported fine since their refs would have been fetched this time... I still need to look at this massive diff to see if that'd be possible, but you probably see the idea.

    All pull requests are imported in parallel, and I'd rather not have all of them also check the repository status using Git as this is likely to blow up the NFS shard.

    So if we want this we need to store a timestamp somewhere that marks the time we finished cloning the repository. We then need synchronisation across processes/threads to make this work, otherwise e.g. 100 threads will run git fetch in parallel.

    @rymai >>> We could probably fail a PR import if the ref isn't fetched and restart a new import, which would do a `fetch` first so that PR that weren't imported correctly would be imported fine since their refs would have been fetched this time... I still need to look at this _massive_ diff to see if that'd be possible, but you probably see the idea. >>> All pull requests are imported in parallel, and I'd rather not have all of them _also_ check the repository status using Git as this is likely to blow up the NFS shard. So _if_ we want this we need to store a timestamp somewhere that marks the time we finished cloning the repository. We then need synchronisation across processes/threads to make this work, otherwise e.g. 100 threads will run `git fetch` in parallel.
  • 🏝 Rémy Coutable 🏝 (back March 27th) @rymai commented Oct 24, 2017
    Master

    All pull requests are imported in parallel, and I'd rather not have all of them also check the repository status using Git as this is likely to blow up the NFS shard.

    @yorickpeterse Sorry, it seems I wasn't clear in my answer above, I meant if a MR diff generation fails (most probably because ref isn't fetched), then we should retry the import after it's finished (in a failed state because of the MR diff that couldn't be generated). The import retry would start with a fresh git fetch, so the MR diff that couldn't be generated should now be generated properly...

    Anyways, I will need to go through the diff to understand better how this will (or won't) work.

    > All pull requests are imported in parallel, and I'd rather not have all of them _also_ check the repository status using Git as this is likely to blow up the NFS shard. @yorickpeterse Sorry, it seems I wasn't clear in my answer above, I meant if a MR diff generation fails (most probably because ref isn't fetched), then we should retry the import after it's finished (in a failed state because of the MR diff that couldn't be generated). The import retry would start with a fresh `git fetch`, so the MR diff that couldn't be generated should now be generated properly... Anyways, I will need to go through the diff to understand better how this will (or won't) work.
  • Yorick Peterse @yorickpeterse

    added 2 commits

    • 6f58e81d - Rewrite the GitHub importer from scratch
    • 6149b7aa - Enable the new GitHub importer with a feature flag

    Compare with previous version

    Oct 24, 2017

    added 2 commits

    • 6f58e81d - Rewrite the GitHub importer from scratch
    • 6149b7aa - Enable the new GitHub importer with a feature flag

    Compare with previous version

    added 2 commits * 6f58e81d - Rewrite the GitHub importer from scratch * 6149b7aa - Enable the new GitHub importer with a feature flag [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7740211&start_sha=2f2dbd72f441ca0316dfb4efe2601d17ca61441f)
    Toggle commit list
  • Yorick Peterse
    @yorickpeterse started a discussion Oct 24, 2017
    Resolved by Yorick Peterse Oct 30, 2017
    • Yorick Peterse @yorickpeterse commented Oct 24, 2017
      Master

      @rymai Let's discuss the PR in this discussion instead of separate comments, to keep things clear.

      What we could do is update projects.last_activity_at once done cloning. We can then compare this with the updated_at value of a pull request. This would still require synchronisation when fetching, but at least we don't need to pass along extra data. I however don't really like the idea of abusing last_activity_at for this.

      @rymai Let's discuss the PR in this discussion instead of separate comments, to keep things clear. What we _could_ do is update `projects.last_activity_at` once done cloning. We can then compare this with the `updated_at` value of a pull request. This would still require synchronisation when fetching, but at least we don't need to pass along extra data. I however don't really like the idea of abusing `last_activity_at` for this.
    • 🏝 Rémy Coutable 🏝 (back March 27th) @rymai commented Oct 26, 2017
      Master

      I however don't really like the idea of abusing last_activity_at for this.

      Yeah I feel the same, but we could probably start with that...

      > I however don't really like the idea of abusing `last_activity_at` for this. Yeah I feel the same, but we could probably start with that...
    • Yorick Peterse @yorickpeterse commented Oct 30, 2017
      Master

      @rymai I think we can do the following:

      Instead of fetching when processing individual PRs (this would require synchronisation) we do this when scheduling the jobs (this runs in a single thread). This works because when scheduling PR jobs we include the commit IDs and what not, meaning we have a stable view of the PR (even after a push).

      The benefit of this approach is that we don't need to schedule any additional data with PR jobs, and we don't need synchronisation. We however still need to store a timestamp somewhere that tracks the last clone time.

      Apparently we have a timestamp called projects.last_repository_updated_at. I think using that makes more sense than last_activity_at.

      @rymai I think we can do the following: Instead of fetching when processing individual PRs (this would require synchronisation) we do this when scheduling the jobs (this runs in a single thread). This works because when scheduling PR jobs we include the commit IDs and what not, meaning we have a stable view of the PR (even after a push). The benefit of this approach is that we don't need to schedule any additional data with PR jobs, and we don't need synchronisation. We however still need to store a timestamp somewhere that tracks the last clone time. Apparently we have a timestamp called `projects.last_repository_updated_at`. I think using that makes more sense than `last_activity_at`.
    Please register or sign in to reply
  • Yorick Peterse @yorickpeterse

    added 2 commits

    • 9872c93b - Rewrite the GitHub importer from scratch
    • 7b1d922a - Enable the new GitHub importer with a feature flag

    Compare with previous version

    Oct 25, 2017

    added 2 commits

    • 9872c93b - Rewrite the GitHub importer from scratch
    • 7b1d922a - Enable the new GitHub importer with a feature flag

    Compare with previous version

    added 2 commits * 9872c93b - Rewrite the GitHub importer from scratch * 7b1d922a - Enable the new GitHub importer with a feature flag [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7782373&start_sha=6149b7aa56d2158b6347f0cd928518a0a6549bf4)
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    added 141 commits

    • 7b1d922a...6919f636 - 136 commits from branch master
    • dad72cdd - Refactor User.find_by_any_email
    • 98d2aab8 - Add returning IDs to Gitlab::Database.bulk_insert
    • 03f8f3d3 - Cache feature names in RequestStore
    • 0cc94878 - Rewrite the GitHub importer from scratch
    • 023e5ae2 - Enable the new GitHub importer with a feature flag

    Compare with previous version

    Oct 30, 2017

    added 141 commits

    • 7b1d922a...6919f636 - 136 commits from branch master
    • dad72cdd - Refactor User.find_by_any_email
    • 98d2aab8 - Add returning IDs to Gitlab::Database.bulk_insert
    • 03f8f3d3 - Cache feature names in RequestStore
    • 0cc94878 - Rewrite the GitHub importer from scratch
    • 023e5ae2 - Enable the new GitHub importer with a feature flag

    Compare with previous version

    added 141 commits * 7b1d922a...6919f636 - 136 commits from branch `master` * dad72cdd - Refactor User.find_by_any_email * 98d2aab8 - Add returning IDs to Gitlab::Database.bulk_insert * 03f8f3d3 - Cache feature names in RequestStore * 0cc94878 - Rewrite the GitHub importer from scratch * 023e5ae2 - Enable the new GitHub importer with a feature flag [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7885447&start_sha=7b1d922a391b83e7405e866182d46f969733fcc1)
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    marked the task Re-fetch if a PR was updated after the initial repository fetch as completed

    Oct 30, 2017

    marked the task Re-fetch if a PR was updated after the initial repository fetch as completed

    marked the task **Re-fetch if a PR was updated after the initial repository fetch** as completed
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    changed the description

    Oct 30, 2017

    changed the description

    changed the description
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    resolved all discussions

    Oct 30, 2017

    resolved all discussions

    resolved all discussions
    Toggle commit list
  • Yorick Peterse @yorickpeterse commented Oct 30, 2017
    Master

    @DouweM While I continue running tests on staging/locally, could you start reviewing this? There's quite a lot to review, so it may be easier to look at the individual commits instead of the diff in its entirety. We can also review this via a video call if that's easier.

    @DouweM While I continue running tests on staging/locally, could you start reviewing this? There's quite a lot to review, so it may be easier to look at the individual commits instead of the diff in its entirety. We can also review this via a video call if that's easier.
  • Douwe Maan
    @DouweM started a discussion on the diff Oct 30, 2017
    Resolved by Yorick Peterse Oct 31, 2017
    lib/feature.rb
    22 26 flipper.feature(key)
    23 27 end
    24 28
    29 def persisted_names
    • Douwe Maan @DouweM commented Oct 30, 2017
      Master

      Would it make sense to use Gitlab::Cache::RequestCache, here or on FlipperFeature?

      Would it make sense to use `Gitlab::Cache::RequestCache`, here or on `FlipperFeature`?
    • Yorick Peterse @yorickpeterse commented Oct 30, 2017
      Master

      Yes, I wasn't aware of this code; I'll add it.

      Yes, I wasn't aware of this code; I'll add it.
    • Yorick Peterse @yorickpeterse commented Oct 30, 2017
      Master

      Oh, actually we can't. Gitlab::Cache::RequestCache messes around with instance variables. This means we'd modify a shared global state, which could lead to all sorts of nasty issues.

      Oh, actually we can't. `Gitlab::Cache::RequestCache` messes around with instance variables. This means we'd modify a shared global state, which could lead to all sorts of nasty issues.
    Please register or sign in to reply
  • Douwe Maan
    @DouweM started a discussion on an old version of the diff Oct 30, 2017
    Resolved by Yorick Peterse Nov 01, 2017
    app/services/projects/import_service.rb
    14 26 error("Error importing repository #{project.import_url} into #{project.full_path} - #{e.message}")
    15 27 end
    16 28
    29 def importer_class
    30 if use_parallel_github_importer?
    • Douwe Maan @DouweM commented Oct 30, 2017
      Master

      Would it make more sense for this check to be in Gitlab::ImportSources? We could allow the ImportSource importer attribute to be a proc, so it can evaluate to either Github::Import (as it does now), or Gitlab::GithubImport::ParallelImporter.

      Would it make more sense for this check to be in `Gitlab::ImportSources`? We could allow the `ImportSource` `importer` attribute to be a proc, so it can evaluate to either `Github::Import` (as it does now), or `Gitlab::GithubImport::ParallelImporter`.
    • Douwe Maan @DouweM commented Oct 30, 2017
      Master

      Also, why do we have both Github::Import and Gitlab::GithubImport::Importer, with github using the former, and gitea using the latter?

      Edit: Ah, never mind, you already point that out in 0cc94878.

      Edited Oct 30, 2017 by Douwe Maan
      Also, why do we have both `Github::Import` and `Gitlab::GithubImport::Importer`, with `github` using the former, and `gitea` using the latter? Edit: Ah, never mind, you already point that out in https://gitlab.com/gitlab-org/gitlab-ce/commit/0cc94878438c0e7ddd98dfc2f130526613b1d28e.
    • Yorick Peterse @yorickpeterse commented Oct 30, 2017
      Master

      @DouweM Yeah the naming is stupid, maybe we should just rename it to Gitea importer instead of LegacyGithubImporter; though IIRC the code makes assumptions/references to GitHub, so that would also be odd.

      Github::Import is the old importer, which will be nuked once we're confident with these changes. Initially I was going to dump this in a patch release, but with the merge window closing in I'm leaning towards hot-patching GitLab.com and removing Github::Import in this MR once we're confident. That way this MR introduces a "final" state, instead of something that's still a WIP.

      @DouweM Yeah the naming is stupid, maybe we should just rename it to Gitea importer instead of LegacyGithubImporter; though IIRC the code makes assumptions/references to GitHub, so that would also be odd. `Github::Import` is the old importer, which will be nuked once we're confident with these changes. Initially I was going to dump this in a patch release, but with the merge window closing in I'm leaning towards hot-patching GitLab.com and removing `Github::Import` in this MR once we're confident. That way this MR introduces a "final" state, instead of something that's still a WIP.
    • Yorick Peterse @yorickpeterse commented Oct 30, 2017
      Master

      For example, there's code such as:

      • OmniAuth::Strategies::GitHub.default_options[:client_options].symbolize_keys
      • Gitlab.config.omniauth.providers.find { |provider| provider.name == "github" }
      • joins(:identities).where(identities[:provider].eq(:github)

      So naming this GiteaImporter would be quite weird.

      For example, there's code such as: * `OmniAuth::Strategies::GitHub.default_options[:client_options].symbolize_keys` * `Gitlab.config.omniauth.providers.find { |provider| provider.name == "github" }` * `joins(:identities).where(identities[:provider].eq(:github)` So naming this `GiteaImporter` would be quite weird.
    • Douwe Maan @DouweM commented Oct 31, 2017
      Master

      @yorickpeterse If it's only used for Gitea, shouldn't those github mentions be gitea anyway? :/ Or does Gitea actually use the github OmniAuth provider?

      @yorickpeterse If it's only used for Gitea, shouldn't those `github` mentions be `gitea` anyway? :/ Or does Gitea actually use the `github` OmniAuth provider?
    • Yorick Peterse @yorickpeterse commented Oct 31, 2017
      Master

      @DouweM The importer used to be used for GitHub, hence the mentions. Gitea offers a GitHub compatible API as far as I can tell, hence it's now used for Gitea while Github::Import handles GitHub. For Gitea things such as the Omniauth provider are still "github", and I have no idea if we can replace that.

      @DouweM The importer used to be used for GitHub, hence the mentions. Gitea offers a GitHub compatible API as far as I can tell, hence it's now used for Gitea while `Github::Import` handles GitHub. For Gitea things such as the Omniauth provider are still "github", and I have no idea if we can replace that.
    Please register or sign in to reply
  • Yorick Peterse @yorickpeterse

    added 2 commits

    • c3389cd7 - Rewrite the GitHub importer from scratch
    • fafeb06f - Enable the new GitHub importer with a feature flag

    Compare with previous version

    Oct 30, 2017

    added 2 commits

    • c3389cd7 - Rewrite the GitHub importer from scratch
    • fafeb06f - Enable the new GitHub importer with a feature flag

    Compare with previous version

    added 2 commits * c3389cd7 - Rewrite the GitHub importer from scratch * fafeb06f - Enable the new GitHub importer with a feature flag [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7889292&start_sha=023e5ae285490b1e2c812fee3960b4456377d773)
    Toggle commit list
  • Douglas Barbosa Alexandre @dbalexandre

    mentioned in issue #27429 (closed)

    Oct 30, 2017

    mentioned in issue #27429 (closed)

    mentioned in issue #27429
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    changed the description

    Oct 30, 2017

    changed the description

    changed the description
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    changed the description

    Oct 30, 2017

    changed the description

    changed the description
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    changed the description

    Oct 30, 2017

    changed the description

    changed the description
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    marked the task Some GitHub pull requests use the same source/target branch, but GitLab does not seem to support this. Example: https://github.com/rubinius/rubinius/pull/704. This may happen when the source repository/project (or user) no longer exists. See https://sentry.gitlap.com/gitlab/staginggitlabcom/issues/107832/ for more info. as completed

    Oct 30, 2017

    marked the task Some GitHub pull requests use the same source/target branch, but GitLab does not seem to support this. Example: https://github.com/rubinius/rubinius/pull/704. This may happen when the source repository/project (or user) no longer exists. See https://sentry.gitlap.com/gitlab/staginggitlabcom/issues/107832/ for more info. as completed

    marked the task **Some GitHub pull requests use the same source/target branch, but GitLab does not seem to support this. Example: https://github.com/rubinius/rubinius/pull/704. This may happen when the source repository/project (or user) no longer exists. See https://sentry.gitlap.com/gitlab/staginggitlabcom/issues/107832/ for more info.** as completed
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    added 2 commits

    • 507f3dba - Rewrite the GitHub importer from scratch
    • fc7eb290 - Enable the new GitHub importer with a feature flag

    Compare with previous version

    Oct 31, 2017

    added 2 commits

    • 507f3dba - Rewrite the GitHub importer from scratch
    • fc7eb290 - Enable the new GitHub importer with a feature flag

    Compare with previous version

    added 2 commits * 507f3dba - Rewrite the GitHub importer from scratch * fc7eb290 - Enable the new GitHub importer with a feature flag [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7914830&start_sha=fafeb06f8c62cb43b92b6ac332543e131a90e11e)
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    added 2 commits

    • 638b3060 - Rewrite the GitHub importer from scratch
    • c97b5efe - Enable the new GitHub importer with a feature flag

    Compare with previous version

    Oct 31, 2017

    added 2 commits

    • 638b3060 - Rewrite the GitHub importer from scratch
    • c97b5efe - Enable the new GitHub importer with a feature flag

    Compare with previous version

    added 2 commits * 638b3060 - Rewrite the GitHub importer from scratch * c97b5efe - Enable the new GitHub importer with a feature flag [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7916496&start_sha=fc7eb2907d8f0e5e7dc4f6e3505e0d0e7f33f1fd)
    Toggle commit list
  • 🏝 Rémy Coutable 🏝 (back March 27th)
    @rymai started a discussion on an old version of the diff Oct 31, 2017
    Resolved by Yorick Peterse Nov 01, 2017
    doc/development/github_importer.md 0 → 100644
    61 ### 4. Stage::ImportPullRequestsWorker
    62
    63 This worker will import all pull requests. For every pull request a job for the
    64 `Gitlab::GithubImport::ImportPullRequestWorker` worker is scheduled.
    65
    66 ### 5. Stage::ImportIssuesAndDiffNotesWorker
    67
    68 This worker will import all issues and pull request comments. For every issue we
    69 schedule a job for the `Gitlab::GithubImport::ImportIssueWorker` worker. For
    70 pull request comments we instead schedule jobs for the
    71 `Gitlab::GithubImport::DiffNoteImporter` worker.
    72
    73 ### 6. Stage::ImportNotesWorker
    74
    75 This worker imports regular comments for both issues and pull requests. For
    76 every comment a job for the `Gitlab::GithubImport::ImportNoteWorker`.
    • 🏝 Rémy Coutable 🏝 (back March 27th) @rymai commented Oct 31, 2017
      Master

      "For every comment we schedule a job for the Gitlab::GithubImport::ImportNoteWorker worker"

      "For every comment{+ we schedule+} a job for the `Gitlab::GithubImport::ImportNoteWorker`{+ worker+}"
    • Yorick Peterse @yorickpeterse

      changed this line in version 63 of the diff

      Oct 31, 2017

      changed this line in version 63 of the diff

      changed this line in [version 63 of the diff](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7928071&start_sha=8189c5eaa7ff05ba89a0189760d278fa04517f6f#ee0e535e4319e9fd7456046da8c7b537b10abd84_76_76)
      Toggle commit list
    Please register or sign in to reply
  • 🏝 Rémy Coutable 🏝 (back March 27th)
    @rymai started a discussion on an old version of the diff Oct 31, 2017
    Resolved by Yorick Peterse Oct 31, 2017
    doc/user/project/import/github.md
    24 24 - the milestones (GitLab 8.7+)
    25 25 - the labels (GitLab 8.7+)
    26 26 - the release note descriptions (GitLab 8.12+)
    27 - the pull request review comments (GitLab 10.2+)
    28 - the regulra issue and pull request comments
    • 🏝 Rémy Coutable 🏝 (back March 27th) @rymai commented Oct 31, 2017
      Master

      regulra -> regular

      `regulra` -> `regular`
    • Yorick Peterse @yorickpeterse

      changed this line in version 63 of the diff

      Oct 31, 2017

      changed this line in version 63 of the diff

      changed this line in [version 63 of the diff](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7928071&start_sha=8189c5eaa7ff05ba89a0189760d278fa04517f6f#f38837d54dc4ad950cc6bb90504803d2b315e468_28_28)
      Toggle commit list
    Please register or sign in to reply
  • 🏝 Rémy Coutable 🏝 (back March 27th)
    @rymai started a discussion on an old version of the diff Oct 31, 2017
    Resolved by Yorick Peterse Oct 31, 2017
    doc/user/project/import/github.md
    121 126 You can also choose a different name for the project and a different namespace,
    122 127 if you have the privileges to do so.
    123 128
    129 ## Making the import process go faster
    130
    131 For large projects it may take a while to import all data. To reduce the time
    132 necessary you can increase the number of Sidekiq workers that process the
    133 following queues:
    134
    135 * `github_importer`
    136 * `github_importer_advance_stage`
    137
    138 For an optimal experience we recommend having at least 4 Sidekiq processes (each
    139 running a number of threads equal to the number of CPU cores) that _only_
    140 process these queues. We also recommend that these processes run on separate
    141 servers. Foror 4 servers with 8 cores this means you can import up to 32 objects
    • 🏝 Rémy Coutable 🏝 (back March 27th) @rymai commented Oct 31, 2017
      Master

      Foror -> For

      `Foror` -> `For`
    • Yorick Peterse @yorickpeterse

      changed this line in version 63 of the diff

      Oct 31, 2017

      changed this line in version 63 of the diff

      changed this line in [version 63 of the diff](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7928071&start_sha=8189c5eaa7ff05ba89a0189760d278fa04517f6f#f38837d54dc4ad950cc6bb90504803d2b315e468_141_141)
      Toggle commit list
    Please register or sign in to reply
  • Douwe Maan
    @DouweM started a discussion on the diff Oct 31, 2017
    Resolved by Yorick Peterse Nov 02, 2017
    lib/gitlab/github_import/importer/labels_importer.rb 0 → 100644
    1 # frozen_string_literal: true
    2
    3 module Gitlab
    4 module GithubImport
    5 module Importer
    6 class LabelsImporter
    7 include BulkImporting
    8
    9 attr_reader :project, :client, :existing_labels
    10
    11 # project - An instance of `Project`.
    12 # client - An instance of `Gitlab::GithubImport::Client`.
    13 def initialize(project, client)
    14 @project = project
    15 @client = client
    16 @existing_labels = project.labels.pluck(:title).to_set
    • Douwe Maan @DouweM commented Oct 31, 2017
      Master

      Should we also take group labels into account here, so that we don't try to create a new label on the project when it already exists on the group?

      Should we also take group labels into account here, so that we don't try to create a new label on the project when it already exists on the group?
    • Yorick Peterse @yorickpeterse commented Oct 31, 2017
      Master

      @DouweM The same applies here as milestones. If we can get them easily (there doesn't seem to be something like Project#group_labels), I think so; otherwise it will probably become too complex/annoying.

      @DouweM The same applies here as milestones. If we can get them easily (there doesn't seem to be something like `Project#group_labels`), I think so; otherwise it will probably become too complex/annoying.
    • Douwe Maan @DouweM commented Nov 01, 2017
      Master

      @yorickpeterse I think we can do that using the LabelsFinder.

      @yorickpeterse I think we can do that using the `LabelsFinder`.
    Please register or sign in to reply
  • Douwe Maan
    @DouweM started a discussion on the diff Oct 31, 2017
    Resolved by Yorick Peterse Nov 02, 2017
    lib/gitlab/github_import/importer/milestones_importer.rb 0 → 100644
    1 # frozen_string_literal: true
    2
    3 module Gitlab
    4 module GithubImport
    5 module Importer
    6 class MilestonesImporter
    7 include BulkImporting
    8
    9 attr_reader :project, :client, :existing_milestones
    10
    11 # project - An instance of `Project`
    12 # client - An instance of `Gitlab::GithubImport::Client`
    13 def initialize(project, client)
    14 @project = project
    15 @client = client
    16 @existing_milestones = project.milestones.pluck(:iid).to_set
    • Douwe Maan @DouweM commented Oct 31, 2017
      Master

      We also have group milestones now: https://gitlab.com/groups/gitlab-org/milestones/new

      Do we need to do anything special?

      We also have group milestones now: https://gitlab.com/groups/gitlab-org/milestones/new Do we need to do anything special?
    • Yorick Peterse @yorickpeterse commented Oct 31, 2017
      Master

      @DouweM Not sure. In the current state we import all repository milestones as project milestones, which could lead to duplicate milestones (one group milestone and a project milestone).

      Given a project, how would you obtain the group milestones available to it? Is this list simply limited to the milestones available to the group that owns the project? Or can you access all milestones from groups that the project is shared with? If it's based on ownership we could change this code to handle group milestones, but otherwise I think it will get too complex/slow too quickly; meaning I'd rather just skip it.

      @DouweM Not sure. In the current state we import all repository milestones as project milestones, which could lead to duplicate milestones (one group milestone and a project milestone). Given a project, how would you obtain the group milestones available to it? Is this list simply limited to the milestones available to the group that owns the project? Or can you access all milestones from groups that the project is shared with? If it's based on ownership we could change this code to handle group milestones, but otherwise I think it will get too complex/slow too quickly; meaning I'd rather just skip it.
    • Douwe Maan @DouweM commented Nov 01, 2017
      Master

      @yorickpeterse I think we can use the MilestonesFinder.

      @yorickpeterse I think we can use the `MilestonesFinder`.
    • Yorick Peterse @yorickpeterse commented Nov 01, 2017
      Master

      @DouweM Looking at the code I'm outright scared of using that. MilestonesFinder seems to operate on raw IDs, so does Milestone.for_projects_and_groups.

      Even then it's not clear how we'd pass the list of groups to this finder. Would we pass [project.namespace], or do we also have to get all ancestors and what not? What about the groups the project is shared with.

      @DouweM Looking at the code I'm outright scared of using that. `MilestonesFinder` seems to operate on raw IDs, so does `Milestone.for_projects_and_groups`. Even then it's not clear how we'd pass the list of groups to this finder. Would we pass `[project.namespace]`, or do we also have to get all ancestors and what not? What about the groups the project is shared with.
    • Douwe Maan @DouweM commented Nov 01, 2017
      Master

      @yorickpeterse Judging by https://gitlab.com/gitlab-org/gitlab-ee/blob/master/app/services/issuable_base_service.rb#L104 and http://gitlab.com/gitlab-org/gitlab-ee/blob/master/app/controllers/projects/milestones_controller.rb#L106, it only needs the project ID and its immediate parent group ID.

      Projects inherit milestones from their parent group, but nothing else.

      @yorickpeterse Judging by https://gitlab.com/gitlab-org/gitlab-ee/blob/master/app/services/issuable_base_service.rb#L104 and http://gitlab.com/gitlab-org/gitlab-ee/blob/master/app/controllers/projects/milestones_controller.rb#L106, it only needs the project ID and its immediate parent group ID. Projects inherit milestones from their parent group, but nothing else.
    • Yorick Peterse @yorickpeterse commented Nov 01, 2017
      Master

      @DouweM Ah, that makes things easier. I'll take a look to see how difficult it would be to add this, if it's too much we can do it in a separate MR.

      @DouweM Ah, that makes things easier. I'll take a look to see how difficult it would be to add this, if it's too much we can do it in a separate MR.
    • Yorick Peterse @yorickpeterse commented Nov 02, 2017
      Master

      I'm going to create issues for this so we can tackle this separately, that way we can at least get something out of the door (the old importer didn't handle this either so we don't lose anything).

      I'm going to create issues for this so we can tackle this separately, that way we can at least get something out of the door (the old importer didn't handle this either so we don't lose anything).
    • Yorick Peterse @yorickpeterse

      created #39724 to continue this discussion

      Nov 02, 2017

      created #39724 to continue this discussion

      created #39724 to continue this discussion
      Toggle commit list
    Please register or sign in to reply
  • Douwe Maan
    @DouweM started a discussion on the diff Oct 31, 2017
    Resolved by Yorick Peterse Nov 01, 2017
    lib/gitlab/github_import/importer/note_importer.rb 0 → 100644
    18
    19 def execute
    20 return unless (noteable_id = find_noteable_id)
    21
    22 author_id, author_found = user_finder.author_id_for(note)
    23
    24 note_body =
    25 MarkdownText.format(note.note, note.user, author_found)
    26
    27 attributes = {
    28 noteable_type: note.noteable_type,
    29 noteable_id: noteable_id,
    30 project_id: project.id,
    31 author_id: author_id,
    32 note: note_body,
    33 system: false,
    • Douwe Maan @DouweM commented Oct 31, 2017
      Master

      Did you happen to see if it would be feasible to also import system (event) notes? See also #15377.

      Did you happen to see if it would be feasible to also import system (event) notes? See also https://gitlab.com/gitlab-org/gitlab-ce/issues/15377.
    • Yorick Peterse @yorickpeterse commented Oct 31, 2017
      Master

      @DouweM We can add it, but it will probably require several extra days of development and testing to get going. As such I think it's better to do this separately, that way we can get the MR as it currently is out of the door quicker.

      @DouweM We can add it, but it will probably require several extra days of development and testing to get going. As such I think it's better to do this separately, that way we can get the MR as it currently is out of the door quicker.
    • Douwe Maan @DouweM commented Nov 01, 2017
      Master

      @yorickpeterse I agree, it was just a question that came to mind reading this code :)

      @yorickpeterse I agree, it was just a question that came to mind reading this code :)
    Please register or sign in to reply
  • Douwe Maan
    @DouweM started a discussion on the diff Oct 31, 2017
    Resolved by Yorick Peterse Nov 02, 2017
    app/workers/gitlab/github_import/refresh_import_jid_worker.rb 0 → 100644
    1 # frozen_string_literal: true
    2
    3 module Gitlab
    4 module GithubImport
    5 class RefreshImportJidWorker
    • Douwe Maan @DouweM commented Oct 31, 2017
      Master

      I think we could/should reuse this for all importers that may take more than 15minutes, like the simple Git repo importer, and even the fork worker.

      I think we could/should reuse this for all importers that may take more than 15minutes, like the simple Git repo importer, and even the fork worker.
    • Yorick Peterse @yorickpeterse commented Oct 31, 2017
      Master

      @DouweM In the future yes, but for this MR I prefer not to mess too much with all the other (largely untested) importers. In particular the problem is that for non GitHub importers we handle the cloning in Projects::ImportService#add_repository_to_project, which doesn't have access to the job ID of the current Sidekiq job.

      @DouweM In the future yes, but for this MR I prefer not to mess too much with all the other (largely untested) importers. In particular the problem is that for non GitHub importers we handle the cloning in `Projects::ImportService#add_repository_to_project`, which doesn't have access to the job ID of the current Sidekiq job.
    • Yorick Peterse @yorickpeterse commented Oct 31, 2017
      Master

      Actually we can't do this unless we rewrite the other importers. RefreshImportJidWorker requires the cloning process to be done in a separate worker, whereas all other importers perform all their work in a single one. If we use RefreshImportJidWorker for other importers (in the current setup) this would lead to job IDs being refreshed until the import work is done. This in turn means an import job could never be marked as stuck, allowing them to run effectively indefinitely.

      Actually we can't do this unless we rewrite the other importers. `RefreshImportJidWorker` requires the cloning process to be done in a separate worker, whereas all other importers perform all their work in a single one. If we use `RefreshImportJidWorker` for other importers (in the current setup) this would lead to job IDs being refreshed until the import work is done. This in turn means an import job could never be marked as stuck, allowing them to run effectively indefinitely.
    • Douwe Maan @DouweM commented Nov 01, 2017
      Master

      @yorickpeterse I'm fine with doing it in a separate issue/MR.

      If we use RefreshImportJidWorker for other importers (in the current setup) this would lead to job IDs being refreshed until the import work is done. This in turn means an import job could never be marked as stuck, allowing them to run effectively indefinitely.

      Isn't that a good thing? I don't think we should be marking imports as stuck that are actually still running. As I see it, an import is only stuck if it started, but somehow the job got killed and wasn't requeued, like when Sidekiq gets a SIGKILL while the job is running.

      @yorickpeterse I'm fine with doing it in a separate issue/MR. > If we use `RefreshImportJidWorker` for other importers (in the current setup) this would lead to job IDs being refreshed until the import work is done. This in turn means an import job could never be marked as stuck, allowing them to run effectively indefinitely. Isn't that a good thing? I don't think we should be marking imports as stuck that are actually still running. As I see it, an import is only stuck if it started, but somehow the job got killed and wasn't requeued, like when Sidekiq gets a `SIGKILL` while the job is running.
    • Yorick Peterse @yorickpeterse commented Nov 01, 2017
      Master

      Isn't that a good thing? I don't think we should be marking imports as stuck that are actually still running.

      Yes and no. We'd have to change StuckImportJobsWorker to only mark dead import jobs as stuck (= job ID no longer exists but the state is not "finished"). Then we'd have to change all importers to perform their cloning separately as otherwise jobs will never get marked as stuck (since RefreshImportJidWorker would be refreshing the same JID that it checks, thus never letting it time out).

      tl;dr we'd have to change the other importers in a similar fashion, then make StuckImportJobsWorker handle all this better.

      > Isn't that a good thing? I don't think we should be marking imports as stuck that are actually still running. Yes and no. We'd have to change `StuckImportJobsWorker` to only mark dead import jobs as stuck (= job ID no longer exists but the state is not "finished"). Then we'd have to change all importers to perform their cloning separately as otherwise jobs will never get marked as stuck (since `RefreshImportJidWorker` would be refreshing the same JID that it checks, thus never letting it time out). tl;dr we'd have to change the other importers in a similar fashion, then make `StuckImportJobsWorker` handle all this better.
    • Douwe Maan @DouweM commented Nov 01, 2017
      Master

      Yes and no. We'd have to change StuckImportJobsWorker to only mark dead import jobs as stuck (= job ID no longer exists but the state is not "finished").

      @yorickpeterse Doesn't it already do that? The query is Project.with_import_status(:scheduled, :started).where.not(import_jid: nil), and then we use Gitlab::SidekiqStatus to check those JIDs.

      Then we'd have to change all importers to perform their cloning separately as otherwise jobs will never get marked as stuck (since RefreshImportJidWorker would be refreshing the same JID that it checks, thus never letting it time out).

      I don't think that's right, since Gitlab::SidekiqStatus::ServerMiddleware would still unset the JID when the job actually finishes, and RefreshImportJidWorker only refreshes if the JID is still set. So once it finishes, it won't be reset again.

      We'd "just" need to make sure that RefreshImportJidWorker doesn't set the status for the JID if it was actually unset by Gitlab::SidekiqStatus::ServerMiddleware between checking SidekiqStatus.running? and calling project.refresh_import_jid_expiration.


      Totally something that can go into a followup MR, but something I think we should definitely do :)

      Edited Nov 01, 2017 by Douwe Maan
      > Yes and no. We'd have to change `StuckImportJobsWorker` to only mark dead import jobs as stuck (= job ID no longer exists but the state is not "finished"). @yorickpeterse Doesn't it already do that? `The query is Project.with_import_status(:scheduled, :started).where.not(import_jid: nil)`, and then we use `Gitlab::SidekiqStatus` to check those JIDs. > Then we'd have to change all importers to perform their cloning separately as otherwise jobs will never get marked as stuck (since `RefreshImportJidWorker` would be refreshing the same JID that it checks, thus never letting it time out). I don't think that's right, since [`Gitlab::SidekiqStatus::ServerMiddleware`](http://gitlab.com/gitlab-org/gitlab-ee/blob/master/lib/gitlab/sidekiq_status/server_middleware.rb#L7) would still unset the JID when the job actually finishes, and `RefreshImportJidWorker` only refreshes if the JID is still set. So once it finishes, it won't be reset again. We'd "just" need to make sure that `RefreshImportJidWorker` doesn't `set` the status for the JID if it was actually `unset` by `Gitlab::SidekiqStatus::ServerMiddleware` between checking `SidekiqStatus.running?` and calling `project.refresh_import_jid_expiration`. --- Totally something that can go into a followup MR, but something I think we should definitely do :)
    • Yorick Peterse @yorickpeterse commented Nov 02, 2017
      Master

      Hm maybe this should be possible after all. Either way I don't want to do this in this MR as it will get even bigger than it already is.

      Hm maybe this should be possible after all. Either way I don't want to do this in this MR as it will get even bigger than it already is.
    • Yorick Peterse @yorickpeterse

      created #39725 to continue this discussion

      Nov 02, 2017

      created #39725 to continue this discussion

      created #39725 to continue this discussion
      Toggle commit list
    Please register or sign in to reply
  • Douwe Maan
    @DouweM started a discussion on an old version of the diff Oct 31, 2017
    Resolved by Yorick Peterse Nov 01, 2017
    doc/development/github_importer.md 0 → 100644
    58 work is done in a single thread since it can be performed fast enough that we
    59 don't need to perform this work in parallel.
    60
    61 ### 4. Gitlab::GithubImport::Stage::ImportPullRequestsWorker
    62
    63 This worker will import all pull requests. For every pull request a job for the
    64 `Gitlab::GithubImport::ImportPullRequestWorker` worker is scheduled.
    65
    66 ### 5. Gitlab::GithubImport::Stage::ImportIssuesAndDiffNotesWorker
    67
    68 This worker will import all issues and pull request comments. For every issue we
    69 schedule a job for the `Gitlab::GithubImport::ImportIssueWorker` worker. For
    70 pull request comments we instead schedule jobs for the
    71 `Gitlab::GithubImport::DiffNoteImporter` worker.
    72
    73 ### 6. Gitlab::GithubImport::Stage::ImportNotesWorker
    • Douwe Maan @DouweM commented Oct 31, 2017
      Master

      Would it make sense to import issues and PRs in parallel, and then label links, diff notes and regular notes in parallel too? It sounds like that would remove a stage.

      Or don't we want to do label links separately from issues, because we'd need to iterate over issues/PRs again?

      Could we do label links in-line with issues and PRs?

      Edited Oct 31, 2017 by Douwe Maan
      Would it make sense to import issues and PRs in parallel, and then label links, diff notes and regular notes in parallel too? It sounds like that would remove a stage. Or don't we want to do label links separately from issues, because we'd need to iterate over issues/PRs again? Could we do label links in-line with issues and PRs?
    • Yorick Peterse @yorickpeterse commented Oct 31, 2017
      Master

      @DouweM Issues are imported separately (and after PRs) since some issues are also PRs, allowing us to update the label links. The PRs endpoint for whatever reason doesn't include this information.

      For small projects we could do a separate crawl through issues just for the label links, but for larger projects (e.g. kubernetes) this could waste too many requests. For example, Kubernetes has 22 117 issues which would require 222 API calls (when showing 100 per page). Crawling the data separately would mean 444 API calls. A large number of API calls is already spent on getting user data, by merging issues and label links into a single crawl we can reduce the likelihood of hitting the rate limit (which would mean having to wait up to an hour).

      So long story short: I prefer not to because it will increase the number of API calls.

      @DouweM Issues are imported separately (and after PRs) since some issues are also PRs, allowing us to update the label links. The PRs endpoint for whatever reason doesn't include this information. For small projects we could do a separate crawl through issues just for the label links, but for larger projects (e.g. kubernetes) this could waste too many requests. For example, Kubernetes has 22 117 issues which would require 222 API calls (when showing 100 per page). Crawling the data separately would mean 444 API calls. A large number of API calls is already spent on getting user data, by merging issues and label links into a single crawl we can reduce the likelihood of hitting the rate limit (which would mean having to wait up to an hour). So long story short: I prefer not to because it will increase the number of API calls.
    • Douwe Maan @DouweM commented Nov 01, 2017
      Master

      @yorickpeterse Sounds reasonable. It may be worth explicitly explaining this in the code or the docs, since the order of the stages is a little counterintuitive.

      @yorickpeterse Sounds reasonable. It may be worth explicitly explaining this in the code or the docs, since the order of the stages is a little counterintuitive.
    Please register or sign in to reply
  • Douwe Maan
    @DouweM started a discussion on the diff Oct 31, 2017
    Resolved by Yorick Peterse Nov 01, 2017
    lib/gitlab/github_import/representation/note.rb 0 → 100644
    16 # Builds a note from a GitHub API response.
    17 #
    18 # note - An instance of `Sawyer::Resource` containing the note details.
    19 def self.from_api_response(note)
    20 matches = note.html_url.match(NOTEABLE_TYPE_REGEX)
    21
    22 if !matches || !matches[:type]
    23 raise(
    24 ArgumentError,
    25 "The note URL #{note.html_url.inspect} is not supported"
    26 )
    27 end
    28
    29 hash = {
    30 noteable_type: matches[:type].to_sym,
    31 noteable_id: matches[:iid].to_i,
    • Douwe Maan @DouweM commented Oct 31, 2017
      Master

      I think it will cause confusion that noteable_type and noteable_id on a GitHub note representation and an actual note model represent something similar, but different. Could we store MergeRequest/Issue instead of pull/issues for noteable_type, and call noteable_id noteable_iid?

      Edited Oct 31, 2017 by Douwe Maan
      I think it will cause confusion that `noteable_type` and `noteable_id` on a GitHub note representation and an actual note model represent something similar, but different. Could we store `MergeRequest`/`Issue` instead of `pull`/`issues` for `noteable_type`, and call `noteable_id` `noteable_iid`?
    • Yorick Peterse @yorickpeterse commented Oct 31, 2017
      Master

      @DouweM The reason for this is mostly because I prefer to serialise (when scheduling via Sidekiq) simple (and small, so we don't waste a ton of RAM) data such as :pull instead of class names (e.g. MergeRequest). We can do the latter, but then we'd have to do something like 'MergeRequest'.constantize which I'm generally not a fan of.

      We can change the field names to something like note_target_type and note_target_id to make it more clear these are "raw" values.

      Edited Oct 31, 2017 by Yorick Peterse
      @DouweM The reason for this is mostly because I prefer to serialise (when scheduling via Sidekiq) simple (and small, so we don't waste a ton of RAM) data such as `:pull` instead of class names (e.g. `MergeRequest`). We can do the latter, but then we'd have to do something like `'MergeRequest'.constantize` which I'm generally not a fan of. We _can_ change the field names to something like `note_target_type` and `note_target_id` to make it more clear these are "raw" values.
    • Douwe Maan @DouweM commented Nov 01, 2017
      Master

      @yorickpeterse I would personally prefer MergeRequest/Issue because that's what we send in forms etc anyway, but I'm fine with anything that makes this less confusing :) Note that I don't think we would actually need to explicitly 'MergeRequest'.constantize, since AR will do that automatically if you do note.noteable_type = 'MergeRequest'.

      @yorickpeterse I would personally prefer `MergeRequest`/`Issue` because that's what we send in forms etc anyway, but I'm fine with anything that makes this less confusing :) Note that I don't think we would actually need to explicitly `'MergeRequest'.constantize`, since AR will do that automatically if you do `note.noteable_type = 'MergeRequest'`.
    Please register or sign in to reply
  • Douwe Maan
    @DouweM started a discussion on an old version of the diff Oct 31, 2017
    Resolved by Yorick Peterse Oct 31, 2017
    lib/gitlab/github_import/representation/diff_note.rb 0 → 100644
    17 # Builds a diff note from a GitHub API response.
    18 #
    19 # note - An instance of `Sawyer::Resource` containing the note details.
    20 def self.from_api_response(note)
    21 matches = note.html_url.match(NOTEABLE_ID_REGEX)
    22
    23 unless matches
    24 raise(
    25 ArgumentError,
    26 "The note URL #{note.html_url.inspect} is not supported"
    27 )
    28 end
    29
    30 hash = {
    31 noteable_id: matches[:iid].to_i,
    32 path: note.path,
    • Douwe Maan @DouweM commented Oct 31, 2017
      Master

      Could this be file_path, since it's not the path to a note?

      Could this be `file_path`, since it's not the path to a note?
    • Yorick Peterse @yorickpeterse

      changed this line in version 64 of the diff

      Oct 31, 2017

      changed this line in version 64 of the diff

      changed this line in [version 64 of the diff](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7931417&start_sha=0d287d716597806443aa9f2516e38cc29eec48d4#82c4a70db272d693e541f3f0f7790e2f9b8f3910_33_33)
      Toggle commit list
    Please register or sign in to reply
  • Douwe Maan
    @DouweM started a discussion on an old version of the diff Oct 31, 2017
    Resolved by Yorick Peterse Oct 31, 2017
    lib/gitlab/github_import/representation/pull_request.rb 0 → 100644
    25 end
    26
    27 hash = {
    28 iid: pr.number,
    29 title: pr.title,
    30 description: pr.body,
    31 source_branch: pr.head.ref,
    32 target_branch: pr.base.ref,
    33 source_branch_sha: pr.head.sha,
    34 target_branch_sha: pr.base.sha,
    35 source_repository_id: pr.head&.repo&.id,
    36 target_repository_id: pr.base&.repo&.id,
    37 source_repository_owner: pr.head&.user&.login,
    38 state: pr.state == 'open' ? :opened : :closed,
    39 milestone_number: pr.milestone&.number,
    40 user: Representation::User.from_api_response(pr.user),
    • Douwe Maan @DouweM commented Oct 31, 2017
      Master

      If this is the author, could we call it that?

      If this is the author, could we call it that?
    • Yorick Peterse @yorickpeterse

      changed this line in version 62 of the diff

      Oct 31, 2017

      changed this line in version 62 of the diff

      changed this line in [version 62 of the diff](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7923848&start_sha=c97b5efe90a2d38b5bc826c1ffbea26dd7b7207b#c708ba573471bc3721b5f5c226ad68e76ea5c6d2_40_41)
      Toggle commit list
    Please register or sign in to reply
  • Douwe Maan
    @DouweM started a discussion on an old version of the diff Oct 31, 2017
    Resolved by Yorick Peterse Oct 31, 2017
    lib/gitlab/github_import/representation/diff_note.rb 0 → 100644
    20 def self.from_api_response(note)
    21 matches = note.html_url.match(NOTEABLE_ID_REGEX)
    22
    23 unless matches
    24 raise(
    25 ArgumentError,
    26 "The note URL #{note.html_url.inspect} is not supported"
    27 )
    28 end
    29
    30 hash = {
    31 noteable_id: matches[:iid].to_i,
    32 path: note.path,
    33 commit_id: note.commit_id,
    34 diff_hunk: note.diff_hunk,
    35 user: Representation::User.from_api_response(note.user),
    • Douwe Maan @DouweM commented Oct 31, 2017
      Master

      If this is the author, could we call it that?

      If this is the author, could we call it that?
    • Yorick Peterse @yorickpeterse

      changed this line in version 62 of the diff

      Oct 31, 2017

      changed this line in version 62 of the diff

      changed this line in [version 62 of the diff](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7923848&start_sha=c97b5efe90a2d38b5bc826c1ffbea26dd7b7207b#82c4a70db272d693e541f3f0f7790e2f9b8f3910_35_36)
      Toggle commit list
    Please register or sign in to reply
  • Douwe Maan
    @DouweM started a discussion on an old version of the diff Oct 31, 2017
    Resolved by Yorick Peterse Oct 31, 2017
    lib/gitlab/github_import/representation/issue.rb 0 → 100644
    16 # Builds an issue from a GitHub API response.
    17 #
    18 # issue - An instance of `Sawyer::Resource` containing the issue
    19 # details.
    20 def self.from_api_response(issue)
    21 hash = {
    22 iid: issue.number,
    23 title: issue.title,
    24 description: issue.body,
    25 milestone_number: issue.milestone&.number,
    26 state: issue.state == 'open' ? :opened : :closed,
    27 assignees: issue.assignees.map do |u|
    28 Representation::User.from_api_response(u)
    29 end,
    30 label_names: issue.labels.map(&:name),
    31 user: Representation::User.from_api_response(issue.user),
    • Douwe Maan @DouweM commented Oct 31, 2017
      Master

      If this is the author, could we call it that?

      If this is the author, could we call it that?
    • Yorick Peterse @yorickpeterse

      changed this line in version 62 of the diff

      Oct 31, 2017

      changed this line in version 62 of the diff

      changed this line in [version 62 of the diff](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7923848&start_sha=c97b5efe90a2d38b5bc826c1ffbea26dd7b7207b#dee45c624c58ca599bfee24721996b0144889f7d_31_36)
      Toggle commit list
    Please register or sign in to reply
  • Douwe Maan
    @DouweM started a discussion on an old version of the diff Oct 31, 2017
    Resolved by Yorick Peterse Oct 31, 2017
    lib/gitlab/github_import/representation/note.rb 0 → 100644
    17 #
    18 # note - An instance of `Sawyer::Resource` containing the note details.
    19 def self.from_api_response(note)
    20 matches = note.html_url.match(NOTEABLE_TYPE_REGEX)
    21
    22 if !matches || !matches[:type]
    23 raise(
    24 ArgumentError,
    25 "The note URL #{note.html_url.inspect} is not supported"
    26 )
    27 end
    28
    29 hash = {
    30 noteable_type: matches[:type].to_sym,
    31 noteable_id: matches[:iid].to_i,
    32 user: Representation::User.from_api_response(note.user),
    • Douwe Maan @DouweM commented Oct 31, 2017
      Master

      If this is the author, could we call it that?

      If this is the author, could we call it that?
    • Yorick Peterse @yorickpeterse

      changed this line in version 62 of the diff

      Oct 31, 2017

      changed this line in version 62 of the diff

      changed this line in [version 62 of the diff](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7923848&start_sha=c97b5efe90a2d38b5bc826c1ffbea26dd7b7207b#229abc674928b15c873729e2c3607ff30b8bbc7e_32_34)
      Toggle commit list
    Please register or sign in to reply
  • Douwe Maan @DouweM commented Oct 31, 2017
    Master

    @godfat @jameslopez You two may want to have a look too, since you've been working on the GitHub importer too for https://gitlab.com/gitlab-org/gitlab-ce/issues/36284.

    @godfat @jameslopez You two may want to have a look too, since you've been working on the GitHub importer too for https://gitlab.com/gitlab-org/gitlab-ce/issues/36284.
  • Yorick Peterse @yorickpeterse

    added 2 commits

    • b192b2bd - Rewrite the GitHub importer from scratch
    • 8189c5ea - Enable the new GitHub importer with a feature flag

    Compare with previous version

    Oct 31, 2017

    added 2 commits

    • b192b2bd - Rewrite the GitHub importer from scratch
    • 8189c5ea - Enable the new GitHub importer with a feature flag

    Compare with previous version

    added 2 commits * b192b2bd - Rewrite the GitHub importer from scratch * 8189c5ea - Enable the new GitHub importer with a feature flag [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7923848&start_sha=c97b5efe90a2d38b5bc826c1ffbea26dd7b7207b)
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    marked the task Sometimes inserts into merge_requests produce duplicate key errors in the index index_merge_requests_on_target_project_id_and_iid. This is probably due to GitHub sometimes returning duplicate data. We can solve this by not scheduling PRs for which we have already scheduled an importing run (using a Set of iid values). See https://sentry.gitlap.com/gitlab/staginggitlabcom/issues/107834/ for more info. as completed

    Oct 31, 2017

    marked the task Sometimes inserts into merge_requests produce duplicate key errors in the index index_merge_requests_on_target_project_id_and_iid. This is probably due to GitHub sometimes returning duplicate data. We can solve this by not scheduling PRs for which we have already scheduled an importing run (using a Set of iid values). See https://sentry.gitlap.com/gitlab/staginggitlabcom/issues/107834/ for more info. as completed

    marked the task **Sometimes inserts into `merge_requests` produce duplicate key errors in the index `index_merge_requests_on_target_project_id_and_iid`. This is probably due to GitHub _sometimes_ returning duplicate data. We can solve this by not scheduling PRs for which we have already scheduled an importing run (using a Set of iid values). See https://sentry.gitlap.com/gitlab/staginggitlabcom/issues/107834/ for more info.** as completed
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    added 115 commits

    • 8189c5ea...01340794 - 110 commits from branch master
    • 27c27955 - Refactor User.find_by_any_email
    • ab8dfa8e - Add returning IDs to Gitlab::Database.bulk_insert
    • 16166dfe - Cache feature names in RequestStore
    • e0b7dad5 - Rewrite the GitHub importer from scratch
    • 0d287d71 - Enable the new GitHub importer with a feature flag

    Compare with previous version

    Oct 31, 2017

    added 115 commits

    • 8189c5ea...01340794 - 110 commits from branch master
    • 27c27955 - Refactor User.find_by_any_email
    • ab8dfa8e - Add returning IDs to Gitlab::Database.bulk_insert
    • 16166dfe - Cache feature names in RequestStore
    • e0b7dad5 - Rewrite the GitHub importer from scratch
    • 0d287d71 - Enable the new GitHub importer with a feature flag

    Compare with previous version

    added 115 commits * 8189c5ea...01340794 - 110 commits from branch `master` * 27c27955 - Refactor User.find_by_any_email * ab8dfa8e - Add returning IDs to Gitlab::Database.bulk_insert * 16166dfe - Cache feature names in RequestStore * e0b7dad5 - Rewrite the GitHub importer from scratch * 0d287d71 - Enable the new GitHub importer with a feature flag [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7928071&start_sha=8189c5eaa7ff05ba89a0189760d278fa04517f6f)
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    added 2 commits

    • 325ddb00 - Rewrite the GitHub importer from scratch
    • f03588c3 - Enable the new GitHub importer with a feature flag

    Compare with previous version

    Oct 31, 2017

    added 2 commits

    • 325ddb00 - Rewrite the GitHub importer from scratch
    • f03588c3 - Enable the new GitHub importer with a feature flag

    Compare with previous version

    added 2 commits * 325ddb00 - Rewrite the GitHub importer from scratch * f03588c3 - Enable the new GitHub importer with a feature flag [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7931417&start_sha=0d287d716597806443aa9f2516e38cc29eec48d4)
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    added 2 commits

    • 0c465f18 - Rewrite the GitHub importer from scratch
    • 935910d1 - Enable the new GitHub importer with a feature flag

    Compare with previous version

    Oct 31, 2017

    added 2 commits

    • 0c465f18 - Rewrite the GitHub importer from scratch
    • 935910d1 - Enable the new GitHub importer with a feature flag

    Compare with previous version

    added 2 commits * 0c465f18 - Rewrite the GitHub importer from scratch * 935910d1 - Enable the new GitHub importer with a feature flag [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7931685&start_sha=f03588c359d460b2e809c4bf06ff36b245a57e4d)
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    changed the description

    Oct 31, 2017

    changed the description

    changed the description
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    added 2 commits

    • 20e98b71 - Rewrite the GitHub importer from scratch
    • 578a8f17 - Enable the new GitHub importer with a feature flag

    Compare with previous version

    Oct 31, 2017

    added 2 commits

    • 20e98b71 - Rewrite the GitHub importer from scratch
    • 578a8f17 - Enable the new GitHub importer with a feature flag

    Compare with previous version

    added 2 commits * 20e98b71 - Rewrite the GitHub importer from scratch * 578a8f17 - Enable the new GitHub importer with a feature flag [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7932621&start_sha=935910d102564cee65b887c5b3283ce3d32d7051)
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    mentioned in issue gitlab-com/infrastructure#3124 (closed)

    Nov 01, 2017

    mentioned in issue gitlab-com/infrastructure#3124 (closed)

    mentioned in issue gitlab-com/infrastructure#3124
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    added 2 commits

    • 3547aa2a - Rewrite the GitHub importer from scratch
    • 492638ea - Enable the new GitHub importer with a feature flag

    Compare with previous version

    Nov 01, 2017

    added 2 commits

    • 3547aa2a - Rewrite the GitHub importer from scratch
    • 492638ea - Enable the new GitHub importer with a feature flag

    Compare with previous version

    added 2 commits * 3547aa2a - Rewrite the GitHub importer from scratch * 492638ea - Enable the new GitHub importer with a feature flag [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7947311&start_sha=578a8f174559b24e709f9be1dee7074e117d99c0)
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    mentioned in issue gitlab-com/infrastructure#3128 (closed)

    Nov 01, 2017

    mentioned in issue gitlab-com/infrastructure#3128 (closed)

    mentioned in issue gitlab-com/infrastructure#3128
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    changed the description

    Nov 01, 2017

    changed the description

    changed the description
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    changed the description

    Nov 01, 2017

    changed the description

    changed the description
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    changed the description

    Nov 01, 2017

    changed the description

    changed the description
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    changed the description

    Nov 01, 2017

    changed the description

    changed the description
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    added 2 commits

    • 434086a6 - Rewrite the GitHub importer from scratch
    • 8feffb40 - Enable the new GitHub importer with a feature flag

    Compare with previous version

    Nov 01, 2017

    added 2 commits

    • 434086a6 - Rewrite the GitHub importer from scratch
    • 8feffb40 - Enable the new GitHub importer with a feature flag

    Compare with previous version

    added 2 commits * 434086a6 - Rewrite the GitHub importer from scratch * 8feffb40 - Enable the new GitHub importer with a feature flag [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7959985&start_sha=492638ead5fe7e39fe05837fcdd035d3d99a927d)
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    marked the task Handle rescheduled jobs better when cloning (e.g. don't add the remotes if they already exist). Right now a job getting terminated in the middle (e.g. by the memory killer) can prevent us from starting over/resuming as completed

    Nov 02, 2017

    marked the task Handle rescheduled jobs better when cloning (e.g. don't add the remotes if they already exist). Right now a job getting terminated in the middle (e.g. by the memory killer) can prevent us from starting over/resuming as completed

    marked the task **Handle rescheduled jobs better when cloning (e.g. don't add the remotes if they already exist). Right now a job getting terminated in the middle (e.g. by the memory killer) can prevent us from starting over/resuming** as completed
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    mentioned in issue #39724

    Nov 02, 2017

    mentioned in issue #39724

    mentioned in issue #39724
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    resolved all discussions

    Nov 02, 2017

    resolved all discussions

    resolved all discussions
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    mentioned in issue #39725

    Nov 02, 2017

    mentioned in issue #39725

    mentioned in issue #39725
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    changed the description

    Nov 02, 2017

    changed the description

    changed the description
    Toggle commit list
  • Yorick Peterse @yorickpeterse commented Nov 02, 2017
    Master

    For those seeing this and thinking "Gee, I want to try this out before 10.2.0 releases", here's how:

    1. SSH into the server(s) running GitLab
    2. Run curl -s https://gitlab.com/snippets/1681949/raw -o /tmp/importer.patch on every server
    3. Run sudo patch -p1 -N -s -d /opt/gitlab/embedded/service/gitlab-rails < /tmp/importer.patch on every server. This assumes GitLab is installed in /opt/gitlab/embedded/service/gitlab-rails (the omnibus default). You may not need sudo if it's installed in a directory your user can access.
    4. Restart GitLab on all servers (Unicorn, Sidekiq, etc)
    5. Since the code is hidden behind a feature flag (this will be taken out before getting merged0 you will need to enable it for the user you want to use for testing. You can do this by running the following on a server:
    sudo gitlab-rails r "Feature.enable(:parallel_github_importer, User.find_by(username: 'username of the user here'))"

    So for example if the user is "yorickpeterse" you'd run:

    sudo gitlab-rails r "Feature.enable(:parallel_github_importer, User.find_by(username: 'yorickpeterse'))"

    Then you can import the project as usual. If nothing is happening you should make sure that Sidekiq (our background processing code) processes the following two queues:

    • github_importer
    • github_importer_advance_stage

    Omnibus installations will process these queues automatically

    The requirement for this is that you have to run at least EE 10.1.0. The supplied patch will not work on CE.

    Edited Nov 02, 2017 by Yorick Peterse
    For those seeing this and thinking "Gee, I want to try this out before 10.2.0 releases", here's how: 1. SSH into the server(s) running GitLab 2. Run `curl -s https://gitlab.com/snippets/1681949/raw -o /tmp/importer.patch` on every server 3. Run `sudo patch -p1 -N -s -d /opt/gitlab/embedded/service/gitlab-rails < /tmp/importer.patch` on every server. This assumes GitLab is installed in `/opt/gitlab/embedded/service/gitlab-rails` (the omnibus default). You may not need `sudo` if it's installed in a directory your user can access. 4. Restart GitLab on all servers (Unicorn, Sidekiq, etc) 5. Since the code is hidden behind a feature flag (this will be taken out before getting merged0 you will need to enable it for the user you want to use for testing. You can do this by running the following on a server: ``` sudo gitlab-rails r "Feature.enable(:parallel_github_importer, User.find_by(username: 'username of the user here'))" ``` So for example if the user is "yorickpeterse" you'd run: sudo gitlab-rails r "Feature.enable(:parallel_github_importer, User.find_by(username: 'yorickpeterse'))" Then you can import the project as usual. If nothing is happening you should make sure that Sidekiq (our background processing code) processes the following two queues: * github_importer * github_importer_advance_stage Omnibus installations will process these queues automatically The requirement for this is that you _have_ to run at least EE 10.1.0. The supplied patch will _not_ work on CE.
  • Yorick Peterse @yorickpeterse

    added 159 commits

    • 8feffb40...64c9d780 - 154 commits from branch master
    • 90fb0741 - Refactor User.find_by_any_email
    • 2f2f2c2e - Add returning IDs to Gitlab::Database.bulk_insert
    • cccb2fe3 - Cache feature names in RequestStore
    • 11584bd8 - Rewrite the GitHub importer from scratch
    • 161cb5c4 - Enable the new GitHub importer with a feature flag

    Compare with previous version

    Nov 02, 2017

    added 159 commits

    • 8feffb40...64c9d780 - 154 commits from branch master
    • 90fb0741 - Refactor User.find_by_any_email
    • 2f2f2c2e - Add returning IDs to Gitlab::Database.bulk_insert
    • cccb2fe3 - Cache feature names in RequestStore
    • 11584bd8 - Rewrite the GitHub importer from scratch
    • 161cb5c4 - Enable the new GitHub importer with a feature flag

    Compare with previous version

    added 159 commits * 8feffb40...64c9d780 - 154 commits from branch `master` * 90fb0741 - Refactor User.find_by_any_email * 2f2f2c2e - Add returning IDs to Gitlab::Database.bulk_insert * cccb2fe3 - Cache feature names in RequestStore * 11584bd8 - Rewrite the GitHub importer from scratch * 161cb5c4 - Enable the new GitHub importer with a feature flag [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=7982287&start_sha=8feffb40442f69416d23317074672412696e2038)
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    mentioned in issue #37814

    Nov 03, 2017

    mentioned in issue #37814

    mentioned in issue #37814
    Toggle commit list
  • Douglas Barbosa Alexandre @dbalexandre

    mentioned in issue #38948 (closed)

    Nov 03, 2017

    mentioned in issue #38948 (closed)

    mentioned in issue #38948
    Toggle commit list
  • Grzegorz Bizon @grzesiek commented Nov 03, 2017
    Master

    Awesome work @yorickpeterse! Would it be possible to split this MR into few iterations? 🤔

    Awesome work @yorickpeterse! Would it be possible to split this MR into few iterations? :thinking:
  • Yorick Peterse @yorickpeterse commented Nov 03, 2017
    Master

    @grzesiek No. While there are a few separate commits that could be moved those changes are so small they'd have no impact on the reviewing process. The core changes themselves can't be split up because they don't work on their own / depend on each other.

    Edited Nov 03, 2017 by Yorick Peterse
    @grzesiek No. While there are a few separate commits that could be moved those changes are so small they'd have no impact on the reviewing process. The core changes themselves can't be split up because they don't work on their own / depend on each other.
  • Douglas Barbosa Alexandre @dbalexandre

    mentioned in issue #12450 (closed)

    Nov 03, 2017

    mentioned in issue #12450 (closed)

    mentioned in issue #12450
    Toggle commit list
  • Douglas Barbosa Alexandre @dbalexandre

    mentioned in issue #17995 (closed)

    Nov 03, 2017

    mentioned in issue #17995 (closed)

    mentioned in issue #17995
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    changed the description

    Nov 04, 2017

    changed the description

    changed the description
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    changed the description

    Nov 04, 2017

    changed the description

    changed the description
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    changed the description

    Nov 04, 2017

    changed the description

    changed the description
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    added 59 commits

    • 161cb5c4...4841c3cb - 54 commits from branch master
    • 4479845c - Refactor User.find_by_any_email
    • cdb671c4 - Add returning IDs to Gitlab::Database.bulk_insert
    • 71a34494 - Cache feature names in RequestStore
    • ada56efe - Rewrite the GitHub importer from scratch
    • 2a528131 - Enable the new GitHub importer with a feature flag

    Compare with previous version

    Nov 04, 2017

    added 59 commits

    • 161cb5c4...4841c3cb - 54 commits from branch master
    • 4479845c - Refactor User.find_by_any_email
    • cdb671c4 - Add returning IDs to Gitlab::Database.bulk_insert
    • 71a34494 - Cache feature names in RequestStore
    • ada56efe - Rewrite the GitHub importer from scratch
    • 2a528131 - Enable the new GitHub importer with a feature flag

    Compare with previous version

    added 59 commits * 161cb5c4...4841c3cb - 54 commits from branch `master` * 4479845c - Refactor User.find_by_any_email * cdb671c4 - Add returning IDs to Gitlab::Database.bulk_insert * 71a34494 - Cache feature names in RequestStore * ada56efe - Rewrite the GitHub importer from scratch * 2a528131 - Enable the new GitHub importer with a feature flag [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=8057253&start_sha=161cb5c4effe22f53d09383687dad3fb6f8ce2ec)
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    marked the task The logic used for refreshing repositories before importing PRs doesn't scale. Frequently updated PRs will lead to many refetches of the repository, even if the repository wasn't actually updated. Since we schedule the refs in a job we really only need to refetch if the refs can't be found. This will require a lease however. as completed

    Nov 04, 2017

    marked the task The logic used for refreshing repositories before importing PRs doesn't scale. Frequently updated PRs will lead to many refetches of the repository, even if the repository wasn't actually updated. Since we schedule the refs in a job we really only need to refetch if the refs can't be found. This will require a lease however. as completed

    marked the task **The logic used for refreshing repositories before importing PRs doesn't scale. Frequently updated PRs will lead to many refetches of the repository, even if the repository wasn't actually updated. Since we schedule the refs in a job we really only need to refetch if the refs can't be found. This will require a lease however.** as completed
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    added 2 commits

    • eafe3e44 - Rewrite the GitHub importer from scratch
    • 1227b80e - Enable the new GitHub importer with a feature flag

    Compare with previous version

    Nov 04, 2017

    added 2 commits

    • eafe3e44 - Rewrite the GitHub importer from scratch
    • 1227b80e - Enable the new GitHub importer with a feature flag

    Compare with previous version

    added 2 commits * eafe3e44 - Rewrite the GitHub importer from scratch * 1227b80e - Enable the new GitHub importer with a feature flag [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=8093262&start_sha=2a52813116c9e76d5d5b5ca628113f4aad914b3f)
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    added 2 commits

    • f7c421de - Rewrite the GitHub importer from scratch
    • 11bf14d1 - Enable the new GitHub importer with a feature flag

    Compare with previous version

    Nov 05, 2017

    added 2 commits

    • f7c421de - Rewrite the GitHub importer from scratch
    • 11bf14d1 - Enable the new GitHub importer with a feature flag

    Compare with previous version

    added 2 commits * f7c421de - Rewrite the GitHub importer from scratch * 11bf14d1 - Enable the new GitHub importer with a feature flag [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=8131486&start_sha=1227b80e63c43ae4de036dd3847a3bcaa3764492)
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    marked the task It seems that under certain conditions the page numbers of e.g. the pull requests data isn't tracked properly. It seems to work fine locally, so perhaps something else botched this up. This needs some additional testing. as completed

    Nov 05, 2017

    marked the task It seems that under certain conditions the page numbers of e.g. the pull requests data isn't tracked properly. It seems to work fine locally, so perhaps something else botched this up. This needs some additional testing. as completed

    marked the task **It seems that under certain conditions the page numbers of e.g. the pull requests data isn't tracked properly. It seems to work fine locally, so perhaps something else botched this up. This needs some additional testing.** as completed
    Toggle commit list
  • James Lopez @jameslopez

    mentioned in merge request !13788 (closed)

    Nov 06, 2017

    mentioned in merge request !13788 (closed)

    mentioned in merge request !13788
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    added 247 commits

    • 11bf14d1...aaee69c4 - 242 commits from branch master
    • b9785387 - Refactor User.find_by_any_email
    • 79d1d32f - Add returning IDs to Gitlab::Database.bulk_insert
    • 47b39cc1 - Cache feature names in RequestStore
    • 335f3742 - Rewrite the GitHub importer from scratch
    • 287e54a1 - Enable the new GitHub importer with a feature flag

    Compare with previous version

    Nov 06, 2017

    added 247 commits

    • 11bf14d1...aaee69c4 - 242 commits from branch master
    • b9785387 - Refactor User.find_by_any_email
    • 79d1d32f - Add returning IDs to Gitlab::Database.bulk_insert
    • 47b39cc1 - Cache feature names in RequestStore
    • 335f3742 - Rewrite the GitHub importer from scratch
    • 287e54a1 - Enable the new GitHub importer with a feature flag

    Compare with previous version

    added 247 commits * 11bf14d1...aaee69c4 - 242 commits from branch `master` * b9785387 - Refactor User.find_by_any_email * 79d1d32f - Add returning IDs to Gitlab::Database.bulk_insert * 47b39cc1 - Cache feature names in RequestStore * 335f3742 - Rewrite the GitHub importer from scratch * 287e54a1 - Enable the new GitHub importer with a feature flag [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=8184106&start_sha=11bf14d1bc2a0a4ddd87833a68219525a23e798c)
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    added 2 commits

    • daa7ae9d - Rewrite the GitHub importer from scratch
    • 0331e1ed - Enable the new GitHub importer with a feature flag

    Compare with previous version

    Nov 06, 2017

    added 2 commits

    • daa7ae9d - Rewrite the GitHub importer from scratch
    • 0331e1ed - Enable the new GitHub importer with a feature flag

    Compare with previous version

    added 2 commits * daa7ae9d - Rewrite the GitHub importer from scratch * 0331e1ed - Enable the new GitHub importer with a feature flag [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=8188627&start_sha=287e54a12e410cf3317bb1b7f158166cf99320c9)
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    added 2 commits

    • 47c2b926 - Rewrite the GitHub importer from scratch
    • 9a8f3f60 - Replace old GH importer with the parallel importer

    Compare with previous version

    Nov 06, 2017

    added 2 commits

    • 47c2b926 - Rewrite the GitHub importer from scratch
    • 9a8f3f60 - Replace old GH importer with the parallel importer

    Compare with previous version

    added 2 commits * 47c2b926 - Rewrite the GitHub importer from scratch * 9a8f3f60 - Replace old GH importer with the parallel importer [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=8227678&start_sha=0331e1ed69bc9b351918c6408b9edab23d514f8a)
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    added 1 commit

    • 59a8df32 - Replace old GH importer with the parallel importer

    Compare with previous version

    Nov 06, 2017

    added 1 commit

    • 59a8df32 - Replace old GH importer with the parallel importer

    Compare with previous version

    added 1 commit * 59a8df32 - Replace old GH importer with the parallel importer [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=8229136&start_sha=9a8f3f609fec6603e4f12c4a9cbf742564d62931)
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    added 1 commit

    • b68e63e2 - Replace old GH importer with the parallel importer

    Compare with previous version

    Nov 06, 2017

    added 1 commit

    • b68e63e2 - Replace old GH importer with the parallel importer

    Compare with previous version

    added 1 commit * b68e63e2 - Replace old GH importer with the parallel importer [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=8229209&start_sha=59a8df32198012eda1e572c06b5f892aa861b972)
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    added 64 commits

    • b68e63e2...21d8ec15 - 59 commits from branch master
    • b433a499 - Refactor User.find_by_any_email
    • 1f72403a - Add returning IDs to Gitlab::Database.bulk_insert
    • b7216bfd - Cache feature names in RequestStore
    • 0960651c - Rewrite the GitHub importer from scratch
    • 2273d8a1 - Replace old GH importer with the parallel importer

    Compare with previous version

    Nov 06, 2017

    added 64 commits

    • b68e63e2...21d8ec15 - 59 commits from branch master
    • b433a499 - Refactor User.find_by_any_email
    • 1f72403a - Add returning IDs to Gitlab::Database.bulk_insert
    • b7216bfd - Cache feature names in RequestStore
    • 0960651c - Rewrite the GitHub importer from scratch
    • 2273d8a1 - Replace old GH importer with the parallel importer

    Compare with previous version

    added 64 commits * b68e63e2...21d8ec15 - 59 commits from branch `master` * b433a499 - Refactor User.find_by_any_email * 1f72403a - Add returning IDs to Gitlab::Database.bulk_insert * b7216bfd - Cache feature names in RequestStore * 0960651c - Rewrite the GitHub importer from scratch * 2273d8a1 - Replace old GH importer with the parallel importer [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=8230645&start_sha=b68e63e29a94ecc6c4adae152b1e611a0cba4a42)
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    changed the description

    Nov 06, 2017

    changed the description

    changed the description
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    added 85 commits

    • 2273d8a1...c71cf908 - 80 commits from branch master
    • 1ed36936 - Refactor User.find_by_any_email
    • 04c2b88d - Add returning IDs to Gitlab::Database.bulk_insert
    • 08878777 - Cache feature names in RequestStore
    • e7c215e6 - Rewrite the GitHub importer from scratch
    • becb4499 - Replace old GH importer with the parallel importer

    Compare with previous version

    Nov 06, 2017

    added 85 commits

    • 2273d8a1...c71cf908 - 80 commits from branch master
    • 1ed36936 - Refactor User.find_by_any_email
    • 04c2b88d - Add returning IDs to Gitlab::Database.bulk_insert
    • 08878777 - Cache feature names in RequestStore
    • e7c215e6 - Rewrite the GitHub importer from scratch
    • becb4499 - Replace old GH importer with the parallel importer

    Compare with previous version

    added 85 commits * 2273d8a1...c71cf908 - 80 commits from branch `master` * 1ed36936 - Refactor User.find_by_any_email * 04c2b88d - Add returning IDs to Gitlab::Database.bulk_insert * 08878777 - Cache feature names in RequestStore * e7c215e6 - Rewrite the GitHub importer from scratch * becb4499 - Replace old GH importer with the parallel importer [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=8231961&start_sha=2273d8a1c6094a70c09757ddee3c5a7822d53746)
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    changed the description

    Nov 07, 2017

    changed the description

    changed the description
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    marked the task Changelog entry added, if necessary as completed

    Nov 07, 2017

    marked the task Changelog entry added, if necessary as completed

    marked the task **[Changelog entry](https://docs.gitlab.com/ee/development/changelog.html) added, if necessary** as completed
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    marked the task Documentation created/updated as completed

    Nov 07, 2017

    marked the task Documentation created/updated as completed

    marked the task **[Documentation created/updated](https://docs.gitlab.com/ee/development/doc_styleguide.html)** as completed
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    marked the task API support added as completed

    Nov 07, 2017

    marked the task API support added as completed

    marked the task **API support added** as completed
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    marked the task Tests added for this feature/bug as completed

    Nov 07, 2017

    marked the task Tests added for this feature/bug as completed

    marked the task **Tests added for this feature/bug** as completed
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    marked the task Conform by the merge request performance guides as completed

    Nov 07, 2017

    marked the task Conform by the merge request performance guides as completed

    marked the task **Conform by the [merge request performance guides](https://docs.gitlab.com/ee/development/merge_request_performance_guidelines.html)** as completed
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    marked the task Squashed related commits together as completed

    Nov 07, 2017

    marked the task Squashed related commits together as completed

    marked the task **[Squashed related commits together](https://git-scm.com/book/en/Git-Tools-Rewriting-History#Squashing-Commits)** as completed
    Toggle commit list
  • Yorick Peterse @yorickpeterse commented Nov 07, 2017
    Master

    @DouweM Apart from the unrelated static-analysis/karma failures, this is ready for final review. Since the last review a few things have changed such as:

    • How we paginate data and resume where we left off (we track things in Redis in an atomic way)
    • A bunch of bugs have been fixed (e.g. the code used for handling dead messages was broken)
    • The HTTP client has been adjusted to better paginate data
    • Feature flag code has been removed and the importer has been made the default, this also means the Github::Import code has been removed as it is no longer needed

    Most of these however are fairly specific changes, the bigger picture still remains the same.

    Edited Nov 07, 2017 by Yorick Peterse
    @DouweM Apart from the unrelated static-analysis/karma failures, this is ready for final review. Since the last review a few things have changed such as: * How we paginate data and resume where we left off (we track things in Redis in an atomic way) * A bunch of bugs have been fixed (e.g. the code used for handling dead messages was broken) * The HTTP client has been adjusted to better paginate data * Feature flag code has been removed and the importer has been made the default, this also means the `Github::Import` code has been removed as it is no longer needed Most of these however are fairly specific changes, the bigger picture still remains the same.
  • Yorick Peterse @yorickpeterse

    added 75 commits

    • becb4499...666ab488 - 70 commits from branch master
    • 2432261a - Refactor User.find_by_any_email
    • c703a19e - Add returning IDs to Gitlab::Database.bulk_insert
    • af6a1b38 - Cache feature names in RequestStore
    • 59d618e5 - Rewrite the GitHub importer from scratch
    • 57db4c8a - Replace old GH importer with the parallel importer

    Compare with previous version

    Nov 07, 2017

    added 75 commits

    • becb4499...666ab488 - 70 commits from branch master
    • 2432261a - Refactor User.find_by_any_email
    • c703a19e - Add returning IDs to Gitlab::Database.bulk_insert
    • af6a1b38 - Cache feature names in RequestStore
    • 59d618e5 - Rewrite the GitHub importer from scratch
    • 57db4c8a - Replace old GH importer with the parallel importer

    Compare with previous version

    added 75 commits * becb4499...666ab488 - 70 commits from branch `master` * 2432261a - Refactor User.find_by_any_email * c703a19e - Add returning IDs to Gitlab::Database.bulk_insert * af6a1b38 - Cache feature names in RequestStore * 59d618e5 - Rewrite the GitHub importer from scratch * 57db4c8a - Replace old GH importer with the parallel importer [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=8284223&start_sha=becb4499c363e23e8c9219989edb3feac15c71d0)
    Toggle commit list
  • Douwe Maan
    @DouweM started a discussion on the diff Nov 07, 2017
    Resolved by Yorick Peterse Nov 07, 2017
    app/controllers/import/github_controller.rb
    43 43 @target_namespace = find_or_create_namespace(namespace_path, current_user.namespace_path)
    44 44
    45 45 if can?(current_user, :create_projects, @target_namespace)
    46 @project = Gitlab::GithubImport::ProjectCreator.new(repo, @project_name, @target_namespace, current_user, access_params, type: provider).execute
    46 @project = Gitlab::LegacyGithubImport::ProjectCreator.new(repo, @project_name, @target_namespace, current_user, access_params, type: provider).execute
    • Douwe Maan @DouweM commented Nov 07, 2017
      Master

      Why do we keep LegacyGithubImport around?

      Why do we keep `LegacyGithubImport` around?
    • Douwe Maan @DouweM commented Nov 07, 2017
      Master

      Right, Gitea, I keep forgetting. So confusing.

      Right, Gitea, I keep forgetting. So confusing.
    Please register or sign in to reply
  • Douwe Maan
    @DouweM started a discussion on an old version of the diff Nov 07, 2017
    Resolved by Yorick Peterse Nov 07, 2017
    app/services/projects/import_service.rb
    4 4
    5 5 Error = Class.new(StandardError)
    6 6
    7 # Returns true if this importer is supposed to perform its work in the
    8 # background.
    9 #
    10 # This method will only return `true` if async importing is explicitly
    11 # supported by an importer class (`Gitlab::GithubImport::ParallelImporter`
    12 # for example).
    13 def async?
    14 return false unless has_importer?
    15
    16 importer_class.respond_to?(:async?) && importer_class.async?
    • Douwe Maan @DouweM commented Nov 07, 2017
      Master

      Is this the same thing as importer_class.try(:async?)?

      Is this the same thing as `importer_class.try(:async?)`?
    • Yorick Peterse @yorickpeterse

      changed this line in version 81 of the diff

      Nov 07, 2017

      changed this line in version 81 of the diff

      changed this line in [version 81 of the diff](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=8292323&start_sha=57db4c8a1dd9fbd7b64984c36ea8fc7e25de55ec#0f082d3e08008573b6ad29bfeb34ee272e353d78_16_16)
      Toggle commit list
    Please register or sign in to reply
  • Douwe Maan
    @DouweM started a discussion on the diff Nov 07, 2017
    Resolved by Yorick Peterse Nov 08, 2017
    app/workers/gitlab/github_import/stage/import_repository_worker.rb 0 → 100644
    3 module Gitlab
    4 module GithubImport
    5 module Stage
    6 class ImportRepositoryWorker
    7 include Sidekiq::Worker
    8 include GithubImport::Queue
    9 include StageMethods
    10
    11 # client - An instance of Gitlab::GithubImport::Client.
    12 # project - An instance of Project.
    13 def import(client, project)
    14 # In extreme cases it's possible for a clone to take more than the
    15 # import job expiration time. To work around this we schedule a
    16 # separate job that will periodically run and refresh the import
    17 # expiration time.
    18 RefreshImportJidWorker.perform_in_the_future(project.id, jid)
    • Douwe Maan @DouweM commented Nov 07, 2017
      Master

      Why does RefreshImportJidWorker need a check_job_id? Could it use the actual project.import_jid? They are equal anyway, since we update the import_jid column in Gitlab::GithubImport::ParallelImporter after we. schedule this ImportRepositoryWorker.

      Why does `RefreshImportJidWorker` need a `check_job_id`? Could it use the actual `project.import_jid`? They are equal anyway, since we update the `import_jid` column in `Gitlab::GithubImport::ParallelImporter` after we. schedule this `ImportRepositoryWorker`.
    • Yorick Peterse @yorickpeterse commented Nov 07, 2017
      Master

      @DouweM Because otherwise it would keep running forever. If we were to check key A and refresh it while running (= meaning the key exists) we'd never stop since we'd be refreshing the same key (= it would never time out as a result). This could lead to the import process getting stuck if the clone job were to die unexpectedly and somehow not unset the JID key.

      The values also aren't equal, since we're checking the JID of Stage::ImportRepositoryWorker, not the one originally produced by RepositoryImportWorker.

      @DouweM Because otherwise it would keep running forever. If we were to check key A and refresh it while running (= meaning the key exists) we'd never stop since we'd be refreshing the same key (= it would never time out as a result). This could lead to the import process getting stuck if the clone job were to die unexpectedly and somehow not unset the JID key. The values also aren't equal, since we're checking the JID of `Stage::ImportRepositoryWorker`, not the one originally produced by `RepositoryImportWorker`.
    • Yorick Peterse @yorickpeterse commented Nov 07, 2017
      Master

      @DouweM Derp you're right, I don't even remember how my own code works. We set import_jid in the ParallelImporter.

      I'll see if I can get jid of the check_job_id since it's somewhat redundant, and if the keep-running-forever problem described above is something we can deal with.

      @DouweM Derp you're right, I don't even remember how my own code works. We set import_jid in the ParallelImporter. I'll see if I can get jid of the check_job_id since it's somewhat redundant, and if the keep-running-forever problem described above is something we can deal with.
    • Yorick Peterse @yorickpeterse commented Nov 07, 2017
      Master

      The more I think of it, the more I think RefreshImportJidWorker is not as useful as anticipated, at least not in the current implementation.

      First of all it can lead to an import getting stuck. This can happen when a ImportRepositoryWorker job gets terminated in such a way that its JID does not get removed (e.g. a SIGKILL). In this case RefreshImportJidWorker will keep refreshing the same JID forever, preventing the import from getting marked as stuck.

      Second there's a race condition. Because we use the JID of ImportRepositoryWorker the import can be marked as stuck after that worker finishes and before the next worker starts, this is because Sidekiq will unset the JID. To work around this we'd have to use a custom import_jid value, unrelated to any actual Sidekiq workers.

      The only way I can see us solving both problems is by having an additional thread in every Sidekiq process. This thread has a synchronised list of JIDs that is updated by Sidekiq threads as jobs come and go. Once a minute this thread wakes up, takes the list of current JIDs (= the ones the process is running), then refreshes their TTL. This however still requires a unique JID for GitHub imports since the work is spread across multiple JIDs.

      Edited Nov 07, 2017 by Yorick Peterse
      The more I think of it, the more I think RefreshImportJidWorker is not as useful as anticipated, at least not in the current implementation. First of all it can lead to an import getting stuck. This can happen when a ImportRepositoryWorker job gets terminated in such a way that its JID does not get removed (e.g. a SIGKILL). In this case RefreshImportJidWorker will keep refreshing the same JID forever, preventing the import from getting marked as stuck. Second there's a race condition. Because we use the JID of ImportRepositoryWorker the import can be marked as stuck after that worker finishes and before the next worker starts, this is because Sidekiq will unset the JID. To work around this we'd have to use a custom import_jid value, unrelated to any actual Sidekiq workers. The only way I can see us solving both problems is by having an additional thread in every Sidekiq process. This thread has a synchronised list of JIDs that is updated by Sidekiq threads as jobs come and go. Once a minute this thread wakes up, takes the list of current JIDs (= the ones the process is running), then refreshes their TTL. This however still requires a unique JID for GitHub imports since the work is spread across multiple JIDs.
    • Yorick Peterse @yorickpeterse commented Nov 07, 2017
      Master

      Basically we'd need a way to distinguish between a job that is still running, and a job that is no longer running but for which the key still exists.

      Basically we'd need a way to distinguish between a job that is still running, and a job that is no longer running but for which the key still exists.
    • Yorick Peterse @yorickpeterse commented Nov 07, 2017
      Master

      I changed the code to at least use custom values for projects.import_jid, solving the problem of refreshing the same JID. This means we basically have 30 hours to perform the clone (15 hours timeout in case the clone job crashes, then another 15 hours for the custom JID). This is good enough for now, but in the long term we need something better.

      I changed the code to at least use custom values for `projects.import_jid`, solving the problem of refreshing the same JID. This means we basically have 30 hours to perform the clone (15 hours timeout in case the clone job crashes, then another 15 hours for the custom JID). This is good enough for now, but in the long term we need something better.
    Please register or sign in to reply
  • Douwe Maan
    @DouweM started a discussion on an old version of the diff Nov 07, 2017
    Resolved by Yorick Peterse Nov 07, 2017
    doc/user/project/import/github.md
    43 45 namespace that started the import process.
    44 46
    45 47 The importer will also import branches on forks of projects related to open pull
    46 requests. These branches will be imported with a naming scheume similar to
    48 requests. These branches will be imported with a naming scheume similar to
    • Douwe Maan @DouweM commented Nov 07, 2017
      Master

      "scheume"

      "sche{-u-}me"
    • Yorick Peterse @yorickpeterse

      changed this line in version 81 of the diff

      Nov 07, 2017

      changed this line in version 81 of the diff

      changed this line in [version 81 of the diff](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=8292323&start_sha=57db4c8a1dd9fbd7b64984c36ea8fc7e25de55ec#f38837d54dc4ad950cc6bb90504803d2b315e468_48_48)
      Toggle commit list
    Please register or sign in to reply
  • Douwe Maan
    @DouweM started a discussion on an old version of the diff Nov 07, 2017
    Resolved by Yorick Peterse Nov 07, 2017
    lib/gitlab/github_import/user_finder.rb 0 → 100644
    127
    128 Caching.write(ID_CACHE_KEY % id, gitlab_id)
    129 end
    130
    131 # Queries and caches the GitLab user ID for a GitHub email, if one was
    132 # found.
    133 def id_for_github_email(email)
    134 gitlab_id = query_id_for_github_email(email) || nil
    135
    136 Caching.write(ID_FOR_EMAIL_CACHE_KEY % email, gitlab_id)
    137 end
    138
    139 def query_id_for_github_id(id)
    140 User
    141 .joins(:identities)
    142 .where(identities: { provider: :github, extern_uid: id.to_s })
    • Douwe Maan @DouweM commented Nov 07, 2017
      Master

      Can we create a scope on User for this? That will make it slightly easier to find and fix this when we get to #38822 (closed). I know that this ID is numeric, so it doesn't really matter, but scopes make it easier to be consistent :)

      Can we create a scope on `User` for this? That will make it slightly easier to find and fix this when we get to https://gitlab.com/gitlab-org/gitlab-ce/issues/38822. I know that this ID is numeric, so it doesn't really matter, but scopes make it easier to be consistent :)
    • Yorick Peterse @yorickpeterse

      changed this line in version 81 of the diff

      Nov 07, 2017

      changed this line in version 81 of the diff

      changed this line in [version 81 of the diff](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=8292323&start_sha=57db4c8a1dd9fbd7b64984c36ea8fc7e25de55ec#596c9cc79f8f58fc6daa1e1065417b8073191df7_142_140)
      Toggle commit list
    Please register or sign in to reply
  • Yorick Peterse @yorickpeterse

    changed the description

    Nov 07, 2017

    changed the description

    changed the description
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    added 54 commits

    • 57db4c8a...83e96c7b - 51 commits from branch master
    • dea6aa55 - Refactor User.find_by_any_email
    • dec01b5c - Add returning IDs to Gitlab::Database.bulk_insert
    • 528ea70a - Cache feature names in RequestStore

    Compare with previous version

    Nov 07, 2017

    added 54 commits

    • 57db4c8a...83e96c7b - 51 commits from branch master
    • dea6aa55 - Refactor User.find_by_any_email
    • dec01b5c - Add returning IDs to Gitlab::Database.bulk_insert
    • 528ea70a - Cache feature names in RequestStore

    Compare with previous version

    added 54 commits * 57db4c8a...83e96c7b - 51 commits from branch `master` * dea6aa55 - Refactor User.find_by_any_email * dec01b5c - Add returning IDs to Gitlab::Database.bulk_insert * 528ea70a - Cache feature names in RequestStore [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=8292323&start_sha=57db4c8a1dd9fbd7b64984c36ea8fc7e25de55ec)
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    added 2 commits

    • 4b5afc81 - Rewrite the GitHub importer from scratch
    • c7ead95a - Replace old GH importer with the parallel importer

    Compare with previous version

    Nov 07, 2017

    added 2 commits

    • 4b5afc81 - Rewrite the GitHub importer from scratch
    • c7ead95a - Replace old GH importer with the parallel importer

    Compare with previous version

    added 2 commits * 4b5afc81 - Rewrite the GitHub importer from scratch * c7ead95a - Replace old GH importer with the parallel importer [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=8292774&start_sha=83e96c7b9b89d4ff35aeac62ac36b0e3d9553ddc)
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    added 2 commits

    • 09282d2a - Rewrite the GitHub importer from scratch
    • d04ad08f - Replace old GH importer with the parallel importer

    Compare with previous version

    Nov 07, 2017

    added 2 commits

    • 09282d2a - Rewrite the GitHub importer from scratch
    • d04ad08f - Replace old GH importer with the parallel importer

    Compare with previous version

    added 2 commits * 09282d2a - Rewrite the GitHub importer from scratch * d04ad08f - Replace old GH importer with the parallel importer [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=8296291&start_sha=c7ead95a3cc0f0fe6bbc97e97c0ab744d77638b9)
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    added 125 commits

    • d04ad08f...673b6be1 - 120 commits from branch master
    • 44be82dd - Refactor User.find_by_any_email
    • bda30182 - Add returning IDs to Gitlab::Database.bulk_insert
    • 90be53c5 - Cache feature names in RequestStore
    • 91c34e0d - Rewrite the GitHub importer from scratch
    • 6ce286f9 - Replace old GH importer with the parallel importer

    Compare with previous version

    Nov 07, 2017

    added 125 commits

    • d04ad08f...673b6be1 - 120 commits from branch master
    • 44be82dd - Refactor User.find_by_any_email
    • bda30182 - Add returning IDs to Gitlab::Database.bulk_insert
    • 90be53c5 - Cache feature names in RequestStore
    • 91c34e0d - Rewrite the GitHub importer from scratch
    • 6ce286f9 - Replace old GH importer with the parallel importer

    Compare with previous version

    added 125 commits * d04ad08f...673b6be1 - 120 commits from branch `master` * 44be82dd - Refactor User.find_by_any_email * bda30182 - Add returning IDs to Gitlab::Database.bulk_insert * 90be53c5 - Cache feature names in RequestStore * 91c34e0d - Rewrite the GitHub importer from scratch * 6ce286f9 - Replace old GH importer with the parallel importer [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=8298482&start_sha=d04ad08fe1993cc39da3b9e18db269b90a2b7b27)
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    added 2 commits

    • 4dfe26cd - Rewrite the GitHub importer from scratch
    • 6e242e82 - Replace old GH importer with the parallel importer

    Compare with previous version

    Nov 07, 2017

    added 2 commits

    • 4dfe26cd - Rewrite the GitHub importer from scratch
    • 6e242e82 - Replace old GH importer with the parallel importer

    Compare with previous version

    added 2 commits * 4dfe26cd - Rewrite the GitHub importer from scratch * 6e242e82 - Replace old GH importer with the parallel importer [Compare with previous version](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14731/diffs?diff_id=8302616&start_sha=6ce286f9d786cd6a45d6d2c75971b54604ccb2dc)
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    resolved all discussions

    Nov 08, 2017

    resolved all discussions

    resolved all discussions
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    assigned to @DouweM

    Nov 08, 2017

    assigned to @DouweM

    assigned to @DouweM
    Toggle commit list
  • Douwe Maan @DouweM

    approved this merge request

    Nov 08, 2017

    approved this merge request

    approved this merge request
    Toggle commit list
  • Douwe Maan @DouweM

    mentioned in commit 92249f1a

    Nov 08, 2017

    mentioned in commit 92249f1a

    mentioned in commit 92249f1ac883c2a861235ec49526cbafca73b362
    Toggle commit list
  • Douwe Maan @DouweM

    merged

    Nov 08, 2017

    merged

    merged
    Toggle commit list
  • Yorick Peterse @yorickpeterse

    mentioned in merge request gitlab-ee!3310 (merged)

    Nov 08, 2017

    mentioned in merge request gitlab-ee!3310 (merged)

    mentioned in merge request gitlab-ee!3310
    Toggle commit list
  • Yorick Peterse @yorickpeterse commented Nov 08, 2017
    Master

    EE MR: gitlab-ee!3310 (merged)

    EE MR: https://gitlab.com/gitlab-org/gitlab-ee/merge_requests/3310
  • Yorick Peterse @yorickpeterse

    mentioned in merge request gitlab-com/www-gitlab-com!7736 (merged)

    Nov 08, 2017

    mentioned in merge request gitlab-com/www-gitlab-com!7736 (merged)

    mentioned in merge request gitlab-com/www-gitlab-com!7736
    Toggle commit list
  • Douwe Maan @DouweM

    mentioned in commit gitlab-ee@d1825985

    Nov 09, 2017

    mentioned in commit gitlab-ee@d1825985

    mentioned in commit gitlab-ee@d1825985d4bd197ff58c7de6f1a64b7871a48365
    Toggle commit list
  • James Lopez @jameslopez

    mentioned in issue #31425 (closed)

    Dec 04, 2017

    mentioned in issue #31425 (closed)

    mentioned in issue #31425
    Toggle commit list
  • Douwe Maan @DouweM

    mentioned in issue #40107

    Dec 05, 2017

    mentioned in issue #40107

    mentioned in issue #40107
    Toggle commit list
  • Brendan O'Leary 🐢 @boleary

    mentioned in issue #37249

    Jan 09, 2018

    mentioned in issue #37249

    mentioned in issue #37249
    Toggle commit list
  • Write
  • Preview
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or sign in to comment
Douwe Maan
Assignee
Douwe Maan @DouweM
Assign to
10.2
Milestone
10.2
Assign milestone
Time tracking
4
Labels
Platform database import performance
Assign labels
  • View labels
Reference: gitlab-org/gitlab-ce!14731
×

Revert this merge request

This will create a new commit in order to revert the existing changes.

Switch branch
Cancel
A new branch will be created in your fork and a new merge request will be started.
×

Cherry-pick this merge request

Switch branch
Cancel
A new branch will be created in your fork and a new merge request will be started.