The Import group needs to focus on the Importers direction in order to deliver market-required features. To accomplish that, we would need to insulate the group from any customer-specific work unrelated to the direction. This issue shows a roadmap of what features we need to deliver with a general timeline for the delivery.
Goals
Why is this important? In order to enable large enterprises to migrate to GitLab, our importers must scale to support reliable imports of large data sets.
This is currently not the case. Our importers are not resilient and therefore not reliable. We also run into file size limitations imposed by our infrastructure. In order to overcome these challenges, a significant investment must be made into GitLab and GitHub importers. To achieve that, we must be able to focus the team on these long-term goals, even at the cost of some short-term pain. Without these investments, we will not be successful in the long run.
Why focus on scaling our importers, reliability, and overcoming import size limitations? Large enterprises have large data sets that need to be migrated into GitLab. These migrations are tied to ARR opportunities, which may be put at risk if we can't support those migrations.
Why focus on GitLab and GitHub importers? The latest data shows about 52% of imported projects coming from GitLab and 18% coming from GitHub. These are the 2 largest sources of our imports. Based on the usage and our SaaS First initiative, we will focus on delivering GitLab-to-GitLab Migration first.
Proposed Solution
Focus the groupimport team on the long-term direction for 12 months in order to achieve reliable and scaleable imports from GitHub and GitLab.
These are the highlights of the proposed roadmap (CY):
By the end of this quarter (Q3'21), we will have the new migration solution for Groups, with the goal of deprecating the current file-based Group Export/Import. We will also improve the reliability of the GitHub Importer.
By the end of this year (2021), we will have a replacement for the file-based Project Export/Imports, removing the project size limits imposed by the infrastructure file-size limits.
By the end of Q1'22, we will support self-serviced GitHub migrations of massive projects.
By the end of Q2'22, we will deliver a Complete GitHub importer that will satisfy requirements for complex organizations migrating into GitLab.
Assumptions
Roadmap length: 1 year
Team size/composition/allocation remains constant
Increasing the BE team size would result in acceleration of the timeline
Team only works on the roadmap items (and bugs)
Any customer-specific work would push the timeline
Proposed Roadmap
Final proposal
Focused delivery of the Scaleable GitLab Importer, followed by a focused delivery of the Complete GitHub Importer
@ogolowinski@lmcandrew - Would you please provide your initial feedback on this issue: the content, the timeline, the granularity, and any other aspect?
I would like to iterate with a smaller crowd first, before opening the issue to everyone.
Thanks @hdelalic - I thought we were working in parallel on GitHub importer and GitLAb importer - it looks from here that we are working serially on one and then moving to the next
Could you please share your reasoning? It is not obvious to me why this plan is better.
Sure! This makes small iterations to each one of the different solutions and will also allow getting feedback on each one of them faster rather than when doing this serially.
GitHub importer is blocking some large deals so we need to make progress there while still allowing users to migrate from self hosted to SaaS. Having any of these wait on the sidelines without making any progress for several months does not seem like a good path forward to me.
I agree that we have the need to make progress on both large features. However, parallelizing the two large features will result in both features being completed at the end of the 12-month period, instead of getting one in 6 months and the other in 12 months.
I think that this ask is not feasible:
GitHub importer is blocking some large deals so we need to make progress there while still allowing users to migrate from self hosted to SaaS.
The problem here is that we can only make progress on one at the expense of the other. Focus on one delays the other. Focus on both delays both.
That said, in addition to sequential focus, we could try to optimize the delivery in these ways:
Select a primary feature and put the majority of our effort there, while working on small changes that would allow us to get feedback on the other feature. We could finish the higher priority feature earlier than in 12 months.
Create focus blocks of 1-2 milestones where the team would focus on one feature per milestone to reduce the cost of context switching. We would still deliver both features only at the end of the period.
What do you think about the idea to flip the order of sequence here?
Given the (more) firm date for the GitHub Importer, and no drop-off date for the GitLab Migration, we could stop working on the GitLab Migration and focus on the scalable GitHub Importer first.
I know this won't be popular with support or infradev, but it would align our delivery with the Extra-Large customer moving from GitHub.
Something like this:
The advantage of this plan over the one proposed in this comment above is that we would be ready for the large customer moving from GitHub by Q4'21, instead of Q3'22.
I have updated the Proposed Roadmap in the description above in order to achieve parallel delivery of the most important features for GitHub and GitLab importers sooner than the original proposal.
I have added a section with Issue Links for a quick reference to all of the epics/issues proposed in the roadmap.
I have added a Goals section to the description to provide the reasons why this focused roadmap is critically important to our success and serve as a motivation for the team.
Would you please review at your convenience and provide your feedback? I would like to share this issue with a wider audience, if we are on the same page.
This does raise another concern regarding the small team size. I completely appreciate the Product considerations above, but only having 1 backend engineer on each of these large initiatives does mean we are extremely stretched. Further, it means we have zero resilience in the team and any knowledge sharing considerations (e.g. code reviewing each other's MRs) means constant context switching.
There are also architectural considerations for us to think about. Ideally, the ETL framework that has been created for the new GitLab Group Migration tool will become a framework used by all importers in the future. Completing this first will allow us to utilize it for improvements to the GitHub importer. However, I understand we need to be pragmatic and there is customer demand meaning we will likely have to iterate on the current GitHub importer first.
@lmcandrew Thanks for your perspective on this issue. I fully understand the context switching concern that you have on a team of this size. We will need to find ways to mitigate that concern.
How do you feel about the feasibility of the 12-month timeline to complete the two features with this team?
@lmcandrew@hdelalic is there a win-win here? Meaning things that can be developed first that are common between the two importers (for example large repo)?
@ogolowinski Unfortunately, the two importers don't have common code that could be developed first. The new GitLab Migration, once completed, has been architected to become the new common platform for all importers, but that work would have to be done once the platform is delivered and hardened.
That said, I think that there is a win-win here. We could parallelize the work in each quarter by splicing the features into thin iterations which could be played in each milestone. So, the milestones could still be focused on one feature at a time, but each quarter would see advancements made to each importer. With our Kanban process, we don't even have to wait for the official end of any milestone to shift our focus.
To accomplish that, we would need to insulate the group from any customer-specific work unrelated to the direction.
I am concerned about this and wonder if it is realistic? Is there some sort of plan or buffer in this roadmap that will accommodate and/or provide workarounds to prevent customer-specific work from impacting direction?
I'm also curious about what data we have to support these initiatives. It is hard for me to understand why these are the most important items to tackle and in this order. I'm not sure if the intent here was to include that or if that data exists somewhere else already, but it would make a stronger case for the lack of bandwidth and pushback on customer-specific work. For example, after 4 quarters, will we be able to support massive customer migrations for GitHub? In 2 quarters, will we no longer have import limitations for GitLab? Can Professional Services stop using Congregate and create less escalations in 12 months?
@hdelalic In addition to this can you map how many customers/prospects are using the GitLab importer vs. the GitHub importer.
When can we expect to support large repos? I see that in Q2 it says scale UX to 1000 projects, is there an iteration plan here? Can we start with API/scripts before this reaches the UI. Can we create buffers to take 500 projects at a time instead of all or similar?
To accomplish that, we would need to insulate the group from any customer-specific work unrelated to the direction.
I am concerned about this and wonder if it is realistic?
I share your concern, but that's kind of the purpose of this discussion - to show what can be done without such distractions and then to ask for support for this focused effort.
It is hard for me to understand why these are the most important items to tackle and in this order.
I have updated the Proposed Roadmap to provide more details, including links to epics/issues. I have also added a Goals section to the description to provide the reasons why this focused roadmap is critically important to our success. There are further details on the Importers Direction page.
Can Professional Services stop using Congregate and create less escalations in 12 months?
I expect much less need for PS import engagements, but Congregate will likely continue to be a tool that PS will use to script complex/custom migrations involving GitLab and other tools for large customers,
When can we expect to support large repos? I see that in Q2 it says scale UX to 1000 projects, is there an iteration plan here? Can we start with API/scripts before this reaches the UI. Can we create buffers to take 500 projects at a time instead of all or similar?
According to the updated plan, support for extra-large projects for the GitHub importer would be delivered at the end of Q3'21. The UX for supporting a large number of projects would be delivered at the end of Q4'21. The API solution is already available here, so this would be mostly a frontend effort. This can be parallelized on the team, as we have a dedicated FE. All this is taken into account in the latest proposal for the roadmap.
The support for large projects for GitLab migrations would be delivered at the end of Q1'22.
@hdelalic your updates to the description are excellent. I agree with @ogolowinski in adding this data to the issue - that is the missing piece. You have done a great job with "Goals" in helping us understand what this roadmap hopes to achieve, but WHY are these the goals at all (as opposed to BitBucket migrations, for example)? This will help others understand why other topics are not the priority.
I am still concerned about unexpected problems not being accounted for. This is a barrier to success. I wonder if a solution to that is to over-estimate or provide padding for these? That way, instead of saying "Any customer-specific work would push the timeline," you could say "Any decrease in customer-specific work would expedite the timeline."
We do have alignment already on customer-specific work being inefficient to work on, and we will support you there, but what if this changes throughout the next 12 months? This may be a non-issue if the epics you've provided have strategic MVCs already in place. I have not gone through the epics, but that would allow portions of the goals to be fulfilled despite barriers arising.
@m_gill Thanks. In addition - I think we should address the WHY behind prospect and large customers coming in - which is why we made GitHub a higher priority - it may not be the larger percentage of imports but there is a lot of !RR behind it. Also, I think that explaining that solving the larger issues - large repos, more logging etc. will help in the long run with customer specific scenarios
I expect much less need for PS import engagements, but Congregate will likely continue to be a tool that PS will use to script complex/custom migrations involving GitLab and other tools for large customers
@hdelalic - I'm not sure i agree with this. Large scale customers who are moving their source code from another system to a gitlab destination tend to engage with professional services. The congregate automation is coordinating API calls to the importers. So when the importers improve in quality, so does the customer experience (for either customers who decide to migrate themselves or the ones who engage with PS). I don't think there will be less migration engagements. In fact if we continue to grow at the pace we have over the past few years, we will likely see more engagements.
@bryan-may I suspect "engagements" here refers to professional services needing to engage with the Import team directly rather than being able to do these seamlessly for the customer. When this happens, it's an escalation that derails the roadmap. Is this also what you were referring to or are you speaking to the engagements from customer to professional services only? We want to prevent (long term) the need for groupimport to be involved with the customer/professional services migrations.
No sorry, when I said engagements, I meant customers engaging with GitLab PS to help them migrate their data to gitlab not engagements between GitLab PS and groupimport to unstick customer problems
I am still concerned about unexpected problems not being accounted for. This is a barrier to success. I wonder if a solution to that is to over-estimate or provide padding for these? That way, instead of saying "Any customer-specific work would push the timeline," you could say "Any decrease in customer-specific work would expedite the timeline."
I agree with your concern here. The proposed timeline was my initial guess and is definitely subject to being adjusted with input from Engineering on the feasibility. In my estimate, I have assumed that there is still some padding there for unexpected problems. If the Engineering feels that the timeline is too aggressive with too little wiggle room, I will adjust the roadmap.
That said, there is always wiggle room in the requirements and how deep we take each individual epic and issue. I will be continuously prioritizing all the work in order to stay on track.
@gitlab-org/manage/import/backend Would you please provide any input on the overall feasibility of the proposed timeline?
I'm not sure i agree with this. Large scale customers who are moving their source code from another system to a gitlab destination tend to engage with professional services. The congregate automation is coordinating API calls to the importers. So when the importers improve in quality, so does the customer experience (for either customers who decide to migrate themselves or the ones who engage with PS). I don't think there will be less migration engagements. In fact if we continue to grow at the pace we have over the past few years, we will likely see more engagements.
I agree that there will be more PS engagements in the future due to increased business. What I meant was that more of those migrations would be able to be performed by the customer, because the importers would be able to handle larger data sets more reliably. Sorry for not being clear.
Would you please provide any input on the overall feasibility of the proposed timeline?
My takes from the proposals
By the end of this year (2021), we will have a replacement for the file-based Project Export/Imports, removing the project size limits imposed by the infrastructure file-size limits.
IMHO, this is a stretch, the Group Migration was achieved over almost 3 Quarters (started at end of September 2020). Although now we have a good framework to introduce the Project Migration, this (the Project Migration) is way more complex than the Group Migration. For this reason, I'm not comfortable with the estimation of 2 quarters to finish the Project Migration.
By the end of Q1'22, we will support self-serviced GitHub migrations of massive projects.
What is the "massive projects" definition? It's hard to assess the estimated timelines without more information on that.
By the end of Q2'22, we will deliver a Complete GitHub importer that will satisfy requirements for complex organizations migrating into GitLab.
Similarly, what are the "requirements for complex organizations" to migrate to GitLab? It's hard to assess the estimated timelines without more information on that.
I know we have some issues/epics linked on the description, but most of these epics are constantly growing in scope. That's why I think it's so important to have some base definitions, if possible with exactly numbers, of what we're trying to achieve on each delivery. For example, fictional numbers,
"Massive Projects"
5Gb repository with up to 1000 issues/pull requests (in total; issues + pull requests); OR
10Gb repository with up to 500 issues/pull requests (in total; issues + pull requests)
Also, as mentioned in threads above, I would like to highlight the "bus factor" of the team. Currently groupimport have only 2 backend engineers, which makes it hard to deliver high amount of features, even more with the amount of support issues we have to handle. Besides that, we try to alternate PTOs and other time off, but to be able to do that effectively we try to avoid creating silos of knowledge about the importers. Therefore, although the "Alternate proposal" looks good, it sounds to me that each track would be maintained by one single engineer, which have a high risk IMHO.
By the end of this year (2021), we will have a replacement for the file-based Project Export/Imports, removing the project size limits imposed by the infrastructure file-size limits.
IMHO, this is a stretch, the Group Migration was achieved over almost 3 Quarters (started at end of September 2020). Although now we have a good framework to introduce the Project Migration, this (the Project Migration) is way more complex than the Group Migration. For this reason, I'm not comfortable with the estimation of 2 quarters to finish the Project Migration.
Agreed. Having 2 engineers working on separate projects + distractions make this goal a stretch. Especially because projects migration is much more complex comparing to group migration. And considering the fact that groups and projects are going to be migrated within the same migration process, which wasn't happening before, can lead to unexpected issues.
By the end of Q1'22, we will support self-serviced GitHub migrations of massive projects.
What is the "massive projects" definition? It's hard to assess the estimated timelines without more information on that.
Agreed. It would be better to have concrete numbers, since what we consider a massive project may not be the same for somebody else. Kassio fixed a lot of bugs in the importer making data migration more consistent, but from what I understand there are issues with import process resiliency. Would fixing these problems mean that this goal is achieved?
The 'current state' says that GitHub importer is able to import simple projects. I don't think that's the case. It is able to import a lot of complex data already.
Generally speaking I agree with proposed goals but not sure about the timeline as it looks like a guesstimate, putting 1 major thing per quarter.
Additionally, I see we put emphasis on 'insulating' the team from any customer-specific work. Does that mean we are going to simply ignore customer issues? This does sound unrealistic as there are a number of issues our team has to support, which primarily come from PS engagements, and not doing anything about those can be detrimental to the business.
Lastly, both Project/Group Migration and GitHub importer are going to require frontend work (what I read from the timeline) and we only have 1 frontend engineer, which can also impact the timeline.
The issues linked do not correlate to the naming on the slide - therefore it is hard to understand what relates to what - can you keep a consistent naming convention for the issues so that it is easily understood?
How does this consider IdP users? As we are currently prioritzing GitHub importer because of Enterprise customers - how is this import considered? Do we use SAML sync (or similar) before importing repo, issues, etc?
@ogolowinski Currently, our Importers do not create (import) users. Users are expected to exist in the destination instance and, if they do, they are matched to various objects (issues, epics, comments, MRs, etc.) during import.
The new GitLab Migration feature will revisit this and solve the user import where possible (respecting security and privacy constraints). So far, we have only completed an initial technical spike, but the feature will be moved to the back of the delivery timeline, since it is not considered a part of "parity" with the current feature.
The desire is, of course, to take advantage of IdP where possible.
@hdelalic I appreciate your efforts here and working to answer our questions - I know we have a lot of them. I understand that the real request from Import was to increase headcount. Since that is not an option in the near term, we want to make sure your roadmap is justified with data and makes sense for customer demands, so that this group is empowered to push back on requests in an informed way.
We have talked in the past about how increasing capacity can be done by increasing headcount or by being transparent with what is, and isn't, a priority. Our questions are meant to help you be transparent with priorities. I hope you understand!
@dennis@lmcandrew the roadmap and description here have been updated and are in a much closer and different place. Would you provide feedback on any concerns or barriers you foresee?
@m_gill I think the roadmap makes sense. This focus will be the best path forward given the capacity we have.
My only reservation is concerning the alternative proposal with parallelized efforts, as the context switching will likely cause a little friction, but I understand that it may be necessary to move the needle across both importers in the short-term.
@m_gill@ogolowinski - After further consideration, I think prioritizing a solution to the GitHub importer (scalability) makes sense if the GH importer is a smaller amount of work.
Problems with both GitHub and GitLab importers are affecting customers in getting their data onto gitlab. I think if the GitHub importer improvements are a smaller amount of work (as this approach is already a streaming one), it makes sense to prioritize the improvements to the GitHub importer to be able to handle projects with lots of metadata (PRs, comments, etc). Also, focusing on the backend functionality/scalability and exposing via the API is more important than the front end as most customers who have large repos in terms of metadata also have lots of repos. This means they are likely scripting the migration and not using the UI.
Second, I would prioritize the gitlab streaming group and project import (parity, project metadata, project repo) work to help resolve some of the issues listed here. This will resolve the hard 5GB limit and hopefully resolve some of the large project metadata problems. Again the backend functionality via API is more important than the front end.
I think bringing both importers to "completion" should be considered after the are completed. I understand there is a context switching "tax" to be paid, but its likely the shortest path to unstick most customer engagements.
@lmcandrew in terms of effort, would you say that the GitHub tasks are a smaller effort? This is the assumption because it is fixing the existing importer and not creating a new one
would you say that the GitHub tasks are a smaller effort?
It's a difficult question to answer. It's more likely there could be small bug fixes and improvements, as you say, we already have the existing importer. However, there are more unknowns with GitHub, our observability over imports isn't comprehensive gitlab-org&6270, and we can't influence the export process in the same way we can for GitLab.
@hdelalic Thanks for the transparency on the focused roadmap for importing. Speaking from field experience in Mid Market, self-managed to GitLab.com has been a weekly conversation with customers, and are the most common custom engagements we create with the PS team.
FWIW, I was also curious of the site analytics on doc usage for the importers. It appears to align with your data:
@hdelalic - Thank you. Am I right that in both your proposals we get done with the roadmap by end of Q2? If this is true - then I don't understand what optimized refers to. Optimized for us as GitLab or for our customers and prospects? We should aim to optimize for our customers and prospects but do it in the most efficient manner possible.
@adawar Sorry for not being clear there. The optimized proposal was optimized for delivery velocity, as it is truly focused on one large feature at a time. For a small team, context switching carries a lot of costs and that's what the alternate solution was based on.
The optimized solution is my preference, but the benefits of the alternate solution are faster delivery of the high priority issues for both features.
Got it. Thank you. I'm inclined to ask you to focus on the faster delivery of high priority issues as we have live customers in the middle of these situations. Additionally it also appears to be the best path to accelerate and spur adoption which is a key goal for the PM team.
Thanks for the feedback, @adawar. I have taken into consideration your suggestion to initially focus on the high priority issues and have combined the two proposals into one that initially prioritizes GitHub Importer's reliability as a high priority and then switches to sequential delivery of the two large features in order to optimize for velocity.
Please let me know if you have any additional concerns or suggestions.
@hdelalic - Thank you. One thing that is extremely important given the expectations from our sales team as well as customers -- I want us to proactively communicate our posture to the right people. Some ways to do that are
Record a video of the direction page
Explicitly and clearly call out that relentless prioritization and its impact. Help people understand why the other approach is untenable.
Share it with key leaders and influencers in CS and Support --so that everyone is on the same page.
Offer a standard response they can provide to our customers who will be unhappy with this roadmap stance.
Share whatever alternatives there are for those impacted customers -- although we understand that these aren't the most optimal alternatives.
Thanks @hdelalic for the work to define this roadmap. It makes sense to me. I realize the trade between GL to GL and GH to GL is a tough one, but based on usage and our SaaS First initiative, I like the higher prioritization of GL to GL.
In order to make this come true, sounds like we need to materially reduce the number of customer specific issues for the Import team. Do we have a plan to do that? Just want to make sure that CS/PS are aligned with this plan, and everyone is clear on how to handle customer specific issues while we build out more scalable solutions.
@sfwgitlab Customer Success, Support, and Professional Services have been providing their feedback on this initiative from the beginning. I will be communicating this plan outward on multiple channels (Handbook, issues, Slack, YouTube) to ensure that everyone is aware of the plan and providing messaging and alternatives that the customers can use.
Additionally, we have been very focused on improving the reliability of our importers in the last several releases. However, with only 2 BE in the group, we are unable to deliver both the critically needed long-term updates and the short-term customer-specific issues. With the completion of this large GitHub Importer escalation it will be the right time for us to switch the focus to the long-term goals.
@hdelalic I think we are still getting information together on how we are going to accomplish this. The goal is to accomplish the roadmap and the engineering allocation + possible customer escalations, and the task is to find the resources available to accomplish that. Otherwise, we should maybe huddle synchronously to evaluate the impact to the freshly updated roadmap. Would you make sure to cover setbacks in your Import weekly as an async way to monitor roadmap changes for now?