When groupId is given, the data returned should include projects using components for the requested group and all sub-groups.
The groupId argument is required for gitlab.com customers. It is not required for self-managed.
If groupId is not given for self-managed customers, the returned data should include all projects using components
Depending on how the data is structured in the database (I - @avielle - am not sure), this endpoint might be performance intensive
For now, mark all new fields as alpha and make an issue for us to remove alpha in a year. We might want to restructure the query in the future. For example, we might want to return a tree of groups with their projects nested inside. (For now I think it's okay to return all the projects in one list)
If you are unsure about the correct group, please do not leave the issue without a group label, and refer to
GitLab's shared responsibility functionality guidelines
for more information on how to triage this kind of issue.
This issue was automatically tagged with the label grouppipeline authoring by TanukiStan, a machine learning classification model, with a probability of 0.87.
If this label is incorrect, please tag this issue with the correct group label as well as automation:ml wrong to help TanukiStan learn from its mistakes.
Authors who do not have permission to update labels can use the @gitlab-bot label ~"group::<correct group name>" command, or leave the issue to be triaged by group leaders initially assigned by TanukiStan.
This message was generated automatically.
You're welcome to improve it.
Dov Hershkovitchchanged title from MVP - Backend: Create fields to make query below to MVP - Backend: Create fields to make query which returns a list of projects where components were used in a pipeline
changed title from MVP - Backend: Create fields to make query below to MVP - Backend: Create fields to make query which returns a list of projects where components were used in a pipeline
Dov Hershkovitchchanged title from MVP - Backend: Create fields to make query which returns a list of projects where components were used in a pipeline to MVC - Backend: Create fields to make query which returns a list of projects where components were used in a pipeline
changed title from MVP - Backend: Create fields to make query which returns a list of projects where components were used in a pipeline to MVC - Backend: Create fields to make query which returns a list of projects where components were used in a pipeline
Dov Hershkovitchchanged the descriptionCompare with previous version
Laura Montemayorchanged title from MVC - Backend: Create fields to make query which returns a list of projects where components were used in a pipeline to MVC - Backend: Create fields to make query which returns a list of projects where components were used in a pipeline - GraphQL
changed title from MVC - Backend: Create fields to make query which returns a list of projects where components were used in a pipeline to MVC - Backend: Create fields to make query which returns a list of projects where components were used in a pipeline - GraphQL
Laura Montemayorchanged the descriptionCompare with previous version
Another implementation detail which is missing - it's ok to hardcode the query to return the data for the last 30 days (added to the implementation details)
Hi @lauraXD,
do you already know when roughly you plan to provide the feature for the tracking of the usage of CICD components? Can we expect it to be available in weeks or rather months?
Hi @avielle,
could you tell me when roughly you plan to provide the feature for the tracking of the usage of CICD components? Can we expect it to be available in weeks or rather months?
@dhershkovitch@avielle It's great that we improved our proposal based on customer feedback.
As far as I understand, users are okay with using a GraphQL endpoint to fetch this data, for now. And, we don't know if we will use the same endpoint for a UI that we may implement in the future. So, this work can be stale after a while.
Also, I still would like to know how users will use this data. I think a filtering mechanism will be asked in the next iterations because fetching a bulk of projects may be too complex for users.
Final note; in the proposed GraphQL query, we definitely need pagination. We can't return all projects belonging to a group or being in the whole instance.
Thanks @rkadam3 for helping move this forward - Avielle/Laura if there is any additional context that he should be aware of, please pass along. I had mentioned to @rkadam3 this week that I thought it would be good to help here provided there wasn't anything that either of you had in-progress that he should be aware of.
Thanks @rkadam3! Laura and I have the async pipeline creation work covered so it'd be nice to move this one forward. This issue's ready to be picked up…except the description is a bit out of date and confusing I'll update it now and you can let me know if you have any questions
@avielle@marknuzzo@lauraXD - Do you'll think we should add an FF as well when deploying? Just in case the query to fetch data is not performant and creates bottlenecks on the database.
@rkadam3 I don't think we need a ff, if we ff this then some customers might not be able to use this until we completely roll it out Since it's not going to break any existing behavior, what we can do is mark the fields as alpha so we can remove them at any time if it's necessary. WDYT?
@rkadam3 - I totally understand. In this case if you feel it's the right thing to do, then go for it :) you can always do a quick rollout anyway. Perhaps we can even do the rollout first with specific users that Dov knows want to try this
@lauraXD - Yes, and in case we get some alarms too because of slowness or whatever, we can disable it quickly instead of doing a rollout that involves multiple parties.
Thanks @rkadam3@lauraXD - the other thing to consider here is when we roll this out, if there are any queries in Kibana that we can monitor, it would give us more assurance throughout the rollout that this will be well-performant. WDYT?
I'll update this once I get to the queries to find if a project uses components or not and then fetching those projects from a group or at an instance level.
@dhershkovitch - One thing to note here would be we should only check for default_branch or main branch's CI file to check if a project uses a component or not.
Let me know if you disagree. For v1, this sounds good to me.
Hi @rkadam3 - you can use the catalog_resource_component_usages table to determine whether a project uses a component, so we should be able to avoid having to check CI config files. WDYT?
@avielle - I thought about it, but I think that might not be very accurate to consider?
Also, that data also would refer to some projects that might not be using any components, I guess. As in, if a project uses it once, then removes the usage from CI file, the usage table would still have the record.
To make it easier, do you think we should persist this information as a boolean in the project table or some other table that store project's metadata? That boolean field will be updated when the ci file is updated in the repository. Maybe we could add a trigger or something when ci file is updated?
But then again, we would need to check for all yaml files in the repository, as we support breaking the main CI file into multiple yaml files. And the component include could be in any of those yml files.
And other logic would be to run on diffs, in after_push_commit to check if the component usage was added/removed and update the boolean field accordinly. Sounds like a long shot but would prevent excessive calls to read the CI config file.
One other approach would be in lib/gitlab/ci/config.rb to just update the project's boolean field to true or false based on whether there are included_components. This will prevent the call to reading files and their content altogether. wdyt?
With the above approach, we might not need to backfill the projects table too as when the CI config is loaded, they get marked automatically. But then what if a project is dormant, and uses components, but is never called to load the CI config and the boolean field is not marked at all.
@rkadam3 Reading from every single repository in every project will definitely be a performance drain and I believe it will be quite difficult to deal with these N+1s in GraphQL. Not sure it's worth it. I think we can try using the table as a first iteration and get some user feedback first. wdyt?
Also, that data also would refer to some projects that might not be using any components, I guess. As in, if a project uses it once, then removes the usage from CI file, the usage table would still have the record.
@rkadam3 the table is sufficiently accurate for this feature because the records get updated every time a pipeline is run, so we will always have available the latest components that have actually been used. If you think it'll be valuable to the users, you can include a field in the GraphQL that has the timestamp for the latest usage record so they can triage accordingly
@avielle - Sure, let's give it a try. I am also thinking that should we just consider the usage of components in a project in 30 day window too? Reading all of data may not make sense, wdyt?
Edit: Or maybe we don't have to do that as we might not expose this information in project resolver right?
Thanks @avielle - Yes, we are using the 30 day filter. And -
I'm not sure what you mean here
I mean't we wont be sending back information about the components used in a project in a response that gives out other project details, like project_resolver or something.
@rkadam3 I see, we'll be using the new ProjectsUsingComponentsType instead of using ProjectType. Not using ProjectType is unusual so I chatted about this with @lauraXD and we're okay with using a new type to keep the complexity of possible queries small. What do you think about naming it something like ComponentConsumerType, ComponentConsumerProjectType, or even just Components::ConsumerType? ProjectsUsingComponentsType sounds to me like it should live in the Projects namespace
So I have !164731 (closed) this now, that is in reviewable state. It has all the information about testing, demo video, etc. Can I request you to add this review and any other reviews you might have on the MR?
Hi @dhershkovitch,
do you already know when roughly you plan to provide the feature for the tracking of the usage of CICD components? Can we expect it to be available in weeks or rather months?
@rkadam3 how do we deal with multiple components that were used in the same project?
e.g. component A was used 10 times in the last 30 days in the same project with the same version, are we going to return the latest or all usage of the component?
@dhershkovitch@marknuzzo@lauraXD - I see the scope of the MVC has been updated with now needing to specify projectID along with groupID. Is it a final version?
This would change somethings drastically like the DB query and such.
@rkadam3
I prefer to continue the work as planned and then follow up with a 2nd iteration to update the query based on the new description with the ProjectID & GroupID (what we have now right now)
This way the work you put into this feature doesn’t go to waste since it does provide value for our users
So, probably continue the review process
This issue might get pushed to %17.5 that starts tomorrow. Since there are some complexities involved with the query. I will update the milestone accordingly.
Laura, Avielle and I are scheduled to chat more about the needed query tomorrow. We would discuss approach of changing the usage table schema too if it is helpful.
@dhershkovitch - I do not anticipate it to spill over to %17.6 but for first task, things might get a bit complicated and subsequently for task 2 that is the existing functionality.
As I move forward, I will be able to share accurate timeline.
@dhershkovitch and I had a call today and we discovered that by attaching the component usage information in existing ProjectType that is used in Groups and Projects GraphQL API, we can use the below query to get the component usage data.
This also works for all nested sub group projects for the current user. In case of SM, this returns all projects paginated in whichever currentUser is a member of.
We also have -
query { projects { nodes { id name fullPath ciComponentsUsed { nodes { name } } } }}
that returns all project in the instance. Dov confirmed that this is a good start for customers in the MVC. Ofcourse there are various ways one can fetch projects list using the GraphQL API, we fit those on all. There are checks in the component usage resolver to check for relevant license and permissions before returning the data.
Hi @rkadam3 - for the 3 tasks you noted, I think we should track them as individual tasks in this issue so that we can see their progression through the remainder of %17.5. WDYT?
I'll create them for now but let me know your thoughts. I also think that if you can provide end-of-week updates, that would be helpful. Thank you.
Why interested: Visibility into CI/CD component usage (e.g., which projects are using them, which versions are being used, when they were last used, pass/fail rates and execution times for jobs using these components).
For that, we will need to associate usage data to each project as an ActiveRecord association. But even after doing that we need the below to get the data in required format -
def ci_components_usage ci_components_used .preload(component: :version) .group_by(&:component_id) .transform_values { |usages| usages.max_by(&:used_date) } .map do |_component_id, usage| { name: usage.component.name, version: usage.component.version.name, used_date: usage.used_date } end end
Now group_by is the one that is again inefficient and results in N+1, as per the discussion with @ahegyi on Slack with me. That discussion is added on the Agenda doc of the meeting we had last week. Even if we want to use existing table, there are some complex queries that we would need to use as mentioned in !166254 (comment 2114283040)
So I would go for the most boring approach and maintainable, easy to understand instead of trying the existing table to make it work 'somehow'. In the process, we are not changing the outcome for usage count, just making it easy to understand by taking a step back to reform the existing usage table.
Ideally, we would want the data in a single query on project record. Instead of doing group_by or distinct, etc. I am working on the reform table that I will be sharing details with all before implementing it, we can brainstorm if required to make it better.
@rkadam3@dhershkovitch I am not against having a better/faster table to get what we want; I only want to highlight the possible implications of the plan.
p_catalog_resource_component_usages is stored uniquely by project & date.
AggregateLast30DayUsageWorker calculates and stores this data in the last_30_day_usage_count column of catalog_resource_components.
1: This system will change.
2: It seems like we don't need more than 30 days. However, I see these from old issues;
We will need to provide more detailed dashboard information when we'll conduct our design sprint, IMO, it will be probably a separate effort since we will need more depth in this data
It seems like there is already a proposal for this performance problem;
Aggregate the data on a monthly basis and then store it into a new table (also partitioned) for long term historical data storage.
Update the model logic for p_catalog_resource_component_usages so that it drops partitions containing old data ("old" meaning data we've already aggregated and stored in the long term table.)
@furkanayhan - The reform of the table that I am working on would take care of the rolling window too, except storing old data, the usage will be reset after 30 days. Regarding data older than 30 days, I would want to understand @lauraXD and @avielle's thoughts too.
We would store the combination of project <> component along with used date and that would be updated based on the usage. More details about the approach, I will share in https://gitlab.com/gitlab-org/gitlab/-/issues/491071
regarding data older than 30 days, I would want to understand @lauraXD and @avielle's thoughts too.
@rkadam3 My thoughts are - unless we need the data stored, I don't think we should store it prematurely as we might not be storing the data we will need.
I know there have been conversations about component visibility but what is on the roadmap right now is:
implementing better visibility into where components are used primarily to validate the version of a components
which we don't need this information for. I am not sure what the next focus will be after this.
Re older issues
We will need to provide more detailed dashboard information when we'll conduct our design sprint, IMO, it will be probably a separate effort since we will need more depth in this data
As far as I can see in the epic for the dashboard: CI Catalog - Component visibility dashboard (&14027)- most of these issues are still in design phases with @sunjungp so the data we need to support these features is subject to change. I don't really see historical data mentioned in any of these issues (although I could've missed something so please let me know if I did). It seems like there are some mentions of dependency scanning and notifications, but I'm not sure which one is next.
Based on this, I see these as the next steps in order of priority given that they are already on the roadmap:
MVC (this issue) - in the works
Component visibility for validating the version of a component (there is an issue where we have been discussing this but I can't find it right now) - in discussion
Ah, I remember now (thanks to Sunjung) - so the historical data is from a mockup from months ago, which has already been iterated on a few times. So much so that there is not even a high-fidelity design for it. So I think we can be safe deleting here.
Regarding data older than 30 days, I would want to understand @lauraXD and @avielle's thoughts too.
@rkadam3 I agree with Laura on this - generally I don't think we should store data unless we're using it. Feature designs and planning change too often for us to store data just in case we need it in the future
@lauraXD@avielle@rkadam3 I understand your point; we should not store data we don't need. Now, the plan is to change how we store the usage in the p_catalog_resource_component_usages table. All the implementation will change. I am fine with this as long as we are sure that we'll never reimplement this again. That's why it's important to get an approval from @dhershkovitch and @sunjungp about this.
It's a WIP mockup, but you can already take a look. I'm trying to represent the number of projects that used this catalog resource in the last 30 days, and users should select a group first to see the list.
And a small note, there's a design delivered that we want to show the number of projects using this component (not the catalog resource) in the details page as well: #443632[_443632-usage-details-page-tooltip.png]
Given Dov's availability, I think we can probably proceed with this @sunjungp has confirmed that there are no plans to use historical data, which does mean we do not have this feature in our roadmap in the next year or so. The mockup that showed it was from iterations ago so I think we're safe. Either way this will take a bit of time to implement, so I think we don't need to delay this further. what do y'all think?
The challenge is that we'll need the distinct project ids using components within a given time frame. This cannot be acquired from the DB tables efficiently (partitioned table is not suitable for distinct query).
What's today's slow SQL query to force us to change how we store the component usage?
The issue is that the new API cannot have performant queries with the existing table structure, this is a bottleneck even if we add pagination to the number of projects we read in one call.
So to fix it, changing table schema seems the right approach even though it would slow us a bit down with the API implementation.
What's the faster SQL query that we aim to have with the plan?
Ideally, we would want a single query, without distinct usage, and is performant irrespective of number of components a single project uses and is recorded in the usage table.
Also, I am curious, whether you see any risks in changing the usage table schema and would prefer somehow making the query performant?
@furkanayhan - Yes, we will not need distinct in the new query with new schema as we will have pkey as combination of used_by_project_id and component_id.
In the approach thread, we will have used_by_project <> component pair to be unique, so fetching components used by list of projects can be grouped on project id so that we have a list across each project.
The GQL can get pushed to %17.7 as the data collection need to be atleast 30 days old.
I will be keep the second task ready for review but we may merge it towards the end of %17.6 so that data on the UI that is computed from the background worker is from the new table and not the old table.
Use Case: Customer is wanting visibility into who is using what components, where they are being used, and what version of the component is being used. A GQL API would go along way in providing these details while the Dashboard is being built out.
Since the usages are recorded as of now, post that I will work to power our Catalog listing UI from the new table data. But for that we would atleast need 30 days of data.
This MR is for a change to use the new table added as part of this parent issue to count the component usages and power the Catalog UI. Change is behind a feature flag to derisk the .com deployment.
I am working on adding the license requirements, but the query is much more efficient now with the new table structure. I was able to use BatchLoader for fetching the list of component, the name of components and the version too.
Why interested: Without a list of projects who are consuming their published components, this customer will not be able to show progress or impact to their leadership team. Potentially risking future investment into Component Catalog. Additionally, they need to know for standardization purposes who is consuming the proper templates.
Current solution for this problem: None
Impact to the customer of not having this: mentioned above
Questions: My question is, with this backend MVC, is there anything immediately consumable by the customer starting in 17.7 or they'll need to wait until the API and UI are done?
@conleyr The goal of this issue is to ensure that the API is available for our users to consume before a UI solution is implemented. Once this issue is merged, I expect users to utilize the API immediately. Additionally, this API will likely serve as the foundation for a future UI solution. However, since it may take some time to implement the UI, the API serves as a practical solution
@dhershkovitch@nmezzopera - The work for this issue is complete now, I moved the issue to verification, once MR is deployed, I will verify the API and update the status of this issue.