As a pipeline author wanting to accomplish a task with pipeline configurations, I would like to search for capabilities and components specifically in the CI catalog so that I can quickly find what I need for my pipeline.
Context
Data we have available today for each catalog resource (items shown in the CI catalog) is:
project name
project path
project description
README.md (the resource's documentation)
Data we could gather or add in the future:
parse the template.yml files in the repository to extract metadata about each component.
project topics
Proposal
Add an Advanced Search migration to add new top level fields
ci_catalog boolean field to track whether a project has a catalog_resources record in the database. Note: recommend not to name it catalog_resource so that name can be used as a nested field in the future. It's difficult to rename fields in Elasticsearch
readme_path text field
This page may contain information related to upcoming products, features and functionality.
It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes.
Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab Inc.
@avielle we discussed that but I don't think we had an issue. We can use this one now - cc @dhershkovitch
Perhaps we should start with what problems we want to solve and by consequence, what search use-cases are needed. Then we can go in the specifics of the implementation. WDYT?
@fabiopitino#393214 (closed) specifically deals with the frontend for search and filter so we can use this issue to discuss the backend implementation for search
Perhaps we should start with what problems we want to solve and by consequence, what search use-cases are needed. Then we can go in the specifics of the implementation. WDYT?
This makes sense to me! Beyond searching by name, I'm not sure what problems we need to solve Maybe users will also want to search by namespace?
@avielle@fabiopitino let's use this issue to discuss the problem, i believe we should separate it into 2
Public catalog: made of all public catalog item projects (similar to public projects), non-authenticated users should be able to see the catalog resources.
Private catalog: made of all catalog item projects inside the root namespace (public or private once published).
It would be ideal to focus on both problems, but if we cant then our focus should be on private catalog first.
ideally, users should be able to search components by name, description, and some arbitrary tag (which we didn't define yet).
For the public catalog we would need to have a bigger range of search criteria since we are looking at a larger pool, so we should consider namespace as well, overtime i expected to add additional search criteria (e.g. certified by GitLab, Certified by partners), which makes me wonder if we should focus on private catalog search only
When I was going through the sketching exercise before, aside from doing a keyword search on name and description as well as any tags, I was thinking that a date range search (i.e. recently uploaded, etc.) and/or searching for official components (i.e. AWS, GCP, etc.) would maybe provide novice users an easy way to get up and running (noted above as "certified by" criteria). Also searching and/or filtering on star popularity could also be useful as well. May not be MVC but would certainly be impactful compared to our competitors' UI experiences they currently provide.
I'm not sure if we have a process for adding new data types (we've been discussing adding vulnerabilities to the index), but in general we have Advanced Search migration framework, which can be used to add new document types. Most likely it will require our team's help since we should also implement indexing and searching.
Maybe we can start with an estimation of the new index size? Also, it would be great to see the list of fields and the number of documents. We might want to create an issue/epic specifically for implementing Advanced Search.
Adding README.md to a project's index semantically makes sense. That would also save a lot on storage if we don't make a new index
I think we should list all the fields we need and then discuss the next steps. If a component is a big object and we also need to filter/search by those fields, it might make sense to create a new index
As others pointed out above, we don't have a unified framework to allow other teams to add their content in Advanced Search, but the idea has been circulated around. For now, I'd suggest the two teams, Pipeline Authoring and Global Search, can collaborate to see what's the best path forward to assist you on this
@dhershkovitch@dgruzd@terrichu @john-mason The data we need to index for the search is #396556 (closed). We also need to understand how feasible is to add fields to the index in the future, for example when we have a way to list all the components in a project.
If we want to do sorting, for example by popularity (stars, downloads, etc.), would this need to be a search field too?
We also need to understand how feasible is to add fields to the index in the future
Adding fields is feasible with Advanced Search migration framework. Changing data types for existing fields is where things get tricky because it requires a full reindex.
If we want to do sorting, for example by popularity (stars, downloads, etc.), would this need to be a search field too?
Yes
have a way to list all the components in a project
Can you help me understand what a "component" is inside the GitLab application? Because I don't have the necessary context, I am unsure what that means.
Can you help me understand what a "component" is inside the GitLab application? Because I don't have the necessary context, I am unsure what that means.
@lauraX thanks. So far it seems what y'all are wanting to do is project search. There is an existing Elasticsearch index for projects, and it would be reasonable to add more attributes of a project (ie: README.md) to the index if needed.
After reading the glossary you shared, I suspect there may be component-specific attributes that I'm not aware of due to my lack of knowledge and context, such as component path.
Data we could gather or add in the future:
parse the template.yml files in the repository to extract metadata about each component.
It would be really helpful to have a more specific "wish list" of all possible attributes that could ever be searched for. That would help the groupglobal search team advise on whether a dedicated components index would be necessary or not.
@john-mason pretty much! I guess at the end of the day a catalog is indeed a project, so we do want a project search. This is what we would like to be able to search for:
* project name <- we can already do this, right?* project path * project description* README.md (the resource's documentation)* File name <- is this possible now, given that we can search for paths already? (this is "component name")z
What do you think? I am not sure what is in the scope of the groupglobal search and I'm unfamiliar with the domain, so please correct me on anything that I am wrong (and I will learn!). :D
As far as a "wish list" - I will try to get that to you as soon as I can - given that this is a new project there's a few details we need to iron out. We do want to be able to parse a metadata file at some point (template.yml) which will contain tags, versions, categories, etc but we haven't sorted this out yet. Is this at least a bit helpful?
I just saw the issue @dgruzd posted and I see that name, path, description are already indexed, which is great! Does that mean we can already search on them?
I just saw the issue @dgruzd posted and I see that name, path, description are already indexed, which is great! Does that mean we can already search on them?
@lauraX that's correct! One caveat: we only index paid projects (we are considering indexing all as part of #340857 (closed)).
Currently, by default we use this list of fields for project searches: [ "name^10", "name_with_namespace^2", "path_with_namespace", "path^9", "description" ], but it's possible to override the list and/or relevance boosts for your custom use case. This is an example project search: https://gitlab.com/search?group_id=9970&repository_ref=master&scope=projects&search=gitlab
@dgruzd Ok, great! As for the paid projects ... good to know. I guess that is ok for the MVC but I'm thinking what we could do for after that. Would it be reasonable to create an index containing only projects with a matching record in the CatalogResources table?
Would it be reasonable to create an index containing only projects with a matching record in the CatalogResources table?
@lauraX Having a separate index is definitely an option for us. I believe we want to avoid having too much duplicate data, so I'd say the decision depends on the outcome of the storage assessment.
We do want to be able to parse a metadata file at some point (template.yml) which will contain tags, versions, categories, etc but we haven't sorted this out yet. Is this at least a bit helpful?
This sounds like we are going to have more fields related to CI Catalog, which only increases the chance of having a dedicated index .
Maybe we can draft an end state for the list of the fields and an approximate number of records? I think we'd like to avoid extra work like adding fields to projects and later on switching to a dedicated index for CI Catalog. In this case it would be better to start with a separate index and add new fields to it as we go.
Having a separate index is definitely an option for us. I believe we want to avoid having too much duplicate data, so I'd say the decision depends on the outcome of the storage assessment.
Ok cool! Our frontend team tried to use the current search on the projects, but found that they got all the projects back, not just the ones marked as a CatalogResource. Would the index be helpful for this?
This sounds like we are going to have more fields related to CI Catalog, which only increases the chance of having a dedicated index .
That's true! I guess for now we would do the same search we would do in any project. The template.yml idea is not yet fully fleshed out, and we don't even know if we will end up doing it I hope this doesn't complicate things.
Maybe we can draft an end state for the list of the fields and an approximate number of records
I will try to get this end state of the list of fields, but it might take a while since we haven't finished refining yet. Once we have a good draft I will send it over. We hope to get a lot of feedback from internal testing which will guide our decision.
I do understand not wanting to do any extra work, so I am keen to do whatever you think is simplest. For now, I think using the search the same way as we do a project search is ideal, although only on projects marked as a catalog resource. WDYT?
For now, I think using the search the same way as we do a project search is ideal, although only on projects marked as a catalog resource.
To use the existing project index, we will need to have a new field to help identify which projects are catalog resources. It looks like it's all projects that have a record in the catalog_resources table?
@dhershkovitch I would say that a basic search by project name and description would be a nice-to-have for the MVC but we should not delay the MVC because of that. There will be many iterations between the MVC and the GA and we can iterate on the search several times.
If we don't have the project name and description in MVC for basic searching, can users search for anything in MVC?
If searching by project name and description is not critical for the first iteration of MVC, should this issue be nudged to %16.0 or %16.1 and focus on issues like #396559 (closed) now for MVC? @lauraX @avielle WDYT?
@marknuzzo I would agree that sorting is more important than search. We don't expect in the first month that users will have hundreds of components that would be hard to paginate through. And if they do we know to prioritize search in the upcoming milestones.
Thanks @fabiopitino for your thoughts here. I agree that with not a lot of components to search for in the first month, paginating won't be too overwhelming. @dhershkovitch @lauraX - it sounds like flipping this issue for #396559 (closed) could be a better use of our time right now. For now, I'm going to adjust but open to hear all of your thoughts.
Thanks @lauraX - since it's related to GA and we need to define what we want to search so this effort can be weighted, that's fine for now. @dhershkovitch - we need to determine what elements we want to extract from template.yml to search for here.
@dhershkovitch - Since weights are normally related to just what our team is doing, I'm wondering if it may be helpful to just add a comment noting that groupglobal search will have an effort here too in conjunction with PA's efforts so that our weights are still only our work. Laura, this would be my preference but curious of Dov's thoughts here as it could set precedence going forward with other joint efforts (i.e. Steps with grouprunner, etc.) WDYT Dov?
Hi @john-mason @terrichu@dgruzd - I hope you are having a good start to your week.
I wanted to reconnect with all of you to understand the level of effort in incorporating search criteria into the CI Catalog with global search. I see some discussions have started to happen above already but we also wanted to better understand what the next steps are here. Thank you!
/cc @lauraX @dhershkovitch - please add any additional thoughts you have here.
Would the CI Catalog search need to work on all projects on GitLab.com, not just projects which are indexed in Elasticsearch. For some context, on GitLab.com only paid namespaces are indexed. Is that correct?
Should projects that have a catalog_resource database record come up in general project search? Or would there be a Project Search (only return projects without a catalog_resource record) and CI Catalog Search (only return records without a catalog_resource record)?
Would the CI Catalog search need to work on all projects on GitLab.com, not just projects which are indexed in Elasticsearch. For some context, on GitLab.com only paid namespaces are indexed. Is that correct?
Hi @terrichu - CI Catalog will be available to Ultimate users so only paid namespaces should only have access to CI Catalog to search through components. I hope that answers your question but the good news is that it seems to align with what is indexed right now with who it is intended for.
As for the catalog_resource question, @lauraX @f_caplette can you provide your thoughts here, please? Thank you!
Should projects that have a catalog_resource database record come up in general project search?
I would think that we would use the project search for the catalog resource projects, but I think that may be more of a product/ux question so I will defer to @dhershkovitch and @sunjungp
CI Catalog will be available to Ultimate users so only paid namespaces should only have access to CI Catalog to search through components. I hope that answers your question but the good news is that it seems to align with what is indexed right now with who it is intended for.
While this is true for the first iteration, we also would like to release a community catalog that should be available to all users, so eventually we would need to allow users to search across all projects that has published components (users needs to publish a component in order for it to be available in the catalog and available for search)
@dhershkovitch @lauraX thanks for the feedback, we discussed this in our team meeting and recommd to use the existing projects index vs. creating a completely new one. This will allow a faster implementation for the initial iteration. Here are next steps:
Note: using the existing projects index
Add an Advanced Search migration to add new top level fields
ci_catalog boolean field to track whether a project has a catalog_resources record in the database. Note: recommend not to name it catalog_resource so that name can be used as a nested field in the future. It's difficult to rename fields in Elasticsearch
readme_content text field
Add an Advanced Search migration to populate the new fields
Update project search to add a catalog_resource filter capability to global and group project searches, it should default to false
Thanks for the investigation and next steps @terrichu !
@marknuzzo - I think we now have our next steps for the search, as Terri has outlined them. I guess the easiest thing to do would be to keep this issue and create a bunch of MRs with the steps:
Add an Advanced Search migration to add new top level fields
ci_catalog boolean field to track whether a project has a catalog_resources record in the database. Note: recommend not to name it catalog_resource so that name can be used as a nested field in the future. It's difficult to rename fields in Elasticsearch
readme_path text field
Add an Advanced Search migration to populate the new fields
Update project search to add a catalog_resource filter capability to global and group project searches, it should default to false
or create separate issues, but they will each be blocked by the previous one. wdyt?
Also, since this involves a migration, should we try to schedule this for %16.1? I'm not entirely sure how long this will take but it is multiple steps.
or create separate issues, but they will each be blocked by the previous one. wdyt?
Hi @lauraX - with the list noted above, what do you think the weight would be for this effort if all of them were bundled together in 1 issue? I want to manage this as efficiently as possible but with subsequent issues being blocked by each other seems inefficient if it can all be grouped together.
The only other question is should this be considered a multi-milestone effort? If yes (which it seems), it may be easier to figure out what is achievable in %16.1 and then separate the remaining effort into %16.2. WDYT?
@dgruzd good point, I don't know why I assumed the path would be stored Do you think that indexing can be handled in ruby or will it need to be done in the go indexer?
@dgruzd good point, I don't know why I assumed the path would be stored Do you think that indexing can be handled in ruby or will it need to be done in the go indexer?
@terrichu I think it is going to be handled in Ruby. We might want to create a Sidekiq worker though since I'm not sure that we should make calls to Gitaly from the main indexing pipeline.
to manage this as efficiently as possible but with subsequent issues being blocked by each other seems inefficient if it can all be grouped together
@marknuzzo I agree! As for effort, this is definitely a 5+ altogether, which is why I suggested we break it down into the steps Terri mentioned. And I do think it's a multi-milestone effort, especially with the hackathon taking up a week or so next week. I think we could try to aim to do this:
Add an Advanced Search migration to add new top level fields
ci_catalog boolean field to track whether a project has a catalog_resources record in the database. Note: recommend not to name it catalog_resource so that name can be used as a nested field in the future. It's difficult to rename fields in Elasticsearch
readme_path text field
For %16.1 - and maybe populate the new fields as a stretch goal. WDYT?
@marknuzzo after an initial review, we are probably going to need 3 MRs for this. I will create a separate issue when the time comes, just wanted you to be aware.
Thanks for the heads up @lauraX - yeah, I had felt that it was possible that it would be spread across multiple MRs. Do we need to adjust the weight on this issue since a separate issue will be coming eventually or should we leave as-is? WDYT?
@marknuzzo I had this issue set as 4, then downgraded to a 3 after I realized we'd need to split it up. So far a 3 seems accurate but I will weigh the other issues accordingly!