Elastic upgrade: Implement Go Indexer option "full"
Goal: Implement Go indexer option index_type: "full"
Suggested steps:
-
Use go-elasticsearch to implement indexer logic: https://github.com/elastic/go-elasticsearch/tree/main
-
Create a Go script that will take in a parameter called
index_type
. This parameter will control what type of indexing we will do. This issue will handle full type -
Full indexing will,
- Go through every project's
.md
doc files (doc/docs/doc-locale directories) - parallel processing - Break down each content page by header sections. Each header section need to be processed to provide values for the index mentioned below.
- Go through every project's
-
Set explicit index mapping. We need to ensure the right data type is set for each field.
-
When updating index use the following format,
{ "id": "docs/user/project/settings#access-tokens", "title": "Access Tokens", "page_title": "Project Settings", "anchor": "#access-tokens", "url_path": "/docs/user/project/settings/#access-tokens", "content": "You can create access tokens to authenticate with GitLab APIs. Personal access tokens are scoped to a user account, while project access tokens are scoped to a specific project.", "heading_hierarchy": [ { "level": 1, "text": "Project Settings", "anchor": null }, { "level": 2, "text": "Security", "anchor": "#security" }, { "level": 3, "text": "Access Tokens", "anchor": "#access-tokens" } ], "gitlab_docs_breadcrumbs": "User > Project > Settings > Security > Access Tokens", "gitlab_docs_section": "user", "language": "en", "product": "gitlab", "version": "16.5", "last_updated": "2024-01-15T10:30:00Z", "last_indexed": "2024-01-15T14:45:00Z" }
-
Add serverless project connection, including CI/CD variables.
-
Ensure indexer respects “noindex” meta tags by not indexing pages that have this.
Edited by Hiru Fernando