Create AI context harvester for product strings

After we have decided to not use the Crowdin Context Harvester, we need to create our custom context harvester for product strings.

Idea

The idea is to get the code occurrences of each string and use the file contents of the code in the LLM prompt to infer the context of the string. The context of the string in Crowdin will contain a description of where and how is the given string used in the GitLab project.

For example, this is the generated AI context for the string AccessTokens|Rotate using an in-development prompt:

Implementation

First we need to extract the code occurrences alongside the strings from the codebase:

Create a Ruby + Node environment (needed to extract the strings + code occurrences into the gitlab.pot file)
Clone the GitLab codebase with --depth set to 1 to create a shallow clone without the whole commit history.
Install only the dependencies needed for the gitlab.pot file extraction:
- Ruby: parallel, gettext, json, hamlit
- Node: chalk, commander, gettext-extractor, gettext-extractor-vue, vue-template-compiler, wrap-ansi
Enable include_reference_comment in @po_format_options in the GettextExtractor (see here) as mentioned in &81 (comment 2172176810)
Run the command tooling/bin/gettext_extractor locale/gitlab.pot as mentioned in the Internationalization docs

Now that the gitlab.pot file contains the code occurrences, we can use them in the AI context harvesting:

Create a Python environment and install the polib library
Get all strings in the production Crowdin project that don't have the AI context
Parse the gitlab.pot file using the polib library and filter out any strings that already have the AI context.
For each string:
1. Get the file contents of each file where the string is in the codebase base on its occurrences (potentially limit the number of the files if the prompt gets too large)
2. Call the Google Vertex AI API with the prompt to infer the context from the string and the file contents
3. Add this context to the string in Crowdin

Edited Nov 20, 2024 by Martin Chrástek