Create AI context harvester for product strings
After we have decided to not use the Crowdin Context Harvester, we need to create our custom context harvester for product strings.
Idea
The idea is to get the code occurrences of each string and use the file contents of the code in the LLM prompt to infer the context of the string. The context of the string in Crowdin will contain a description of where and how is the given string used in the GitLab project.
For example, this is the generated AI context for the string AccessTokens|Rotate
using an in-development prompt:
Implementation
First we need to extract the code occurrences alongside the strings from the codebase:
- Create a Ruby + Node environment (needed to extract the strings + code occurrences into the
gitlab.pot
file) - Clone the GitLab codebase with
--depth
set to 1 to create a shallow clone without the whole commit history. - Install only the dependencies needed for the
gitlab.pot
file extraction:- Ruby:
parallel
,gettext
,json
,hamlit
- Node:
chalk
,commander
,gettext-extractor
,gettext-extractor-vue
,vue-template-compiler
,wrap-ansi
- Ruby:
- Enable
include_reference_comment
in@po_format_options
in theGettextExtractor
(see here) as mentioned in &81 (comment 2172176810) - Run the command
tooling/bin/gettext_extractor locale/gitlab.pot
as mentioned in the Internationalization docs
Now that the gitlab.pot
file contains the code occurrences, we can use them in the AI context harvesting:
- Create a Python environment and install the
polib
library - Get all strings in the production Crowdin project that don't have the AI context
- Parse the
gitlab.pot
file using thepolib
library and filter out any strings that already have the AI context. - For each string:
- Get the file contents of each file where the string is in the codebase base on its occurrences (potentially limit the number of the files if the prompt gets too large)
- Call the Google Vertex AI API with the prompt to infer the context from the string and the file contents
- Add this context to the string in Crowdin
Edited by Martin Chrástek