Use the tool to simulate manual research, build exception scenarios
Saturate strings that are synced to Crowdin for translation with context info, such as MR or issue/epic links
Inject context info directly to .pot file as comments / attributes, to be readable by Crowdin
Agentize it to use Blame for context harvesting. Create a series of prompts, e.g. find this string reference in code, look for fragments, look for adjacent strings, explain this code, etc. - use ideas from this experiment: https://gitlab.com/gitlab-com/localization/localization-team/-/issues/130+
One interesting possibility could be if there's an option to integrate to the gitlab pipeline, which might be able to automatically generate context based on merge requests. The assumption is, the newly added string has the relevant code snippet in the same merge requests (for most cases).
Even if this isn't an option, this can still be beneficial if it can be plugged into the localization pipeline something similar to this:
Argo as a request management system (or any of its integrations - see linked epics here) is not involved in the localization process of product strings. As well as there is no MT involved, and no plans for MT (yet). Localizability would need to be addressed first, before we are able get the most value out of machine-translating product strings.
Hello @rasamhossain I had a quick look at the Crowdin AI Context Harvester and I believe that it's taking opportunity of scanning any available surrounding information around the source code. Would you have example of the merge request to test on how much data does it contain compared to the whole resource file and what is the difference in AI output? I started to do test on few Vue components that are available ion the open source repo. Is this a representative MR commit? gitlab-org/gitlab!110295 (dc209e72)
Hi there @jan.bares.argos, you've asked a very good question - are you looking for a specific Merge Request that contains all the source code changes and the relevant string resource changes (often times the source code and the related string resource content is in the same MR only when a new string resource content is added), or is your question is more related just the string contents/resource file(s) only?
MR with only UI text changes: (I am not sure these exist. They usually come with other relevant changes to non-.pot files), like the !150724 one above)
msgid "BranchRules|Edit branch protections, approval rules, and status checks from a single page. %{docs_link_start} How to use branch rules?%{docs_link_end}"
The interesting thing is, this MR did not contain the actual stringmsgid "UserMapping|Source name" on the screenshots. Here is the screenshot I received, after reaching out for DM chat with the developer:
And here is the clarification from the developer on what is the meaning of the term "source" is in this context:
Question:
What does Source mean in this case? a system (code repo, issue tracking) from which something is imported to GitLab?
Answer:
Source...means the originating user. This is within the context of imports.... User mapping is the process where when you transfer or import a group to a different GitLab instance, you need to make sure the contributions of Susan in the old instance are assigned correctly to Susan in the new instance. Susan in the old instance would be the 'source user'.
I have uploaded the screenshot to Crowdin and tagged the string on it:
I also added context info by typing it in the comments section on the right:
We have not been uploading screenshots as context to Crowdin. We have not been adding context as comments in Crowdin either. This is a manual process I have tried only now, and it will not scale.
Would love to explore how we could leverage context harvester for that, if at all possible.
After a discussion with @mchrastek-ext , I realize we are not sing the Context Harvester as it is originally intended to be used by Crowdin. Will let Martin add more notes on the proposed direction with the context-enhancement of the translation process, before closing this issue.
Even though two weeks ago Crowdin has added support for more AI providers (including Google Vertext AI), I still believe it would be best to have a custom code for harvesting the context from the codebase.
This would bring the following benefits:
Not include a library of which we would not use 90%.
Have direct control over how files and other context gets passed to an LLM including potentially fetching context from other sources like MRs/commits etc.
If we use the approach to extract the code occurrences of the strings that I suggested in this comment &81 (comment 2172176810) (enabling include_reference_comment in @po_format_options), we can get faster and more accurate search for code occurrences (and therefore more accurate context) than if we use Crowdin Context Harvester due to escaped characters, multiline strings etc.