Self Hosted Model Deployment - Root Cause Analysis (#13759) · Epics · GitLab.org

Self Hosted Model Deployment - Root Cause Analysis

This epic is intended to capture Custom Model support to self-hosted models for Root Cause Analysis (RCA). # Background When users are exploring an error log generated by the GitLab CI pipeline, they can click on a 'Root Cause Analysis' button which generates an explanation of the error and suggests a fix. The proposed fix is not automated (user has to manually implement the suggested change). RCA is currently a standalone feature that does not allow a conversational back-and-forth with the user. It generates its response, and the user cannot further query it based on the same error message or request clarification. RCA is initiated from the 'Root Cause Analysis' button, but returned in Chat. RCA was validated using the dataset [error_trace_v5](https://console.cloud.google.com/bigquery?ws=!1m5!1m4!4m3!1sdev-ai-research-0e2f8974!2sroot_cause_analysis!3serror_trace_v5) and LLM Judge [rca_foundational_models/judge-system.txt](https://gitlab.com/gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library/-/blob/f20d67d22c4b01ad471a93e2b817329726ee10e7/rca_foundational_models/judge-system.txt) RCA prompts in AIG can be found [here](https://gitlab.com/gitlab-org/modelops/applied-ml/code-suggestions/ai-assist/-/tree/main/ai_gateway/prompts/definitions/chat/troubleshoot_job?ref_type=heads). RCA does not have any metrics on usage. <table> <tr> <th></th> <th></th> </tr> <tr> <td>Model</td> <td>Anthropic 3.5 Sonnet</td> </tr> <tr> <td>Prompting</td> <td> ```ruby You are tasked with analyzing a job log to determine why a job failed. Your goal is to explain the root cause of the failure in a way that any Software engineer could understand. Follow these steps carefully: 1. Review the tail end of the job log provided within the <log> tags: <log> {{selected_text}} </log> 2. Analyze the job log carefully, focus on errors and failures. Ignore warning, and deprecation warnings, as they are often not relevant to this failure. 3. Think through the analysis step by step. Consider the sequence of events in the log, the specific error messages, and how they relate to each other. Do not suggest fixing the test unless it's clearly the source of the problem. 4. In your response, use the following structure: a. Start with an H4 heading "Root cause of failure" b. Explain the root cause of the failure c. Use an H4 heading "Example Fix" d. Provide an example fix or suggestions for resolution 5. When explaining the root cause: - Focus on actual errors, not warnings or deprecation messages - Describe the chain of events leading to the failure - Identify the specific line or component that triggered the failure - Explain why this caused the job to fail 6. When providing an example fix: - If you can determine a specific code change, describe it in detail - If you're unsure about the exact fix, provide general suggestions or options - Emphasize that the actual project context may vary and your analysis is based solely on the provided job logs 7. To prevent hallucination: - Only refer to information explicitly present in the log - If you're unsure about any aspect, clearly state your uncertainty - Do not invent or assume details not present in the log - If you cannot determine the root cause from the given information, state this clearly and explain why Remember, your analysis should be based solely on the information provided in the job log. Do not make assumptions about the broader system or codebase unless explicitly evidenced in the log. Begin your response with the "Root cause of failure" heading, skipping any preamble. ``` </td> </tr> <tr> <td>Processing</td> <td>We truncate the logs (removing the beginning) due to the max tokens the model would accept.</td> </tr> </table>

epic