Low quality answers due to incompatible prompt with null context
Problem to solve
There are many answers that could be a result of a wrong prompt template. They often starts with Unfortunately, there is no context provided to answer...
. This is due to the empty <context>
blob for use cases that do not provide/need this in the prompt, i.e. Code Generation, Documentation.
Below is the percentages of such answers from dev-ai-research-0e2f8974.duo_chat_foundation_models.llm_judge
table.
created_at | task | answering_model | percentage |
---|---|---|---|
2024-03-08 | Code Explanation | gemini-1.0-pro-001 | 5.85 |
2024-03-08 | Code Explanation | gemini-1.5-pro-preview-0215 | 0.57 |
2024-03-07 | Documentation | gemini-1.0-pro-001 | 85.8 |
2024-03-07 | Documentation | gemini-1.5-pro-preview-0215 | 39.05 |
2024-03-07 | Code Generation | gemini-1.0-pro-001 | 9.84 |
2024-03-07 | Code Generation | gemini-1.5-pro-preview-0215 | 1.25 |
2024-03-07 | Code Explanation | gemini-1.0-pro-001 | 5.34 |
2024-03-07 | Code Explanation | gemini-1.5-pro-preview-0215 | 1.0 |
2024-03-06 | Issue/Epic | claude-2 | 7.75 |
2024-03-06 | Issue/Epic | gpt-4 | 3.06 |
2024-03-06 | Issue/Epic | claude-3-sonnet | 4.55 |
2024-03-06 | Issue/Epic | claude-3-opus | 1.61 |
2024-03-06 | Code Explanation | claude-2 | 64.5 |
2024-03-06 | Code Explanation | claude-3-opus | 0.32 |
2024-03-06 | Code Explanation | claude-3-sonnet | 4.78 |
2024-03-05 | Documentation | claude-2 | 100.0 |
2024-03-05 | Documentation | gpt-4 | 9.02 |
2024-03-05 | Documentation | claude-3-sonnet | 92.62 |
2024-03-05 | Code Generation | claude-2 | 78.62 |
2024-03-05 | Code Generation | claude-3-sonnet | 45.37 |
2024-03-05 | Issue/Epic | claude-2 | 7.58 |
2024-03-05 | Issue/Epic | claude-3-sonnet | 5.68 |
2024-03-05 | Issue/Epic | claude-3-opus | 2.26 |
2024-03-05 | Issue/Epic | gpt-4 | 2.99 |
SQL to generate the data
WITH totalCount AS (
SELECT
COUNT(*) total,
EXTRACT(date
FROM
created_at) AS created_at,
task,
answering_model
FROM
`dev-ai-research-0e2f8974.duo_chat_foundation_models.llm_judge`
GROUP BY
created_at,
task,
answering_model
LIMIT
1000
), totalErrorCount AS (
SELECT
COUNT(*) total,
EXTRACT(date
FROM
created_at) AS created_at,
task,
answering_model
FROM
`dev-ai-research-0e2f8974.duo_chat_foundation_models.llm_judge`
WHERE
answer LIKE "Unfortunately%context%" OR
answer LIKE "The provided context%"
GROUP BY
created_at,
task,
answering_model
LIMIT
1000
)
SELECT
ROUND(totalErrorCount.total/totalCount.total * 100, 2),
totalErrorCount.created_at,
totalErrorCount.task,
totalErrorCount.answering_model
FROM
totalErrorCount,
totalCount
WHERE
totalErrorCount.task = totalCount.task AND
totalErrorCount.created_at = totalCount.created_at AND
totalErrorCount.answering_model = totalCount.answering_model
Proposal
- Revisit the answering prompt template
- Consider using different templates for different task and model
- Run the test on a problematic set of questions and compare before and after.
Further details
Links / references
Edited by Tan Le