Skip to content

Update explain vulnerability prompt for moderation

Neil McCorrison requested to merge 414880-update-explain-exploit-prompt into master

What does this MR do and why?

Vertex AI introduced content moderation around 2023-06-04 and is failing our prompt about 50% of the time

See #414880 (closed).

Changes prompt from:

Provide a code example with syntax highlighting on how to exploit it.

to

Provide a code example with syntax highlighting on how an attacker can take advantage of the vulnerability.

Example responses using the GCP sandbox:

image

Analysis we have conducted should see a decrease of the content moderation blocking prompt responses from 16.44% of the time to only 2.35% with this prompt change. This may also result in up to an approximate 10% reduction in requests failing, likely due to less time spent in the Google API's during the content moderation cycles.

How to set up and validate locally

There should be no operational difference except for a distinct decrease in the amount of responses that are actively content blocked by the Google Moderation algorithm.

When the content is blocked by the moderation algorithm, the response received is I'm not able to help with that, as I'm only a language model. If you believe this is an error, please send us your feedback.

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #414880 (closed)

Edited by Gregory Havenga

Merge request reports