Abstract: Large language models (LLMs) are artificial intelligence (AI) tools that respond to text prompts. They can be made to pursue goals and work on complex, multi-stage tasks by wrapping them into repeat-prompting systems like Auto-GPT. Many such systems exist and are freely accessible. Through Internet access and user assistance, they can cause significant harm in the real world as soon as they become sufficiently effective. The guardrails imposed by LLMs are currently unable to prevent such harm.
With repeat-prompting systems freely downloadable, ensuring their safety is an urgent goal. Rather than shutting down LLMs outright, one may be able to mitigate the risks they pose by constraining the size of their publicly accessible context windows.
To account for the rapidly changing state-of-the-art of both repeat-prompting systems and LLM guardrails, we propose an adaptive mechanism for determining safe context windows sizes. The underlying approach is a continuous adversarial testing regime with gradual window size adaptation as either the safety or risk of LLMs is demonstrated. The mechanism combines a downward binary search with an upward exponential back-off strategy to establish boundaries quickly while still allowing for exponential growth as systems are shown safe.
Leave a comment