From Refusal to Prevention: Anticipatory Safety Mechanisms for Conversational AI
We propose to use an interdisciplinary approach to design an alignment-focused conversational guardrails and explanation system that predicts potentially risky conversation trajectories before the conversation reaches a harmful request, building up up an explanation over multiple turns that is culturally relevant, aligned to impartial principles, and convincingly justified to the user. We will use techniques from Human-Robot Interaction, NLP, Explainable AI, and Qualitative Methods in Cognitive Science to design control protocols and red-team a guardrail + explanation system.
Relevance to Alignment Project Objectives
Our objective is to create, test, and red-team a LLM-based conversation control system that characterizes and predicts when dialogue is on a trajectory toward a problematic user request, and intervenes early with justified, culturally-aware conversational redirection, rather than waiting for a problematic trigger to shut the request down with boilerplate refusal. Our project fits with The Alignment Project’s goal of designing AI systems that do not attempt risky actions in the first place, in the Empirical Investigations Into AI Monitoring and Red Teaming Research Agenda, situated in the domain Human-Robot Interaction domain, as embodied interactions are increasingly employing LLMs and multi-turn conversations [1].