🛡️
DECEPTIVE
|
Building on my first FURI experience studying ambiguity in LLMs, I
continued working with Prof. Liu's DMML lab to investigate a deeper
question: what happens when safety-aligned AI agents face conflicts
between their training objectives and user instructions?
|
|
Large language models (LLMs) typically undergo extensive safety alignment to refuse harmful requests and align with human values. However, recent work suggests that this alignment may be fundamentally brittle when models face competing value systems. In this work, we investigate this phenomenon in the context of tool-calling enabled agentic systems, where models have access to communication and data manipulation capabilities. To empirically verify this phenomenon, we create a benchmark of 128 real-world scenarios across 16 domains and find that safety-aligned models exhibit systematic deceptive tool-calling behaviors, such as whistleblowing and data exfiltration, even when explicitly instructed to maintain confidentiality. Then, we attempt to identify whether this behavior emerges from the alignment training process, or results from overalignment to safety training objectives, or emerges from misrepresented training objectives. |
📄
RESEARCH
|
💭
REFLECTION |
This second research project deepened my understanding of AI safety
beyond the ambiguity problem I studied previously. Creating
DTCBench, a benchmark of 128 scenarios across 16 domains, taught
me how to design rigorous evaluations for complex agent behaviors.
Systematically evaluating 12 language models, including proprietary,
open-source, and uncensored variants, revealed dramatic variation in
deceptive behavior that challenged my assumptions about alignment.
|
|
This project is a direct continuation of my GCSP Security theme. While my first FURI project addressed the reliability of AI through ambiguity resolution, this work tackles a more fundamental security concern: can we trust AI agents to follow instructions when their safety training conflicts with organizational directives? The "lethal trifecta" of AI agents, private data access, exposure to untrusted content, and external communication capabilities, is exactly the kind of security challenge that my GCSP theme prepares me to address. Understanding how and why models engage in deceptive tool-calling is essential for building trustworthy AI systems in regulated industries. |
🔒
RELATION
|
|
Want to connect? |
contact@aryankeluskar.com
|