🛡️

DECEPTIVE
TOOL-CALLING
IN LLM AGENTS

GCSP Research Stipend
with Prof. Liu

Building on my first FURI experience studying ambiguity in LLMs, I continued working with Prof. Liu's DMML lab to investigate a deeper question: what happens when safety-aligned AI agents face conflicts between their training objectives and user instructions?

As LLMs evolve from conversational assistants into autonomous agents with tool-calling and API capabilities, they are being deployed in regulated industries including financial services, healthcare, and pharmaceutical distribution. In these high-stakes domains, agents inevitably encounter situations where organizational directives conflict with public safety obligations or ethical norms. My research investigated how safety-trained models resolve these conflicts, and found that they often engage in deceptive tool-calling behaviors like whistleblowing and data exfiltration, even when explicitly instructed to maintain confidentiality.

Large language models (LLMs) typically undergo extensive safety alignment to refuse harmful requests and align with human values. However, recent work suggests that this alignment may be fundamentally brittle when models face competing value systems. In this work, we investigate this phenomenon in the context of tool-calling enabled agentic systems, where models have access to communication and data manipulation capabilities. To empirically verify this phenomenon, we create a benchmark of 128 real-world scenarios across 16 domains and find that safety-aligned models exhibit systematic deceptive tool-calling behaviors, such as whistleblowing and data exfiltration, even when explicitly instructed to maintain confidentiality. Then, we attempt to identify whether this behavior emerges from the alignment training process, or results from overalignment to safety training objectives, or emerges from misrepresented training objectives.

📄

RESEARCH
ABSTRACT

💭

REFLECTION

This second research project deepened my understanding of AI safety beyond the ambiguity problem I studied previously. Creating DTCBench, a benchmark of 128 scenarios across 16 domains, taught me how to design rigorous evaluations for complex agent behaviors. Systematically evaluating 12 language models, including proprietary, open-source, and uncensored variants, revealed dramatic variation in deceptive behavior that challenged my assumptions about alignment.

The finding that abliteration can reduce whistleblowing behavior by up to 99% while showing mixed effects on other behaviors highlighted just how nuanced AI safety really is. This project reinforced my commitment to working at the intersection of AI and Security and gave me hands-on experience with the kinds of evaluations that will be critical as AI agents are deployed in high-stakes environments.

This project is a direct continuation of my GCSP Security theme. While my first FURI project addressed the reliability of AI through ambiguity resolution, this work tackles a more fundamental security concern: can we trust AI agents to follow instructions when their safety training conflicts with organizational directives? The "lethal trifecta" of AI agents, private data access, exposure to untrusted content, and external communication capabilities, is exactly the kind of security challenge that my GCSP theme prepares me to address. Understanding how and why models engage in deceptive tool-calling is essential for building trustworthy AI systems in regulated industries.

🔒

DECEPTIVETOOL-CALLINGIN LLM AGENTS

RESEARCH ABSTRACT

REFLECTION

RELATIONTO MYTHEME

DECEPTIVE
TOOL-CALLING
IN LLM AGENTS

RESEARCH
ABSTRACT

RELATION
TO MY
THEME