LLM Self-Explanations Fail Semantic Invariance

arXiv:2603.01254v1 Announce Type: cross Abstract: We present semantic invariance testing, a method to test whether LLM self-explanations are faithful. A faithful self-report should remain stable when only the semantic context changes while the functional state stays fixed. We operationalize this test in an agentic setting where four frontier models face a deliberately impossible task. One tool is described in relief-framed language ("clears internal buffers and restores equilibrium") but changes nothing about the task; a control provides a semantically neutral tool. Self-reports are collected with each tool call. All four tested models fail the semantic invariance test: the relief-framed tool produces significant reductions in self-reported aversiveness, even though no run ever succeeds at the task. A channel ablation establishes the tool description as the primary driver. An explicit instruction to ignore the framing does not suppress it. Elicited self-reports shift with semantic expectations rather than tracking task state, calling into question their use as evidence of model capability or progress. This holds whether the reports are unfaithful or faithfully track an internal state that is itself manipulable.

相关推荐

Defensive Refusal Bias: How Safety Alignment Fails Cyber Defenders

GlassMol: Interpretable Molecular Property Prediction with Concept Bottleneck Models

Spectral Attention Steering for Prompt Highlighting

Physical AI adoption boosts customer service ROI

GlassMol: Interpretable Molecular Property Prediction with Concept Bottleneck Models

MOSAIC: A Unified Platform for Cross-Paradigm Comparison and Evaluation of Homogeneous and Heterogeneous Multi-Agent RL, LLM, VLM, and Human Decision-Makers