I Can't Believe It's Not Robust: Catastrophic Collapse of...

I Can't Believe It's Not Robust: Catastrophic Collapse of Safety Classifiers under Embedding Drift

arXiv:2603.01297v1 Announce Type: cross Abstract: Instruction tuned reasoning models are increasingly deployed with safety classifiers trained on frozen embeddings, assuming representation stability across model updates. We systematically investigate this assumption and find it fails: normalized perturbations of magnitude $\sigma=0.02$ (corresponding to $\approx 1^\circ$ angular drift on the embedding sphere) reduce classifier performance from $85\%$ to $50\%$ ROC-AUC. Critically, mean confidence only drops $14\%$, producing dangerous silent failures where $72\%$ of misclassifications occur with high confidence, defeating standard monitoring. We further show that instruction-tuned models exhibit 20$\%$ worse class separability than base models, making aligned systems paradoxically harder to safeguard. Our findings expose a fundamental fragility in production AI safety architectures and challenge the assumption that safety mechanisms transfer across model versions.

相关推荐

AI companies are spending millions to thwart this former tech exec’s congressional bid

I Can't Believe It's Not Robust: Catastrophic Collapse of Safety Classifiers under Embedding Drift

GuiDINO: Rethinking Vision Foundation Model in Medical Image Segmentation

DeAR: Fine-Grained VLM Adaptation by Decomposing Attention Head Roles

SWE-Adept: An LLM-Based Agentic Framework for Deep Codebase Analysis and Structured Issue Resolution

JailNewsBench: Multi-Lingual and Regional Benchmark for Fake News Generation under Jailbreak Attacks