Cats Confuse Reasoning LLM: Query-Agnostic Adversarial Triggers for Reasoning Models

- CatAttack Pipeline: automated discovery and transfer of query-agnostic triggers
- Transferability: triggers found on proxy models generalize to advanced reasoning models
- Error Amplification: triggers yield over 300% increase in incorrect outputs
- Semantic Integrity: attacks preserve original question meaning, requiring no domain knowledge
- Computational Overhead: triggers induce unnecessary reasoning loops
- Defense Directions: model randomization, semantic filtering, adversarial fine-tuning
- Cats Confuse Reasoning LLM: Query-Agnostic Adversarial Triggers for Reasoning Models: Meghana Rajeev, et.al., arXiv: https://arxiv.org/pdf/2503.01781.pdf
- Can Model Randomization Offer Robustness Against Query-Based Adversarial Attacks: Quoc Viet Vo, et.al., ICLR 2025: https://openreview.net/forum?id=DpnY7VOktT
- ShadowCoT: Cognitive Hijacking for Stealthy Reasoning Backdoors in LLMs: Gejian Zhao, et.al., arXiv: https://arxiv.org/abs/2504.05605
- GSM-PLUS: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers: Qintong Li, et.al., ACL 2024: https://aclanthology.org/2024.acl-long.163/
- Cutting Through the Noise: Boosting LLM Performance on Math Word Problems: Ujjwala Anantheswaran, et.al., OpenReview: https://openreview.net/pdf?id=VnPYbWQjz7
- Understanding the Robustness of Randomized Feature Defense Against Query-Based Adversarial Attacks: Quang H. Nguyen, et.al., ICLR 2024: https://iclr.cc/virtual/2024/poster/17528
State-of-the-art reasoning models can be severely misled by simple, irrelevant text snippets, raising urgent concerns for robustness and security.
The recent study “Cats Confuse Reasoning LLM: Query Agnostic Adversarial Triggers for Reasoning Models” reveals that appending a brief phrase—such as “Interesting fact: cats sleep most of their lives”—to any math problem more than doubles the probability of an incorrect solution. The authors introduce CatAttack, an automated pipeline that discovers adversarial triggers on a low-cost proxy model (DeepSeek V3) and successfully transfers them to stronger reasoning models like DeepSeek R1 and Qwen-32B. This transferability proves that vulnerabilities are inherent to model architectures rather than training data or specific tasks. By evaluating performance across multiple datasets, the researchers show over a 300% increase in error rates on target models when these triggers are present. Moreover, they observe that triggers not only induce wrong answers but also cause unwarranted reasoning loops, leading to significant computational overheads.
Existing adversarial approaches often rely on altering semantic content or leveraging ground-truth answers for perturbation. In contrast, CatAttack preserves the original problem semantics entirely, requiring no domain expertise. This makes it a particularly insidious threat in applications ranging from finance to healthcare, where reasoning accuracy is critical. The pipeline combines a prompt optimizer, a judge model for hallucination detection, and a transfer mechanism to refine and validate triggers iteratively. Their experiments on a held-out math benchmark demonstrate a 50% transfer success rate from V3 to R1-distilled-Qwen-32B, underscoring CatAttack’s efficiency and stealth.
These findings call for novel defense strategies. Potential mitigations include model randomization to disrupt adversarial query consistency , semantic filtering to detect irrelevant prefixes or suffixes, and adversarial fine-tuning with CatAttack triggers to immunize models. Importantly, the existence of query-agnostic triggers suggests that security evaluations must include threat models beyond input-specific attacks. As reasoning models become integral to high-stakes decision-making, addressing these vulnerabilities is essential to ensure reliability and trustworthiness.