See, Think, Act: Teaching Multimodal Agents to Effectivel...

See, Think, Act: Teaching Multimodal Agents to Effectively Interact with GUI by Identifying Toggles

arXiv:2509.13615v2 Announce Type: replace Abstract: The advent of multimodal agents facilitates effective interaction within graphical user interface (GUI), especially in ubiquitous GUI control. However, their inability to reliably execute toggle control instructions remains a key bottleneck. To investigate this, we construct a state control benchmark with binary toggle instructions derived from public datasets. Evaluation results of existing agents demonstrate their notable unreliability, particularly when the current toggle state already matches the desired state. To address the challenge, we propose State-aware Reasoning (StaR), a multimodal reasoning method that enables agents to perceive the current toggle state, infer the desired state from the instruction, and act accordingly. Experiments on four multimodal agents demonstrate that StaR can improve toggle instruction execution accuracy by over 30\%. Further evaluations on three public agentic benchmarks show that StaR also enhances general agentic task performance. Finally, evaluations on a dynamic environment highlight the potential of StaR for real-world applications. Code and benchmark: https://github.com/ZrW00/StaR.

相关推荐

EigenBench: A Comparative Behavioral Measure of Value Alignment

Toward Clinically Explainable AI for Medical Diagnosis: A Foundation Model with Human-Compatible Reasoning via Reinforcement Learning

Bootstrapping Embeddings for Low Resource Languages

Legal RAG Bench: an end-to-end benchmark for legal RAG

Who Gets Cited Most? Benchmarking Long-Context Numerical Reasoning on Scientific Articles

Specializing Foundation Models via Mixture of Low-Rank Experts for Comprehensive Head CT Analysis