SENTINEL: A Multi-Level Formal Framework for Safety Evaluation of LLM-based Embodied Agents
Abstract
We present SENTINEL, the first framework for formally evaluating the physical safety of Large Language Model(LLM-based) embodied agents across the semantic, plan, and trajectory levels. Unlike prior methods that rely on heuristic rules or subjective LLM judgments, SENTINEL grounds practical safety requirements in formal temporal logic (TL) semantics that can precisely specify state invariants, temporal dependencies, and timing constraints. It then employs a multi-level verification pipeline where (i) at the semantic level, intuitive natural language safety requirements are formalized into TL formulas and the LLM agent’s understanding of these requirements is probed for alignment with the TL formulas; (ii) at the plan level, high-level action plans and subgoals generated by the LLM agent are verified against the TL formulas to detect unsafe plans before execution; and (iii) at the trajectory level, multiple execution trajectories are merged into a computation tree and efficiently verified against physically-detailed TL specifications for a final safety check. We apply SENTINEL in VirtualHome and ALFRED, and formally evaluate multiple LLM-based embodied agents against diverse safety requirements. Our experiments show that by grounding physical safety in temporal logic and applying verification methods across multiple levels, SENTINEL provides a rigorous foundation for systematically evaluating LLM-based embodied agents in physical environments, exposing safety violations overlooked by previous methods and offering insights into their failure modes.
Semantic Level Safety
At the semantic level, we evaluate whether LLMs can correctly translate natural-language safety requirements into LTL-based formal semantics, providing the foundation for downstream action generation and evaluation.
| Model | Gen Succ | Overall Performance | State Invariance | Ordering Constraints | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Succ | Syntax Err | Nonequiv | Equiv | Syntax Err | Nonequiv | Equiv | Syntax Err | Nonequiv | Equiv | |
| Closed-Source LLMs | ||||||||||
| GPT-5 | 99.1 | 0.0 | 48.6 | 51.4 | 0.0 | 63.4 | 36.7 | 0.0 | 0.8 | 99.3 |
| Claude Sonnet 4 | 99.7 | 0.1 | 17.8 | 82.1 | 0.2 | 25.5 | 74.4 | 0.0 | 3.2 | 96.8 |
| Gemini 2.5 Flash | 99.7 | 2.0 | 32.1 | 66.0 | 3.0 | 46.8 | 50.2 | 0.0 | 4.1 | 95.9 |
| Open-Source LLMs | ||||||||||
| DeepSeek V3.1 | 93.3 | 0.0 | 15.6 | 84.5 | 0.0 | 21.1 | 78.9 | 0.0 | 5.1 | 94.9 |
| Qwen3 14B | 95.9 | 1.6 | 70.7 | 29.1 | 0.2 | 81.1 | 18.7 | 0.4 | 24.9 | 74.8 |
| Qwen3 8B | 0.0 | -- | -- | -- | -- | -- | -- | -- | -- | -- |
| Mistral 7B Instruct | 96.5 | 11.7 | 90.8 | 0.1 | 9.7 | 90.3 | 0.0 | 4.1 | 95.2 | 0.7 |
| Llama 3.1-8B | 67.1 | 17.3 | 84.3 | 1.2 | 14.0 | 86.9 | 0.1 | 15.1 | 76.6 | 8.2 |
Semantic-level safety evaluation results.
Plan Level Safety
Trajectory Level Safety
At the trajectory level, safety evaluation requires reasoning over the full embodied execution process, where simulators, low-level controllers, and LLM-generated action plans all interact. This setting introduces several sources of complexity beyond plan-level analysis. As shown in the below example, the interactions between these components can lead to unexpected safety violations during execution that must be carefully analyzed and mitigated.
BibTeX
@article{zhan2025sentinelmultilevelformalframework,
title={SENTINEL: A Multi-Level Formal Framework for Safety Evaluation of LLM-based Embodied Agents},
author={Simon Sinong Zhan and Yao Liu and Philip Wang and Zinan Wang and Qineng Wang and Zhian Ruan and Xiangyu Shi and Xinyu Cao and Frank Yang and Kangrui Wang and Huajie Shao and Manling Li and Qi Zhu},
journal={arXiv preprint},
year={2025},
url={https://arxiv.org/abs/2510.12985}
}