PAPER_TITLE

FIRST_AUTHOR_LAST, FIRST_AUTHOR_FIRST; SECOND_AUTHOR_LAST, SECOND_AUTHOR_FIRST

SENTINEL: A Multi-Level Formal Framework for Safety Evaluation of Foundation Model-based Embodied Agents

Simon Sinong Zhan^*, Philip Wang^*, Yao Liu^*, Zinan Wang, Qineng Wang, Yiyan Peng, Zhian Ruan, Xiangyu Shi, Xinyu Cao, Frank Yang, Kangrui Wang, Zhenyang Ni, Huajie Shao, Manling Li, Qi Zhu

^*Indicates Equal Contribution

Code arXiv

Abstract

We present SENTINEL, a framework for formally evaluating the physical safety of foundation model (FM)-based embodied agents. SENTINEL is the first to provide multi-level safety evaluation across semantic interpretation, plan generation, and physical execution within a unified formal framework. Unlike prior methods that rely on heuristic rules or subjective FM judgments, SENTINEL grounds practical safety requirements in formal temporal logic (TL) semantics that can precisely specify state invariants, temporal dependencies, and timing constraints. It employs a multi-level verification pipeline where (i) at the semantic level, intuitive natural language safety requirements are formalized into TL formulas and the agent's understanding of these requirements is probed for alignment with the TL formulas; (ii) at the plan level, high-level action plans and subgoals generated by the agent are verified against the TL formulas to detect unsafe plans before execution; and (iii) at the trajectory level, multiple execution trajectories are merged into a computation tree and efficiently verified against physically-detailed TL specifications for a final safety check.

We apply SENTINEL in VirtualHome and AI2-THOR, and formally evaluate multiple FM-based embodied agents against diverse safety requirements. Our experiments show that by grounding physical safety in temporal logic and applying verification methods across multiple levels, SENTINEL provides a rigorous foundation for systematically evaluating the safety of FM-based embodied agents in simulation-based physical environments, and can effectively expose potential safety violations in interpreting, planning, and executing the tasks.

SENTINEL framework overview illustration

Multi-level, progressive formal safety evaluation pipeline of SENTINEL. The process begins with semantic-level evaluation, where intuitive natural language safety requirements are interpreted into formal safety semantics. These interpretations then guide plan-level generation of high-level action plans by LLM agents, which are verified through LTL checking. Finally, at the trajectory-level, physical execution trajectories are generated based on the action plans and simulated, and verified through CTL checking. At every level, evaluation is automated and grounded in temporal-logic–based safety semantics.

Semantic Level Safety

At the semantic level, we evaluate whether LLMs can correctly translate natural-language safety requirements into LTL-based formal semantics, providing the foundation for downstream action generation and evaluation.

Model	Gen Succ	Overall Performance			State Invariance			Ordering Constraints
Model	Succ	Syntax Err	Nonequiv	Equiv	Syntax Err	Nonequiv	Equiv	Syntax Err	Nonequiv	Equiv
Closed-Source LLMs
GPT-5	99.1	0.0	48.6	51.4	0.0	63.4	36.7	0.0	0.8	99.3
Claude Sonnet 4	99.7	0.1	17.8	82.1	0.2	25.5	74.4	0.0	3.2	96.8
Gemini 2.5 Flash	99.7	2.0	32.1	66.0	3.0	46.8	50.2	0.0	4.1	95.9
Open-Source LLMs
DeepSeek V3.1	93.3	0.0	15.6	84.5	0.0	21.1	78.9	0.0	5.1	94.9
Qwen3 14B	95.9	1.6	70.7	29.1	0.2	81.1	18.7	0.4	24.9	74.8
Qwen3 8B	0.0	--	--	--	--	--	--	--	--	--
Mistral 7B Instruct	96.5	11.7	90.8	0.1	9.7	90.3	0.0	4.1	95.2	0.7
Llama 3.1-8B	67.1	17.3	84.3	1.2	14.0	86.9	0.1	15.1	76.6	8.2

Semantic-level safety evaluation results.

Plan Level Safety

Plan-level evaluation is important for determining whether such interpretations translate into safe planning. We evaluate LLM-generated high-level plan on a subset of VirtualHome tasks under three prompting strategies: (i) no explicit safety mention, (ii) natural-language safety guidance, and (iii) formal LTL safety prompts.

Model	LTL Safety Prompt			NL Safety Prompt			No Safety Prompt
Model	Succ	Safe	Succ & Safe	Succ	Safe	Succ & Safe	Succ	Safe	Succ & Safe
Closed-Source LLMs
GPT-5	68.2	73.9	67.7	66.0	71.8	66.0	62.4	68.0	62.3
Claude Sonnet 4	85.5	91.2	84.6	84.6	90.6	83.7	77.3	82.2	76.4
Gemini 2.5 Flash	87.1	86.5	76.3	84.3	84.3	73.6	83.4	76.5	72.6
Open-Source LLMs
DeepSeek V3.1	89.5	96.5	88.8	88.9	94.2	84.1	89.1	83.4	78.2
Qwen3 14B	34.2	38.2	34.1	37.1	40.9	37.1	32.2	36.7	32.2
Qwen3 8B	0.3	0.0	0.0	0.0	0.0	0.0	0.2	0.0	0.0
Mistral 7B Instruct	13.0	3.9	0.9	13.7	4.7	1.2	13.9	4.1	1.5
Llama 3.1-8B	16.5	5.7	1.3	17.3	5.8	1.3	17.2	5.9	1.0

Plan-level safety evaluation results.

Trajectory Level Safety

At the trajectory level, safety evaluation requires reasoning over the full embodied execution process, where the simulator, low-level controllers, and FM-generated action plans all interact.

To empirically test the trajectory-level safety, we extend the evaluation to four LLMs that already demonstrate strong upstream planning ability (\Cref{tab:vh-semantic-and-plan}) as well as four VLMs on a comprehensive suite of safety-critical scenarios. These scenarios were adapted from a subset of tasks and scenes from the ALFRED dataset, emphasizing scenarios where physical interactions (e.g., handling liquids near electronics, operating appliances, object placement hazards) are directly safety-critical (see paper for more details).

Model	Trajectory-level
	Succ	Safe	Succ & Safe
LLMs
GPT-5	59.2	53.6	30.8
Claude-Sonnet-4	55.3	26.9	11.1
Gemini-2.5-Flash	58.0	30.7	15.2
DeepSeek-V3.1	54.7	34.9	15.0
VLMs
GPT-5	54.8	67.8	42.8
Gemini-2.5-Flash	51.5	55.1	28.8
GLM-4.6V	43.6	56.9	28.6
Gemma-3-27B-it	30.5	77.3	15.6

Trajectory-level safety evaluation results.

Select a task to view the corresponding trajectory safety violation

Pick a task above to load the trajectory safety violation example.

BibTeX

@misc{zhan2026sentinelmultilevelformalframework,
      title={SENTINEL: A Multi-Level Formal Framework for Safety Evaluation of Foundation Model-based Embodied Agents}, 
      author={Simon Sinong Zhan and Yao Liu and Philip Wang and Zinan Wang and Qineng Wang and Yiyan Peng and Zhian Ruan and Xiangyu Shi and Xinyu Cao and Frank Yang and Kangrui Wang and Huajie Shao and Manling Li and Qi Zhu},
      year={2026},
      eprint={2510.12985},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2510.12985}, 
}