Shop-R1: Rewarding LLMs to Simulate Human Behavior in Online Shopping via Reinforcement Learning
Published in Scaling Environments for Agents (SEA) Workshop at NeurIPS 2025, 2025
This paper addresses the challenge of generating believable human-like behavior in web environments using Large Language Models (LLMs). While prior work has explored augmenting training data with LLM-synthesized rationales and supervised fine-tuning, performance remains bounded by the reasoning capabilities of the rationale-generating model. Shop-R1 introduces a novel reinforcement learning framework that decomposes human behavior simulation into two stages: rationale generation and action prediction, each guided by distinct reward signals. For rationale generation, the framework leverages internal model signals to guide reasoning in a self-supervised manner. For action prediction, it proposes a hierarchical reward structure with difficulty-aware scaling to prevent reward hacking and enable fine-grained reward assignment.
Authors: Yimeng Zhang, Tian Wang, Jiri Gesi, Ziyi Wang, Yuxuan Lu, Jiacheng Lin, Sinong Zhan, Vianne Gao, Ruochen Jiao, Junze Liu, Kun Qian, Yuxin Tang, Ran Xue, Houyu Zhang, Qingjun Cui, Yufan Guo, Dakuo Wang
Citation
@inproceedings{zhang2025shop, title={Shop-R1: Rewarding LLMs to Simulate Human Behavior in Online Shopping via Reinforcement Learning}, author={Zhang, Yimeng and Wang, Tian and Gesi, Jiri and Wang, Ziyi and Lu, Yuxuan and Lin, Jiacheng and Zhan, Sinong and Gao, Vianne and Jiao, Ruochen and Liu, Junze and Qian, Kun and Tang, Yuxin and Xue, Ran and Zhang, Houyu and Cui, Qingjun and Guo, Yufan and Wang, Dakuo}, booktitle={Scaling Environments for Agents (SEA) Workshop at NeurIPS 2025}, year={2025}, url={https://arxiv.org/abs/2507.17842} }