Large Language Models (LLMs) have recently demonstrated strong potential in generating "believable human-like" behavior in web environments. Prior work has explored augmenting training data with LLM-synthesized rationales and applying supervised fine-tuning (SFT) to enhance reasoning ability. However, the performance of such approaches remains inherently bounded by the reasoning capabilities of the model used to generate the rationales.
In this paper, we introduce Shop-R1, a novel reinforcement learning (RL) framework aimed at enhancing the reasoning ability of LLMs for simulation of real human behavior in online shopping environments. Specifically, Shop-R1 decomposes the human behavior simulation task into two stages: rationale generation and action prediction, each guided by distinct reward signals.
For rationale generation, we leverage internal model signals (e.g., logit distributions) to guide the reasoning process in a self-supervised manner. For action prediction, we propose a hierarchical reward structure with difficulty-aware scaling to prevent reward hacking and enable fine-grained reward assignment.
Simulating human shopping behavior on Amazon with structured action space
Given a shopping context and webpage state, predict the next user action with rationale explaining the decision-making process.
Type text into an input field and submit the form (equivalent to typing + pressing Enter)
{
"type": "type_and_submit",
"name": "input_name",
"text": "search_text"
}
Click on a button or clickable element identified by name
{
"type": "click",
"name": "clickable_name"
}
Close the browser and terminate when unsatisfied with search results
{
"type": "terminate"
}
The model outputs a JSON object containing both the predicted action and a first-person rationale explaining the reasoning:
{
"rationale": "<rationale>", // explains why making this action
"action": {
"type": "<type>",
...
}
}
A two-stage reinforcement learning framework for human behavior simulation
Figure 1: Overview of the Shop-R1 framework
Leverage internal model signals (e.g., logit distributions) to guide the reasoning process in a self-supervised manner, enabling the model to generate high-quality rationales.
Hierarchical reward structure with difficulty-aware scaling evaluates both high-level action types and fine-grained sub-action details (attributes and values).
Difficulty-aware scaling mechanism prevents reward hacking by rewarding outputs proportionally to their difficulty, ensuring robust learning.
Encourages the model to produce responses in a structured JSON format with two keys: rationale and action. A response earns a format reward of 0.5 if it is in valid JSON format; otherwise, it gains no format reward.
Quantifies the model's confidence in its generated rationale by computing the KL divergence between the model's predictive distribution and a uniform distribution:
where N is the number of tokens, pij is the predicted probability of token i at position j, and Ui = 1/|V| is the uniform distribution. Higher values indicate greater certainty.
Replaces brittle binary signals with a hierarchical scheme that credits both coarse-grained action types and fine-grained sub-actions. This densifies the reward landscape, lifts the agent out of 'no-reward' plateaus, and makes reward hacking uneconomical.
| Action Type | Type Reward | Sub-action Attribute Reward | Text-Similarity Value Reward |
|---|---|---|---|
| terminate | 0.3 | None | None |
| click | 0.3 | +0.2 (if name ≠ ∅) | +DARS × ROUGE-L(name) |
| type_and_submit | 0.3 | +0.1 (if name ≠ ∅) +0.1 (if text ≠ ∅) |
+0.1 × ROUGE-L(name) +DARS × ROUGE-L(text) |
Table 4: Hierarchical reward schedule with Difficulty-Aware Reward Scaling (DARS).
Long-text sub-actions (e.g., button labels, search queries) are substantially harder since modern webpages can expose thousands of candidate elements. DARS amplifies rewards for correctly predicting these components, preventing reward hacking where the agent repeatedly selects trivial terminate actions to secure easy points.
Shop-R1 maximizes the combined reward signal while regularizing with KL divergence to a reference policy:
where v(a) is the action reward, s(r) is the self-certainty score, and α, β are hyperparameters controlling regularization strength.
Simulation accuracy under different fine-tuning methods across models of different sizes
| Model | Settings | Exact Action | Action Type | |
|---|---|---|---|---|
| Acc. | Acc. | F1 | ||
| Qwen-2.5-3B-Instruct | Zero-shot prompting | 0.32% | 15.33% | 16.15% |
| RL (Binary) | 1.01% | 6.17% | 9.92% | |
| SFT | 16.76% | 22.25% | 24.52% | |
| SFT + RL (Binary) | 16.55% | 23.74% | 28.07% | |
| Shop-R1 (Ours) | 27.72% | 36.40% | 31.28% | |
| Qwen-2.5-1.5B-Instruct | Zero-shot prompting | 0.53% | 3.94% | 6.16% |
| SFT | 10.86% | 23.58% | 29.02% | |
| Shop-R1 (Ours) | 24.11% | 34.54% | 29.19% | |
| Qwen-2.5-0.5B-Instruct | Zero-shot prompting | 6.76% | 12.88% | 15.55% |
| SFT | 9.90% | 17.72% | 21.61% | |
| Shop-R1 (Ours) | 27.72% | 31.83% | 21.20% | |
Table 1: Three complementary metrics: exact action accuracy (all sub-fields must match the label); action type accuracy, and action type F1 to disentangle mistakes in coarse intent classification from those in long-text arguments.
| Model | Training Scheme Components | Exact Action | Action Type | |||||
|---|---|---|---|---|---|---|---|---|
| SFT | Format Reward | Rationale Reward | Reward Scale | Action Reward | Acc. | Acc. | F1 | |
| Qwen-2.5-3B-Instruct | ✗ | ✓ | ✓ | ✓ | hierarchical |
4.63% | 36.56% | 21.92% |
| ✓ | ✗ | ✓ | ✓ | hierarchical |
2.87% | 3.19% | 5.04% | |
| ✓ | ✓ | ✗ | ✓ | hierarchical |
26.93% | 37.25% | 33.74% | |
| ✓ | ✓ | ✓ | ✗ | hierarchical |
27.83% | 27.20% | 11.70% | |
| ✓ | ✓ | ✓ | ✓ | binary |
27.04% | 27.46% | 12.11% | |
| ✓ | ✓ | ✓ | ✓ | hierarchical |
27.72% | 36.40% | 31.28% | |
Table 3: Ablation study on different training component configurations, evaluated by exact match action accuracy and action type accuracy / F1.
Yimeng Zhang1,2
Tian Wang2
Jiri Gesi2
Ziyi Wang3
Yuxuan Lu3
Jiacheng Lin4
Sinong Zhan5
Vianne Gao2
Ruochen Jiao2
Junze Liu2
Kun Qian2
Yuxin Tang2
Ran Xue2
Houyu Zhang2
Qingjun Cui2
Yufan Guo2
Dakuo Wang3
If you find our work useful, please cite us
@inproceedings{zhang2026shopr1,
title={Shop-R1: Rewarding LLMs to Simulate Human Behavior in Online Shopping via Reinforcement Learning},
author={Zhang, Yimeng and Wang, Tian and Gesi, Jiri and Wang, Ziyi and Lu, Yuxuan and Lin, Jiacheng and Zhan, Sinong and Gao, Vianne and Jiao, Ruochen and Liu, Junze and Qian, Kun and Tang, Yuxin and Xue, Ran and Zhang, Houyu and Cui, Qingjun and Guo, Yufan and Wang, Dakuo},
booktitle={International Conference on Learning Representations (ICLR)},
year={2026}
}