Shop-R1: Rewarding LLMs to Simulate Human Behavior in Online Shopping

Abstract

Large Language Models (LLMs) have recently demonstrated strong potential in generating "believable human-like" behavior in web environments. Prior work has explored augmenting training data with LLM-synthesized rationales and applying supervised fine-tuning (SFT) to enhance reasoning ability. However, the performance of such approaches remains inherently bounded by the reasoning capabilities of the model used to generate the rationales.

In this paper, we introduce Shop-R1, a novel reinforcement learning (RL) framework aimed at enhancing the reasoning ability of LLMs for simulation of real human behavior in online shopping environments. Specifically, Shop-R1 decomposes the human behavior simulation task into two stages: rationale generation and action prediction, each guided by distinct reward signals.

For rationale generation, we leverage internal model signals (e.g., logit distributions) to guide the reasoning process in a self-supervised manner. For action prediction, we propose a hierarchical reward structure with difficulty-aware scaling to prevent reward hacking and enable fine-grained reward assignment.

Task Setup

Simulating human shopping behavior on Amazon with structured action space

Objective

Given a shopping context and webpage state, predict the next user action with rationale explaining the decision-making process.

Action Space

type_and_submit

Type text into an input field and submit the form (equivalent to typing + pressing Enter)

{
  "type": "type_and_submit",
  "name": "input_name",
  "text": "search_text"
}

click

Click on a button or clickable element identified by name

{
  "type": "click",
  "name": "clickable_name"
}

terminate

Close the browser and terminate when unsatisfied with search results

{
  "type": "terminate"
}

Output Format

The model outputs a JSON object containing both the predicted action and a first-person rationale explaining the reasoning:

{
  "rationale": "<rationale>",  // explains why making this action
  "action": {
    "type": "<type>",
    ...
  }
}

Method Overview

A two-stage reinforcement learning framework for human behavior simulation

Figure 1: Overview of the Shop-R1 framework

Stage 1: Rationale Generation

Leverage internal model signals (e.g., logit distributions) to guide the reasoning process in a self-supervised manner, enabling the model to generate high-quality rationales.

Stage 2: Action Prediction

Hierarchical reward structure with difficulty-aware scaling evaluates both high-level action types and fine-grained sub-action details (attributes and values).

Reward Hacking Prevention

Difficulty-aware scaling mechanism prevents reward hacking by rewarding outputs proportionally to their difficulty, ensuring robust learning.

Reward Design Details

Binary Format Reward

Encourages the model to produce responses in a structured JSON format with two keys: rationale and action. A response earns a format reward of 0.5 if it is in valid JSON format; otherwise, it gains no format reward.

Self-Certainty Score (Rationale Reward)

Quantifies the model's confidence in its generated rationale by computing the KL divergence between the model's predictive distribution and a uniform distribution:

s(r | q) = (1/N|V|) Σ_j Σ_i p_ij log(p_ij / U_i)

where N is the number of tokens, p_ij is the predicted probability of token i at position j, and U_i = 1/|V| is the uniform distribution. Higher values indicate greater certainty.

Hierarchical Action Reward

Replaces brittle binary signals with a hierarchical scheme that credits both coarse-grained action types and fine-grained sub-actions. This densifies the reward landscape, lifts the agent out of 'no-reward' plateaus, and makes reward hacking uneconomical.

Action Type	Type Reward	Sub-action Attribute Reward	Text-Similarity Value Reward
terminate	0.3	None	None
click	0.3	+0.2 (if name ≠ ∅)	+DARS × ROUGE-L(name)
type_and_submit	0.3	+0.1 (if name ≠ ∅) +0.1 (if text ≠ ∅)	+0.1 × ROUGE-L(name) +DARS × ROUGE-L(text)

Table 4: Hierarchical reward schedule with Difficulty-Aware Reward Scaling (DARS).

Difficulty-Aware Reward Scaling (DARS)

Long-text sub-actions (e.g., button labels, search queries) are substantially harder since modern webpages can expose thousands of candidate elements. DARS amplifies rewards for correctly predicting these components, preventing reward hacking where the agent repeatedly selects trivial terminate actions to secure easy points.

Training Objective

Shop-R1 maximizes the combined reward signal while regularizing with KL divergence to a reference policy:

max_{π_θ} 𝔼_{r,a~π_θ(q)} [ v(a) + α·s(r) − β·KL(π_θ(r,a|q) ∥ π_ref(r,a|q)) ]

where v(a) is the action reward, s(r) is the self-certainty score, and α, β are hyperparameters controlling regularization strength.

Results

Simulation accuracy under different fine-tuning methods across models of different sizes

Model	Settings	Exact Action	Action Type
Model	Settings	Acc.	Acc.	F1
Qwen-2.5-3B-Instruct	Zero-shot prompting	0.32%	15.33%	16.15%
	RL (Binary)	1.01%	6.17%	9.92%
	SFT	16.76%	22.25%	24.52%
	SFT + RL (Binary)	16.55%	23.74%	28.07%
	Shop-R1 (Ours)	27.72%	36.40%	31.28%

Qwen-2.5-1.5B-Instruct	Zero-shot prompting	0.53%	3.94%	6.16%
	SFT	10.86%	23.58%	29.02%
	Shop-R1 (Ours)	24.11%	34.54%	29.19%

Qwen-2.5-0.5B-Instruct	Zero-shot prompting	6.76%	12.88%	15.55%
	SFT	9.90%	17.72%	21.61%
	Shop-R1 (Ours)	27.72%	31.83%	21.20%

Table 1: Three complementary metrics: exact action accuracy (all sub-fields must match the label); action type accuracy, and action type F1 to disentangle mistakes in coarse intent classification from those in long-text arguments.

Model	Training Scheme Components					Exact Action	Action Type
Model	SFT	Format Reward	Rationale Reward	Reward Scale	Action Reward	Acc.	Acc.	F1
Qwen-2.5-3B-Instruct	✗	✓	✓	✓	`hierarchical`	4.63%	36.56%	21.92%
	✓	✗	✓	✓	`hierarchical`	2.87%	3.19%	5.04%
	✓	✓	✗	✓	`hierarchical`	26.93%	37.25%	33.74%
	✓	✓	✓	✗	`hierarchical`	27.83%	27.20%	11.70%
	✓	✓	✓	✓	`binary`	27.04%	27.46%	12.11%
	✓	✓	✓	✓	`hierarchical`	27.72%	36.40%	31.28%

Table 3: Ablation study on different training component configurations, evaluated by exact match action accuracy and action type accuracy / F1.

Team

Yimeng Zhang^1,2

Tian Wang²

Jiri Gesi²

Ziyi Wang³

Yuxuan Lu³

Jiacheng Lin⁴

Sinong Zhan⁵

Vianne Gao²

Ruochen Jiao²

Junze Liu²

Kun Qian²

Yuxin Tang²

Ran Xue²

Houyu Zhang²

Qingjun Cui²

Yufan Guo²

Dakuo Wang³

¹Michigan State University ²Amazon ³Northeastern University ⁴UIUC ⁵Northwestern University

Citation

If you find our work useful, please cite us

BibTeX

@inproceedings{zhang2026shopr1,
  title={Shop-R1: Rewarding LLMs to Simulate Human Behavior in Online Shopping via Reinforcement Learning},
  author={Zhang, Yimeng and Wang, Tian and Gesi, Jiri and Wang, Ziyi and Lu, Yuxuan and Lin, Jiacheng and Zhan, Sinong and Gao, Vianne and Jiao, Ruochen and Liu, Junze and Qian, Kun and Tang, Yuxin and Xue, Ran and Zhang, Houyu and Cui, Qingjun and Guo, Yufan and Wang, Dakuo},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2026}
}