ICLR 2026

Shop-R1

0
% Relative Improvement
2-Stage
RL Framework
Hierarchical
Reward Design
Paper Code (Under Legal Review) Dataset (Under Legal Review)

Abstract

Large Language Models (LLMs) have recently demonstrated strong potential in generating "believable human-like" behavior in web environments. Prior work has explored augmenting training data with LLM-synthesized rationales and applying supervised fine-tuning (SFT) to enhance reasoning ability. However, the performance of such approaches remains inherently bounded by the reasoning capabilities of the model used to generate the rationales.

In this paper, we introduce Shop-R1, a novel reinforcement learning (RL) framework aimed at enhancing the reasoning ability of LLMs for simulation of real human behavior in online shopping environments. Specifically, Shop-R1 decomposes the human behavior simulation task into two stages: rationale generation and action prediction, each guided by distinct reward signals.

For rationale generation, we leverage internal model signals (e.g., logit distributions) to guide the reasoning process in a self-supervised manner. For action prediction, we propose a hierarchical reward structure with difficulty-aware scaling to prevent reward hacking and enable fine-grained reward assignment.

Task Setup

Simulating human shopping behavior on Amazon with structured action space

Objective

Given a shopping context and webpage state, predict the next user action with rationale explaining the decision-making process.

Action Space

type_and_submit

Type text into an input field and submit the form (equivalent to typing + pressing Enter)

{
  "type": "type_and_submit",
  "name": "input_name",
  "text": "search_text"
}
click

Click on a button or clickable element identified by name

{
  "type": "click",
  "name": "clickable_name"
}
terminate

Close the browser and terminate when unsatisfied with search results

{
  "type": "terminate"
}

Output Format

The model outputs a JSON object containing both the predicted action and a first-person rationale explaining the reasoning:

{
  "rationale": "<rationale>",  // explains why making this action
  "action": {
    "type": "<type>",
    ...
  }
}

Method Overview

A two-stage reinforcement learning framework for human behavior simulation

Shop-R1 Framework Overview

Figure 1: Overview of the Shop-R1 framework

Stage 1: Rationale Generation

Leverage internal model signals (e.g., logit distributions) to guide the reasoning process in a self-supervised manner, enabling the model to generate high-quality rationales.

Stage 2: Action Prediction

Hierarchical reward structure with difficulty-aware scaling evaluates both high-level action types and fine-grained sub-action details (attributes and values).

Reward Hacking Prevention

Difficulty-aware scaling mechanism prevents reward hacking by rewarding outputs proportionally to their difficulty, ensuring robust learning.

Reward Design Details

Binary Format Reward

Encourages the model to produce responses in a structured JSON format with two keys: rationale and action. A response earns a format reward of 0.5 if it is in valid JSON format; otherwise, it gains no format reward.

Self-Certainty Score (Rationale Reward)

Quantifies the model's confidence in its generated rationale by computing the KL divergence between the model's predictive distribution and a uniform distribution:

s(r | q) = (1/N|V|) Σj Σi pij log(pij / Ui)

where N is the number of tokens, pij is the predicted probability of token i at position j, and Ui = 1/|V| is the uniform distribution. Higher values indicate greater certainty.

Hierarchical Action Reward

Replaces brittle binary signals with a hierarchical scheme that credits both coarse-grained action types and fine-grained sub-actions. This densifies the reward landscape, lifts the agent out of 'no-reward' plateaus, and makes reward hacking uneconomical.

Action Type Type Reward Sub-action Attribute Reward Text-Similarity Value Reward
terminate 0.3 None None
click 0.3 +0.2 (if name ≠ ∅) +DARS × ROUGE-L(name)
type_and_submit 0.3 +0.1 (if name ≠ ∅)
+0.1 (if text ≠ ∅)
+0.1 × ROUGE-L(name)
+DARS × ROUGE-L(text)

Table 4: Hierarchical reward schedule with Difficulty-Aware Reward Scaling (DARS).

Difficulty-Aware Reward Scaling (DARS)

Long-text sub-actions (e.g., button labels, search queries) are substantially harder since modern webpages can expose thousands of candidate elements. DARS amplifies rewards for correctly predicting these components, preventing reward hacking where the agent repeatedly selects trivial terminate actions to secure easy points.

Training Objective

Shop-R1 maximizes the combined reward signal while regularizing with KL divergence to a reference policy:

maxπθ 𝔼r,a~πθ(q) [ v(a) + α·s(r) − β·KL(πθ(r,a|q) ∥ πref(r,a|q)) ]

where v(a) is the action reward, s(r) is the self-certainty score, and α, β are hyperparameters controlling regularization strength.

Results

Simulation accuracy under different fine-tuning methods across models of different sizes

Model Settings Exact Action Action Type
Acc. Acc. F1
Qwen-2.5-3B-Instruct Zero-shot prompting 0.32% 15.33% 16.15%
RL (Binary) 1.01% 6.17% 9.92%
SFT 16.76% 22.25% 24.52%
SFT + RL (Binary) 16.55% 23.74% 28.07%
Shop-R1 (Ours) 27.72% 36.40% 31.28%
Qwen-2.5-1.5B-Instruct Zero-shot prompting 0.53% 3.94% 6.16%
SFT 10.86% 23.58% 29.02%
Shop-R1 (Ours) 24.11% 34.54% 29.19%
Qwen-2.5-0.5B-Instruct Zero-shot prompting 6.76% 12.88% 15.55%
SFT 9.90% 17.72% 21.61%
Shop-R1 (Ours) 27.72% 31.83% 21.20%

Table 1: Three complementary metrics: exact action accuracy (all sub-fields must match the label); action type accuracy, and action type F1 to disentangle mistakes in coarse intent classification from those in long-text arguments.

Model Training Scheme Components Exact Action Action Type
SFT Format Reward Rationale Reward Reward Scale Action Reward Acc. Acc. F1
Qwen-2.5-3B-Instruct hierarchical 4.63% 36.56% 21.92%
hierarchical 2.87% 3.19% 5.04%
hierarchical 26.93% 37.25% 33.74%
hierarchical 27.83% 27.20% 11.70%
binary 27.04% 27.46% 12.11%
hierarchical 27.72% 36.40% 31.28%

Table 3: Ablation study on different training component configurations, evaluated by exact match action accuracy and action type accuracy / F1.

Team

Tian Wang2

Jiri Gesi2

Ziyi Wang3

Yuxuan Lu3

Jiacheng Lin4

Sinong Zhan5

Vianne Gao2

Ruochen Jiao2

Junze Liu2

Kun Qian2

Yuxin Tang2

Ran Xue2

Houyu Zhang2

Qingjun Cui2

Yufan Guo2

Dakuo Wang3

1Michigan State University 2Amazon 3Northeastern University 4UIUC 5Northwestern University

Citation

If you find our work useful, please cite us

BibTeX
@inproceedings{zhang2026shopr1,
  title={Shop-R1: Rewarding LLMs to Simulate Human Behavior in Online Shopping via Reinforcement Learning},
  author={Zhang, Yimeng and Wang, Tian and Gesi, Jiri and Wang, Ziyi and Lu, Yuxuan and Lin, Jiacheng and Zhan, Sinong and Gao, Vianne and Jiao, Ruochen and Liu, Junze and Qian, Kun and Tang, Yuxin and Xue, Ran and Zhang, Houyu and Cui, Qingjun and Guo, Yufan and Wang, Dakuo},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2026}
}