Rex-Thinker: Grounded Object Referring via Chain-of-Thought Reasoning

Qing Jiang1,2*, Xingyu Chen3*, Zhaoyang Zeng1, Junzhi Yu3, Lei Zhang1

1International Digital Economy Academy (IDEA)
2South China University of Technolodgy
3Peking University
* Equal contribution † Corresponding author

Live Reasoning Demo

Watch Rex-Thinker analyze and reason step-by-step

Demo Image
Processing...
Analyzing image...
Rex-Thinker AI Reasoning
Human:

Grounded Object Referring

Grounded Object Referring is the task of locating objects in an image based on natural language descriptions—for example, "the man in a blue shirt" or "the dog sitting under the table." Unlike standard object detection, this task requires a deeper understanding of both visual details and linguistic nuances, including attributes, spatial relationships, and interactions.

What makes it grounded is the focus on explainable and verifiable reasoning. Instead of guessing the object from the description, a grounded model reasons step-by-step—just like a human would. It first identifies possible candidates, then examines each one carefully based on the description, and finally selects the best match. This process reduces hallucination (predicting non-existent objects) and makes the model's decisions transparent and trustworthy.

Our model, Rex-Thinker, tackles this task using Chain-of-Thought (CoT) reasoning, breaking down each decision into clear steps: Planning, Action, and Summarization. Combined with our HumanRef-CoT dataset and fine-tuning methods, this leads to state-of-the-art accuracy and interpretability.

Grounded Object Referring Illustration

Chain-of-Thought object referring through planning, action, and summarization.

HumanRef-CoT Dataset

HumanRef-CoT Dataset Engine

Overview of the proposed CoT reasoning referring data engine.

HumanRef-CoT is a large-scale dataset designed to support Chain-of-Thought (CoT) reasoning for the Referring Expression Comprehension (REC) task. It includes 90,824 high-quality annotations generated using GPT-4o, based on the HumanRef dataset—a benchmark focusing on complex, human-centric referring expressions.

Each annotation follows a structured CoT reasoning format with three stages:

  • Planning: Decomposes the referring expression into subgoals.
  • Action: Evaluates each candidate region step-by-step using box-level visual hints.
  • Summarization: Aggregates the reasoning results to produce the final prediction.

To generate these annotations, we use GPT-4o with carefully designed in-context prompts, grounded visual markers, and strict answer verification. Only samples where the model's final prediction matches the ground truth are retained.

HumanRef-CoT enables interpretable and verifiable reasoning supervision and serves as the backbone for both supervised fine-tuning and reinforcement learning (GRPO) stages in our training pipeline.

Two-Stage Training with SFT and GRPO

Two-Stage Training Pipeline

Overview of the Rex-Thinker architecture and our two-stage training methods

To effectively teach the model how to reason step-by-step, we adopt a two-stage training pipeline:

Supervised Fine-Tuning (SFT)

We start by fine-tuning the model on the HumanRef-CoT dataset using token-level supervision. This stage teaches the model to generate structured reasoning traces—including planning, action, and summarization—guided by ground-truth annotations. The goal is to instill the model with a strong foundation in Chain-of-Thought reasoning grounded in visual evidence.

GRPO-Based Reinforcement Learning

After SFT, we apply Group Relative Policy Optimization (GRPO) to further improve performance. GRPO enables the model to explore alternative reasoning paths by sampling multiple responses and reinforcing those that achieve higher task-level rewards. This enhances the model's generalization, robustness, and ability to handle ambiguous or out-of-distribution cases, while also reducing hallucinations.

Together, these two stages result in a model that is not only accurate, but also interpretable and trustworthy in its reasoning process.

More Examples

Explore additional examples of Rex-Thinker's reasoning capabilities across diverse scenarios.

Box Hint

Please detect the youngest one.

Box Hint

Please detect the two people holding hands

Box Hint

Please detect all non-real persons.

Box Hint

Please detect the one with rich Vitamin C

Box Hint

Please detect the leader.

Box Hint

Please detect the potato, and tell me how to pick and place it into the container using the robot gripper.

Box Hint

Please detect apple with disease, and tell me what kind of disease it may be.

Box Hint

Please detect ripe tomato.

Box Hint

Please detect damaged container.

Box Hint

Please detect athletes with an even number of number plates.

Box Hint

Please detect person holding two footballs.

Box Hint

Please detect car in a crash.

Box Hint

Please detect Hot dogs on the grill.

Box Hint

Please detect all person to the right of the person wearing a yellow tie.

Box Hint

Please detect person wearing cloth that has letter A .

Citation

@misc{jiang2025rexthinkergroundedobjectreferring,
      title={Rex-Thinker: Grounded Object Referring via Chain-of-Thought Reasoning}, 
      author={Qing Jiang and Xingyu Chen and Zhaoyang Zeng and Junzhi Yu and Lei Zhang},
      year={2025},
      eprint={2506.04034},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2506.04034}
}