Rex-Thinker: Grounded Object Referring via Chain-of-Thought Reasoning

Qing Jiang^1,2^*, Xingyu Chen³^*, Zhaoyang Zeng¹, Junzhi Yu³, Lei Zhang¹^†

¹International Digital Economy Academy (IDEA)
²South China University of Technolodgy
³Peking University
* Equal contribution † Corresponding author

Code Paper Weight Data Demo

Live Reasoning Demo

Watch Rex-Thinker analyze and reason step-by-step

Processing...

Analyzing image...

Rex-Thinker AI Reasoning

Human:

Grounded Object Referring

Grounded Object Referring is the task of locating objects in an image based on natural language descriptions—for example, "the man in a blue shirt" or "the dog sitting under the table." Unlike standard object detection, this task requires a deeper understanding of both visual details and linguistic nuances, including attributes, spatial relationships, and interactions.

What makes it grounded is the focus on explainable and verifiable reasoning. Instead of guessing the object from the description, a grounded model reasons step-by-step—just like a human would. It first identifies possible candidates, then examines each one carefully based on the description, and finally selects the best match. This process reduces hallucination (predicting non-existent objects) and makes the model's decisions transparent and trustworthy.

Our model, Rex-Thinker, tackles this task using Chain-of-Thought (CoT) reasoning, breaking down each decision into clear steps: Planning, Action, and Summarization. Combined with our HumanRef-CoT dataset and fine-tuning methods, this leads to state-of-the-art accuracy and interpretability.

Chain-of-Thought object referring through planning, action, and summarization.

HumanRef-CoT Dataset

Overview of the proposed CoT reasoning referring data engine.

HumanRef-CoT is a large-scale dataset designed to support Chain-of-Thought (CoT) reasoning for the Referring Expression Comprehension (REC) task. It includes 90,824 high-quality annotations generated using GPT-4o, based on the HumanRef dataset—a benchmark focusing on complex, human-centric referring expressions.

Each annotation follows a structured CoT reasoning format with three stages:

Planning: Decomposes the referring expression into subgoals.
Action: Evaluates each candidate region step-by-step using box-level visual hints.
Summarization: Aggregates the reasoning results to produce the final prediction.

To generate these annotations, we use GPT-4o with carefully designed in-context prompts, grounded visual markers, and strict answer verification. Only samples where the model's final prediction matches the ground truth are retained.

HumanRef-CoT enables interpretable and verifiable reasoning supervision and serves as the backbone for both supervised fine-tuning and reinforcement learning (GRPO) stages in our training pipeline.

Two-Stage Training with SFT and GRPO

Overview of the Rex-Thinker architecture and our two-stage training methods

To effectively teach the model how to reason step-by-step, we adopt a two-stage training pipeline:

Supervised Fine-Tuning (SFT)

We start by fine-tuning the model on the HumanRef-CoT dataset using token-level supervision. This stage teaches the model to generate structured reasoning traces—including planning, action, and summarization—guided by ground-truth annotations. The goal is to instill the model with a strong foundation in Chain-of-Thought reasoning grounded in visual evidence.

GRPO-Based Reinforcement Learning

After SFT, we apply Group Relative Policy Optimization (GRPO) to further improve performance. GRPO enables the model to explore alternative reasoning paths by sampling multiple responses and reinforcing those that achieve higher task-level rewards. This enhances the model's generalization, robustness, and ability to handle ambiguous or out-of-distribution cases, while also reducing hallucinations.

Together, these two stages result in a model that is not only accurate, but also interpretable and trustworthy in its reasoning process.

More Examples

Explore additional examples of Rex-Thinker's reasoning capabilities across diverse scenarios.

Please detect the youngest one.

Please detect the two people holding hands

Please detect all non-real persons.

Please detect the one with rich Vitamin C

Please detect the leader.

Please detect the potato, and tell me how to pick and place it into the container using the robot gripper.

Please detect apple with disease, and tell me what kind of disease it may be.

Please detect ripe tomato.

Please detect damaged container.

Please detect athletes with an even number of number plates.

Please detect person holding two footballs.

Please detect car in a crash.

Please detect Hot dogs on the grill.

Please detect all person to the right of the person wearing a yellow tie.

Please detect person wearing cloth that has letter A .

Box Hint

Answer

Reasoning Process


Question: Please detect person wearing cloth that has letter A
Answer:
 OK, the user needs us to detect the person wearing cloth that has letter A. To accomplish this task, I need to break it down into the following steps:
- Step 1: Find all the people in the image.
- Step 2: Check the clothes of each person to see if there is the letter A on them.

# Step 1: Find all the people in the image.
There are 8 people in the image, listed from left-to-right as person 4, person 6, person 5, person 2, person 3, person 7, person 1, person 8.

# Step 2: Check the clothes of each person to see if there is the letter A on them.
- Person 4: Wearing a pink dress with the word "VOL" on it, no letter A → ❌
- Person 6: Wearing an orange dress with the word "VOL" on it, no letter A → ❌
- Person 5: Wearing a blue shirt with the word "L" on it, no letter A → ❌
- Person 2: Wearing a pink dress with the word "U" on it, no letter A → ❌
- Person 3: Wearing a yellow shirt with the word "N" on it, no letter A → ❌
- Person 7: Wearing a brown shirt with the word "T" on it, no letter A → ❌
- Person 1: Wearing a blue shirt with the word "ER" on it, no letter A → ❌
- Person 8: There is a blue shirt inside the box, but no person 8 is visible in terms of wearing clothes with a letter A, and among the people visible, none have a letter A on their clothes.

We also note that person 4 and person 6 are part of the group mentioned in the additional information, but they still do not have the letter A on their clothes.

# Summarize and Re-Check answer
Let's now recheck our answer and put ✅ for the target person and ❌ for others
- Person 4: no letter A on clothes → ❌
- Person 6: no letter A on clothes → ❌
- Person 5: no letter A on clothes → ❌
- Person 2: no letter A on clothes → ❌
- Person 3: no letter A on clothes → ❌
- Person 7: no letter A on clothes → ❌
- Person 1: no letter A on clothes → ❌
- Person 8: no person 8 visible wearing relevant clothes and others don't have A either → ❌
There is no such thing in the image.

Citation

@misc{jiang2025rexthinkergroundedobjectreferring,
      title={Rex-Thinker: Grounded Object Referring via Chain-of-Thought Reasoning}, 
      author={Qing Jiang and Xingyu Chen and Zhaoyang Zeng and Junzhi Yu and Lei Zhang},
      year={2025},
      eprint={2506.04034},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2506.04034}
}