- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Undergraduate Research /
- GRPO Finetuning for VLM Mathematical Reasoning
Open Collections
UBC Undergraduate Research
GRPO Finetuning for VLM Mathematical Reasoning Chohan, Tayyib
Abstract
Vision Language Models (VLMs) like GPT-4v[1] have demonstrated remarkable capabilities but struggle with visual mathematical reasoning tasks. These tasks require numerical precision, multi-step logic, and accurate interpretation of visual elements like charts and diagrams. This thesis investigates the potential of Group Relative Policy Optimization (GRPO), a reinforcement learning technique, to enhance these specific capabilities in VLMs. We finetuned the Qwen2VL-7B model using GRPO with reward signals based on answer accuracy and response formatting. The study also explored the impact of GRPO by comparing full model finetuning against scenarios with selectively frozen vision or language components. Our findings indicate that GRPO finetuning yields observable improvements in mathematical reasoning accuracy compared to the baseline model. Notably, finetuning the complete model produced better performance than freezing the vision or language components, suggesting potential benefits to adapting both modalities. Additionally, we conducted a qualitative analysis that revealed persistent error types in evaluation. While GRPO clearly enhances visual–mathematical reasoning in VLMs, it does not solve every problem. Gaps remain around fine-grained graph interpretation and formula application. Future work should explore more sophisticated RL strategies, from carefully tuned reward functions to innovative model architectures.
Item Metadata
Title |
GRPO Finetuning for VLM Mathematical Reasoning
|
Creator | |
Date Issued |
2025-04
|
Description |
Vision Language Models (VLMs) like GPT-4v[1] have demonstrated remarkable capabilities but struggle with visual mathematical reasoning tasks.
These tasks require numerical precision, multi-step logic, and accurate interpretation of visual elements like charts and diagrams. This thesis investigates the potential of Group Relative Policy Optimization (GRPO),
a reinforcement learning technique, to enhance these specific capabilities in
VLMs. We finetuned the Qwen2VL-7B model using GRPO with reward signals based on answer accuracy and response formatting. The study also explored the impact of GRPO by comparing full model finetuning against scenarios with selectively frozen vision or language components. Our findings
indicate that GRPO finetuning yields observable improvements in mathematical reasoning accuracy compared to the baseline model. Notably, finetuning the complete model produced better performance than freezing the
vision or language components, suggesting potential benefits to adapting
both modalities. Additionally, we conducted a qualitative analysis that revealed persistent error types in evaluation. While GRPO clearly enhances
visual–mathematical reasoning in VLMs, it does not solve every problem.
Gaps remain around fine-grained graph interpretation and formula application. Future work should explore more sophisticated RL strategies, from
carefully tuned reward functions to innovative model architectures.
|
Genre | |
Type | |
Language |
eng
|
Series | |
Date Available |
2025-05-14
|
Provider |
Vancouver : University of British Columbia Library
|
Rights |
Attribution-NonCommercial-NoDerivatives 4.0 International
|
DOI |
10.14288/1.0448888
|
URI | |
Affiliation | |
Peer Review Status |
Unreviewed
|
Scholarly Level |
Undergraduate
|
Rights URI | |
Aggregated Source Repository |
DSpace
|
Item Media
Item Citations and Data
Rights
Attribution-NonCommercial-NoDerivatives 4.0 International