GRPO Finetuning for VLM Mathematical Reasoning

UBC Undergraduate Research

GRPO Finetuning for VLM Mathematical Reasoning Chohan, Tayyib

Abstract

Vision Language Models (VLMs) like GPT-4v[1] have demonstrated remarkable capabilities but struggle with visual mathematical reasoning tasks. These tasks require numerical precision, multi-step logic, and accurate interpretation of visual elements like charts and diagrams. This thesis investigates the potential of Group Relative Policy Optimization (GRPO), a reinforcement learning technique, to enhance these specific capabilities in VLMs. We finetuned the Qwen2VL-7B model using GRPO with reward signals based on answer accuracy and response formatting. The study also explored the impact of GRPO by comparing full model finetuning against scenarios with selectively frozen vision or language components. Our findings indicate that GRPO finetuning yields observable improvements in mathematical reasoning accuracy compared to the baseline model. Notably, finetuning the complete model produced better performance than freezing the vision or language components, suggesting potential benefits to adapting both modalities. Additionally, we conducted a qualitative analysis that revealed persistent error types in evaluation. While GRPO clearly enhances visual–mathematical reasoning in VLMs, it does not solve every problem. Gaps remain around fine-grained graph interpretation and formula application. Future work should explore more sophisticated RL strategies, from carefully tuned reward functions to innovative model architectures.

Item Metadata

Title	GRPO Finetuning for VLM Mathematical Reasoning
Creator	Chohan, Tayyib
Date Issued	2025-04
Description	Vision Language Models (VLMs) like GPT-4v[1] have demonstrated remarkable capabilities but struggle with visual mathematical reasoning tasks. These tasks require numerical precision, multi-step logic, and accurate interpretation of visual elements like charts and diagrams. This thesis investigates the potential of Group Relative Policy Optimization (GRPO), a reinforcement learning technique, to enhance these specific capabilities in VLMs. We finetuned the Qwen2VL-7B model using GRPO with reward signals based on answer accuracy and response formatting. The study also explored the impact of GRPO by comparing full model finetuning against scenarios with selectively frozen vision or language components. Our findings indicate that GRPO finetuning yields observable improvements in mathematical reasoning accuracy compared to the baseline model. Notably, finetuning the complete model produced better performance than freezing the vision or language components, suggesting potential benefits to adapting both modalities. Additionally, we conducted a qualitative analysis that revealed persistent error types in evaluation. While GRPO clearly enhances visual–mathematical reasoning in VLMs, it does not solve every problem. Gaps remain around fine-grained graph interpretation and formula application. Future work should explore more sophisticated RL strategies, from carefully tuned reward functions to innovative model architectures.
Genre	Graduating Project
Type	Text
Language	eng
Series	University of British Columbia. CPEN_V 499
Date Available	2025-05-14
Provider	Vancouver : University of British Columbia Library
Rights	Attribution-NonCommercial-NoDerivatives 4.0 International
DOI	10.14288/1.0448888
URI	http://hdl.handle.net/2429/91132
Affiliation	Applied Science, Faculty of; Electrical and Computer Engineering, Department of
Peer Review Status	Unreviewed
Scholarly Level	Undergraduate
Rights URI	http://creativecommons.org/licenses/by-nc-nd/4.0/
Aggregated Source Repository	DSpace

Open Collections

UBC Undergraduate Research