UBC Undergraduate Research

GRPO Finetuning for VLM Mathematical Reasoning Chohan, Tayyib

Abstract

Vision Language Models (VLMs) like GPT-4v[1] have demonstrated remarkable capabilities but struggle with visual mathematical reasoning tasks. These tasks require numerical precision, multi-step logic, and accurate interpretation of visual elements like charts and diagrams. This thesis investigates the potential of Group Relative Policy Optimization (GRPO), a reinforcement learning technique, to enhance these specific capabilities in VLMs. We finetuned the Qwen2VL-7B model using GRPO with reward signals based on answer accuracy and response formatting. The study also explored the impact of GRPO by comparing full model finetuning against scenarios with selectively frozen vision or language components. Our findings indicate that GRPO finetuning yields observable improvements in mathematical reasoning accuracy compared to the baseline model. Notably, finetuning the complete model produced better performance than freezing the vision or language components, suggesting potential benefits to adapting both modalities. Additionally, we conducted a qualitative analysis that revealed persistent error types in evaluation. While GRPO clearly enhances visual–mathematical reasoning in VLMs, it does not solve every problem. Gaps remain around fine-grained graph interpretation and formula application. Future work should explore more sophisticated RL strategies, from carefully tuned reward functions to innovative model architectures.

Item Citations and Data

Rights

Attribution-NonCommercial-NoDerivatives 4.0 International