UBC Theses and Dissertations
Visual grounding through iterative refinement Fan, Zicong
The problem of visual grounding has attracted much attention in recent years due to its pivotal role in more general visio-lingual high level reasoning tasks (e.g., image captioning, VQA). Despite the tremendous progress in this area, the performance of most approaches has been hindered by the precision of bounding box proposals obtained in the early stages of the recent pipelines. To address this limitation, we propose a general progressive query-guided bounding box refinement architecture (OptiBox) that regresses the output of a visual grounding system closer to the ground truth. We apply this architecture in the context of the GroundeR model and the One-Stage Grounding model. The results from the GroundeR model show that our model can provide an additional grounding accuracy gain for a two-stage grounding system. Further, our experiments show that the proposed model can significantly improve bounding box precision when the predicted box of a grounding system deviates from the ground truth.
Item Citations and Data
Attribution-NonCommercial-NoDerivatives 4.0 International