Graph-based language grounding

UBC Theses and Dissertations

Featured Collection

UBC Theses and Dissertations

Graph-based language grounding Bajaj, Mohit

Abstract

In recent years, phrase (or more generally language) grounding has emerged as a fundamental task in computer vision. Phrase grounding is a generalization of more traditional computer vision tasks with the goal of localizing a natural language phrase spatially in a given image. Most recent work use state-of-the-art deep learning techniques to achieve good performance on this task. However, they do not capture complex dependencies among proposal regions and phrases that are crucial for the superior performance on the task. In this work we try to overcome this limitation through a model that makes no assumptions regarding the underlying dependencies in both of the modalities. We present an end-to-end framework for grounding of the phrases in images that uses graphs to formulate more complex, non-sequential dependencies among proposal image regions and phrases. We capture intra-modal dependencies using a separate graph neural network for each modality (visual and lingual), and then use conditional message-passing in another graph neural network to fuse their outputs and capture cross-modal relationships. This final representation is used to make the grounding decisions. The framework supports many-to-many matching and is able to ground single phrase to multiple image regions and vice versa. We validate our design choices through a series of ablation studies and demonstrate state-of-the-art performance on the Flickr30k Entities dataset and the ReferIt Game dataset.

Item Metadata

Title	Graph-based language grounding
Creator	Bajaj, Mohit
Publisher	University of British Columbia
Date Issued	2019
Description	In recent years, phrase (or more generally language) grounding has emerged as a fundamental task in computer vision. Phrase grounding is a generalization of more traditional computer vision tasks with the goal of localizing a natural language phrase spatially in a given image. Most recent work use state-of-the-art deep learning techniques to achieve good performance on this task. However, they do not capture complex dependencies among proposal regions and phrases that are crucial for the superior performance on the task. In this work we try to overcome this limitation through a model that makes no assumptions regarding the underlying dependencies in both of the modalities. We present an end-to-end framework for grounding of the phrases in images that uses graphs to formulate more complex, non-sequential dependencies among proposal image regions and phrases. We capture intra-modal dependencies using a separate graph neural network for each modality (visual and lingual), and then use conditional message-passing in another graph neural network to fuse their outputs and capture cross-modal relationships. This final representation is used to make the grounding decisions. The framework supports many-to-many matching and is able to ground single phrase to multiple image regions and vice versa. We validate our design choices through a series of ablation studies and demonstrate state-of-the-art performance on the Flickr30k Entities dataset and the ReferIt Game dataset.
Genre	Thesis/Dissertation
Type	Text
Language	eng
Date Available	2019-08-19
Provider	Vancouver : University of British Columbia Library
Rights	Attribution-NonCommercial-NoDerivatives 4.0 International
DOI	10.14288/1.0380482
URI	http://hdl.handle.net/2429/71323
Degree	Master of Science - MSc
Program	Computer Science
Affiliation	Science, Faculty of; Computer Science, Department of
Degree Grantor	University of British Columbia
Graduation Date	2019-09
Campus	UBCV
Scholarly Level	Graduate
Rights URI	http://creativecommons.org/licenses/by-nc-nd/4.0/
Aggregated Source Repository	DSpace

Open Collections

UBC Theses and Dissertations

UBC Theses and Dissertations

Graph-based language grounding Bajaj, Mohit

Abstract

Item Metadata

Item Media

Item Citations and Data

Rights