Spatio-temporal relational reasoning for video question answering

UBC Theses and Dissertations

Featured Collection

UBC Theses and Dissertations

Spatio-temporal relational reasoning for video question answering Singh, Gursimran

Abstract

Video question answering is the task of automatically answering questions about videos. Apart from direct practical interest, it provides a good way to benchmark our progress on various tasks in video understanding. A successful algorithm must ground objects of interest and model relationships among them in both the spatial and temporal domains jointly. We show that the existing state-of-the-art approaches, which are based on Convolutional Neural Networks or Recurrent Neural Networks, are not effective at joint reasoning in both spatial and temporal domains. Moreover, they are short-sighted and struggle with long-range dependencies in videos. To address these challenges, we present a novel spatio-temporal reasoning neural module that models complex multi-entity relationships in space and long-term dependencies in time. Our model captures both time-changing object interactions and action dynamics of individual objects in an effective way. We evaluate our module on two benchmark datasets which require spatio-temporal reasoning: TGIF-QA and SVQA. We achieve state-of-the-art performance on both datasets. More significantly, we achieve substantial improvements in some of the most challenging question types, like counting, which demonstrate the effectiveness of our proposed spatio-temporal relational module.

Item Metadata

Title	Spatio-temporal relational reasoning for video question answering
Creator	Singh, Gursimran
Publisher	University of British Columbia
Date Issued	2019
Description	Video question answering is the task of automatically answering questions about videos. Apart from direct practical interest, it provides a good way to benchmark our progress on various tasks in video understanding. A successful algorithm must ground objects of interest and model relationships among them in both the spatial and temporal domains jointly. We show that the existing state-of-the-art approaches, which are based on Convolutional Neural Networks or Recurrent Neural Networks, are not effective at joint reasoning in both spatial and temporal domains. Moreover, they are short-sighted and struggle with long-range dependencies in videos. To address these challenges, we present a novel spatio-temporal reasoning neural module that models complex multi-entity relationships in space and long-term dependencies in time. Our model captures both time-changing object interactions and action dynamics of individual objects in an effective way. We evaluate our module on two benchmark datasets which require spatio-temporal reasoning: TGIF-QA and SVQA. We achieve state-of-the-art performance on both datasets. More significantly, we achieve substantial improvements in some of the most challenging question types, like counting, which demonstrate the effectiveness of our proposed spatio-temporal relational module.
Genre	Thesis/Dissertation
Type	Text
Language	eng
Date Available	2019-10-22
Provider	Vancouver : University of British Columbia Library
Rights	Attribution-NonCommercial-NoDerivatives 4.0 International
DOI	10.14288/1.0384578
URI	http://hdl.handle.net/2429/72033
Degree	Master of Science - MSc
Program	Computer Science
Affiliation	Science, Faculty of; Computer Science, Department of
Degree Grantor	University of British Columbia
Graduation Date	2019-11
Campus	UBCV
Scholarly Level	Graduate
Rights URI	http://creativecommons.org/licenses/by-nc-nd/4.0/
Aggregated Source Repository	DSpace

Open Collections

UBC Theses and Dissertations

UBC Theses and Dissertations

Spatio-temporal relational reasoning for video question answering Singh, Gursimran

Abstract

Item Metadata

Item Media

Item Citations and Data

Rights