On effective learning for multimodal data

UBC Theses and Dissertations

Featured Collection

UBC Theses and Dissertations

On effective learning for multimodal data Rahman, Tanzila

Abstract

Humans can perceive the world through multiple modalities. Strong behavioral scientific evidence suggests that such ability, which includes implicit information integration and cross-modal alignment inherent in it, is critical for human learning. Nevertheless, until relatively recently, most deep learning methods have primarily focused on addressing single-modality issues associated with learning from vision, sound, or text. Over the recent years, however, researchers started to focus on multi-modal learning, specifically emphasizing high-level visual comprehension challenges like image-text matching, video captioning, and generation of audiovisual content. In this thesis, we aim to broaden the scope of learning from multimodal information, enhance its integration, and solve problems related to humancentric spatio-temporal perception in a manner that does not necessarily require complete supervision (e.g., granular spatio-temporal multi-modal alignment). Specifically, we focus on addressing two fundamental challenges: (1) Multimodal learning; and (2) Weak-supervision. We address these challenges across a range of diverse tasks. First, we focus on weakly-supervised dense video captioning, where we combine audio with visual features to improve state-of-the-art performance. We also show that audio itself can carry a surprising amount of information, compared to existing visual-only models. Secondly, we introduce an endto- end audio-visual co-segmentation network to recognize individual objects and corresponding sounds using only object labels, without requiring any additional supervision or bounding box proposals. Third, we propose TriBERT, a transformerbased architecture with co-attention, that learns contextual features across three modalities: vision, pose, and audio. We show that these features are general and improve performance on a variety of tasks spanning audio-visual sound source separation and cross-modal retrieval. Fourth, we delve into generative text-to-image (TTI) models, specifically to address consistency when generating complex story visualizations by augmenting diffusion models with memory module. Finally, we look at aspects of penalization within TTI. This allows us to generate diverse visuals for custom and user-specified concepts (e.g., a specific person, dog, etc.). Throughout our comprehensive analysis of these tasks within this thesis, we present significant algorithmic, theoretical, and empirical contributions to the field of multimodal machine learning and computer vision.

Item Metadata

Title	On effective learning for multimodal data
Creator	Rahman, Tanzila
Supervisor	Sigal, Leonid
Publisher	University of British Columbia
Date Issued	2024
Description	Humans can perceive the world through multiple modalities. Strong behavioral scientific evidence suggests that such ability, which includes implicit information integration and cross-modal alignment inherent in it, is critical for human learning. Nevertheless, until relatively recently, most deep learning methods have primarily focused on addressing single-modality issues associated with learning from vision, sound, or text. Over the recent years, however, researchers started to focus on multi-modal learning, specifically emphasizing high-level visual comprehension challenges like image-text matching, video captioning, and generation of audiovisual content. In this thesis, we aim to broaden the scope of learning from multimodal information, enhance its integration, and solve problems related to humancentric spatio-temporal perception in a manner that does not necessarily require complete supervision (e.g., granular spatio-temporal multi-modal alignment). Specifically, we focus on addressing two fundamental challenges: (1) Multimodal learning; and (2) Weak-supervision. We address these challenges across a range of diverse tasks. First, we focus on weakly-supervised dense video captioning, where we combine audio with visual features to improve state-of-the-art performance. We also show that audio itself can carry a surprising amount of information, compared to existing visual-only models. Secondly, we introduce an endto- end audio-visual co-segmentation network to recognize individual objects and corresponding sounds using only object labels, without requiring any additional supervision or bounding box proposals. Third, we propose TriBERT, a transformerbased architecture with co-attention, that learns contextual features across three modalities: vision, pose, and audio. We show that these features are general and improve performance on a variety of tasks spanning audio-visual sound source separation and cross-modal retrieval. Fourth, we delve into generative text-to-image (TTI) models, specifically to address consistency when generating complex story visualizations by augmenting diffusion models with memory module. Finally, we look at aspects of penalization within TTI. This allows us to generate diverse visuals for custom and user-specified concepts (e.g., a specific person, dog, etc.). Throughout our comprehensive analysis of these tasks within this thesis, we present significant algorithmic, theoretical, and empirical contributions to the field of multimodal machine learning and computer vision.
Genre	Thesis/Dissertation
Type	Text
Language	eng
Date Available	2024-05-06
Provider	Vancouver : University of British Columbia Library
Rights	Attribution-NonCommercial-NoDerivatives 4.0 International
DOI	10.14288/1.0442340
URI	http://hdl.handle.net/2429/88225
Degree	Doctor of Philosophy - PhD
Program	Computer Science
Affiliation	Science, Faculty of; Computer Science, Department of
Degree Grantor	University of British Columbia
Graduation Date	2024-11
Campus	UBCV
Scholarly Level	Graduate
Rights URI	http://creativecommons.org/licenses/by-nc-nd/4.0/
Aggregated Source Repository	DSpace

Open Collections

UBC Theses and Dissertations

UBC Theses and Dissertations

On effective learning for multimodal data Rahman, Tanzila

Abstract

Item Metadata

Item Media

Item Citations and Data

Rights