- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- On effective learning for multimodal data
Open Collections
UBC Theses and Dissertations
UBC Theses and Dissertations
On effective learning for multimodal data Rahman, Tanzila
Abstract
Humans can perceive the world through multiple modalities. Strong behavioral scientific evidence suggests that such ability, which includes implicit information integration and cross-modal alignment inherent in it, is critical for human learning. Nevertheless, until relatively recently, most deep learning methods have primarily focused on addressing single-modality issues associated with learning from vision, sound, or text. Over the recent years, however, researchers started to focus on multi-modal learning, specifically emphasizing high-level visual comprehension challenges like image-text matching, video captioning, and generation of audiovisual content. In this thesis, we aim to broaden the scope of learning from multimodal information, enhance its integration, and solve problems related to humancentric spatio-temporal perception in a manner that does not necessarily require complete supervision (e.g., granular spatio-temporal multi-modal alignment). Specifically, we focus on addressing two fundamental challenges: (1) Multimodal learning; and (2) Weak-supervision. We address these challenges across a range of diverse tasks. First, we focus on weakly-supervised dense video captioning, where we combine audio with visual features to improve state-of-the-art performance. We also show that audio itself can carry a surprising amount of information, compared to existing visual-only models. Secondly, we introduce an endto- end audio-visual co-segmentation network to recognize individual objects and corresponding sounds using only object labels, without requiring any additional supervision or bounding box proposals. Third, we propose TriBERT, a transformerbased architecture with co-attention, that learns contextual features across three modalities: vision, pose, and audio. We show that these features are general and improve performance on a variety of tasks spanning audio-visual sound source separation and cross-modal retrieval. Fourth, we delve into generative text-to-image (TTI) models, specifically to address consistency when generating complex story visualizations by augmenting diffusion models with memory module. Finally, we look at aspects of penalization within TTI. This allows us to generate diverse visuals for custom and user-specified concepts (e.g., a specific person, dog, etc.). Throughout our comprehensive analysis of these tasks within this thesis, we present significant algorithmic, theoretical, and empirical contributions to the field of multimodal machine learning and computer vision.
Item Metadata
Title |
On effective learning for multimodal data
|
Creator | |
Supervisor | |
Publisher |
University of British Columbia
|
Date Issued |
2024
|
Description |
Humans can perceive the world through multiple modalities. Strong behavioral
scientific evidence suggests that such ability, which includes implicit information
integration and cross-modal alignment inherent in it, is critical for human learning.
Nevertheless, until relatively recently, most deep learning methods have primarily
focused on addressing single-modality issues associated with learning from vision,
sound, or text. Over the recent years, however, researchers started to focus on
multi-modal learning, specifically emphasizing high-level visual comprehension
challenges like image-text matching, video captioning, and generation of audiovisual
content. In this thesis, we aim to broaden the scope of learning from multimodal
information, enhance its integration, and solve problems related to humancentric
spatio-temporal perception in a manner that does not necessarily require
complete supervision (e.g., granular spatio-temporal multi-modal alignment).
Specifically, we focus on addressing two fundamental challenges: (1) Multimodal
learning; and (2) Weak-supervision. We address these challenges across
a range of diverse tasks. First, we focus on weakly-supervised dense video captioning,
where we combine audio with visual features to improve state-of-the-art
performance. We also show that audio itself can carry a surprising amount of information,
compared to existing visual-only models. Secondly, we introduce an endto-
end audio-visual co-segmentation network to recognize individual objects and
corresponding sounds using only object labels, without requiring any additional supervision
or bounding box proposals. Third, we propose TriBERT, a transformerbased
architecture with co-attention, that learns contextual features across three
modalities: vision, pose, and audio. We show that these features are general and
improve performance on a variety of tasks spanning audio-visual sound source separation
and cross-modal retrieval. Fourth, we delve into generative text-to-image
(TTI) models, specifically to address consistency when generating complex story
visualizations by augmenting diffusion models with memory module. Finally, we
look at aspects of penalization within TTI. This allows us to generate diverse visuals
for custom and user-specified concepts (e.g., a specific person, dog, etc.).
Throughout our comprehensive analysis of these tasks within this thesis, we
present significant algorithmic, theoretical, and empirical contributions to the field
of multimodal machine learning and computer vision.
|
Genre | |
Type | |
Language |
eng
|
Date Available |
2024-05-06
|
Provider |
Vancouver : University of British Columbia Library
|
Rights |
Attribution-NonCommercial-NoDerivatives 4.0 International
|
DOI |
10.14288/1.0442340
|
URI | |
Degree | |
Program | |
Affiliation | |
Degree Grantor |
University of British Columbia
|
Graduation Date |
2024-11
|
Campus | |
Scholarly Level |
Graduate
|
Rights URI | |
Aggregated Source Repository |
DSpace
|
Item Media
Item Citations and Data
Rights
Attribution-NonCommercial-NoDerivatives 4.0 International