UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

On effective learning for multimodal data Rahman, Tanzila

Abstract

Humans can perceive the world through multiple modalities. Strong behavioral scientific evidence suggests that such ability, which includes implicit information integration and cross-modal alignment inherent in it, is critical for human learning. Nevertheless, until relatively recently, most deep learning methods have primarily focused on addressing single-modality issues associated with learning from vision, sound, or text. Over the recent years, however, researchers started to focus on multi-modal learning, specifically emphasizing high-level visual comprehension challenges like image-text matching, video captioning, and generation of audiovisual content. In this thesis, we aim to broaden the scope of learning from multimodal information, enhance its integration, and solve problems related to humancentric spatio-temporal perception in a manner that does not necessarily require complete supervision (e.g., granular spatio-temporal multi-modal alignment). Specifically, we focus on addressing two fundamental challenges: (1) Multimodal learning; and (2) Weak-supervision. We address these challenges across a range of diverse tasks. First, we focus on weakly-supervised dense video captioning, where we combine audio with visual features to improve state-of-the-art performance. We also show that audio itself can carry a surprising amount of information, compared to existing visual-only models. Secondly, we introduce an endto- end audio-visual co-segmentation network to recognize individual objects and corresponding sounds using only object labels, without requiring any additional supervision or bounding box proposals. Third, we propose TriBERT, a transformerbased architecture with co-attention, that learns contextual features across three modalities: vision, pose, and audio. We show that these features are general and improve performance on a variety of tasks spanning audio-visual sound source separation and cross-modal retrieval. Fourth, we delve into generative text-to-image (TTI) models, specifically to address consistency when generating complex story visualizations by augmenting diffusion models with memory module. Finally, we look at aspects of penalization within TTI. This allows us to generate diverse visuals for custom and user-specified concepts (e.g., a specific person, dog, etc.). Throughout our comprehensive analysis of these tasks within this thesis, we present significant algorithmic, theoretical, and empirical contributions to the field of multimodal machine learning and computer vision.

Item Media

Item Citations and Data

Rights

Attribution-NonCommercial-NoDerivatives 4.0 International