Multimodal understanding of long documents : from topic modeling to question answering

UBC Theses and Dissertations

Featured Collection

UBC Theses and Dissertations

Multimodal understanding of long documents : from topic modeling to question answering Abaskohi, Amirhossein

Abstract

Long multimodal documents, which contain text, images, and other types of content, are common in real-world settings but remain difficult for natural language processing (NLP) models to process. These documents pose challenges in both understanding their content and training models when labeled data is limited. This thesis presents two contributions that address these problems from different angles. First, we introduce CEMTM, a topic modeling method designed for long documents that include both text and images. Instead of relying on bag-of-words or treating different modalities separately, CEMTM uses contextual embeddings and cross-modal alignment to produce more coherent and meaningful topics. It performs well across several datasets and offers better topic diversity and interpretability. Second, we present FM²DS, a pipeline for generating synthetic training data for multimodal multihop question answering (MMQA). FM²DS uses prompting and document retrieval to create realistic question answering (QA) examples, and applies knowledge distillation to transfer reasoning ability from a large teacher model to a smaller multimodal model. This approach makes it possible to train competitive QA systems with only a few examples, reducing the need for large annotated datasets. Together, these two methods support more effective processing of long multimodal documents: CEMTM for exploring and summarizing content, and FM²DS for enabling downstream MMQA systems in low-resource settings. We evaluate both approaches across multiple tasks and demonstrate a substantial performance improvement.

Item Metadata

Title	Multimodal understanding of long documents : from topic modeling to question answering
Creator	Abaskohi, Amirhossein
Supervisor	Carenini, Giuseppe
Publisher	University of British Columbia
Date Issued	2025
Description	Long multimodal documents, which contain text, images, and other types of content, are common in real-world settings but remain difficult for natural language processing (NLP) models to process. These documents pose challenges in both understanding their content and training models when labeled data is limited. This thesis presents two contributions that address these problems from different angles. First, we introduce CEMTM, a topic modeling method designed for long documents that include both text and images. Instead of relying on bag-of-words or treating different modalities separately, CEMTM uses contextual embeddings and cross-modal alignment to produce more coherent and meaningful topics. It performs well across several datasets and offers better topic diversity and interpretability. Second, we present FM²DS, a pipeline for generating synthetic training data for multimodal multihop question answering (MMQA). FM²DS uses prompting and document retrieval to create realistic question answering (QA) examples, and applies knowledge distillation to transfer reasoning ability from a large teacher model to a smaller multimodal model. This approach makes it possible to train competitive QA systems with only a few examples, reducing the need for large annotated datasets. Together, these two methods support more effective processing of long multimodal documents: CEMTM for exploring and summarizing content, and FM²DS for enabling downstream MMQA systems in low-resource settings. We evaluate both approaches across multiple tasks and demonstrate a substantial performance improvement.
Genre	Thesis/Dissertation
Type	Text
Language	eng
Date Available	2025-08-27
Provider	Vancouver : University of British Columbia Library
Rights	Attribution-NonCommercial-NoDerivatives 4.0 International
DOI	10.14288/1.0449900
URI	http://hdl.handle.net/2429/92096
Degree (Theses)	Master of Science - MSc
Program (Theses)	Computer Science
Affiliation	Science, Faculty of; Computer Science, Department of
Degree Grantor	University of British Columbia
Graduation Date	2025-11
Campus	UBCV
Scholarly Level	Graduate
Rights URI	http://creativecommons.org/licenses/by-nc-nd/4.0/
Aggregated Source Repository	DSpace

Open Collections

UBC Theses and Dissertations

UBC Theses and Dissertations

Multimodal understanding of long documents : from topic modeling to question answering Abaskohi, Amirhossein

Abstract

Item Metadata

Item Media

Item Citations and Data

Rights