- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Multimodal understanding of long documents : from topic...
Open Collections
UBC Theses and Dissertations
UBC Theses and Dissertations
Multimodal understanding of long documents : from topic modeling to question answering Abaskohi, Amirhossein
Abstract
Long multimodal documents, which contain text, images, and other types of content, are common in real-world settings but remain difficult for natural language processing (NLP) models to process. These documents pose challenges in both understanding their content and training models when labeled data is limited. This thesis presents two contributions that address these problems from different angles. First, we introduce CEMTM, a topic modeling method designed for long documents that include both text and images. Instead of relying on bag-of-words or treating different modalities separately, CEMTM uses contextual embeddings and cross-modal alignment to produce more coherent and meaningful topics. It performs well across several datasets and offers better topic diversity and interpretability. Second, we present FM²DS, a pipeline for generating synthetic training data for multimodal multihop question answering (MMQA). FM²DS uses prompting and document retrieval to create realistic question answering (QA) examples, and applies knowledge distillation to transfer reasoning ability from a large teacher model to a smaller multimodal model. This approach makes it possible to train competitive QA systems with only a few examples, reducing the need for large annotated datasets. Together, these two methods support more effective processing of long multimodal documents: CEMTM for exploring and summarizing content, and FM²DS for enabling downstream MMQA systems in low-resource settings. We evaluate both approaches across multiple tasks and demonstrate a substantial performance improvement.
Item Metadata
Title |
Multimodal understanding of long documents : from topic modeling to question answering
|
Creator | |
Supervisor | |
Publisher |
University of British Columbia
|
Date Issued |
2025
|
Description |
Long multimodal documents, which contain text, images, and other types of content, are common in real-world settings but remain difficult for natural language processing (NLP) models to process. These documents pose challenges in both understanding their content and training models when labeled data is limited. This thesis presents two contributions that address these problems from different angles.
First, we introduce CEMTM, a topic modeling method designed for long documents that include both text and images. Instead of relying on bag-of-words or treating different modalities separately, CEMTM uses contextual embeddings and cross-modal alignment to produce more coherent and meaningful topics. It performs well across several datasets and offers better topic diversity and interpretability.
Second, we present FM²DS, a pipeline for generating synthetic training data for multimodal multihop question answering (MMQA). FM²DS uses prompting and document retrieval to create realistic question answering (QA) examples, and applies knowledge distillation to transfer reasoning ability from a large teacher model to a smaller multimodal model. This approach makes it possible to train competitive QA systems with only a few examples, reducing the need for large annotated datasets.
Together, these two methods support more effective processing of long multimodal documents: CEMTM for exploring and summarizing content, and FM²DS for enabling downstream MMQA systems in low-resource settings. We evaluate both approaches across multiple tasks and demonstrate a substantial performance improvement.
|
Genre | |
Type | |
Language |
eng
|
Date Available |
2025-08-27
|
Provider |
Vancouver : University of British Columbia Library
|
Rights |
Attribution-NonCommercial-NoDerivatives 4.0 International
|
DOI |
10.14288/1.0449900
|
URI | |
Degree (Theses) | |
Program (Theses) | |
Affiliation | |
Degree Grantor |
University of British Columbia
|
Graduation Date |
2025-11
|
Campus | |
Scholarly Level |
Graduate
|
Rights URI | |
Aggregated Source Repository |
DSpace
|
Item Media
Item Citations and Data
Rights
Attribution-NonCommercial-NoDerivatives 4.0 International