A Multimodal Recommender System Using Deep Learning Techniques Combining Review Texts and Images

UBC Faculty Research and Publications

A Multimodal Recommender System Using Deep Learning Techniques Combining Review Texts and Images Jeong, Euiju; Li, Xinzhe; Kwon, Angela Eunyoung; Park, Seonu; Li, Qinglong; Kim, Jaekyeong

Abstract

Online reviews that consist of texts and images are an essential source of information for alleviating data sparsity in recommender system studies. Although texts and images provide different types of information, they can provide complementary or substitutive advantages. However, most studies are limited in introducing the complementary effect between texts and images in the recommender systems. Specifically, they have overlooked the informational value of images and proposed recommender systems solely based on textual representations. To address this research gap, this study proposes a novel recommender model that captures the dependence between texts and images. This study uses the RoBERTa and VGG-16 models to extract textual and visual information from online reviews and applies a co-attention mechanism to capture the complementarity between the two modalities. Extensive experiments were conducted using Amazon datasets, confirming the superiority of the proposed model. Our findings suggest that the complementarity of texts and images is crucial for enhancing recommendation accuracy and performance.

Item Metadata

Title	A Multimodal Recommender System Using Deep Learning Techniques Combining Review Texts and Images
Creator	Jeong, Euiju; Li, Xinzhe; Kwon, Angela Eunyoung; Park, Seonu; Li, Qinglong; Kim, Jaekyeong
Publisher	Multidisciplinary Digital Publishing Institute
Date Issued	2024-10-10
Description	Online reviews that consist of texts and images are an essential source of information for alleviating data sparsity in recommender system studies. Although texts and images provide different types of information, they can provide complementary or substitutive advantages. However, most studies are limited in introducing the complementary effect between texts and images in the recommender systems. Specifically, they have overlooked the informational value of images and proposed recommender systems solely based on textual representations. To address this research gap, this study proposes a novel recommender model that captures the dependence between texts and images. This study uses the RoBERTa and VGG-16 models to extract textual and visual information from online reviews and applies a co-attention mechanism to capture the complementarity between the two modalities. Extensive experiments were conducted using Amazon datasets, confirming the superiority of the proposed model. Our findings suggest that the complementarity of texts and images is crucial for enhancing recommendation accuracy and performance.
Subject	recommender system; multimodal review; deep learning; co-attention mechanism; complementarity
Genre	Article
Type	Text
Language	eng
Date Available	2024-10-28
Provider	Vancouver : University of British Columbia Library
Rights	CC BY 4.0
DOI	10.14288/1.0447147
URI	http://hdl.handle.net/2429/89549
Affiliation	Business, Sauder School of; Non UBC
Citation	Applied Sciences 14 (20): 9206 (2024)
Publisher DOI	10.3390/app14209206
Peer Review Status	Reviewed
Scholarly Level	Faculty; Researcher
Rights URI	https://creativecommons.org/licenses/by/4.0/
Aggregated Source Repository	DSpace

Open Collections

UBC Faculty Research and Publications