UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Accelerating recommendation system training by leveraging popular choices Ebrahimzadeh Maboud, Yassaman

Abstract

Recommendation systems have been deployed in e-commerce and online advertising to expose desired items from the user's perspective. To meet this end, various deep learning-based recommendation models have been employed such as the Deep learning recommendation model or DLRM at Facebook. The input of such a model can be categorized as dense and sparse representations. The former demonstrates the numerical representation of items and users with discrete parameters. On the other hand, the latter refers to continuous input such as time or age. Such models are comprised of two main components: computation-intensive components like multilayer perceptron or MLP and memory-intensive like embedding tables which save the numerical representation of sparse. Training these large-scale recommendation models is evolving to require increasing data and compute resources. The highly parallel neural networks portion of these models can benefit from GPU acceleration, however, large embedding tables often cannot fit in the limited-capacity GPU device memory. Hence, this thesis deep dives into the semantics of training data and feature access, transfer, and usage patterns of these models. We observe that, due to the popularity of certain inputs, the accesses to the embeddings are highly skewed. Only a few embedding entries are accessed up to 10000× more. In this thesis, we focus on improving the end-to-end training performance using this insight and offer a framework, called Frequently Accessed Embeddings or FAE. we propose a hot-embedding-aware data layout for training recommender models. This layout utilizes the scarce GPU memory for storing the highly accessed embeddings, thus reducing the data transfers from CPU to GPU. We choose DLRM~\cite{dlrm} and XDL~\cite{xdl} as the baseline. Both of these models have been commercialized and are well-established in the industry. DLRM has been deployed by Facebook as well as XDL by Alibaba. We choose XDL as of its high utilization of CPU and a notably scalable solution for training recommendation models. Experiments on production-scale recommendation models with datasets from real work show that FAE reduces the overall training time by 2.3× and 1.52× in comparison to XDL CPU-only and XDL CPU-GPU execution while maintaining baseline accuracy.

Item Citations and Data

Rights

Attribution-NonCommercial-NoDerivatives 4.0 International