- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Accelerating input dispatching for deep learning recommendation...
Open Collections
UBC Theses and Dissertations
UBC Theses and Dissertations
Accelerating input dispatching for deep learning recommendation models training Adnan, Muhammad
Abstract
Deep-Learning and Time-Series based recommendation models require copious amounts of compute for the deep learning part and large memory capacities for their embedding table portion. Training these models typically involves using GPUs to accelerate the deep learning phase but restrict the memory-intensive embedding tables to the CPUs. This causes data to be constantly transferred between the CPU and GPUs, which limits the overall throughput of the training process. This thesis offers a heterogeneous acceleration pipeline, called Hotline, by leveraging the insight that only a small number of embedding entries are accessed frequently, and can easily fit in a single GPU’s local memory. Hotline aims to pipeline the training mini-batches by efficiently utilizing (1) the main memory for not-frequently accessed embeddings, (2) the GPUs’ local memory for frequently accessed embeddings and their compute for the entire recommender model, whilst stitching their execution through a novel hardware accelerator that gathers required working parameters and dispatches training inputs. Hotline accelerator processes multiple input mini-batches to collect and dispatch the ones that access the frequently-accessed embeddings directly to GPUs. For inputs that require infrequently accessed embeddings, Hotline hides the CPUGPU transfer time by proactively obtaining them from the main memory. This enables the recommendation system training, for its entirety of mini-batches, to be performed on low-capacity high-throughput GPUs. Results on real-world datasets and recommender models shows that Hotline reduces the average training time by 3.45 in comparison to a XDL baseline when using 4 GPUs. Moreover, Hotline increases the overall training throughput to 20.8 epochs/hr in comparison to 5.3 epochs/hr for Criteo Terabyte dataset.
Item Metadata
Title |
Accelerating input dispatching for deep learning recommendation models training
|
Creator | |
Supervisor | |
Publisher |
University of British Columbia
|
Date Issued |
2021
|
Description |
Deep-Learning and Time-Series based recommendation models require copious
amounts of compute for the deep learning part and large memory capacities for
their embedding table portion. Training these models typically involves using
GPUs to accelerate the deep learning phase but restrict the memory-intensive embedding
tables to the CPUs. This causes data to be constantly transferred between
the CPU and GPUs, which limits the overall throughput of the training process.
This thesis offers a heterogeneous acceleration pipeline, called Hotline, by leveraging
the insight that only a small number of embedding entries are accessed
frequently, and can easily fit in a single GPU’s local memory. Hotline aims to
pipeline the training mini-batches by efficiently utilizing (1) the main memory for
not-frequently accessed embeddings, (2) the GPUs’ local memory for frequently
accessed embeddings and their compute for the entire recommender model, whilst
stitching their execution through a novel hardware accelerator that gathers required
working parameters and dispatches training inputs.
Hotline accelerator processes multiple input mini-batches to collect and dispatch
the ones that access the frequently-accessed embeddings directly to GPUs.
For inputs that require infrequently accessed embeddings, Hotline hides the CPUGPU
transfer time by proactively obtaining them from the main memory. This
enables the recommendation system training, for its entirety of mini-batches, to be
performed on low-capacity high-throughput GPUs. Results on real-world datasets
and recommender models shows that Hotline reduces the average training time by
3.45 in comparison to a XDL baseline when using 4 GPUs. Moreover, Hotline
increases the overall training throughput to 20.8 epochs/hr in comparison to 5.3
epochs/hr for Criteo Terabyte dataset.
|
Genre | |
Type | |
Language |
eng
|
Date Available |
2021-10-25
|
Provider |
Vancouver : University of British Columbia Library
|
Rights |
Attribution-NonCommercial-NoDerivatives 4.0 International
|
DOI |
10.14288/1.0402605
|
URI | |
Degree | |
Program | |
Affiliation | |
Degree Grantor |
University of British Columbia
|
Graduation Date |
2021-11
|
Campus | |
Scholarly Level |
Graduate
|
Rights URI | |
Aggregated Source Repository |
DSpace
|
Item Media
Item Citations and Data
Rights
Attribution-NonCommercial-NoDerivatives 4.0 International