Towards efficient machine learning management systems

UBC Theses and Dissertations

Featured Collection

UBC Theses and Dissertations

Towards efficient machine learning management systems Ding, Dujian

Abstract

Machine learning (ML), especially deep learning (DL), has become a leading force in both academic research and industrial applications. In tandem with the impressive capability of ML models, the model sizes have drastically exploded over recent years. Gigantic model sizes not only lead to computational challenges but also ethical concerns such as green AI and democratizing AI. Significant research efforts have been invested into making ML more efficient. Model-level ML efficiency studies the fundamental trade-off between model efficiency and effectiveness, while system-level ML efficiency deals with the overall efficiency of answering ML inference queries invoking multiple models. In this dissertation, we aim to answer the central question: How to make machine learning service more efficient without compromising the overall performance? Our work is aligned with both model-level and system-level ML efficiency. On the model level, we propose effective approaches to extract efficient subnetworks from gigantic ML models with user-specified sparsity targets while maintaining high performance (Chapter 2). On the system level, we identify two important sub-tasks: bulk query processing and streaming query processing, where query objects are either provided all at once as a bulk, or they stream in. As to bulk query processing, we study the Fixed-Radius Near Neighbour query and develop approximate algorithms to efficiently deliver high quality answers with statistical guarantees using one cheap proxy model and one expensive oracle model (Chapter 3). Additionally, we investigate the more general setup where we have a set of ML models at different cost and accuracy trade-offs and conceive principled algorithms to select the optimal model assignments for ML classification queries (Chapter 4). As for streaming query processing, we consider powerful conversational ML services such as ChatGPT which are supported by the development of powerful large language models (LLM). We develop novel adaptive query routing solutions to significantly reduce the overall incurred costs by distributing query traffic from the expensive cloud models to the small on-device models without compromising the response quality (Chapter 5). Furthermore, we extend the routing framework to a spectrum of LLMs and leverage best-of-n sampling techniques to further enhance efficiency (Chapter 6).

Item Metadata

Title	Towards efficient machine learning management systems
Creator	Ding, Dujian
Supervisor	Lakshmanan, Laks V. S., 1959-
Publisher	University of British Columbia
Date Issued	2025
Description	Machine learning (ML), especially deep learning (DL), has become a leading force in both academic research and industrial applications. In tandem with the impressive capability of ML models, the model sizes have drastically exploded over recent years. Gigantic model sizes not only lead to computational challenges but also ethical concerns such as green AI and democratizing AI. Significant research efforts have been invested into making ML more efficient. Model-level ML efficiency studies the fundamental trade-off between model efficiency and effectiveness, while system-level ML efficiency deals with the overall efficiency of answering ML inference queries invoking multiple models. In this dissertation, we aim to answer the central question: How to make machine learning service more efficient without compromising the overall performance? Our work is aligned with both model-level and system-level ML efficiency. On the model level, we propose effective approaches to extract efficient subnetworks from gigantic ML models with user-specified sparsity targets while maintaining high performance (Chapter 2). On the system level, we identify two important sub-tasks: bulk query processing and streaming query processing, where query objects are either provided all at once as a bulk, or they stream in. As to bulk query processing, we study the Fixed-Radius Near Neighbour query and develop approximate algorithms to efficiently deliver high quality answers with statistical guarantees using one cheap proxy model and one expensive oracle model (Chapter 3). Additionally, we investigate the more general setup where we have a set of ML models at different cost and accuracy trade-offs and conceive principled algorithms to select the optimal model assignments for ML classification queries (Chapter 4). As for streaming query processing, we consider powerful conversational ML services such as ChatGPT which are supported by the development of powerful large language models (LLM). We develop novel adaptive query routing solutions to significantly reduce the overall incurred costs by distributing query traffic from the expensive cloud models to the small on-device models without compromising the response quality (Chapter 5). Furthermore, we extend the routing framework to a spectrum of LLMs and leverage best-of-n sampling techniques to further enhance efficiency (Chapter 6).
Genre	Thesis/Dissertation
Type	Text
Language	eng
Date Available	2025-07-04
Provider	Vancouver : University of British Columbia Library
Rights	Attribution-NonCommercial-NoDerivatives 4.0 International
DOI	10.14288/1.0449274
URI	http://hdl.handle.net/2429/91495
Degree (Theses)	Doctor of Philosophy - PhD
Program (Theses)	Computer Science
Affiliation	Science, Faculty of; Computer Science, Department of
Degree Grantor	University of British Columbia
Graduation Date	2025-11
Campus	UBCV
Scholarly Level	Graduate
Rights URI	http://creativecommons.org/licenses/by-nc-nd/4.0/
Aggregated Source Repository	DSpace

Open Collections

UBC Theses and Dissertations

UBC Theses and Dissertations

Towards efficient machine learning management systems Ding, Dujian

Abstract

Item Metadata

Item Media

Item Citations and Data

Rights