- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Towards efficient machine learning management systems
Open Collections
UBC Theses and Dissertations
UBC Theses and Dissertations
Towards efficient machine learning management systems Ding, Dujian
Abstract
Machine learning (ML), especially deep learning (DL), has become a leading force in both academic research and industrial applications. In tandem with the impressive capability of ML models, the model sizes have drastically exploded over recent years. Gigantic model sizes not only lead to computational challenges but also ethical concerns such as green AI and democratizing AI. Significant research efforts have been invested into making ML more efficient. Model-level ML efficiency studies the fundamental trade-off between model efficiency and effectiveness, while system-level ML efficiency deals with the overall efficiency of answering ML inference queries invoking multiple models. In this dissertation, we aim to answer the central question: How to make machine learning service more efficient without compromising the overall performance? Our work is aligned with both model-level and system-level ML efficiency. On the model level, we propose effective approaches to extract efficient subnetworks from gigantic ML models with user-specified sparsity targets while maintaining high performance (Chapter 2). On the system level, we identify two important sub-tasks: bulk query processing and streaming query processing, where query objects are either provided all at once as a bulk, or they stream in. As to bulk query processing, we study the Fixed-Radius Near Neighbour query and develop approximate algorithms to efficiently deliver high quality answers with statistical guarantees using one cheap proxy model and one expensive oracle model (Chapter 3). Additionally, we investigate the more general setup where we have a set of ML models at different cost and accuracy trade-offs and conceive principled algorithms to select the optimal model assignments for ML classification queries (Chapter 4). As for streaming query processing, we consider powerful conversational ML services such as ChatGPT which are supported by the development of powerful large language models (LLM). We develop novel adaptive query routing solutions to significantly reduce the overall incurred costs by distributing query traffic from the expensive cloud models to the small on-device models without compromising the response quality (Chapter 5). Furthermore, we extend the routing framework to a spectrum of LLMs and leverage best-of-n sampling techniques to further enhance efficiency (Chapter 6).
Item Metadata
Title |
Towards efficient machine learning management systems
|
Creator | |
Supervisor | |
Publisher |
University of British Columbia
|
Date Issued |
2025
|
Description |
Machine learning (ML), especially deep learning (DL), has become a leading force in both academic research and industrial applications. In tandem with the impressive capability of ML models, the model sizes have drastically exploded over recent years. Gigantic model sizes not only lead to computational challenges but also ethical concerns such as green AI and democratizing AI. Significant research efforts have been invested into making ML more efficient. Model-level ML efficiency studies the fundamental trade-off between model efficiency and effectiveness, while system-level ML efficiency deals with the overall efficiency of answering ML inference queries invoking multiple models.
In this dissertation, we aim to answer the central question: How to make machine learning service more efficient without compromising the overall performance? Our work is aligned with both model-level and system-level ML efficiency. On the model level, we propose effective approaches to extract efficient subnetworks from gigantic ML models with user-specified sparsity targets while maintaining high performance (Chapter 2). On the system level, we identify two important sub-tasks: bulk query processing and streaming query processing, where query objects are either provided all at once as a bulk, or they stream in. As to bulk query processing, we study the Fixed-Radius Near Neighbour query and develop approximate algorithms to efficiently deliver high quality answers with statistical guarantees using one cheap proxy model and one expensive oracle model (Chapter 3). Additionally, we investigate the more general setup where we have a set of ML models at different cost and accuracy trade-offs and conceive principled algorithms to select the optimal model assignments for ML classification queries (Chapter 4). As for streaming query processing, we consider powerful conversational ML services such as ChatGPT which are supported by the development of powerful large language models (LLM). We develop novel adaptive query routing solutions to significantly reduce the overall incurred costs by distributing query traffic from the expensive cloud models to the small on-device models without compromising the response quality (Chapter 5). Furthermore, we extend the routing framework to a spectrum of LLMs and leverage best-of-n sampling techniques to further enhance efficiency (Chapter 6).
|
Genre | |
Type | |
Language |
eng
|
Date Available |
2025-07-04
|
Provider |
Vancouver : University of British Columbia Library
|
Rights |
Attribution-NonCommercial-NoDerivatives 4.0 International
|
DOI |
10.14288/1.0449274
|
URI | |
Degree (Theses) | |
Program (Theses) | |
Affiliation | |
Degree Grantor |
University of British Columbia
|
Graduation Date |
2025-11
|
Campus | |
Scholarly Level |
Graduate
|
Rights URI | |
Aggregated Source Repository |
DSpace
|
Item Media
Item Citations and Data
Rights
Attribution-NonCommercial-NoDerivatives 4.0 International