UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Towards efficient machine learning management systems Ding, Dujian

Abstract

Machine learning (ML), especially deep learning (DL), has become a leading force in both academic research and industrial applications. In tandem with the impressive capability of ML models, the model sizes have drastically exploded over recent years. Gigantic model sizes not only lead to computational challenges but also ethical concerns such as green AI and democratizing AI. Significant research efforts have been invested into making ML more efficient. Model-level ML efficiency studies the fundamental trade-off between model efficiency and effectiveness, while system-level ML efficiency deals with the overall efficiency of answering ML inference queries invoking multiple models. In this dissertation, we aim to answer the central question: How to make machine learning service more efficient without compromising the overall performance? Our work is aligned with both model-level and system-level ML efficiency. On the model level, we propose effective approaches to extract efficient subnetworks from gigantic ML models with user-specified sparsity targets while maintaining high performance (Chapter 2). On the system level, we identify two important sub-tasks: bulk query processing and streaming query processing, where query objects are either provided all at once as a bulk, or they stream in. As to bulk query processing, we study the Fixed-Radius Near Neighbour query and develop approximate algorithms to efficiently deliver high quality answers with statistical guarantees using one cheap proxy model and one expensive oracle model (Chapter 3). Additionally, we investigate the more general setup where we have a set of ML models at different cost and accuracy trade-offs and conceive principled algorithms to select the optimal model assignments for ML classification queries (Chapter 4). As for streaming query processing, we consider powerful conversational ML services such as ChatGPT which are supported by the development of powerful large language models (LLM). We develop novel adaptive query routing solutions to significantly reduce the overall incurred costs by distributing query traffic from the expensive cloud models to the small on-device models without compromising the response quality (Chapter 5). Furthermore, we extend the routing framework to a spectrum of LLMs and leverage best-of-n sampling techniques to further enhance efficiency (Chapter 6).

Item Media

Item Citations and Data

Rights

Attribution-NonCommercial-NoDerivatives 4.0 International