An FPGA memory architecture to enable efficient weight implementations for machine learning applications

UBC Theses and Dissertations

Featured Collection

UBC Theses and Dissertations

An FPGA memory architecture to enable efficient weight implementations for machine learning applications Chua, Martin

Abstract

Hardware acceleration for machine learning applications has become increasingly important as models grow and evolve rapidly. FPGAs are able to adapt to these changes quickly because they are hardware reconfigurable while also providing low latency, high throughput, and efficiency. The efficiency of machine learning acceleration is intrinsically tied to memory access latency, capacity, and bandwidth. On an FPGA, fine-grained resources like flip-flops and LUTRAMs provide lower latency access but offer limited storage capacity. Dedicated on-chip BRAMs provide higher density but is still finite. Off-chip DRAM suffers from increased latency and constrained bandwidth, which limits throughput of model training and inference. Prior work has proposed an architectural enhancement that allows the user to re-purpose unused configuration bits as user-accessible memory. In typical FPGA implementations, there remains a significant portion of routing segments are left unused. By modifying the switch block architecture, the configuration bits controlling unused segments can be implemented as user storage. Inspired by this work and the growing demand for machine learning acceleration, we present three research contributions. The first contribution is an FPGA architecture enhancement, called switch block memory, that allows the user to re-purpose unused FPGA switch block configuration bits to implement weight memory in machine learning applications. The second contribution is a comprehensive analysis of machine learning memory utilization to identify the specific contexts where our switch block memories is most effective. The third contribution is an augmented CAD flow integrated into the open-source VTR CAD suite to evaluate the proposed architecture. When applied to selected machine learning workloads, our approach achieves up to a 9% improvement in Fmax, a 3% reduction in total wire length, and enables up to 80 Mb of additional on-chip storage for large FPGA devices.

Item Metadata

Title	An FPGA memory architecture to enable efficient weight implementations for machine learning applications
Creator	Chua, Martin
Supervisor	Wilton, Steve
Publisher	University of British Columbia
Date Issued	2026
Description	Hardware acceleration for machine learning applications has become increasingly important as models grow and evolve rapidly. FPGAs are able to adapt to these changes quickly because they are hardware reconfigurable while also providing low latency, high throughput, and efficiency. The efficiency of machine learning acceleration is intrinsically tied to memory access latency, capacity, and bandwidth. On an FPGA, fine-grained resources like flip-flops and LUTRAMs provide lower latency access but offer limited storage capacity. Dedicated on-chip BRAMs provide higher density but is still finite. Off-chip DRAM suffers from increased latency and constrained bandwidth, which limits throughput of model training and inference. Prior work has proposed an architectural enhancement that allows the user to re-purpose unused configuration bits as user-accessible memory. In typical FPGA implementations, there remains a significant portion of routing segments are left unused. By modifying the switch block architecture, the configuration bits controlling unused segments can be implemented as user storage. Inspired by this work and the growing demand for machine learning acceleration, we present three research contributions. The first contribution is an FPGA architecture enhancement, called switch block memory, that allows the user to re-purpose unused FPGA switch block configuration bits to implement weight memory in machine learning applications. The second contribution is a comprehensive analysis of machine learning memory utilization to identify the specific contexts where our switch block memories is most effective. The third contribution is an augmented CAD flow integrated into the open-source VTR CAD suite to evaluate the proposed architecture. When applied to selected machine learning workloads, our approach achieves up to a 9% improvement in Fmax, a 3% reduction in total wire length, and enables up to 80 Mb of additional on-chip storage for large FPGA devices.
Genre	Thesis/Dissertation
Type	Text
Language	eng
Date Available	2026-04-09
Provider	Vancouver : University of British Columbia Library
Rights	Attribution-NonCommercial-NoDerivatives 4.0 International
DOI	10.14288/1.0451848
URI	http://hdl.handle.net/2429/93972
Degree (Theses)	Master of Applied Science - MASc
Program (Theses)	Electrical and Computer Engineering
Affiliation	Applied Science, Faculty of; Electrical and Computer Engineering, Department of
Degree Grantor	University of British Columbia
Graduation Date	2026-05
Campus	UBCV
Scholarly Level	Graduate
Rights URI	http://creativecommons.org/licenses/by-nc-nd/4.0/
Aggregated Source Repository	DSpace

Open Collections

UBC Theses and Dissertations

UBC Theses and Dissertations

An FPGA memory architecture to enable efficient weight implementations for machine learning applications Chua, Martin

Abstract

Item Metadata

Item Media

Item Citations and Data

Rights