Open Collections will undergo scheduled maintenance on the following dates: On Monday, April 27th, 2026, the site will not be available from 7:00 AM – 9:00 AM PST and on Tuesday, April 28th, 2026, the site will remain accessible from 7:00 AM – 9:00 AM PST, however item images and media will not be available during this time.

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

An FPGA memory architecture to enable efficient weight implementations for machine learning applications Chua, Martin

Abstract

Hardware acceleration for machine learning applications has become increasingly important as models grow and evolve rapidly. FPGAs are able to adapt to these changes quickly because they are hardware reconfigurable while also providing low latency, high throughput, and efficiency. The efficiency of machine learning acceleration is intrinsically tied to memory access latency, capacity, and bandwidth. On an FPGA, fine-grained resources like flip-flops and LUTRAMs provide lower latency access but offer limited storage capacity. Dedicated on-chip BRAMs provide higher density but is still finite. Off-chip DRAM suffers from increased latency and constrained bandwidth, which limits throughput of model training and inference. Prior work has proposed an architectural enhancement that allows the user to re-purpose unused configuration bits as user-accessible memory. In typical FPGA implementations, there remains a significant portion of routing segments are left unused. By modifying the switch block architecture, the configuration bits controlling unused segments can be implemented as user storage. Inspired by this work and the growing demand for machine learning acceleration, we present three research contributions. The first contribution is an FPGA architecture enhancement, called switch block memory, that allows the user to re-purpose unused FPGA switch block configuration bits to implement weight memory in machine learning applications. The second contribution is a comprehensive analysis of machine learning memory utilization to identify the specific contexts where our switch block memories is most effective. The third contribution is an augmented CAD flow integrated into the open-source VTR CAD suite to evaluate the proposed architecture. When applied to selected machine learning workloads, our approach achieves up to a 9% improvement in Fmax, a 3% reduction in total wire length, and enables up to 80 Mb of additional on-chip storage for large FPGA devices.

Item Media

Item Citations and Data

Rights

Attribution-NonCommercial-NoDerivatives 4.0 International