UBC Theses and Dissertations
Enhancing FPGA accelerated machine learning debug with lossless compression Nafziger, Zakary Snider
Acceleration of machine learning models is proving to be an important application for FPGAs. Unfortunately, debugging such models during training or inference is difficult. Software simulations of a machine learning system may be of insufficient detail to provide meaningful debug insight, or may require infeasibly long run-times. Thus, it is often desirable to debug the accelerated model while it is running on real hardware. Effective on-chip debug often requires instrumenting a design with additional circuitry to store run-time data, consuming valuable chip resources. Previous work has developed methods to perform lossy compression of signals by exploiting machine learning specific knowledge, thereby increasing the amount of debug context that can be stored in an on-chip trace buffer. However, all prior work compresses each successive element in a signal of interest independently. Since debug signals may have temporal similarity in many machine learning applications there is an opportunity to further increase trace buffer utilization. To that end, this thesis presents two major research contributions. The first contribution is an architecture to perform lossless temporal compression in addition to the existing lossy element-wise compression. Further, it is shown that, when applied to typical machine learning algorithms in realistic debug scenarios, approximately twice as much information can be stored in an on-chip buffer while increasing the total area of the debug instrument by approximately 25\%. The impact is that, for a given instrumentation budget, a significantly larger trace window is available during debug, possibly allowing a designer to narrow down the root cause of a bug faster. The second contribution is an evaluation of the margin for compression performance improvement. An attempt was made to determine the entropy at the input of the proposed encoder using information theory, but this was mathematically intractable. Instead, a comparison was made to a best in class software compression algorithm. It was demonstrated that while not superior to the software algorithm the proposed encoder performs well in the memory-scarce context of FPGA debug.
Item Citations and Data
Attribution-NonCommercial-NoDerivatives 4.0 International