UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Using mixed low-precision formats in multiply-accumulate (MAC) units for DNN training Tatsumi, Mariko


Due to limited size, cost and power, embedded devices do not offer the same computational throughput as graphics processing units (GPUs) for training Deep Neural Networks (DNNs). The most compute-intensive stage of multilayer perceptron (MLP) and convolutional neural network (CNN) training is the general matrix multiply (GEMM) kernel which is executed three times per layer in each iteration: once for forward-propagation and twice for back-propagation. To reduce the number of operations, techniques such as distillation (to reduce model size) and pruning (to introduce sparsity) are commonly applied. This thesis considers another technique, where the computational effort of each operation is reduced using low-precision arithmetic. While the use of small data types is common in DNN inference, this is not yet common in DNN training. Previous work in the area is somewhat limited, sometimes only considering 16-bit floating-point formats or overlooking implementation details, such as the area and accuracy tradeoffs from exact digital multiplier designs. This thesis considers the use of mixed-precision operations (MPO) within the GEMM kernel for DNN training. To conduct these investigations, we have implemented a complete DNN training framework for embedded systems, Archimedes-MPO. Starting with the C++ library TinyDNN, we have abstracted each layer to use custom data types and accelerated the GEMM stage with CUDA and Vitis HLS to produce bit-accurate GPU and FPGA implementations. This framework allows us to exactly measure the impact of various multiplier and accumulator designs on area and accuracy. Compared to 32-bit floating-point multiplier, as few as 2 mantissa bits attain similar accuracy. Accuracy losses are reduced with adaptive loss scaling and the removal of hardware for rounding and not-a-number (NaN) representations. Removal of subnormals saves area as well, but hurts accuracy, so we propose a custom subnormal encoding as a compromise. For accumulation, 12-bit floating-point and 21-bit fixed-point formats work similarly. Fixed-point accumulation seems to have an area advantage, but the impact of a wider output data size can be costly on downstream logic. While precise results depend upon the model and dataset used, the observed trends and framework can help the design of future GEMM-based hardware accelerators for DNNs.

Item Media

Item Citations and Data


Attribution-NonCommercial-NoDerivatives 4.0 International