In this paper, we have presented a case for using low-bitwidth floating-point arithmetic for Transformer-based DNNs inference. We demonstrate that low-bitwidth floating-point (bfp8) matrix multiplication can be implemented effectively in hardware with a marginal increase over an 8-bit integer equivalence while attaining processing throughput close to the platform maximum. In addition, we show that such an array can be effectively reconfigured during run-time into a programmable fp32 vector processing unit that can be programmed to implement any non-linear functions required by future Transformer-based DNN models. With efficient support of both datatypes, the proposed design eliminates the need to quantize and retrain Transformer models, which is increasingly challenging due to its size. We argue that mixed-precision floating-point appears to be a promising datatype that provides a favorable balance between model accuracy and hardware performance for Transformer-based DNN acceleration. Currently, an automatic compilation framework that provides full stack acceleration of Transformer models is underway. The vector processing unit is also being optimized to improve non-linear function performance.
In this work, we present SqueezeBlock, a transparent weight compression scheme to effectively reduce the memory footprint of DNN models while maintaining a low accuracy degradation without requiring network retraining. SqueezeBlock uses a cluster-aided FP8 quantization method to preprocess weights that facilitates a subsequent block-based weight encoding scheme to achieve a good compression ratio and accuracy loss tradeoff. In addition, the proposed design space exploration framework allowed us to identify optimal encoding parameter configurations for different DNN models. A hardware decoder customized to the specific encoding could then be generated automatically to decode the weights during run time. Experiments showed that SqueezeBlock was able to retain the most significant information from the original model despite the large compression ratio, which may not be immediately apparent, as demonstrated by the ViT model compared to other vision models. In the future, we plan to expand the proposed scheme to compress and decompress activations during run time on hardware, which can improve both DNN inference and training time.
This paper has proposed a novel hardware-aware quantization framework, with a fused mixed-precision accelerator, to efficiently support a distribution-adaptive data representation named DyBit. The variable-length bit-fields enable DyBit to adapt to the tensor distribution in DNNs. Evaluation results show that DyBit-based quantization at very low bitwidths ($
In this paper, we have presented AGNA, an open-source deep neural network (DNN) accelerator hardware and software generator for FPGA platforms. AGNA relies on the two proposed MIGP formulations and their relaxed solutions to perform DSE in customizing a generic accelerator template for the given network model-platform combination. Through extensive experiments using many combinations of DNN models and platforms, we have demonstrated AGNA's capability to produce DNN accelerators with performance comparable to the state-of-the-art in research and vendor-provided frameworks. Importantly, although the accelerators produced currently may not be the fastest in all model-platform combinations, AGNA is vendor-agnostic and is designed to be easily extensible, making it suitable for real-world deployments and for serving as a basis for future research. In the future, we plan to improve AGNA with advanced scheduling capability to work with multi-bank memories, as well to improve its performance through low-level hardware optimizations. We also plan to explore novel network model quantization and pruning techniques by leveraging the processing architecture and scheduling capabilities of AGNA.
This work has proposed MSD, an FPGA-tailored and heterogeneous DNN acceleration framework to utilize both LUTs and DSPs as computation resources and to exploit bit-sparsity. The RSD data representation enables MSD to fine-tune and encode the DNN weights into a bit-sparsity-aware format, making the bit-serial computation on LUTs more efficient. Furthermore, we adopt a latency-driven search algorithm into MSD, which can search for the optimal schedule, the number of EB, and the workload split ratio for each layer, based on a latency constraint. Evaluation results on various DNN models and edge FPGA devices demonstrate that MSD achieves 1.52 $\times$ speedup and 1.36 $\times$ higher throughput compared with the state-of-the-art on ResNet-18 model, and 4.78\% higher accuracy on MobileNet-V2. In the future, we will explore more efficient scheduling methods for workload splitting in the heterogeneous architecture and EB selection in the bit-serial computation, and exploit FPGA-layout-tailored hardware design to further enhance the hardware clock frequency.
This paper presents an energy-efficient DBN processor based on heterogeneous multi-core architecture with transposable weight memory and on-chip local learning. In the future, we will focus on ASIC implementation of the proposed DBN processor in a GALS architecture with a 7T/8T SRAM-based transposable memory design to solve the bottlenecks and further improve throughput and energy efficiency.
To solve the aforementioned issues of the program-verify scheme, a novel in-situ error monitoring scheme for IMPLY-based memristive CIM systems is proposed in this paper.
This paper presents a novel CORDIC based Spiking Neural Network (SNN) design with on-line STDP learning and high hardware efficiency. A system design and evaluation method of CORDIC SNN is proposed to evaluate the hardware efficiency of SNN based on different CORDIC algorithm types and bit-width precisions.