1

A Case for Low Bitwidth Floating Point Arithmetic on FPGA for Transformer Based DNN Inference

In this paper, we have presented a case for using low-bitwidth floating-point arithmetic for Transformer-based DNNs inference. We demonstrate that low-bitwidth floating-point (bfp8) matrix multiplication can be implemented effectively in hardware with a marginal increase over an 8-bit integer equivalence while attaining processing throughput close to the platform maximum. In addition, we show that such an array can be effectively reconfigured during run-time into a programmable fp32 vector processing unit that can be programmed to implement any non-linear functions required by future Transformer-based DNN models. With efficient support of both datatypes, the proposed design eliminates the need to quantize and retrain Transformer models, which is increasingly challenging due to its size. We argue that mixed-precision floating-point appears to be a promising datatype that provides a favorable balance between model accuracy and hardware performance for Transformer-based DNN acceleration. Currently, an automatic compilation framework that provides full stack acceleration of Transformer models is underway. The vector processing unit is also being optimized to improve non-linear function performance.

SqueezeBlock: A Transparent Weight Compression Scheme for Deep Neural Networks

In this work, we present SqueezeBlock, a transparent weight compression scheme to effectively reduce the memory footprint of DNN models while maintaining a low accuracy degradation without requiring network retraining. SqueezeBlock uses a cluster-aided FP8 quantization method to preprocess weights that facilitates a subsequent block-based weight encoding scheme to achieve a good compression ratio and accuracy loss tradeoff. In addition, the proposed design space exploration framework allowed us to identify optimal encoding parameter configurations for different DNN models. A hardware decoder customized to the specific encoding could then be generated automatically to decode the weights during run time. Experiments showed that SqueezeBlock was able to retain the most significant information from the original model despite the large compression ratio, which may not be immediately apparent, as demonstrated by the ViT model compared to other vision models. In the future, we plan to expand the proposed scheme to compress and decompress activations during run time on hardware, which can improve both DNN inference and training time.

Model-Platform Optimized Deep Neural Network Accelerator Generation through Mixed-integer Geometric Programming

In this paper, we have presented AGNA, an open-source deep neural network (DNN) accelerator hardware and software generator for FPGA platforms. AGNA relies on the two proposed MIGP formulations and their relaxed solutions to perform DSE in customizing a generic accelerator template for the given network model-platform combination. Through extensive experiments using many combinations of DNN models and platforms, we have demonstrated AGNA's capability to produce DNN accelerators with performance comparable to the state-of-the-art in research and vendor-provided frameworks. Importantly, although the accelerators produced currently may not be the fastest in all model-platform combinations, AGNA is vendor-agnostic and is designed to be easily extensible, making it suitable for real-world deployments and for serving as a basis for future research. In the future, we plan to improve AGNA with advanced scheduling capability to work with multi-bank memories, as well to improve its performance through low-level hardware optimizations. We also plan to explore novel network model quantization and pruning techniques by leveraging the processing architecture and scheduling capabilities of AGNA.

MSD: Mixing Signed Digit Representations for Hardware-efficient DNN Acceleration on FPGA with Heterogeneous Resources

This work has proposed MSD, an FPGA-tailored and heterogeneous DNN acceleration framework to utilize both LUTs and DSPs as computation resources and to exploit bit-sparsity. The RSD data representation enables MSD to fine-tune and encode the DNN weights into a bit-sparsity-aware format, making the bit-serial computation on LUTs more efficient. Furthermore, we adopt a latency-driven search algorithm into MSD, which can search for the optimal schedule, the number of EB, and the workload split ratio for each layer, based on a latency constraint. Evaluation results on various DNN models and edge FPGA devices demonstrate that MSD achieves 1.52 $\times$ speedup and 1.36 $\times$ higher throughput compared with the state-of-the-art on ResNet-18 model, and 4.78\% higher accuracy on MobileNet-V2. In the future, we will explore more efficient scheduling methods for workload splitting in the heterogeneous architecture and EB selection in the bit-serial computation, and exploit FPGA-layout-tailored hardware design to further enhance the hardware clock frequency.

Energy-Efficient Intelligent Pulmonary Auscultation for Post COVID-19 Era Wearable Monitoring Enabled by Two-Stage Hybrid Neural Network

This paper proposes an energy-efficient intelligent pulmonary auscultation system for post COVID-19 era wearable monitoring.

A Reconfigurable Area and Energy Efficient Hardware Accelerator of Five High-order Operators for Vision Sensor Based Robot Systems

This paper proposes a reconfigurable hardware accelerator of five high-order operators for robot vision applications.

An Energy-efficient Multi-core Restricted Boltzmann Machine Processor with On-chip Bio-plausible Learning and Reconfigurable Sparsity

In this paper, a multi-core RBM processor design with on-chip learning and reconfigurable sparsity is proposed to reduce energy consumption and improve processing throughput. The FPGA implementation results show that the proposed RBM design achieves 44.0% energy saving and 24.3% speed improvement in RBM training operation against the baseline CD-based RBM design. In the future, we will focus on ASIC implementation of our proposed RBM processor to further improve the energy efficiency and minimize the hardware cost.