In this work, we have presented TATAA, a programmable accelerators on FPGA for transformer models by using a novel transformable arithmetic architecture. Using TATAA, we demonstrate that both low-bitwidth integer (int8) and floating-point (bfloat16) operations can be implemented efficiently using the same underlying processing array hardware. By transforming the array from systolic mode for int8 MatMul to SIMD-mode for vectorized bfloat16 operations, we show that end-to-end acceleration of modern transformer models including both linear and non-linear functions can be achieved with state-of-the-art performance and efficiency. In the future, we plan to explore more general FPGA implementations of TATAA with more devices support (i.e., with or without HBM) and to enhance the flexibility of our compilation framework to accelerate future transformer models as they are being rapidly developed.
In this paper, we have presented a case for using low-bitwidth floating-point arithmetic for Transformer-based DNNs inference. We demonstrate that low-bitwidth floating-point (bfp8) matrix multiplication can be implemented effectively in hardware with a marginal increase over an 8-bit integer equivalence while attaining processing throughput close to the platform maximum. In addition, we show that such an array can be effectively reconfigured during run-time into a programmable fp32 vector processing unit that can be programmed to implement any non-linear functions required by future Transformer-based DNN models. With efficient support of both datatypes, the proposed design eliminates the need to quantize and retrain Transformer models, which is increasingly challenging due to its size. We argue that mixed-precision floating-point appears to be a promising datatype that provides a favorable balance between model accuracy and hardware performance for Transformer-based DNN acceleration. Currently, an automatic compilation framework that provides full stack acceleration of Transformer models is underway. The vector processing unit is also being optimized to improve non-linear function performance.
In this work, we present SqueezeBlock, a transparent weight compression scheme to effectively reduce the memory footprint of DNN models while maintaining a low accuracy degradation without requiring network retraining. SqueezeBlock uses a cluster-aided FP8 quantization method to preprocess weights that facilitates a subsequent block-based weight encoding scheme to achieve a good compression ratio and accuracy loss tradeoff. In addition, the proposed design space exploration framework allowed us to identify optimal encoding parameter configurations for different DNN models. A hardware decoder customized to the specific encoding could then be generated automatically to decode the weights during run time. Experiments showed that SqueezeBlock was able to retain the most significant information from the original model despite the large compression ratio, which may not be immediately apparent, as demonstrated by the ViT model compared to other vision models. In the future, we plan to expand the proposed scheme to compress and decompress activations during run time on hardware, which can improve both DNN inference and training time.
This paper has proposed a novel hardware-aware quantization framework, with a fused mixed-precision accelerator, to efficiently support a distribution-adaptive data representation named DyBit. The variable-length bit-fields enable DyBit to adapt to the tensor distribution in DNNs. Evaluation results show that DyBit-based quantization at very low bitwidths ($
In this paper, we have presented AGNA, an open-source deep neural network (DNN) accelerator hardware and software generator for FPGA platforms. AGNA relies on the two proposed MIGP formulations and their relaxed solutions to perform DSE in customizing a generic accelerator template for the given network model-platform combination. Through extensive experiments using many combinations of DNN models and platforms, we have demonstrated AGNA's capability to produce DNN accelerators with performance comparable to the state-of-the-art in research and vendor-provided frameworks. Importantly, although the accelerators produced currently may not be the fastest in all model-platform combinations, AGNA is vendor-agnostic and is designed to be easily extensible, making it suitable for real-world deployments and for serving as a basis for future research. In the future, we plan to improve AGNA with advanced scheduling capability to work with multi-bank memories, as well to improve its performance through low-level hardware optimizations. We also plan to explore novel network model quantization and pruning techniques by leveraging the processing architecture and scheduling capabilities of AGNA.
This work has proposed MSD, an FPGA-tailored and heterogeneous DNN acceleration framework to utilize both LUTs and DSPs as computation resources and to exploit bit-sparsity. The RSD data representation enables MSD to fine-tune and encode the DNN weights into a bit-sparsity-aware format, making the bit-serial computation on LUTs more efficient. Furthermore, we adopt a latency-driven search algorithm into MSD, which can search for the optimal schedule, the number of EB, and the workload split ratio for each layer, based on a latency constraint. Evaluation results on various DNN models and edge FPGA devices demonstrate that MSD achieves 1.52 $\times$ speedup and 1.36 $\times$ higher throughput compared with the state-of-the-art on ResNet-18 model, and 4.78\% higher accuracy on MobileNet-V2. In the future, we will explore more efficient scheduling methods for workload splitting in the heterogeneous architecture and EB selection in the bit-serial computation, and exploit FPGA-layout-tailored hardware design to further enhance the hardware clock frequency.
This paper presents an energy-efficient DBN processor based on heterogeneous multi-core architecture with transposable weight memory and on-chip local learning. In the future, we will focus on ASIC implementation of the proposed DBN processor in a GALS architecture with a 7T/8T SRAM-based transposable memory design to solve the bottlenecks and further improve throughput and energy efficiency.
To solve the aforementioned issues of the program-verify scheme, a novel in-situ error monitoring scheme for IMPLY-based memristive CIM systems is proposed in this paper.