In this work, we have presented TATAA, a programmable accelerators on FPGA for transformer models by using a novel transformable arithmetic architecture. Using TATAA, we demonstrate that both low-bitwidth integer (int8) and floating-point (bfloat16) operations can be implemented efficiently using the same underlying processing array hardware. By transforming the array from systolic mode for int8 MatMul to SIMD-mode for vectorized bfloat16 operations, we show that end-to-end acceleration of modern transformer models including both linear and non-linear functions can be achieved with state-of-the-art performance and efficiency. In the future, we plan to explore more general FPGA implementations of TATAA with more devices support (i.e., with or without HBM) and to enhance the flexibility of our compilation framework to accelerate future transformer models as they are being rapidly developed.
This paper has proposed a novel hardware-aware quantization framework, with a fused mixed-precision accelerator, to efficiently support a distribution-adaptive data representation named DyBit. The variable-length bit-fields enable DyBit to adapt to the tensor distribution in DNNs. Evaluation results show that DyBit-based quantization at very low bitwidths ($
This paper presents an energy-efficient DBN processor based on heterogeneous multi-core architecture with transposable weight memory and on-chip local learning. In the future, we will focus on ASIC implementation of the proposed DBN processor in a GALS architecture with a 7T/8T SRAM-based transposable memory design to solve the bottlenecks and further improve throughput and energy efficiency.
To solve the aforementioned issues of the program-verify scheme, a novel in-situ error monitoring scheme for IMPLY-based memristive CIM systems is proposed in this paper.
This paper presents a novel CORDIC based Spiking Neural Network (SNN) design with on-line STDP learning and high hardware efficiency. A system design and evaluation method of CORDIC SNN is proposed to evaluate the hardware efficiency of SNN based on different CORDIC algorithm types and bit-width precisions.