# MSD: Mixing Signed Digit Representations for Hardware-efficient DNN Acceleration on FPGA with Heterogeneous Resources

Jiajun Wu, Jiajun Zhou, Yizhao Gao, Yuhao Ding, Ngai Wong, Hayden Kwok-Hay So Department of Electrical and Electronic Engineering, University of Hong Kong

#### MOTIVATION **DNN Quantization DNN Training** Original model Integer Model 8-bit Conv 1 Conv1 FP32 8-bit Conv2 Conv2 Deploy FP32 FC 8-bit Quantized Multiplication on LUTs and DSPs 8-bit input **Bit-serial** 00001110 Combined ACT PSUM scheme with $\times 00011011$ 8-bit input **Bit-sparsity** Parallel 00001110 00001110 Effectual Serial 00001110 partial 00001110 products DSP48 16-bit result Accumulation Only need to implement one module for bit-8-bit input serial design 00001110 $\times 00011011$ LUTs Improve peak performance by deploying 00001110 multiplication on both LUTs and DSPs 00001110 Multiplication on LUTs: Bit-serial scheme with 00001110 bit-sparsity optimization 00000000 Bit-serial Implementation and the Effectual Bits LUT Cost of multiplier Example Bit-serial Scheme Index: 4 3 2 1 0 $(A[n:0] \times B[i]) \ll i$ $A[n:0] \times B[n:0] =$ Bit-Serial $= 14 \ll 0 + 14 \ll 1 + 14 \ll 3 + 14 \ll 4$ Latency cycles = number of EB $EB \leq 3$ , more efficient! Restriction: EB = 2 Workload Imbalance $30 \rightarrow 24$ 30 (int8) 4 cycles = 00011110 = 000<del>1111</del>0 $46 \rightarrow 40$ 46 (int8) if B[i] = 1 $A[n:0] \times B[i] =$ = 00101110 = 00<del>1</del>0<del>111</del>0 56 (int8) $56 \rightarrow 48$ Skip '0'-bits = 00111000 = 00<mark>11</mark>0000 Make bit-serial scheme more efficient than conventional parallel design To solve the problem of workload imbalance in bit-serial scheme

### **MORE INFORMATION IS HERE!**



= 00010001

Smaller quantization errors compared with 2's complement!

= 00111000



 $= 0100\overline{1}000$ 







#### Our work passes the artifact evaluation process

### **METHODOLOGY**

#### Restricted Signed-Digit Representation (RSD)



#### **METHODOLOGY**

#### Heterogeneous Architecture









- Bit-serial PE processes RSD-based weights, which need to be fine-tuned by QAT
- Bit-parallel PE processes standard weights. We need to balance the workloads

#### End-to-End Framework: Mixed-EB Quantization



With the scheduler and search algorithm, we set up the mixed-EB quantization scheme in which different layers have different restriction of EB, from 1 to 3.

#### **RESULTS**

## Accuracy-speedup Trade-off We need lower latency! Set more aggressive constraint ( $\omega \uparrow$ ) Search for higher speedup DSP-only DSP+MUL $\omega = 1.7$ $\omega = 1.75$ $\omega = 2$ $\omega$ =2.1 $\omega$ =2.2 $\omega$ =2.5 MSD More layers with smaller EB

We can reach a balance between accuracy and speedup in the red box side.

DSP-onlyDSP+MUL  $\omega$ =2.35  $\omega$ =2.5  $\omega$ =2.6  $\omega$ =2.75  $\omega$ =3.1  $\omega$ =3.5

Also, out results show that the bit-serial with bit-sparsity scheme is more efficient than the conventional bit-parallel multiplier design, in terms of latency.

#### Comparison



Results with accuracy loss

Accuracy-speedup trade-off