



Research Objects Reviewed



# MSD: Mixing Signed Digit Representations for Hardware-efficient DNN Acceleration on FPGA with Heterogeneous Resources

Jiajun Wu, Jiajun Zhou, Yizhao Gao, Yuhao Ding, Ngai Wong, Hayden Kwok-Hay So University of Hong Kong



FCCM 2023

9 May 2023

### Contents

- Motivation & Background
- The Proposed Methodology
- Experiment Results
- Conclusion & Future Works

### **DNN Quantization**



Can we further compress the computation of int8 to improve the inference performance without loss of accuracy?

## Deploy Quantized Multiplication

#### Conventional Int8 multiplication



#### How do we further improve the performance?

- 1. To use both DSPs and LUTs for multiplication.
- 2. To make our LUT implementation smaller.

#### **Parallel Multiplier**





Latency: 1 cycle

## Bit-Serial Scheme with Bit-Sparsity

#### An alternative approach for the multiplier on LUT



## Workload Imbalance in Bit-Sparsity Scheme

The workload imbalance issue in bit-sparsity:



Ref: BitCluster (TCAD, Volume: 41, Issue: 11, November 2022)

Need a method to use a restricted and small number of EB without causing major loss of accuracy.

### Contents

- Motivation & Background
- The Proposed Methodology
- Experiment Results
- Conclusion & Future Works

## Mixing Signed-digit Representations (MSD)

 A framework that automatically partitions the DNN workloads to run on DSPs and LUTs

 We proposed a new representation called restricted signed-digit (RSD) to help restrict the number of EB

## Signed-Digit Representation

To solve the imbalance issue



Restriction: EB = 2

Large quantization error

Find another number format with higher representation capability

#### Our Approach: Signed-digit representation

Let X[i] (i-th bit in the number with n-bit) expand to 0, 1 and -1 ( $\overline{1}$ ):

$$X = \sum_{i=0}^{n-1} (X[i] \times 2^i), \qquad X[i] = \{0, \boxed{1, -1}\}$$
 Effectual bits (EB)

Note that 2's complement is actually a special case of signed-digit:

$$X = (-1)^{X[n-1]} \times 2^{n-1} + \sum_{i=0}^{n-2} (X[i] \times 2^i), \qquad X[i] = \{0, 1\}$$
$$= \sum_{i=0}^{n-1} (X[i] \times 2^i), \qquad X[i] = \begin{cases} \{0, 1\} & \text{if } i \neq n-1 \\ \{0, -1\} & \text{if } i = n-1 \end{cases}$$

## Hardware Does Not Change a lot!



#### Signed-digit scheme

$$\begin{array}{c}
14 \times 27 \\
= 00001110 \times 00100\overline{1}0\overline{1} \\
& 00001110 \\
\times 00100\overline{1}0\overline{1} \\
& - 00001110 \\
& + 00001110 \\
& + 00001110
\end{array}$$
Effectual partial products

Subtraction is naturally the same with addition. We only need to add a simple circuit for negative values.

## Restricted Signed-Digit

#### Restrict EB: Restricted signed-digit (RSD)

| Original<br>Numbers     | 2's complement $EB = 2$ ,          | Error | $\frac{RSD}{EB} = 2$                            | Error    |
|-------------------------|------------------------------------|-------|-------------------------------------------------|----------|
| 30 (int8)<br>= 00011110 | $30 \rightarrow 24$<br>= 00011+++0 | 6     | 30 = 32 - 2<br>= $001000\overline{1}0$          | 0        |
| 46 (int8)<br>= 00101110 | $46 \rightarrow 40$<br>= 00101++0  | 6     | 48 = 32 + 16<br>= 00110000                      | 2        |
| 56 (int8)<br>= 00111000 | $56 \rightarrow 48$ = 00010001     | 8     | $\frac{56}{6} = 64 - 8$ $= 0100\overline{1}000$ | <b>○</b> |

Algorithm: set up 2's integer power (0, 2, 4, 8,...) as the bases, iteratively find the base and sign bits according to the restriction.  $30 \rightarrow \text{get } 32$  and -2 in two iterations



## Weights Encoding & Bit-serial PE Design

We focused on layer-wise weights quantization in this work, and using the de facto
 8-bit (int8) model as the starting point.





## Heterogeneous Architecture



- BSPE & BPPE: Bit-serial & Bit-parallel PE, set up as a systolic array
- BSPEs are implemented on LUTs, while BPPEs are based on DSPs, running simultaneously



INT8 multiplication on DSPs (standard representation)

|                                                         | Weights            | Activations |  |  |  |  |
|---------------------------------------------------------|--------------------|-------------|--|--|--|--|
| BSPE                                                    | RSD representation | Standard    |  |  |  |  |
| BPPE                                                    | Standard           | Standard    |  |  |  |  |
| Consistence representation by mixing signed-digit (MSD) |                    |             |  |  |  |  |

 Standard 2's complement can be regarded as a special case of signeddigit representation.

### Search Schedules

Parameters that need to be searched by the scheduler (for each layer)



Part I: 6-dimension for-loop for each layer

Tile size in each dimension:

$$\mathbf{T} = [t_K, t_H, t_W, t_C, t_I, t_I]$$

Part II: Weight Split Ratio (workload partitioning)





### Hardware-aware Mixed-EB Quantization

End-to-end framework



- Inputs: DNN models and FPGA device constraints
- Mixed-EB search for each layer, under the constraint of MSE and latency (calculated based on the hardware model)

$$MSE = \sqrt{\sum_{i=1}^{n} \left(\frac{x - \hat{x}}{\sigma_i}\right)^2}$$





### Contents

- Motivation & Background
- The Proposed Methodology
- Experiment Results
- Conclusion & Future Works

### Mixed-EB Quantization

TABLE I: Mixed-EB speedup results with different constraints  $\omega$ . A larger  $\omega$  means a higher speedup and a more aggressive quantization strategy.

| Model     | $\omega$ | Layer-wise Mixed-EB Result A                                                              | Speedup |
|-----------|----------|-------------------------------------------------------------------------------------------|---------|
| VGG-16    | 1.5      | [3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3                                                    | 1.52    |
|           | 1.6      | [2 3 2 2 2 3 2 2 3 3 3 2 3 3 3 2]                                                         | 1.60    |
|           | 1.7      | [2 2 2 2 2 3 2 2 3 2 3 2 2 2 2 2]                                                         | 1.70    |
|           | 2.0      | $[2\ 2\ 2\ 2\ 2\ 2\ 2\ 2\ 2\ 2\ 2\ 2\ 2\ 2$                                               | 2.00    |
|           | 2.1      | $[1\ 1\ 2\ 1\ 1\ 1\ 1\ 2\ 2\ 1\ 2\ 1\ 2\ 1\ 2\ 1]$                                        | 2.14    |
|           | 2.2      | $[1\ 1\ 2\ 1\ 1\ 1\ 1\ 2\ 1\ 1\ 2\ 1\ 2\ 1\ 1\ 1]$                                        | 2.24    |
| ResNet-18 | 1.5      | [3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3                                                    | 1.51    |
|           | 1.6      | [2 3 2 3 2 3 3 3 2 2 2 3 3 2 2 2 2 3 2 3                                                  | 1.62    |
|           | 1.65     | [2 3 2 3 2 3 2 3 2 2 2 2 3 3 2 2 2 2 3 2 3 2]                                             | 1.65    |
|           | 1.7      | [2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2                                                    | 1.71    |
|           | 1.8      | $[2\; 2\; 2\; 2\; 2\; 2\; 2\; 1\; 1\; 1\; 1\; 1\; 1\; 2\; 2\; 1\; 1\; 2\; 1\; 2\; 1\; 1]$ | 1.84    |
|           | 1.9      | [2 2 2 2 2 2 2 1 1 1 1 1 1 2 2 1 1 1 1 1                                                  | 1.90    |

- Device: Ultra-96
- Layer-wise EB configuration
- **The final speedup meets the constraint of**  $\omega$

$$\min_{\mathbf{A}} \sum_{j=1}^m MSE(j, \mathbf{A}[j])$$
 Objective: Minimize quantization error s.t.  $\omega \times \sum_{j=1}^m Lat_{\mathbf{L}}(j, \mathbf{A}[j]) \leq Lat_{\mathrm{base}}$  Constraint: speedup ratio,  $\omega$ 

The baseline is we only use DSPs for computation

- Higher speedup constraint -> larger  $\omega$
- More aggressive search for higher speedup, to reach the constraint

## Accuracy-Speedup Trade-off



# This area should be OK for most cases!



Also, RSD-based bit-serial scheme is more efficient than conventional parallel design on LUTs!

## Comparison with Previous Works





### Conclusion & Future Works

- MSD framework:
  - is a heterogeneous DNN acceleration framework that utilizes both LUTs and DSPs as computation resources and to exploit bit-sparsity.
  - fine-tunes and encodes the DNN weights into a bit-sparsity-aware format, making the bit-serial computation on LUTs more efficient.
  - uses a latency-driven algorithm to search for the optimal schedule and trade-off based on layer-wise mixed EB.
- We will explore more efficient scheduling methods and exploit FPGA-layout-tailored hardware design to enhance the hardware clock frequency further.









MSD open-source in GitHub

# Thanks for Your Listening

Find us in the poster session!

Q & A









MSD open-source in GitHub