SqueezeBlock: A Transparent Weight Compression Scheme for Deep Neural Networks

Abstract

Modern Deep Neural Networks (DNNs) are no-torious for their large memory footprint, which impacts not only the storage capacity requirement in resource-constrained embedded systems, but also the performance of an inference machine due to data movement. In this work, we demonstrate a transparent weight compression scheme, called SqueezeBlock, which effectively reduces the memory footprint of DNN models with only minimal impact on their accuracy without the need for retraining. SqueezeBlock employs three steps, namely, clustering, quantization, and block encoding, to compress weights of DNN models and relies on automatic design space exploration to derive the optimal encoding configuration. Custom hardware decoders can be generated automatically for seamless integration with the memory subsystem. Experiments on a range of DNNs show that SqueezeBlock can effectively compress the original fp32 weights by up to 4.88× to 6 bit per weight with the loss of accuracy kept within 0.92% across tested models.

Publication
In 2023 International Conference on Field Programmable Technology
Jiajun Wu
Jiajun Wu
PhD Student

My research interests include Hardware accelerator, reconfigurable computing and computer architecture.

Related