A Low-Latency Power-Efficient Convolutional Neural Network Accelerator for Vision Processing Algorithms

Lee, Junghee; Nicopoulos, Chrysostomos

Home // International Journal On Advances in Systems and Measurements, volume 13, numbers 1 and 2, 2020 // View article

A Low-Latency Power-Efficient Convolutional Neural Network Accelerator for Vision Processing Algorithms

Authors:
Junghee Lee
Chrysostomos Nicopoulos

Keywords: Convolutional neural network; Hardware accelerator; On-chip memory optimization; On-chip communication

Abstract:
Deep Convolutional Neural Networks (CNN) are expanding their territory to many applications, including vision processing algorithms. This is because CNNs achieve higher accuracy compared to traditional signal processing algorithms. For real-time vision processing, however, their high demand for computational power and data movement limits their applicability to battery-powered devices. For such applications that require both real-time processing and power efficiency, hardware accelerators are inevitable in meeting the requirements. Recent CNN frameworks, such as SqueezeNet and GoogLeNet, necessitate a redesign of hardware accelerators, because their irregular architectures cannot be supported efficiently by traditional hardware accelerators. In this paper, we propose a novel hardware accelerator for advanced CNNs aimed at realizing real-time vision processing with high accuracy. The proposed design employs data-driven scheduling that enables support for irregular CNN architectures without run-time reconfiguration, and it offers high scalability through its modular design concept. Specifically, the design’s onchip memory management and on-chip communication fabric are tailored to CNNs. As a result, the new accelerator completes all layers of SqueezeNet and GoogLeNet in 14.30 ms and 27.12 ms at 2.47 W and 2.51 W, respectively, with 64 processing elements. The performance offered by the proposed accelerator is comparable to high-performance FPGA-based approaches (that achieve 1.06 to 262.9 ms at 25 to 58 W), albeit with significantly lower power consumption. If the hardware budget allows, these latencies can be further reduced to 6.71 ms and 11.70 ms, respectively, with 256 processing elements. In comparison, the latency reported by existing architectures executing large-scale deep CNNs ranges from 115.3 ms to 4309.5 ms.

Pages: 131 to 141

Publication date: June 30, 2020

Published in: journal

ISSN: 1942-261x