Professional Documents
Culture Documents
Abstract— Object detection algorithms using convolutional To address this issue, a wide range of hardware accelerators
neural networks (CNNs) achieve high detection accuracy, but for deep learning models have been proposed. Most
it is challenging to realize real-time object detection due to deep learning models employ deep neural networks such
their high computational complexity, especially on resource-
constrained mobile platforms. In this article, we propose an as convolutional neural networks (CNNs), and they are
algorithm-hardware co-optimization approach to designing a usually realized using matrix multiplications. Hence, deep
real-time object detection system. We first develop a compact learning accelerators are aimed at efficiently accelerating
object detection model based on a binarized neural network matrix multiplication through parallelization and architecture
(BNN), which employs a new layer structure, the DenseToRes optimization to maximize throughput [6]. However, the
layer, to mitigate information loss due to deep quantization.
We also propose an efficient object detection processor that runs computational complexity of deep learning models for object
object detection with high throughput using limited hardware detection continues to increase [7], [8], [9], and the amount
rescources. We develop a resource-efficient processing unit of computation and memory bandwidth well exceeds those
supporting variable precision with minimal hardware overheads. available on embedded platforms. Therefore, to achieve real-
Implemented in field-programmable gate array (FPGA), the time object detection on embedded platforms with limited
object detection processor achieves 64.51 frames/s throughput
with 64.92 mean average precision (mAP) accuracy. Compared hardware resources and power budget, we need to develop a
to prior FPGA-based designs for object detection, our design highly optimized deep neural network accelerator paired with
achieves high throughput with competitive accuracy and lower a lightweight object detection algorithm having high detection
hardware implementation costs. accuracy.
Index Terms— Binarized neural network (BNN), convolutional Deep learning model optimization is an active field of
neural network (CNN), field-programmable gate array (FPGA), research aimed at reducing the computational complexity
low-power design, object detection. and memory requirements of deep learning models while
preserving algorithm accuracy. For instance, quantization [10],
I. I NTRODUCTION pruning [11], and compression [12] are often employed
Ando et al. [21] propose an efficient binary/ternary neural can be efficiently implemented using XNOR operations in
network accelerator with an in-memory computing scheme. binarized layers. However, some layers (e.g., first and last
On the other hand, the design in [23] supports variable weight layers) are not binarized to avoid performance drops, which
bit precision to support various quantized neural networks. necessitates separate arithmetic units to support high-precision
The high degree of freedom of ASIC allows designers to operations and decreases hardware utilization. The Bi-Real
apply hardware design techniques to optimize the hardware Net [16] utilizes additional shortcut connections to compensate
for the target application and model. Hence, ASIC is usually for information loss due to deep quantization, along with
considered the most power-efficient solution when it comes a piecewise linear approximation of the derivative of the
to accelerating BNN models. However, the initial design sign function. The authors also propose a magnitude-aware
and prototyping costs for ASIC are huge, especially when gradient and a clip function to replace the ReLU function.
fabricated in advanced nodes, and hence it is not an ideal The Bi-Real Net achieves 56.4% and 62.2% top-1 accuracy
solution for applications with low-volume production. Also, on ImageNet with 18 and 34 layers, respectively. The
due to complex design verification and fabrication processes, real-to-binary convolutions [17] adopts the block structure
it is challenging to reduce time-to-market. of the XNOR-Net, double-skip connections, and real-valued
Field-programmable gate array (FPGA) is increasingly downsample layers of the Bi-Real Net. It also introduces
utilized for accelerating deep learning as it offers an optimal trainable scaling factors, which results in 65.4% top-1 accuracy
design choice in the trade-off between hardware flexibility on ImageNet.
and processing efficiency. Since we can reconfigure the
hardware at the gate level in FPGA, the same hardware B. Related Work on FPGA-Based Object Detection Processor
design techniques could be applied to maximize performance
Recent studies on FPGA-based object detection hardware
and efficiency. In addition, the prototyping and production
employ deep quantization to enhance processing efficiency.
costs are significantly lower than those of ASIC, which
Nakahara et al. [28] propose Lightweight YOLOv2 and
makes FPGA an attractive design choice in embedded
demonstrate the first FPGA object detector using BNN.
applications.
However, since binarizing the model severely degrades
In this work, we present an FPGA-based processor for real-
the algorithm accuracy, they adopt a full-precision parallel
time object detection on embedded platforms. We first propose
support vector regressor for localization, which significantly
a deeply quantized neural network model for object detection.
increases external memory access. Tincy you-only-look-once
We minimize performance degradation while binarizing
(YOLO) [29] employs 1-bit weights and 3-bit activations in
weights and activations in most layers throughout the network
the entire neural network. The authors report the end-to-
by proposing quantization-friendly layer structures. Then,
end processing time from capturing an image to displaying
we present an object detection processor supporting the
detection results, but their design processes the first and last
proposed object detection algorithm efficiently. The processor
layers on software as they severely limit throughput. Sim-
features a resource-efficient variable bit-precision processing
YOLO-v2 [30] uses 1-bit weights and 4–6-bit activations
element (PE) and completely removes OFF-chip DRAM access
in the neural network. The authors propose a streaming
by storing the entire model in ON-chip memory. Our design
accelerator architecture that reduces external DRAM access
demonstrates real-time object detection with 64.51 frames/s
to improve throughput and power efficiency. CoDeNet [31]
throughput and 64.92 mAP and consumes 6.58 W.
adopts input-adaptive deformable convolution to develop an
The rest of the article is organized as follows. Section II
efficient network for object detection, and the network is
summarizes the background and related work. Section III
quantized with 4-bit weights and 8-bit activations. The authors
details the proposed object detection algorithm and Section IV
propose hardware-friendly modification of a deformable
discusses the processor architecture. Experimental results are
convolution operation on FPGA. The real-time SSDLite [32]
presented in Section V. Finally, Section VI concludes the
adopts MobileNetV2 [33] as a backbone network for object
article.
detection. This model is not deeply quantized, but the design
demonstrates the highest throughput to date. This is achieved
II. R ELATED W ORK AND BACKGROUND through an efficient processing unit supporting both depthwise
A. Related Work on BNN and pointwise convolutions and the task control unit that
alleviates the overhead of job scheduling.
As BNNs use binary weights and binary activations, external
memory bandwidth is greatly reduced and convolutional
layers are efficiently implemented using simple XNOR and C. YOLO Object Detection Models
bit count operations. An early work in [25] demonstrates The YOLO models [4], [7], [8], [34] are representative
a BNN that adopts GoogleNet [26], and it achieves 47.1% one-stage object detector algorithms based on CNN, which
top-1 accuracy on ImageNet [27]. The XNOR-Net [15] predict the bounding box and class of objects simultaneously
shows an improved performance of 51.2% top-1 accuracy in the input image. They have a fast detection speed to enable
for ImageNet classification by introducing scaling factors real-time object detection, and the object detection accuracy
to reduce quantization errors and reordering the sequence has improved in more recent versions. YOLO models consist
of layers for effective training. In addition, the authors of two parts: a backbone network and a head network. The
suggest that multiplication between activation and weight backbone network extracts features from the input image, and
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
LEE et al.: REAL-TIME OBJECT DETECTION PROCESSOR WITH XNOR-BASED VARIABLE-PRECISION COMPUTING UNIT 3
the head network detects objects by performing localization instance, look-up tables (LUTs) could be used to realize this
and classification using extracted features. The backbone function on FPGA. In addition, to reduce the quantization
network is usually pretrained for the image classification task error of the binarized weights, the mean of the absolute weight
on a large dataset such as ImageNet. Then, the head network values can be used as a scaling factor α
is attached to it, and the entire model is trained for the object (
+α, if wr ≥ 0
detection task. wb = α · sign(wr ) = (5)
YOLOv2 [34] uses DarkNet-19, which is similar to −α, if wr < 0.
VGG-19, as the backbone network and uses five convolution This scaling factor is usually absorbed in the batch
layers in the head network. The model divides the input image normalization layer during inference and does not increase the
into S × S grids and predicts five bounding boxes for each amount of computation [16], [18].
grid based on anchor boxes. Each bounding box consists of
location (x and y), size (w and h), confidence score, and III. B INARIZED O BJECT D ETECTION A LGORITHM
classification score for 20 classes. Therefore, the detector
Recent studies reveal that deep neural networks are
outputs 125 values for each grid. Finally, post-processing
relatively resilient to computation errors incurred during
such as non-maximum suppression (NMS) is performed to
inference. Hence, various model optimization techniques
find the best candidate from the detected boxes and complete
such as quantization have been extensively studied, and
object detection. While more recent versions of YOLO models
commercial devices now provide optimized datapaths for low-
provide higher object detection accuracy, they utilize much
precision operations [37], [38]. In this work, we propose
larger backbone networks and complex connections between
a software–hardware co-optimization approach to design an
the backbone and head networks. Hence, the original YOLO
energy-efficient object detection processor. In this section,
and YOLOv2 have been widely used for ASIC [35], [36], and
we first propose a deeply quantized object detection
FPGA [28], [29], [30] implementations.
model with a good algorithm accuracy and low hardware
implementation costs. More specifically, we aim at minimizing
D. Model Quantization the model size so that the entire model could be stored
The power consumption and area of arithmetic blocks are in on-chip memory. Then, the processor can avoid costly
dictated by the precision they support, and hence model external memory access and achieves higher energy efficiency.
quantization is a key to efficient hardware implementation. In addition, representing weight and activation in low precision
In addition, we can significantly reduce memory footprint also reduces the power consumption of computing units.
and bandwidth by quantizing data stored in external memory.
Provided the number of bits n and quantization scale s, the A. Backbone Network Design
quantized value of a real number xr is expressed by
x In object detection models consisting of a backbone
r
xq = s × round + 0.5 − 0.5 . (1) network and a head network, the backbone network is usually
s pretrained for image classification tasks. Experimental results
Note that the equation above represents a uniform and in Fig. 1 show that the object detection performance is highly
symmetric quantization scheme that does not include zero. correlated with the classification accuracy of the backbone
Then, the maximum value which can be represented in this network; the better the image classification performance of the
scheme is determined by backbone network, the better the object detection performance
n
2 −1
of the entire network. The backbone networks in YOLO
xq max = s × . (2) models are generally prone to performance drop when deeply
2
quantized. For instance, the backbone network of YOLOv2,
If xr is larger than xq max , then its quantized value is clipped DarkNet-19, has a similar structure to VGG-19, and it suffers
to xq max . large performance degradation when activations are quantized
In 1-bit convolution layers of typical BNNs, the weights into less than three bits [30].
and activations are both binarized to −1 and 1 through a sign To alleviate this issue, we propose a new structure called
function DenseToRes layer shown in Fig. 3(a). The DenseToRes layer
(
+1, if xr ≥ 0 places both dense and residual connections in series. The
xb = sign(xr ) = (3) layer first processes the input using dense connections to
−1, if xr < 0.
concatenate the input and output features. This preserves the
If −1 and 1 are expressed as 0 and 1 as shown above, undamaged information [19] as separate channels, but the
multiplication and addition are replaced by bitwise XNOR and preserved information is not yet distributed over the channels
bit count operations that went through the quantized layers. Then, the DenseToRes
layer performs convolution with residual connections; the
Xb · Wb = 2 · BitCount XNOR X̂b , Ŵb − N
(4)
convolution operation distributes the preserved information
where Xb and Wb are the vectors of binary activations and to all channels, while the residual connections minimize the
binary weights represented as −1 or 1 with N elements, while information loss due to quantized convolution operations by
X̂b and Ŵb are identical vectors represented as 0 or 1. These adding the undamaged input features to the output features
operations can be efficiently implemented in hardware; for [16]. As a result, this architecture allows for preserving
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
LEE et al.: REAL-TIME OBJECT DETECTION PROCESSOR WITH XNOR-BASED VARIABLE-PRECISION COMPUTING UNIT 5
TABLE II
E FFECTS OF E ACH D ESIGN O PTIMIZATION ON P ERFORMANCE AND M ODEL S IZE OF BACKBONE N ETWORK S IZE
TABLE III
C OMPARISONS OF D IFFERENT Q UANTIZATION S CHEMES
IN THE BACKBONE N ETWORK
LEE et al.: REAL-TIME OBJECT DETECTION PROCESSOR WITH XNOR-BASED VARIABLE-PRECISION COMPUTING UNIT 7
x = 2x̂ − 1 (6)
Fig. 5. Example of convolution layer processing. w = 2ŵ − 1 (7)
x · w = 2 · x̂ ⊙ ŵ − 1
TABLE V (8)
C OMPARISONS OF P ROCESSING S CHEMES
where ⊙ represents bitwise XNOR operation. Then, the
accumulation of multiplication results is substituted by a
simple bit count operation. If we extend the transformation
in (6) and (7) to n bits, an n-bit binary number X̂ in
the circuit translates to an integer X which is represented
in n-digit binary number format with signed digits (i.e.,
X = (xn−1 . . . x1 x0 )2 where xk ∈ {−1, 1}, and X̂ =
(x̂ n−1 . . . x̂ 1 x̂ 0 )2 where x̂ k ∈ {0, 1})
Table V summarizes the hardware overhead of three different X = (xn−1 . . . x1 x0 )2
processing schemes, where Bi and Bo are the bitwidth of input = xn−1 · 2n−1 + · · · + x1 · 21 + x0
activation and output accumulation, respectively, and Co is
= 2x̂ n−1 − 1 · 2n−1 + · · · + 2x̂ 0 − 1
the number of output channels calculated simultaneously in
= 2 · x̂ n−1 2n−1 + · · · + x̂ 0 − 2n−1 + · · · + 1
the processor. H , W , and N represent the dimensions of input
feature map, and K is the filter size as depicted in Fig. 5. In the = 2 X̂ − 2n − 1 = 2 · X̂ − 2n−1 + 1.
(9)
weight-stationary scheme, each weight is reused for the entire
input feature map. The input feature map is processed by each Note that the circuit can represent only odd values since
of Co weight blocks. To produce the final Co output feature X ∈ {−2n + 1, −2n + 3, . . . , 2n − 1} for X̂ ∈ {0, . . . , 2n − 1}.
maps, temporary output accumulations with the same size and Consider multiplication of activation X and weight W which
the entire input feature maps need to be stored, necessitating are represented using n and m digits in the binary number
large ON-chip buffers. For the row-stationary scheme, the size format with signed digits of {−1, 1}. According to (6)–(8),
of input and output buffers is smaller as only some rows multiplication can be implemented as
of input features are processed and stored at a time, but it
X · W = xn−1 · 2n−1 + · · · + x0 · wm−1 · 2m−1 + · · · + w0
still requires larger buffers than the output-stationary scheme.
With our configuration of hardware parallelism, the buffer = (xn−1 · wm−1 ) · 2m+n−2 + · · · + (x0 · w0 )
size is 1138× and 9× larger for weight-stationary and row- = 2 · x̂ n−1 ⊙ ŵm−1 − 1 · 2m+n−2 + · · ·
stationary schemes, respectively. This is especially critical in
+ 2 · x̂ 0 ⊙ ŵ0 − 1
resource-constrained platforms such as FPGA devices. Due to
= 2 · x̂ n−1 ⊙ ŵ m−1 · 2 + · · · + x̂ 0 ⊙ ŵ0
m+n−2
output stationary processing, we need to load a new weight
− 2m+n−2 + · · · + 1
value for each input feature. However, they are read from
internal memory, and the energy savings due to removing = 2 · x̂ n−1 ⊙ ŵ m−1 · 2m+n−2 + · · · + x̂ 0 ⊙ ŵ0
external DRAM access exceeds this overhead. In the proposed
− 2n−1 + · · · + 1 · 2m−1 + · · · + 1
DenseToRes-Net model, the number of channels increases by
= 2 · X̂ ⊙ Ŵ − 2n−1 + · · · + 1 · 2m−1 + · · · + 1 .
64 after each DenseToRes layer. Hence, we can easily split
the output channels into blocks of 64, where each block is (10)
processed by 64 PEs at a time. In other words, each PE is
This suggests that multibit multiplication X ·W can be
responsible for generating one output channel.
implemented using simple bitwise XNOR operations along with
a linear transform of y = 2x − b. Fig. 6 displays an example
B. Variable-Precision PE of multiplying +5 (= (+1) · 22 + (+1) · 21 + (−1) · 20 ) by −3
In most BNN models, not all layers are binarized to suppress (= (−1)·21 +(−1)·20 ). It should be noted that this only applies
performance degradation; for instance, the first layer and when multiplying odd values since X and W can only have odd
the last classification layer are susceptible to performance values as discussed above. Hence, we employ uniform nonzero
drop due to deep quantization, so they still remain in high quantization as the quantization scheme in our hardware since
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
LEE et al.: REAL-TIME OBJECT DETECTION PROCESSOR WITH XNOR-BASED VARIABLE-PRECISION COMPUTING UNIT 9
Fig. 7. (a) Architecture of the proposed variable-precision MAC unit and configurations for (b) 64 1 b × 1 b MACs, (c) 32 2 b × 1 b MACs, and
(d) 1 8 b × 8 b MAC.
implemented in the controller, which informs the memory and are stored in the processor, and the data channel for input
the buffer when to write the output. images is implemented using a 512-bit AXI-Stream interface
along with a control channel using a 32-bit AXI-Lite interface.
V. E XPERIMENTAL R ESULTS The CPU and the object detection processor exchange START
and DONE signals through the AXI-Lite interface, while the
A. Demonstration System input image and the network output go through the AXI-
To validate the proposed object detection model and Stream interface. The entire process is as follows. When the
processor, we implement the design on the Xilinx ZCU102 CPU opens the saved image file or a video stream from the
evaluation board. The board features Xilinx Zynq UltraScale+ USB port, it is resized to 416 × 416 and normalized to match
MPSoC device [46] which consists of a general-purpose the input resolution of the model. Resized images are copied
CPU (Processing System) and reconfigurable FPGA logic to the shared memory area of DRAM. The DMA IP on the
(Programmable Logic), providing a high degree of flexibility. FPGA transfers the data to the object detection processor and
Fig. 9 shows the overall architecture of the demonstration writes the object detection results back to DRAM. The CPU
system. converts the data type of the object detection results from
The system realizes end-to-end processing of object integer to 32-bit floating point, reorders the data sequence to
detection from capturing images to displaying detection results match the processing order of the CPU, and copies the 8-bit
overlayed on the input image (Fig. 11). The proposed object data between different memory spaces using NEON [48]. After
detection processor is implemented in FPGA logic, and it the CPU receives the results from FPGA, it performs NMS and
runs the object detection model described in Section III. overlays the location and object type of the detected objects
Data transfer between CPU and FPGA are handled by Xilinx on the image, which is displayed via output channels such as
interface IPs using AXI protocol [47]. All model parameters Ethernet and DisplayPort.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE VII
H ARDWARE R ESOURCE U TILIZATION
C. Performance Evaluations
We measure the processing latency of the processor as the
time difference between when the START command is issued
by the CPU and when the CPU receives the DONE signal
from the processor. When the processor runs at 300 MHz
clock frequency, the processing latency is measured at 15.5 ms,
Fig. 10. Schematic of baseline MAC unit for comparison. which translates to a throughput of 64.51 frames/s.
TABLE VI Table VIII compares our design against prior FPGA-
H ARDWARE OVERHEAD C OMPARISON based object detection processors. Since they use different
bit precision, we calculate the BitGOPs [51], [52], [53] for
each design, which represents the operation count normalized
with respect to the bit precision considering actual hardware
implementation costs. In general, our object detection model
achieves a competitive object detection accuracy, while the
number of parameters and operations are lower than those of
B. Implementation Results other models with a similar object detection accuracy. The
Our object detection processor utilizes variable-precision designs in [29] and [31] achieve low BitGOPs and small
MAC units described in Section IV-B. We compare the model sizes, but their accuracy and frame rate are significantly
implementation costs of our design and a baseline that has lower than our design. The designs in [28] and [30] exhibit
independent datapaths dedicated for (1 W, 1 A), (1 W, 2 A), similar mAP to ours, but their models have a larger size
and (8 W, 8 A) MAC operations. Fig. 10 shows the schematic and require significantly more operations in BitGOPs, which
of the baseline that exhibits the same throughput as our translate to more hardware resources such as DSP, LUT, and
design. Table VI displays the hardware implementation costs FF. In addition, the design in [28] uses a costly full-precision
of those two designs when synthesized for the target FPGA SVR as a detector, resulting in a lower frame rate. The design
device. Experimental results show that the proposed variable- in [32] achieves a higher mAP score since the model is not
precision MAC unit reduces the number of LUTs, flip-flops, deeply quantized and it is trained on a larger train set including
and carry logics (CARRY8) by 21.15%, 20.51%, and 40.00%, the MS COCO dataset [54]. This design also exhibits a high
respectively. frame rate of 84.8 frames/s, but using high-precision weights
Our object detection processor design is synthesized and activations unavoidably requires a large amount of external
and place-and-routed in Xilinx Vivado 2018.3. The target memory access and more complicated computing units.
frequency is set to 300 MHz. Hardware resource utilization For comparison between designs, we first calculate the
is summarized in Table VII. The variable-precision MAC unit implementation cost as the number of equivalent LUTs. Each
only uses XNOR gates and adder trees, and those blocks are FPGA device has different types of DSP and LUT, but we can
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
LEE et al.: REAL-TIME OBJECT DETECTION PROCESSOR WITH XNOR-BASED VARIABLE-PRECISION COMPUTING UNIT 11
TABLE VIII
C OMPARISONS W ITH P RIOR FPGA-BASED O BJECT D ETECTION P ROCESSORS
LEE et al.: REAL-TIME OBJECT DETECTION PROCESSOR WITH XNOR-BASED VARIABLE-PRECISION COMPUTING UNIT 13
[34] J. Redmon and A. Farhadi, “YOLO9000: Better, faster, stronger,” in [57] XILINX. ZCU102 Evaluation Board User Guide. Accessed:
Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, Jan. 28, 2023. [Online]. Available: https://www.xilinx.com/support/
pp. 7263–7271. documents/boards_and_kits/zcu102/ug1182-zcu102-eval-bd.pdf
[35] X. Chen, J. Xu, and Z. Yu, “A 68-mW 2.2 Tops/W low bit width and
multiplierless DCNN object detection processor for visually impaired
people,” IEEE Trans. Circuits Syst. Video Techn. (CSVT), vol. 29, no. 11,
pp. 3444–3453, Nov. 2019. Wonjae Lee received the B.S. degree from the
[36] S. Yin et al., “A high energy efficient reconfigurable hybrid neural Department of Electrical and Computer Engineering,
network processor for deep learning applications,” IEEE J. Solid-State Seoul National University, Seoul, South Korea,
Circuits, vol. 53, no. 4, pp. 968–982, Dec. 2017. in 2019, and the M.S. degree from the Graduate
[37] J.-S. Park et al., “9.5 A 6K-MAC feature-map-sparsity-aware neural School of Convergence Science and Technology,
processing unit in 5nm flagship mobile SoC,” in IEEE Int. Solid-State Seoul National University, in 2022.
Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2021, pp. 152–154. Since 2022, he has been an Engineer with
[38] A. Agrawal et al., “A 7nm 4-core AI chip with 25.6TFLOPS hybrid FP8 Samsung Electronics Company Ltd., Hwaseong,
training, 102.4 TOPS INT4 inference and workload-aware throttling,” in South Korea. His research interests include quan-
IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, vol. 64, tized neural network algorithms and systems, and
Feb. 2021, pp. 144–146. low-power and resource-efficient hardware design
[39] R. J. Wang, X. Li, and C. X. Ling, “Pelee: A real-time object for deep learning inference/training.
detection system on mobile devices,” in Proc. Adv. Neural Inf.
Process. Syst. (NIPS), S. Bengio, H. Wallach, H. Larochelle,
K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds., vol. 31. Kukbyung Kim received the B.S. degree in
Red Hook, NY, USA: Curran Associates, 2018, pp. 1–77. electrical engineering from Korea University, Seoul,
[Online]. Available: https://proceedings.neurips.cc/paper/2018/file/ South Korea, in 2009, and the M.S. degree
9908279ebbf1f9b250ba689db6a0222b-Paper.pdf from the Graduate School of Convergence Science
[40] L. Liu et al., “On the variance of the adaptive learning rate and beyond,” and Technology, Seoul National University, Seoul,
in Proc. Int. Conf. Learn. Represent. (ICLR), 2020, pp. 1–14. [Online]. in 2022.
Available: https://openreview.net/forum?id=rkgz2aEKDr Since 2012, he has been a Senior Engineer
[41] I. Loshchilov and F. Hutter, “SGDR: Stochastic gradient descent with with Hanwha Systems Company Ltd., Seongnam,
warm restarts,” 2016, arXiv:1608.03983. South Korea, where he is currently working
[42] J. Bethge, C. Bartz, H. Yang, Y. Chen, and C. Meinel, “MeliusNet: on embedded hardware development. His current
An improved network architecture for binary neural networks,” in research interests include object detection algo-
Proc. IEEE Winter Conf. Appl. Comput. Vis. (WACV), Jan. 2021, rithms, hardware accelerators, field-programmable gate arrays (FPGAs), and
pp. 1439–1448. low-power embedded systems.
[43] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and
A. Zisserman, “The Pascal visual object classes (VOC) challenge,” Int.
J. Comput. Vis., vol. 88, no. 2, pp. 303–338, Sep. 2009.
[44] A. Paszke et al., “Pytorch: An imperative style, high-performance deep Woohyun Ahn received the B.S. and M.S. degrees
learning library,” in Proc. Adv. Neural Inf. Process. Syst., vol. 32. in electrical engineering from Chungbuk National
Vancouver, BS, Canada, Dec. 2019, pp. 8026–8037. University, Cheongju-si, South Korea, in 2012 and
[45] L. Crockett, D. Northcote, C. Ramsay, F. Robinson, and R. 2014, respectively.
Stewart, Exploring Zynq MPSoC: With PYNQ and Machine Learning Since 2015, he has been a Senior Engineer
Applications. Glasgow, Scotland: Strathclyde Academic Media, 2019. with Hanwha Systems Company Ltd., Seongnam,
[46] XILINX. (2023). Zynq Ultrascale+ Device Technical Reference South Korea, where he is currently working
Manual. Accessed: Jan. 28, 2023. [Online]. Available: on embedded software development. His current
https://docs.xilinx.com/r/en-U.S./ug1085-zynq-ultrascale-trm/Zynq- research interests include digital signal processing
UltraScale-Device-Technical-Reference-Manual and embedded software implementation such as
[47] ARM. Amba AXI and ACE Protocol Specification AXI3, AXI4, and AXI4- DSP, field-programmable gate arrays (FPGAs), and
Lite Ace and Ace-Lite. Accessed: Jan. 28, 2023. [Online]. Available: system-on-chip (SoC).
https://developer.arm.com/documentation/ihi0022/e
[48] ARM. Fundamentals of ARMV8 Neon Technology. Accessed:
Jan. 28, 2023. [Online]. Available: https://developer.arm.
Jinho Kim received the B.S. and M.S. degrees
com/documentation/102474/0100/Fundamentals-of-Armv8-Neon-
in information and telecommunication engineer-
technology
ing from Korea Aerospace University, Goyang-si,
[49] XILINX. (2017). Ultrascale Architecture Configurable Logic Block User
South Korea, in 2004 and 2006, respectively.
Guide. Accessed: Jan. 28, 2023. [Online]. Available: https://docs.xilinx.
Since 2010, he has been the Chief Engineer
com/v/u/en-U.S./ug574-ultrascale-clb
of Hanwha Systems Company Ltd., Seongnam,
[50] XILINX. Ultrascale Architecture DSP Slice User Guide. Accessed: South Korea, where he is currently working as a Sys-
Jan. 28, 2023. [Online]. Available: https://docs.xilinx.com/v/u/en- tem Engineer. His current research interests include
U.S./ug579-ultrascale-dsp object detection algorithms, hardware accelerators,
[51] B. Wu, Y. Wang, P. Zhang, Y. Tian, P. Vajda, and K. Keutzer, “Mixed and system integration.
precision quantization of ConvNets via differentiable neural architecture
search,” 2018, arXiv:1812.00090.
[52] Z. Guo et al., “Single path one-shot neural architecture search with
uniform sampling,” in Proc. Eur. Conf. Comput. Vis. (ECCV). Cham,
Switzerland: Springer, 2020, pp. 544–560. Dongsuk Jeon (Member, IEEE) received the B.S.
[53] L. Yang and Q. Jin, “FracBits: Mixed precision quantization via degree in electrical engineering from Seoul National
fractional bit-widths,” 2020, arXiv:2007.02017. University, Seoul, South Korea, in 2009, and the
[54] T.-Y. Lin et al., “Microsoft COCO: Common objects in context,” in Ph.D. degree in electrical engineering from the
Proc. Eur. Conf. Comput. Vis. (ECCV). Cham, Switzerland: Springer, University of Michigan, Ann Arbor, MI, USA,
2014, pp. 740–755. in 2014.
[55] XILINX. 7 Series DSP48e1 Slice User Guide. Accessed: From 2014 to 2015, he was a Postdoctoral Asso-
Jan. 28, 2023. [Online]. Available: https://docs.xilinx.com/v/u/en- ciate with the Massachusetts Institute of Technology,
U.S./ug479_7Series_DSP48E1 Cambridge, MA, USA. He is currently an Associate
[56] XILINX. Zynq Ultrascale+ MPSoC Power Advantage Tool 2018.1. Professor with the Graduate School of Convergence
Accessed: Jan. 28, 2023. [Online]. Available: https://xilinx- Science and Technology, Seoul National University.
wiki.atlassian.net/wiki/spaces/A/pages/18841803/Zynq+UltraScale+ His current research interests include hardware-oriented machine learning
MPSoC+Power+Advantage+Tool+2018.1 algorithms, hardware accelerators, and low-power circuits.