Hardware/Software Co-Design For JPEG Encoder Test Bench: Keywords

Hardware/Software Co-Design for JPEG Encoder Test Bench
Xiaoying Liang

Xiaoying Liang
Guangdong Women's Polytechnic College, minnielxy@gmail.com
Abstract
This paper presents a hardware/software (HW/SW) co-design approach using System On a
Programmable Chip (SOPC) technique to achieve Joint Photographic Experts Group (JPEG)
algorithm. It firstly introduces JPEG image compression technology and the system architecture. Then
the hardware/software design process of JPEG encoder test bench is introduced. It focuses on using
the characteristics of Field-Programmable Gate Array (FPGA) structure to achieve JPEG algorithm
including the improved Discrete Cosine Transform (DCT), and Nios II embedded processor of
customizable characteristics, translating image acquisition, JPEG image compression and Thin Film
Transistor Liquid Crystal Display (TFT-LCD) controller into user-defined modules according to Altera
Avalon bus requirements with the SOPC Builder, where the user-defined module can be added to the
system under the control of soft-core Nios II Embedded. Finally, the whole system is verified on a
single FPGA chip. The experimental results shows the advantages of JPEG algorithm as a FPGA
hardware module includes low power consumption, high image quality, low production costs and
stable performance. Theres a very great practical significance to reduce costs and improve image
processing speed.
Keywords: JPEG, SOPC, FPGA, Nios II Processor, Intellectual Property(IP) Core

1. Introduction
In recent years, with the development of the Internet and multimedia technology, the requirement of
computer ability to deal with multimedia have been put forward higher, and it's crucial of compression
coding technology in large amounts of data storage and transmission during processing multimedia.
Therefore, we must do a study on the image coding algorithm to application-specific integrated circuit
mapping. The development of Integrated Circuit (IC) manufacturing process and Essential Electronic
Design Automation (EDA) technologies greatly promoted the Very Large Scale Integration (VLSI)
Design, and makes it possible to realize digital image signal processing on a programmable chip. The
FPGA belongs this type of chip. There are many advantages to using FPGA, including their
programmable hard-wired feature, fast time-to-market, shorter design cycle, embedding processor, low
power consumption and higher density for the implementation of the digital system. FPGA provides a
bridge between the application-specific integrated circuit (ASIC) hardware and general-purpose
processors [1]. Furthermore, embedded processor IP and application IP can now be developed and
downloaded into FPGA to construct a SOPC environment [2-4]. It allows the user to design a SOPC
module by combining hardware and software in merely one FPGA chip. The software/hardware codesign increases the programmable, flexibility of the designed digital system, reduce the development
time and enhance the system performance [5-7].
This paper shows an efficient and flexible HW/SW co-design architecture that implements a basicline JPEG encoder. Specific research includes the following aspects:
1) By analyzing JPEG coding standard and using the top-down modular design, a basic-line JPEG
encoder which makes full use of the advantage of pipeline principle and parallel structure to get higher
speed and throughput is proposed.
2) In RTL design, by studying the core algorithm (DCT) of JPEG, the improved Discrete Cosine
Transform is present. It not only makes the DCT transformation speed improved, but also reduces the
memory consumption.
3) In order to test the JPEG encoder, using FPGA to built an embedded system platform base on the
Nios II. According to the standard of IP core design, the Camera IP, TFT-LCD IP and other IP resource
based on Avalon standard bus interface are designed. These IP core are designed with the consideration
of the reuse of platform and with the characteristic of hardware and software which can be modified,
scalable and reconstruction. They can meet the demand of general image and video processing systems.
Advances in information Sciences and Service Sciences(AISS)

Volume4, Number2, February 2012
doi: 10.4156/AISS.vol4.issue2.32
258

Xiaoying Liang
4) Compared with the traditional image process system using only software or hardware, the
software and hardware of this system work closely together. And the system could obtain a better
balance of flexibility and performance.
2. JPEG baseline encoder

The basic model for the JPEG encoder is shown in Figure 1.
Before the image data being input to the JPEG encoder, they are firstly divided into Multiple Macro
Blocks of 16 x 16, and every Macro Block is divided into four sub-blocks of 8 x 8 pixels without
overlap. The data that is input to the encoder are all in the unit of one sub-block, and converted the
unsigned integer pixel values to the signed integer format. DCT computation is performed on each
block and 64 discrete cosine transform coefficients in frequency domain are got. First coefficient in
every 8 x 8 block is Direct Current (DC) coefficient. Remaining 63 coefficients in every 8 x 8 block
are Alternating Current (AC) coefficients. The output of the DCT will result in most of the block
energy being stored in the lower spatial frequencies. On the other hand, the higher frequencies will
have values equal to or close to zero which can be ignored during encoding without significantly
affecting the image quality.
To ignore the higher frequencies, the quantization step that follows the DCT computation is
implemented and the user predefine the quantization tables, selection of which is critical since it affects
both the compression efficiency and the reconstructed image quality, are used in the quantization step.
The DCT coefficients matrix is obtained after quantization. As there are a lot of different between
the DC coefficient and AC coefficients in their statistical properties, they need to be processed
separately. JPEG use Differential Pulse Code Modulation (DPCM) technique to the DC coefficients
which is the first element in the top left corner of 8 x 8 matrix block. As remaining 63 coefficients (AC)
have values equal to or close to zero, it uses RunLength Encoding compression algorithm that uses the
Huffman algorithm to define a code for runlengths. To make the RunLength Encoding efficient, the
highest frequencies should be visited last, the zigzag reordering is used.
In order to archive better compression result, input images are transformed to a different color space
(or color coordinates) before being input to the encoder. One of the most appropriate color spaces for
the JPEG algorithm has been shown to be YCbCr, which takes the three standard channels (Red, Green,
Blue) and maps them into a different representation that is based on a luminance (brightness) channel
and two opposing color channels. The JPEG image compression algorithm can then apply more
compression to the color information channels than the luminance information and yet still arrive at an
acceptable resulting image quality after this conversion step.
Figure 1. The JPEG Baseline Encoder
3. The architecture
A FPGA hardware/software co-design approach is becoming increasingly popular for
implementation of digital circuits. It can be developed in software for flexibility and upgrading
completed with hardware IP blocks for cost reduction and performances. Altera provides the SOPC
builder tool for the quick creation and easy evaluation of embedded systems. Using the SOPC Builder,
the proposed system in this paper has been developed with the NIOS II Processor and some peripherals
to give support to the correct operation of the processor. These peripherals are the program and data
memories (DDR SDRAM, SRAM and FLASH), two UART to communicate with the PC and provide
debug information and to program the processor, some input and output ports to read the data from the
259

Xiaoying Liang
camera and deliver the output signal to the LCD, some ports with timing and synchronization purposes.
All this peripherals are connected to the Avalon Bus in a single master/slave configuration, where the
bus master is the NIOS II Processor and DMA controller. In additional, the NIOS II configuration
chosen is the NIOS II/fast, to provide the best performance to the processing unit. The diagram of the
system structure is shown in Figure 2.
Figure 2. System Structure Diagram
4. The HW/SW co-design platform

4.1. Hardware platform
Figure 3. Cyclone II Development Board

The hardware structure of JPEG encoder test bench is shown in Figure 3. 1)FPGACyclone II
EP2C35as the core components can complete the control of image camera, TFT display, DDR
SDRAM memory etc, and embedded Nios II soft CPU can complete the image processing and
analyzing. 2) CMOS camera module (connected to the Altera Daughter Card) mainly completes the
target image acquisition. By using single-chip CMOS color digital camera OV7620, the target image
can directly be converted to digital image. 3) DDR SDRAM, SRAM and FLASH are used as image
frame buffer, storage medium storing the middle data storage and image processing program. 4) TFTLCD interface module (connected to the Altera Daughter Card) is used as bridge between SRAM and
260

Xiaoying Liang
TFT display. 5) Serial configuration device is used to storage the configuration data of FPGA. While
the FPGA powers up, the serial configuration device sends data to the FPGA. 6) JTAG port is the
special port that uses the IEEE Std 1149.1 JTAG interface pins and supports the JAM STAPL standard.
7) UART serial port uses as the debug port for Nios II and image data output. 8) The clock module
produces system clock with a 50 Mhz external Clock. 9) Altera Daughter Card is a port that meets
Altera development board extended standard, using to connect image camera module and TFT-LCD
interface module. 10) The key and LED complete the program control and the result display.
4.2. The software implementation

4.2.1. Camera controller
Figure 4 shows a simple block diagram of the camera controller. The camera controller consists of
the three parts. The first part is the CMOS camera interface. It is responsible for capturing image data
effectively. The second part is the FIFO for temporary storage of the outputs of the CMOS camera. The
third part is the Avalon Streaming Interface that supports the unidirectional flow of data, including
multiplexed streams, packets, and DSP data.
Correspondingly, the HDL codes of camera controller also consist of the three files:
camera_interface.v, camera_pixel_fifo.v, camera_controller_stream.v. The camera_interface.v is the
top file which includes not only the Avalon Streaming Interface but also instance of FIFO module.
Figure 4. Structure of the Camera IP Core

4.2.2. LCD controller
Figure 5. Structure of the LCD IP Core

The display of TFT-LCD needs to transfer lots of data. In the standard VGA (640 x 480,60Hz)
mode, the scan period of every pixel only is 40ns. It is obviously that it is hard to realize the high speed
data transfer by using Nios II CPU software. The method that solves this problem is to realize a TFTLCD controller using Avalon Streaming Interface and built a transmission channel between TFT-LCD
controller and SDRAM by using DMA controller. Then Nios II can complete the TFT-LCD update
through operating the SDRAM framebuffer.
Figure 5 shows a simple block diagram of the TFT-LCD controller. The controller consists of the
three parts. The first part is the TFT-LCD timing generator. The second part is the FIFO. The third part
is the Avalon Streaming Interface that supports the unidirectional flow of data, including multiplexed
streams, packets, and DSP data. Correspondingly, the HDL codes of camera controller also consist of
the three files: lcd_timing.v, lcd_pixel_fifo.v, lcd_controller_stream.v. The lcd_controller_stream.v is
261

Xiaoying Liang
the top file which includes not only the Avalon Streaming Interface but also instance of timing
generator and pixel FIFO.
4.2.3. JPEG encoder
Figure 6. Class Hierarchy for JPEG Encoder

The block diagram of the implemented JPEG encoder is shown in Figure 6. In the baseline JPEG
process, the DCT is the most complex and important operation that needs to be performed. Our
implementation of the DCT is a slightly modified version of that presented in [8]. The Discrete Cosine
Transform is a Fourier-related transform consisting of a set of basis vectors that are sampled cosine
functions. The two-dimensional DCT of an N-by-N matrix X is defined as follows.
Z C t XC .
(1)
where X is the data matrix, C is the matrix of DCT coefficients, and Ct is the transpose of C.
Denoting the 1-D DCT of an N x N data matrix X by Y = XC and letting the element of the data
matrix X be represented by the 2s complement code, then the (k, l)th element of Y is
n2 N
yk ,l cm ,l xk( n,m1) 2n 1 cm ,l xk( ,jm) 2 j .

m 1
(2)
j 0 m 1
where cm,l is the mth row and the lth column element of C, xk( ,jm) is the jth bit of xk ,m which is the kth
row and the mth column element of X and has a value of either 0 or 1, n is the number of bits xk ,m
carries, and xk( n,m1) is the sign bit.
By considering characteristics of the DCT matrix, it can be shown that
N /2
yk ,l u k ,m cm,l .
(3)
m 1
where l = 1, 3, , N-1 with uk,m = xk,m + xk,N-m+1 and
262

Xiaoying Liang
N /2
yk ,l vk ,m cm ,l .
(4)
m1
where l = 2, 4, , N, with vk,m = xk,m xk,N-m+1.

Equations (3) and (4) imply that the variables u and v can be used to replace the original data
sequence x. Figure 7 shows the detailed schematic diagram of the actual implemented 1-D DCT. This
is the same as the first-stage butterfly used in most fast algorithms. It is performed through serial
adders and subtractors rather than multiplications and requires much less logic resources.
Figure 7. Schematic of Actual 8 x 1 DCT
263

Xiaoying Liang
4.2.4. Workflow software
Figure 8. Main Program Flowchart

The working process of the system is shown in Figure 8. The system firstly initializes the camera to
have 8 bits data output after power on. Then the sample data is stored in DDR SDRAM by DDR
SDRAM controller. As the system has two pieces of DDR SDRAM, so it can realize table tennis
operation easily to meet the demand of high speed data buffer and assembly line process. In additional,
the function of the system can be choosing through the switches. If SW4 open, the bridge is built
between the DDR SDRAM data bus and the TFT-LCD controller. The sampling image directly
displays on TFT-LCD. If SW4 off, under the coordination of Nios II processor, the first DMA
controller sends the image data from DDR SDRAM to JPEG hardware encoder to encode, and the
second DMA controller storage the encoded data in SRAM for the further process. In order to prove
the right of JPEG hardware encoder, the system provides two methods to verify: 1) The encoded data
from SRAM is decoded by software program in Nios II using decode function library. 2) Through
opening the SW5, the encoded data from SRAM is transferred to PC by serial port. In PC, the encoded
data is decoded and display by the third party software. If it can decode and display successfully and is
basically the same as the original image, the design is proved right.
5. Experimental results
After finish the design of systems software and hardware, it is needed to test the SOPC system to
assure the correctness of design and the performance of system.
5.1. Image processing results

In order to verify the result of image compression in the design, one method is using serial port line
to connect the PC with the RS232 serial port of FPGA development board. Using the serial port
communication software (such as "Serial port debug assistant"), the data read from SRAM can be
observed to verify the correctness of SOPC system. Figure 9 shows the JPEG encoder development
board and the JPEG compression data that observed from the serial port debug assistant.
264

Xiaoying Liang
Figure 9. Experiment. Left: JPEG Encoder Development Board. Right: JPEG Compression Data
5.2. Performance analysis of image compression

5.2.1. Objective evaluation
The peak signal-to-noise ratio (PSNR) is most commonly used as a measure of objective evaluation
of grayscale image. It can be shown that
PSNR 10 log10
A2
1 N M
(
) [ f (n, m) f n (n, m)]2
NM N 0 M 0
(dB ) .
(5)
where f(n,m) is the original image, fn(n,m) is the grayscale image, the image size is N x M, and A is the
maximum of f(n,m). The results are shown in Table 1.
Design
Table 1. Comparison of image objective evaluation

Test image
Bit rate(bpp)
Proposed encoder
ACDSee
Lenna
Lenna
1.597
1.597
PSNR(dB)
37.574
39.255
As can be seen from the table, there is not much difference between the proposed encoder and the
pure software encoder in compression quality.
5.2.2. Subjective evaluation
The subjective evaluation of images means evaluating quality of image by naked eye. The
experiment result shows the JPEG file compressed by our technologies would be absolutely decoded
and displayed on the third part software. Compared with software encode and decode technologies,
difference cannot be distinguished by human being. Specially, when quality of compressed is 50%, two
images are essentially same. The reason for this result is the maximum bit is 12 for inner calculator in
FPGA. When the quality factor is lower, the greater the quantization step and the quantization error
difference between the proposed DCT and ACDSee is also smaller.
265

Xiaoying Liang
5.2.3. Comparisons over space and time

In order to show the efficient of compressed images, Table 2 lists the comparison between some
commercial IP core and proposed encoder. It shows that the proposed encoder has an excellent
performance in consumption of resources and frequencies, without using embedded multipliers in the
device.
Developer
Proposed encoder
Proposed encoder
Proposed encoder
JPEG_Fast_E (CAST,Inc)
JPEG_E (CAST,Inc)
Table 2. Comparison of compression efficiency

Device Speed grade
Resource
EP2C35
EP1S25
EP2C20
EP1S25
EP2C20
-8
-7
-6
-7
-6
6606LEs
6682LEs
6608LEs
6355 LEs
5,337 LEs, 9 M4Ks, 19 DSP-9bit
Clock frequency
107MHz
119MHz
150MHz
93 MHz
154MHz
6. Conclusion
The new generation of FPGA technologies enables a commercial softcore processor and an
application IP to be integrated into a SOPC developing environment. The benefit of a softcore
processor is to add a micro-programmed logic that introduces more flexibility. Therefore, in this paper,
we present an efficient HW/SW co-design architecture for JPEG encoder and its FPGA implementation.
It is based on a Nios II CPU and a set of specialized processors and interfaces that implements JPEG
baseline encoder. The whole design has been tested on a NIOS II development board and some
experimental results are demonstrated. The result shows that the proposed system is more flexible and
stable, and can be used in a wide range of video system applications, particularly in consumer product
such as Smartphone.
7. References
[1] Jianbo Xu, Jing Long, Wei Liang, Weihong Huang, "A DFA-based Distributed IP Watermarking
Method Using Data", JCIT: Journal of Convergence Information Technology, Vol. 6, No. 8, pp.
152-160, 2011.
[2] Yang-Hsin Fan, Trong-Yen Lee, "Grey Relational Hardware-Software Partitioning for Embedded
Multiprocessor FPGA Systems", AISS: Advances in Information Sciences and Service Sciences,
Vol. 3, No. 3, pp. 32-39, 2011.
[3] Hejin Liu, Kejun Li, Ying Sun, Ruzhen Li, Wenli Wang, Zhenyu Zou, "Design and
implementation of SOPC-based frequency variable inverter", Dianwang Jishu/Power System
Technology, Vol. 35, No. 2, pp. 194-200, 2011.
[4] Yang Yu, Yefu Chen, Yu Peng, "An SOPC test strategy based on wrapper/TAM co-optimization",
In Proceedings of the 10th International Conference on Electronic Measurement and Instruments,
pp.331-335, 2011.
[5] Jigang Tong, Zhenxin Zhang, Qinglin Sun, Zengqiang Chen, "Design of node with SOPC in the
wireless sensor network", ICIC Express Letters, Vol. 4, No. 5B, pp. 1869-1874, 2010.
[6] Chih-Min Lin, Ming-Hung Lin, Chun-Wen Chen, "SoPC-based adaptive PID control system
design for magnetic levitation system", IEEE Systems Journal, Vol. 5, No. 2, pp. 278-287, 2010.
[7] Lionel Damez, Loic Sieler, Alexis Landrault, Jean Pierre Drutin, "Embedding of a real time
image stabilization algorithm on a parameterizable SoPC architecture a chip multi-processor
approach", Journal of Real-Time Image Processing, Vol. 6, No. 1, pp. 47-58, 2011.
[8] Ming-Ting Sun, Ting Chung Chen, Albert M. Gottlieb, "VLSI Implementation of a 16 X 16
Discrete Cosine Transform", IEEE Transactions on Circuits and Systems, Vol. 36, No. 4, pp. 610617, 1989.
266

Hardware/Software Co-Design For JPEG Encoder Test Bench: Keywords

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hardware/Software Co-Design For JPEG Encoder Test Bench: Keywords

Uploaded by

Copyright:

Available Formats

Hardware/Software Co-Design for JPEG Encoder Test Bench