You are on page 1of 4

Cairo University SPARC V2 (CUSPARC V2)

PROCESSOR
S. E. D. Habib*, Mohamed Wagih I Ismail, Ahmed Ibrahim S. Khalil, Ezz El-Din O. Hussein, Alhassan F. Khedr,
Safaa A. Abdelfattah, Ahmed Reda , and Mohamed Elgendy.
Electronics and Communications Engineering Department, Cairo University, Giza, Egypt
*Corresponding Author: seraged@ieee.org

Abstract- The Cairo University SPARC (CUSPARC) other blocks with a white background color are essentially
processor is an IP embedded processor based on SPARC V8 the same as in CUSPAEC V1 [1,2]. The following
ISA standard. It is fully developed at Cairo University and is paragraphs present the features of the main enhancements
the first Egyptian processor. CUSPARC was implemented on added to CUSPARC V2.
several FPGA and ASIC platforms. Its first ASIC version
(CUSPARC V1) was implemented on Si in 2010 using IBM
0.13ȝm CMOS 8RF-DM process.
This paper reports the design, fabrication and testing of an
enhanced version of CUSPARC, labelled CUSPARC V2.
CUSPARC V2 is augmented with an ADPLL, a hardware
multiplier, and four JTAG scan chains.

Keywords:
IP processors, Embedded processor, SPARC V8, CUSPARC,
ADPLL, JTAG.

I: INTRODUCTION
CUSPARC [1,2] is an intellectual property embedded
processor fully developed at Cairo University, Egypt. It
conforms to SPARC V8 ISA [3]. CUSPARC was
implemented on several FPGA platforms as well as on
IBM 0.13ȝm CMOS 8RF-DM process [2]. This early Fig. 1: CUSPARC V2 Architecture
ASIC implementation of CUSPARC processor is,
henceforth, labelled CUSPARC V1. A many-core
A: HW Multiplier
message-passing processor based on the CUSPARC V1
An on-chip integer multiplier is added in
core was designed [4]. It consists of 16 cores arranged in
CUSPARC V2. Hardware multiplication enhances
4x4 mesh architecture
Clocks Per Instruction (CPI), DMIPS/MHz, Power
efficiency metrics of the processor as well as its
This paper introduces the design, fabrication, and testing of performance for DSP applications.
CUSPARC V2 processor. This V2 version integrates a 32 The implemented multiplier is a single cycle 32-bit
bit hardware multiplier, an All-Digital PLL, and four JTAG multiplier that is optimized for the power-delay product. A
scan chains with the CUSPARC V1 core. Section II single cycle multiplier is selected rather than a pipelined
highlights the enhancements added to CUSPARC V2. multiplier so as to keep the processor pipeline simple.
Section III outlines the design methodology adopted. However, single cycle multiplier should be carefully
Section IV presents the performance measures of optimized to avoid significant decrease in the maximum
CUSPARC V2, and compares them with other processors frequency of operation. Our final multiplier design uses a
manufactured at the same technology node. Section V simple partial product generator with (7;3) compressor for
concludes this paper. the partial product accumulator. The final stage adder is a
ripple block-carry look-ahead adder [5,6].
II: CUSPARC V2 ENHANCEMENTS
Fig.1 shows the overall CUSPARC V2 architecture. The B: All Digital PLL
three blocks with a gray background color identify the On-chip frequency synthesizers are mandatory for
CUSPARC V2 enhancements over the V1 version. The

978-1-5090-5721-4/16/$31.00 ©2016 IEEE 301 ICM 2016


Authorized licensed use limited to: Concordia University Library. Downloaded on January 26,2022 at 15:34:29 UTC from IEEE Xplore. Restrictions apply.
processors working at frequencies exceeding ~ 300 MHz. Fig. 3 depicts a bird’s view of how the three internal scan
PLLs are typically used to implement these frequency chains of CUSPARC V2.
synthesizers. We opted for an All-Digital PLL design
approach for its ease of design and ease of portability
amongst different technology nodes. Fig. 2 shows a
simplified block diagram of the ADPLL architecture. Our
ADPLLs is built mainly of Digitally-controlled Oscillator
(DCO), Digital Loop Filter (DLF), and synchronization
blocks.

Our DCO is composed of many inverting delay sections


connected as a ring oscillator. Coarse control bits control
the number of sections in the ring. Additionally, fine and
very fine frequency control are affected via Digitally
Controlled Varactor (DCV) cells [7,8]. The DCO control Fig. 3: Organization of the three internal JTAG scan chains
word is made of 44 control bits, of which 16 are for coarse
control, 12 for fine control, and 16 for very fine control. A III: Design Flow
binary search algorithm is used to speed up the lock time. A classical digital design flow is adopted. Our RTL coding
A synchronizer is used to sample the reference clock by the style for CUSPARC is mainly structural. Standard cell
fast output clock of the DCO. This synchronizer represents design flow is followed. ARM’s Artisan standard cell
a compromise solution between the low performance bang- library for IBM CMOS 130 nm 8RF technology is used in
bang phase detector and the high cost TDC. The digital this flow. Synopsys Design Compiler (DC) is used for the
loop filter is a PI controller with an optimized digital synthesis process. Cadence SoC Encounter is used for
implementation to reduce the power consumption and standard cell place and route. Mentor Graphics HDL
speed up the calculations. simulator ModelSim is heavily used at all design levels up
to full post-layout testing of full chip.
Binary Search 16
Fref Ref Edge Shift R/L Coarse
Synchronization Algorithm
SR
FSM
Div Edge
IV: Test Results and Performance Measures
Digital Controlled FDCO
Phase Error Digital Loop Filter Shift R/L Fine
16+12
Oscillator A: CUSPARC Test Board
(DCO)
(DLF) SR
A test PCB is developed specifically to test the fabricated
CUSPARC chip. This board contains mainly the
єȴ CUSPARC chip, 256 Mbit flash memory, three 2Mx36 bit
SSRAM chips, Cyclone FPGA chip, 24 MHz crystal
Fraction Modulator
1st Order

Fig. 2: Architecture of the implemented ADPLL oscillator, an RS232 interface, and 7 inch capacitive touch
LCD.
Several features were added to ADPLL to facilitate its
testing. Mainly, we augmented our ADPLL with an The configuration of this board allows several modes of
internal custom scan path to control and initialize the testing CUSPARC V2 chip. The processor bus is passed to
ADPLL operation. Using this scan chain, we can set the the FPGA to do glue logic operations. More importantly,
ADPLL to work in open loop or closed loop mode. The this passing of important processor signals to FPGA
frequency multiplication ratio of the PLL can also be set enables us to monitor all the processor bus activities via the
via this scan chain. SignalTap on-chip logic analyzer of Altera FPGAs. Also,
CUSPARC can run from small programs loaded inside the
C: JTAG Scan chains FPGA built-in memory modules, independent of SRAM
One boundary scan chain as well as three internal JTAG and flash memories. Furthermore, the Digital Clock
scan chains are implemented in CUSPARC V2. The management modules inside the FPGA allow very simple
boundary scan chain covers all I/Os of the chip. The way of testing the processor at different clock frequencies.
internal JTAG scan chains are organized as follows:
• Register file scan chain covering all registers of the B: Core Testing
windowed register file of CUSPARC V2. B.1 DMIPS/MHz
• Memory data and address (bus) scan chain, covering The Dhrystone metric {Dhrystone Mega Instructions Per
the signals of the fast Wishbone bus of the Sec (DMIPS)} is widely used to measure the performance
processor. of embedded processors [9]. The DMIPS/MHz metric is
• Status and control registers scan chain. obtained from the following equation:
‫ܵܲܫܯܦ‬Ȁ‫ ݖܪܯ‬ൌ  ͳͲ଺ Ȁሺ‫ͳ כ ܯ‬͹ͷ͹ሻ

302

Authorized licensed use limited to: Concordia University Library. Downloaded on January 26,2022 at 15:34:29 UTC from IEEE Xplore. Restrictions apply.
Where M is the number of clock cycles needed to execute simulations for the DCO output clock frequency variation
the Dhrystone program once. For CUSPARC V2, M was with the very fine control word showed that the very fine
determined to be 445 using SW simulations as well as HW control step in clock period is approximately constant at 2.6
testing. Therefore, the DMIPS/MHz rating for CUSPARC ps.
V2 is 1.279, compared with 1.025 for CUSPARC V1 [2].
The improvement of DMIPS/MHz for CUSPARC V2 is On the hardware test side, we tested the ADPLL under the
due to the implementation of the HW multiplier. same conditions, as shown in Fig. 7. The ADPLL was
configured via its scan chain to work in the open loop
B.2 Maximum Frequency of Operation mode. The coarse control word was scanned over its 16
Post-layout Modelsim simulation of CUSPARC V2 core settings, and the DCO output frequency was measured via
was carried out at the different process corners. At the the ADPLL divided output clock pin. These hardware tests
typical-typical corner maximum frequency is 188 MHz. revealed, as expected, some jitter in the output DCO
frequency. Therefore, for each control word, we have two
Hardware testing determined that the maximum frequency measured DCO output frequencies, the maximum
of the core clock is 188 MHz for the tested IC. measured frequency (max_measur) and the minimum
measured frequency (min_measur). The hardware DCO
B.3 Core Power Dissipation output frequency measurement is shown by the
Again, the power dissipation metrics for CUSPARC V2 are max_measur and min_measur traces of Fig. 8. The
evaluated using post-layout simulation in parallel with real measured DCO output frequency falls within the ss and ff
hardware testing. In both cases, the dynamic power technology corners
consumption is evaluated for the processor running the
Dhrystone V2.1 benchmark.
On the simulation side, the signal activity of the individual
nodes is extracted and passed to Synopsys PrimeTime tool
to evaluate the power consumption. On the hardware
testing side, a small resistor was inserted in series with the
1.2 V core supply. The processor ran the Dhrystone V2.1
in an infinite loop, and the DC voltage across this resistor
is measured. This test is repeated for several clock
frequencies, and the slope of the power- frequency curve is
calculated to yield the mW/MHz rating of the processor.
CUSPARC V2 scored 0.406 mW/MHz (including caches,
but excluding pads).

C: ADPLL Testing
C.1 ADPLL testing in open loop mode. Fig. 7: DCO Frequency Variation with Fine Cells
In the open loop mode, the DCO is working alone. The
phase (delay) detector, loop filter, and feedback divider are C.2: ADPLL testing in closed loop mode.
bypassed, and the ADPLL reduces to the DCO alone. The The ADPLL was simulated and hardware tested
user inputs the DCO control word (44 bit, 16 coarse, 12 successfully under closed loop conditions. The reference
fine, and 16 very fine bits) via the ADPLL scan chain. clock was set to 24 MHz (crystal oscillator frequency on
the test PCB board), and the ADPLL multiplier was varied
On the simulation side, we carried out detailed analog to vary the processor’s core frequency.
simulations to characterize the DCO operation. Analog
simulations are necessary since the DCV cells cannot be D: CUSPARC Core and ADPLL test.
modeled in the digital domain [7]. Fig. 7 shows how the We carried out simulations and hardware testing for the
clock frequency of the DCO varies with the coarse and fine CUSPARC V2 core clocked by ADPLL in closed loop
control words. Different curves in this figure correspond to configuration. The processor functioned correctly and ran
different coarse control values. For example curve labelled our set of benchmark programs successfully. Again, the
C1 corresponds to coarse control value =1. For this case, processor achieved a measured maximum frequency of
the DCO output clock varies with the fine control word (12 operation of 188 MHz.
cases) as shown. The maximum clock frequency of the
DCO is ~ 340 MHz in tt corner. The clock period step is
almost constant, approximately 204ps for coarse tuning
step and 26.4 ps for the fine tuning step. Similar

303

Authorized licensed use limited to: Concordia University Library. Downloaded on January 26,2022 at 15:34:29 UTC from IEEE Xplore. Restrictions apply.
This table shows that CUSPARC offers a good balance
between high total DMIPS, high maximum frequency of
operation, and high power efficiency DMIPS/mW. It
should be noted that the metric values for CUSPARC as
well as for the ARM and MIPS processors are for the cores
only excluding caches.

V: Conclusions
Version V2 of CUSPARC processor was designed,
implemented on Si using IBM 0.13ȝm CMOS 8RF-DM
process, and tested. This version features a HW multiplier,
an ADPLL and four JTAG scan chains. This processor
achieves a maximum frequency of operation of 188 MHz
together with a DMIPS/mW metric of 8.63.
Fig. 8: Measured and simulated DCO output frequency with the
coarse control word. Acknowledgements
The authors acknowledge a MOSIS MEP grant # 4960.
E: Comparison of CUSPARC V2 with relevant Also, the authors acknowledge the help of Mohamed
Processors. AbdElilah, Bassam Mohy Eldin, AbdElRahman
Table 2 compares CUSPARC V2 and CUSAPRC V1 to ElMashad, and Nagham Samir in the design of CUSPARC
several famous ARM [7] and MIPS [8] cores. All these core test PCB and in testing of the processor itself.
processors are implemented using 130 nm CMOS technology.

Table 2: CUSPARC V2, V1, ARM and MIPS processors comparison


Feature CUSPARC CUSPARC ARM926EJ ARM968E ARM7TDMI MIPSM14K MIPSM14Kc
V2 V1
CPU ISA SPARC SPARC ARM/Thumb ARM/Thumb ARM/Thumb MIPS32 R2 MIPS32 R2
Arch. Width 32-bit 32-bit 32-bit 32-bit 32-bit 32-bit 32-bit
Pipeline Depth 4 stage 4 stage 5 stage 5 stage 3 stage 5 stage 5 stage
Core Frequency(MHz) 188 260 238 297 184 100 100
Core Area(mm^2) 2.24 1.38 1.45 0.45 0.35 0.35 0.61
DMIPS/MHz 1.279 1.025 1.1 1.1 0.9 1.5 1.5
DMIPS at max. freq. 240.4 266.5 261.8 326.7 165.6 150 150
Power(mW/MHz) 0.148 0.11 0.36 0.14 0.18 0.12 0.14
Ef¿ciency(DMIPS/mW) 8.64 9.31 3.05 7.85 5.22 12.5 10.7

REFERENCES [7] P.-L. Chen, C.-C. Chung, and C.-Y. Lee, “A portable digitally
controlled oscillator using novel varactors,” IEEE Trans. on Circuits
[1] E. Hussein et al., “CUSPARC IP processor: Design, characterization and Systems II: Express briefs, vol. 52, pp 233- 237, 2005.
and applications,” in Microelectronics (ICM), 2010 International [8] Ezzeldin Omar Ahmed Hussein Hamed, “ASIC Design of All Digital
Conference on, 2010, pp. 435–438. PLLs for Processor-Clock Generation”, Cairo University, M.Sc.
[2] A. Suleiman, A. Khedr, and S. Habib, “ASIC implementation of Cairo thesis, Aug. 2012. Available from authors.
University SPARC "CUSPARC" embedded processor,” in [9] "Dhrystone benchmark." available at
Microelectronics (ICM), 2010 International Conference on, 2010, pp. http://morloch.hd.free.fr/qdos/download.html
439 – 442.
[3] The SPARC Architecture Manual, Version 8. Available at
http://www.sparc.com/standards/V8.pdf
[4] Muhammad R. Soliman, Hossam A. H. Fahmy and S. E.-D. Habib,
"NoC-based Many-Core Processor Using CUSPARC Architecture,"
26th International Conference on Microelectronics (ICM 2014), 14 –
16 December 2014, Doha, Qatar.
[5] Neil H.E. Weste, David Harris, “CMOS VLSI Design : A Circuits
and Systems Perspective,” 3rd Edition, Addison Wesley, 2004.
[6] Alhassan Mohamed Fattin Mohamed Zaki Khedr, “Enhanced
Performance of Cairo University SPARC Processor at 65nm node”,
Cairo University, M.Sc. thesis, Aug. 2011. Available from authors.

304

Authorized licensed use limited to: Concordia University Library. Downloaded on January 26,2022 at 15:34:29 UTC from IEEE Xplore. Restrictions apply.

You might also like