You are on page 1of 5

Paper – 2 Heterogeneous Multicore Systems: A Future Trend Analysis

ABSTRACT
David Patterson predicted memory-wall and power-wall would impede the growth of semiconductor industry which otherwise largely follows the trend of Moore’s Law. Architectural paradigm shift towards multicore computing, helped usher in unprecedented growth in the past decade. Computing paradigm shift towards mobile handheld devices is now emphasizing the use of heterogeneous multicore architectures. The main distinction for these evolving technologies is low-power usage. There are well understood challenges to the design of future computer architectures, solutions to which are just starting to appear. This paper discusses those solutions and predicts the current and the future trends in heterogeneous multicore systems. limitation of memory access which is posed by the concept of multicore computing that uses shared coherent memory for message passing among the multiple cores. In this paper I have discussed the ways to move forward and some of the present technologies to climb these walls. With the shift in focus of the major computing platform towards mobile platforms of varying sizes, newer kinds of applications are getting developed. The scope and reach of a mobile device puts more emphasis on applications that demand power aware smaller and specialized cores in the chip multi-processors. This is in direct context of the earlier analysis of human brain. Another take away from the analysis is that the data processing of all the things that we humans hear, see, taste, smell, think or do the numbercrunching etc, everything is done by various specialized locations in the brain, which vary in their size, properties and functioning. Taking cue from this, we can argue that the processors of the future devices, which has to support natural user interfaces, need to have non-uniform types of specialized data processors for individual kinds of sensing the system needs to support. Thus, the future of computer architecture lies not only in the multicore processors, but the heterogeneous variety of it. The International Technology Roadmap envisions two kinds of semiconductor devices for future computers – devices following “More-Moore” trend and devices following “More-than-Moore” trend . The former being the ones following the trend predicted by Moore’s Law i.e. dependent on advancements in semiconductor feature technology, whereas the latter being the ones used to interface with the analog world. Section 2 describes the problem statement of memory and power walls and section 3 provides some of the ideal and cutting edge solutions to the same. In section 4, I delve deeper towards the trends in computer architecture and present my views on what will be the number of cores in future SoC and present a conclusion to all these in section 5.

Keywords
Heterogeneous Multicore Processors, Wide I/O, Intelligent RAM, Hybrid Memory Cube, TOMI, Interconnects, Epiphany.

1.

INTRODUCTION

The present day computer intelligence and the number-crunching capabilities on a common laptop or a tablet are beyond the documented capabilities of supercomputers from 2 or 3 decades ago as shown in Figure 1a. This opens up vistas for development of new applications on your personal devices which makes us, the user of these devices feel more connected to our environment. Computers have already transcended the physical walls of our homes and moved into our pockets. They are equally powerful, if not more, than the desktop sitting on a table in the corner of the room just over a decade or two ago. The future of computing lies in developing newer forms of interactions with it. This raw processing power has ushered in a race in design and implementation of natural user interfaces which encompasses the seamless integration of our computing infrastructure with our natural physical world. Our future goal is to emulate our natural world interactions, the richness of the physical world and human intelligence upto a practical limit, using computers. The best known and efficient processor with huge intelligence which is a doing similar work continually is the human brain. Analyzing the working of human brain, we can tell a fact without much contention is that it is largely parallel in its operation. The processor development race till the last decade was predominantly focused on the increase in working speed of the central processor. But this trend of computer architectures was hit by the predictions of Dr. Patterson, namely the memory wall and power wall. The power wall states that the switching of logical states inside the processor gets converted into heat which hampers the performance, thus putting an upper limit to switching speed. The answer to this was the historic shift from scaling-up of operating frequencies, towards increasing the number of processors working in parallel, i.e. multicore processors. The caveat to this is the memory wall, which is the reason why a system with n-computing cores working in parallel is not n-times as fast as a single core system. Memory wall is the

2.

PROBLEM STATEMENT

Supercomputers of today are video games of tomorrow. An analysis of problems of the former can give valuable insight into problems of future computers of all size . Figure 1a shows a comparison of a 1997 Sandia Labs Supercomputer and 2006 Sony Playstation computing power which clearly shows the disparity. Acquiring and transferring detailed amounts of information from sensors for the intuitive human-computer interfaces of future and reducing their response-times to match and then surpass human stimuli is the single most critical requirement to be dealt-with, for computer systems of tomorrow. This requires fast instruction execution and faster sensory systems which inadvertently means advancements in both More-Moore and More-than-Moore devices as described earlier. For MoreMoore devices, which are predominantly multicore systems, the main concerns are getting around with the memory-wall and power-wall. Execution time of multicore systems is heavily dominated by the memory access time by multiple CPU’s. The obvious ways to

they make chips highly expensive. 3. or even predicting whether the number would be welldefined or will depend on the applications the future devices would support. HMC uses a new technology Through-Silicon-Via.1 Intel and Micron Hybrid Memory Cube Intel in collaboration with DRAM technology giant Micron has developed highly efficient. Here I present a few potential solutions for the overcoming memory wall and power wall. which are the fastest type of memory cells. Cost per Billion Transistors  Cost Microprocessor Transistors $387 Memory Transistors $0. (Picture credit ) The problem with increasing memory transfer speed is it hits straight into the power wall i. HMC is precisely a 3D stacking of multilayer DRAM stack over CPU process technology based I/O buffer. (iii) Bring memory close to CPU. it wouldn’t work for portable system. the cost of development of such systems would be on higher side of the pricing spectrum and Intel is targeting server and data-center type applications at the beginning. Figure 3a – Comparison b/w DDR and HMC performance (Picture credit ). called HMC hereafter. Thus unless the CPU and memory are close. Mobile devices are already the most popular computing devices. of die real estate in present day chips. this technology is worth keeping an eye on. TSV allows a high amount of parallel interconnects between the memory and processor die. This technology also helps the layout of DRAM. but the technology was highly expensive and didn’t do well commercially . Table 2a – Comparison of cost per billion transistors used for making microprocessors vs memory. This TSV technology supports different dies manufactured in different process technology . called TSV hereafter.c. but this inherently brings limitations on logic density and I/O performance. (ii) Increase the size memory transfer size. A comparison of costs per billion transistors of microprocessor transistor and DRAM transistor in Table 2a shows why caches are so expensive. increasing the transfer speed consumes more power. this has 10 times more bandwidth and is 7 times more energy efficient than the present day most advanced DDR3 module. But as mentioned earlier. 3. These are the basic questions that come in the path of predicting the future trend of chip multi-processors. which is the strong point of manufacturing technology for CPUs . The process technology for fabricating processors and memory are inherently different. But wide memory buses running over long traces would consume more power because wide bus means more memory I/O pins. and data from Samsung . To have a perspective. Other techniques which doesn’t work by increasing frequency like Intel’s Rambus technology were developed and proved to be effective. thereby making memory access much faster than the planar technology currently employed to interface memories with processors. thus putting a commercial limit to their size. But all possibilities of overcoming memory wall suggest that we need to blur the distinction between processor and memory as much as possible. Figure 1a – Comparison of Supercomputer (1997) and Video Game (2006).reduce this access time is to consider one of the following ways – (i) Increase the speed of memory transfer. to stack the DRAM array onto the logic die. Memory transfer size has increased over the years from fetching single instruction at a time to multi instructions. virtually impossible with present technology. supercomputers of today would be video games of tomorrow . The HMC has demonstrated a sustained transfer rate of 1 Tbit/sec . PROPOSED SOLUTIONS The previous section describes the grueling problems that the current researchers in numerous architecture groups around the globe are facing.67 technology. being a new technology. implying a higher bank count per die. However. Performance comparison in Figure 3a shows the critical lead of HMC over other DRAM technologies such as DDR3 and DDR4 (extrapolated).4 mWatt/Gbit/sec energy efficiency and which is optimized for hybrid 3D stacking.e. One way of doing it is by increasing the cache memory size. Long term reliability of TSV is also an unverified parameter. DRAM has the ability to pack massive memory capacity within a small silicon die and at an amazingly low price. named the Hybrid Memory Cube. how many IP cores would be present in the CPU’s in near future and what will be their types. by locating I/O channels nearer the memory array. (Data source – ) Moving the data close to CPU would have huge reduction in access time.2 Samsung Wide I/O Memory 3. But since caches are placed on the microprocessor die and amount to almost 50p. This also frees the processor of memory controller related burdens. Intel has demonstrated an I/O prototype that achieved 1. fast and the most advanced DRAM Samsung’s Wide I/O is a DDR technology for mobile-devices which holds promise of highly power efficient data transfers between the processor and memory.

So. But contrary to this semiconductor industry trend. 4. This is the closest that memory can get from the processor. 22000 only per core .$0. Wide I/O is a 512-bit wide data bus compared to conventional 32-bit wide data buses . To put to perspective. thus cheaper ubiquitous computing systems. This is very similar in concept to Patterson’s Intelligent-RAM project. Quad-core 32-bit CPU + 64M DRAM. TOMI designers created a newer and simpler architecture using which. TOMI engineers created a small library of 60 digital cells and 14 analog cells to make a test processor at first. 5x-10x Cost reduction as compared to similar ARM architecture CPUs. The new Wide I/O mobile 1Gb DDR has reported a data rate of 12.99 . But there were mainly 2 challenges towards TOMI’s microprocessor creation – 1. they are expensive due to the wide buses and complex Printed Circuit Boards (PCBs). Cost effective solution to massively parallel problems like DNA sequencing. Lighter and smaller system designs.com) is a promising startup which is finally successful in pushing existing technology to carve Memory and CPU on a single die. 5x-20x Power reduction as compared to ARM or Intel Atom architecture CPUs. memory transistors are better in terms of cost and leakage power. medical applications etc.g. these are limited to working with huge unstructured data in cloud computing. microprocessors and the memory arrays are fabricated together. there are millions of wires which need to be spread in 10-12 layers. making highly complex data processing possible at cheap prices. the number of layers of wiring is upto 3-4 in memories whereas. previously impossible low-power product designs such as HUD computers. There are various implications of TOMI TM technology as mentioned here – 1. Considering the power and control pins. which means processors can be reasonably fabricated using memory transistors . 5. Thus TOMI technology can be regarded as a breakthroughenabler towards future computing hardware. The trend is similar across all vendors. Highly enhanced performance at very small scale.8V. between memory banks.01x Table 3a – Property comparison between Microprocessor and Memory Transistor (Source ). The cost of this quad-core chip is astonishingly low . Microprocessor Transistors Speed Leakage Power Cost 1x 1x 1x Memory Transistors 0. Due to identical memory cells. SoC design towards realizing natural user interfaces would be made possible with such architectures. 4096-bit wide internal memory buses and packed in an area of 5.6 x 6. DDR2 at 1. These libraries are created to reduce design time. also an advanced technology. Samsung went in the opposite direction of making a wider I/O bus and argued it to be faster and more energy efficient data transfer. Table 3b shows the comparison between memory transistor and processor transistor properties. and TSV Wide I/O memory shows 4 fold reduction in I/O power consumption as shown in Table 3b and Figure 3c. Conventional 3D PoP for LPDDR2 I/O Power 176mW Table 3c – I/O Power Comparison (Source ) TSV SiP for Samsung Wide I/O 44mW 3. and they have the numbers to prove them correct.7 sq. TOMI Aurora is a 20mW/CPU at 500MHz. Functionally both are same. all in those 3 layers with less transistors. barring the speed difference. wiring is minimal in memories. This is surely the hardware architecture that will help towards the development of tomorrows’ natural user interfaces which would require Figure 3b – Comparison between LPDDR2 and Wide I/O (Picture Source ) Conventional method of reducing power budget on mobile memory was to reduce the operating voltage of the DDR e. 2. implant computer. thus higher battery life for mobile devices.8x 0. But as we can see. Wide parallel interfaces have been written-off from electronics industry due to their power hungry nature. Also. the Wide I/O bus can have upto 1200 pins. But as of now. 2. Wide I/O solves some of these sub-10nm power criticalities.8 Gbit/sec. comparison between conventional 3D PoP stacking of DRAM. Wide I/O as compared to conventional Serial I/O gives higher bandwidth at lower power.00001x 0. TOMI is the first multi-core milliwatt microprocessor. 3.venraytechnology. MapReduce.3 Venray TOMITM Aurora Chip Venray Technologies (www. Thus. The power consumption is reported to be reduced by 87% in the same comparison. This is a brand new approach towards co-design of memory cells and processor cells on the same die. TOMI is manufactured by packing CPU cores . Memory-transistor specific wiring libraries are not present with any standard manufacturers. DDR3 at 1. and thus the most enhanced performance that current technology processors can get. operating voltages would practically cease to reduce because it would approach Near Threshold Voltage operation. for processors. which increases its bandwidth by almost 4 times that of LPDDR2 of the same size.5V. But with reducing semiconductor feature size to sub-10nm.Mobile units shows a trend of roughly 56% YoY growth of Samsung smartphones . but its economics are still complex and under consideration . mm. The microprocessor and memory transistor fabrication technologies are different.

content protection etc. 4. Also. at present due to the offloading of all video functionalities to GPUs. They are discussed in the next section. data from various sources to be processed at any given time. Another important subsystem is security subsystems such as SoC firewall. but the only mode of system improvement is to increase the thread count in which case more cores would be required. These are being patented and about to be commercialized in near future. tighter integration of memory and processor would indeed help to overcome the memory limitation and thus in future we would definitely see more CPU cores in a die. Heterogeneous CMPs are best suited for systems where multithreading is used extensively.com). various input/output protocols to communicate with the world.3GHz in single threaded as well as multi threaded applications. with the support for communications processors for 3G/4G onto the same SoC would also increase the number of IP Cores hugely. It also notes that having more cores is better. It shows that in 2012. However. SoC are primarily data-flow problems i. VOIP. mobile applications are best suited for heterogeneous cores. but having 2 faster cores is better than having 4 slower cores. ADC and DAC cores. Though the software and OS at present are not optimized. Whereas for CPU performance. video.7% in the number of different IP cores on a single chip. there are 75-80 unique IP cores per SoC and it shows a YoY growth of 18. memory interfaces. compiled by Semico Research (www. we are at such a technological juncture where we are capable of overcoming both the power wall and memory wall and thus think of designs of SoC with higher numbers of IP cores. the number of differentiating IP cores on a single chip. there are very few applications that need more than 2 cores to run efficiently. network. Extrapolating this chart to a time of 5 years from now. I can predict upto 10-14 CPU cores on a single SoC in the next 5 years. image processing. these trends are for the chips that support today’s hardware technologies. But with TOMI technology. Also. the maximum number of computational cores on any SoC across the industry is 4+1 CPU and 1 GPU on Nvidia’s Tegra3 processor.com). which will be present on future SoCs and add to the IP core count. DISCUSSION ON SOC TRENDS As seen in the previous section. Let us foray into the SoC trends of future computing infrastructure. the chip needs to access the memory frequently and repeatedly which is the real bottleneck for most systems. . Looking at the trend chart in Figure 4a. 4. As noted earlier.anandtech. Qualcomm Snapdragon S4 dual-core CPU running at 1. Also due to hard real-time constraints on most of the data processing on mobile platforms. The reason for this would be considerable increase thread performance and further development of advanced OS and software techniques. the technology node will move to below 20nm. video. design optimizations with respect to area and power. audio.1 Number and Types of Unique IP Cores For mobile systems.5GHz beats Nvidia Tegra3 quad-core CPU running at 1. memory has come inside the SoC which would not only mitigate this problem but would also help in reducing the power consumption. hardware accelerators. all these present day SoCs have homogenous CPUs. 3D graphics. memory access times are critical. However. camera etc for a smartphone. As per the trends in mobile processors . We have to keep in mind that these are unique IP cores and there can be cores that can be used multiple times on a die. With the future natural-user-interfaces these numbers and types of input/output IP cores would definitely increase to a huge number. a SoC is a heterogeneous system and thus best suited for use in mobile devices. With increasing pixel density for displays such as the New-iPad Retina Display and invention of newer display technologies like Mirasol Displays (www. which obviously don’t operate at similar pace.mirasoldisplays. with the invention of TOMI technologies. DSP for computational purposes. Right now. GPU. By definition. result in the use of heterogeneous multiprocessors due to its best fit in an inherently varying workload that mobile systems pose.massive data handling and processing capabilities in ever decreasing form factors. RF cores. as benchmarked by Anandtech (www. most of which are real-time data such as audio. For accessing and processing these large amounts of data. we can see that with advancements in technology nodes. From the trend we can easily say that the number of unique IP cores on an SoC will be close to 140-160.e. the number of unique IP cores increase rapidly. camera. NFC. However. the number of GPUs can very well go to 6-8 per SoC (Quad-GPU-core A5X for New-iPad).com) from Qualcomm. Figure 4a – Trend for SoC IP Core (Source ) Often data from all these interfaces. similar sets of cores for wifi and networks. TOMI TM Aurora also has 4 CPUs.semico. These would include the IP cores for CPU. there are a few rebellious architectures which go beyond these predictions. The website notes that. would need to be operated-on simultaneously.

mm. memory device etc without any hardware support. Another different approach for SoC development is using reconfigurable architectures. As per core CoreMark benchmarking. it is powered down. This will definitely help redefine computer architecture. One way to look at conventional heterogeneous computer architectures is that one IP Core does same-thing-all-the-time and if not used.2 sq. CONCLUSION In this paper we see that we are presently standing at such a position in computer-architecture timeline. where we do have some promising technologies which present innovative solutions to age-old problems related to high-performance computing. The main innovation from Adapteva is a patented low-power networkon-a-chip architecture that sustains 25-Gbytes per second local memory bandwidth and 6. This device can be reprogrammed as a re-definable circuit board. Here an FPGA is the controller of the computer and a CPU and other hardware its peripherals. This brings the new perspective of an on demand reconfigurable parallel processing facility which is not available to any other architecture at present.2 A Few Different Approaches A new approach is taken by Adapteva Inc (www. A patent has been awarded to RMT Inc.adapteva. This means the IP cores are not hardcoded at all and the flexibility is with the system designer to program them as whichever core they wish one part of the array to be. for using such a computer architecture. accommodated on an area of 8. 6. Adapteva’s Epiphany-IV chips prove to be four times better in performance when compared to Intel Xeon and consumes less than 2 watts of peak power. consumes 25mW per core. 5. It is designed to use as an accelerator of DSP related tasks such as speech recognition. SDR etc. machine vision.4-Gbytes per processor network bandwidth. image processing. as per the report . Only time will tell.com) which is a startup making smartphone processors. medical diagnostics. REFERENCES .4. Why not build a reconfigurable architecture on a gate array and let the improvements in technology increase the size and speed of the reconfigurable gate array. Epiphany-IV using 28nm technology and 64 independent RISC cores. each with 32Kbytes of memory. a CPU. any peripheral device. Still. they are so new that not much can be said with concreteness that all or some of these technologies will prevail in the highly competitive market of today. gives the highest efficient floating point processor at 70 GigaFlops/Watt and can scale upto 1000’s of cores .