This action might not be possible to undo. Are you sure you want to continue?
http://www.tomshardware.com/reviews/fusion-hsa-opencl-history,3262.html 12:00 AM - August 14, 2012 by William Van Winkle Source: Tom's Hardware US
Table of content • 1 - The Story Of Fusion Begins • 2 - Looking For The Other Half • 3 - Merger And Mayhem • 4 - Scaling The Brick Wall • 5 - Up From The Ashes • 6 - Fusion Ignites • 7 - Heterogeneous Roots • 8 - OpenCL And HSA • 9 - Focus On The Programmer • 10 - HSA's Big Picture • 11 - More About The Big Picture • 12 - HSA Tomorrow
The Story Of Fusion Begins “Nothing is more difficult than the art of maneuver. What is difficult about maneuver is to make the devious route the most direct and to turn misfortune to advantage.” —Sun Tzu, The Art of War When I interviewed Dave Orton, then the president of ATI Technologies, in 2002, one of the first things he told me, "It’s always what’s possible in the business that keeps people going." No more prophetic words could describe the coming merger between his company and CPU manufacturer AMD. The big question, of course, isn’t what is possible or even whether the possible will become reality. The real question is whether the possible will become reality soon enough.
Orton spent most of the '90s with Silicon Graphics and, in 1999, when almost anything in technology seemed possible, he left SGI to join a little core logic startup called ArtX. The little company won the development contract for Nintendo’s GameCube, which went on to sell a few units (somewhere north of 20 million). That fall, ArtX showed its first integrated chipset at Comdex, and immediately the company flashed on the industry’s radar as a prime acquisition target. Ultimately, ATI was able to put ArtX in its pocket and made Orton its president and COO. Then the tech bubble burst, driver problems abounded, schedules slipped, and, for a while, it seemed that ATI could do nothing right. Part of the road back to glory hinged on Orton figuring out how to complete the meshing of these two development teams. He was the one who figured out how to get ATI on a 12-month cycle for new architectures and six- to nine-month cycles for iterative design revisions. Product teams were given more control and responsibility. And slowly, over 18 months, perhaps, with Nvidia kicking it in the ribs at every turn, ATI managed to get back on its feet. The company rediscovered how to execute. "Just step back and understand your roots," said Orton. "Constantly build. You can never be satisfied with where you are. You’ve got to be satisfied with where you can be and then drive to that." Back on top of its game, Orton knew it was time to keep driving—but to where? I detected no glimmer of the future in our 2002 discussion. ATI continued to excel at integrating graphics into northbridge chips, and Intel, which still viewed integrated graphics as only needing to be good enough for business apps, was still more of a partner than a competitor. However, in a keenly prescient moment, Orton told me, "I guess if I could change one thing about computing, I’d like it to be more open to create a broader range of innovation. I recognize the advantages of standards. Standards provide opportunity." At two different points in our conversation, Orton lamented his daily Silicon Valley commute, even saying that if he could invent anything, no matter how fantastic, it would be a Star Trek-esque transporter. So perhaps we can take him at his word when, in 2007, he left his post as executive vice president of AMD in order to spend more time with his family. But this is jumping ahead. First, Orton’s drive from Toronto was about to take a hard southern turn, straight down to Texas.
Looking For The Other Half Right around the time I was speaking with Dave Orton, AMD founder Jerry Sanders was stepping down to begin his well-earned retirement. Since just after its founding in 1969, Sanders had led AMD through over three decades of highs and lows, cementing its place as the only major rival to Intel in the global CPU market. Sanders hired Hector Ruiz away from Motorola Semiconductor in 2000 to become his right-hand man (as president and COO) and next in line for the CEO's chair. Two years later, Sanders took his final bow and Ruiz seized AMD's steering wheel.
Meanwhile, Dirk Meyer, a former processor engineer for DEC and Intel, was rising through AMD's ranks. Meyer led the team in 1998 and '99 that produced the Athlon (K7), a design so successful (and based on a bus derived from DEC) that AMD crushed the then-leading Pentium III and beat Intel to the 1 GHz threshold.
In 2003, AMD followed up its Athlon success with the K8 (Hammer) architecture, which almost immediately trounced everything Intel had going in the server market. Intel’s NetBurst, far from realizing its 10 GHz aspirations, turned out to be a blistering disappointment as Intel slowly began to realize that efficiency trumps brute force. Intel didn’t have a follow-up ready until 2006, when it released its Core architecture, which in turn presaged the groundbreaking Nehalem design of 2008. We all know that the Fates are fickle in technology. Everything ebbs and flows. If Athlon lit the fuse under AMD, Hammer took the company into orbit. By 2005, Ruiz sensed it was time to try and outflank his bigger competitor. That December, Ruiz elevated Dirk Meyer to chief operating officer of AMD’s microprocessor business, effectively making Meyer number two in the company. By that time, the two of them understood that having graphics on the northbridge was good...but not good enough. According to Forbes, AMD approached Nvidia about merging. As we go forward in this story, consider the element of personalities and corporate cultures, and ask yourself if AMD could have survived such a mix. The answer is likely found in Nvidia’s response: CEO Jen-Hsun Huang was willing to entertain the idea, provided that he would be the chief executive of the resulting organization. Ruiz, at that point feeling on top of the microprocessor world, understandably went looking elsewhere.
In July of 2006, AMD announced that it would buy ATI for the princely, almost hyperbolic, sum of $5.4 billion at a time when AMD was worth somewhere shy of $9 billion. According to Joe Macri, who was then ATI’s director of engineering and another Silicon Graphics alumnus, it was a "grand vision" spawned from Dave Orton and Dirk Meyer. "There was a lot of risk on AMD’s part," says Macri, now as AMD’s product chief technology officer. "But there was a lot of courage between Dirk and Dave. They could see a future of the need to converge the CPU and GPU in a way that would allow it to be treated as a unified compute model. That initial vision sounded simple. It started with a big business deal that was quite the effort to pull off. And then they brought us together as leaders and said, ‘Do it!’" "Quite the effort" is an understatement of epic proportions.
Merger And Mayhem In the summer of 2006, most people didn’t grasp the strategy of what would eventually be called Fusion, the melding of CPU and GPU on a common die. Like most, Ars Technica at the time viewed the merger as a way for AMD to bolster its portfolio breadth, freeing it from reliance on third parties for chipsets and expanding the company into areas such as ultramobile graphics and digital TV.
For their part, AMD and ATI remained thoroughly mute. While some of the silence may have been mandated, owing to lengthy legal process of two large companies melding, a more practical reason might have been the safeguarding of existing sales. "At any point in time, you’re married to a lot of partners," says AMD’s Macri. "I use the word married because they’re very deep relationships, both in business and, at the end of the day, a personal level. Business is about personal relationships. We make commitments to each other. We might embody them in contracts, but part of that is we’re making a big personal commitment. And you are only as good as your word. If you throw the grand vision out there without the time that it takes to move all your partners to the vision, you lose all your partners. AMD at the time had Nvidia as a very strong chipset and graphics partner. You can’t flip those relationships on and off like a switch. So the guys were somewhat limited in their ability to explain to the world this grand vision and how it would all play out." In early 2006, AMD’s stock price hovered just above $40 per share. One year later, at a time when the market was nearing its pre-recession peak, AMD had tumbled to under $15. Two years, later, it was bouncing on a $2 floor. A five-year comparison between AMD and Intel shows the story from pre- to post-recession. While Intel looks relatively flat, the rise and fall of AMD is as exhilarating as it is heartbreaking.
Economic downturn aside, what happened? Heading into late 2006, AMD entered into the first of what would become seven consecutive quarterly losses preceding Hector Ruiz’s resignation. Intel’s Core architecture was out and ramping. Nvidia’s GeForce 7-series, launched in June 2005 to considerable fanfare, gave way to the even better 8-series in November 2006. Meanwhile, the delay-plagued ATI Radeon X1000 series arrived in 2005; there was no major 2006 update. The follow-up Radeon HD 2000 didn’t launch until April 2007—in Tunisia, of all places—and even though AMD/ATI’s performance was starting to edge back up, its momentum in the market had significantly slipped. And those were just the visible problems. Behind the scenes, in the back rooms where the two companies were trying to figure out how to coexist and blend, matters were even more muddled.
Scaling The Brick Wall This AMD core team found itself with two fundamental problems, one technical and the other philosophical, and both had to be solved before anything could move forward. "On a pure technology and transistor side, we had a conundrum on our hands," says AMD’s Macri. "What makes CPUs go really fast ends up burning a lot of power on a GPU. What makes GPUs go really fast will actually slow a CPU down incredibly. So the first thing we ran into was just getting them to live on the die together. We had the high-speed transistor combined with the very lowresistance metal stack that’s optimal for CPUs versus the GPU’s more moderate-speed transistor optimized around very dense metalization. If you look at the GPU’s metal stack, it looks like the letter T. It looks like the letter Z in a CPU. One’s low-resistance, one’s lower density, and so higher resistance. We knew we had to get these guys to live on the same die where they both perform very well, because no one’s going to give us any accolades if the CPU drops off or the GPU power goes up or performance falls. We needed to do both well. We very quickly discovered that wall."
Imagine the pressure on that team. With billions of dollars and the company’s future at stake, the group eventually realized that a hybrid solution couldn’t exist on the current 45 nm process. Ultimately, 45 nm was too optimized for CPU. Understanding that, the question then became how to tune 32 nm silicon-on-insulator (SOI) so that it would effectively play both sides of the fence. Of course, 32 nm didn’t exist outside of the lab yet, and much of what finally defined the 32 nm node for AMD grew from the Fusion pursuit. Unfortunately, until the 32 nm challenge was solved, Fusion was at a standstill—and it took a year of work to reach that solution. Only then could design work begin.
Meanwhile, the Fusion team was also fighting a philosophical battle. With the transistor and process struggle, it was massive, but at least the team knew where it needed to go and what the finish line looked like. Even with the transistor challenge figured out, the question still remained of how to best architect an APU. "One view was like, the GPU should be mostly used for visualization. We should really keep the compute on the CPU side," says Macri. "A second view said, no, we’ve gotta split the views across the two halves. We’ve got this beautiful compute engine on the GPU side; we need to take advantage of it. There were two camps. One said things should be more tightly coupled between the CPU and GPU. Another camp said things should be more loosely coupled. So we had to have this philosophical debate of deciding what we should treat as a compute engine. Through a lot of modeling, we proved that there was an enormous advantage to a vector engine when you have inherent parallelism in your code." This might have seemed obvious from ATI’s prior work with Stream, but the question was how much work to throw at the GPU. Despite being highly parallel, GPUs remain optimized for visualization. They can process traditional parallel compute tasks, but this introduces more overhead. With more overhead comes more impact on visualization. With infinite available transistors on the die, one could just keep throwing resources at the problem. But, of course, there are only a few hundred million transistors to go around. "Think of all the applications of the world as a bathtub," says Macri. "If you look at the left edge of the bathtub, we call those applications the least parallel, the ones with the least amount of inherent parallelism. A good example of that would be pointer chasing, right? You need a reference. You need to go grab that memory to figure out the next memory you gotta go grab. No parallelism there at all. The only way to parallelize is to start guessing–prediction. Then, if you go to the right edge of the bathtub, matrix multiply is a great example of a super-parallel piece of code. Everything is disambiguated very nicely, read and write stream is all separate, it’s just beautiful. You can parallelize that out the wazoo. For those applications, it’s very low overhead to go and map that into a GPU. To do the left side well, though, means building a low-latency memory system, and that would load all kinds of problems into a GPU that really wants a high-bandwidth, throughputoptimized memory system. So we said, 'How do we shrink the edges of the bathtub?' Because, the closer we could bring those edges, the more programs we could address in a very efficient way." A big part of the philosophical debate boiled down to how much to shrink those bathtub edges while preserving all of AMD’s existing visualization performance. Naturally, though, while all of this debate was happening, AMD was getting hammered in the market.
Up From The Ashes No one on the outside could see the engineers frantically fighting for answers and a path forward. What they saw were stony-faced executives and delays. Lots and lots of delays in multiple segments from graphics to CPUs to chipsets. In July, ten months after the merger, AMD executive vice president Dave Orton, arguably one of the most influential minds behind today’s hybrid processor trend, resigned. In September, chief sales and marketing officer Henri Richard followed Orton out the door. Ultimately, the $5.4 billion decision to buy ATI rested on Hector Ruiz’s shoulders, and onlookers couldn’t help but associate this purchase with the catastrophic plummet in AMD’s financials. Sources at TheStreet.com indicated that Ruiz might also be resigning, although his contract ran until April 26, 2008. As it turned out, Ruiz survived another three months, finally leaving in July.
Ruiz went on to become the first CEO of GlobalFoundries in March 2009. More significantly, GlobalFoundries started out as AMD’s spun-off manufacturing arm. With its various setbacks, AMD could no longer afford to maintain so much manufacturing capacity all for itself. This marks the latest and hopefully last major divestiture of the company’s assets, capping 2008’s jettisoning of the digital TV business and (this one really had to hurt) early 2009’s $65 million sale to Qualcomm of the old ATI handset division and all of the mobile graphics and multimedia intellectual property that went with it. By that point, AMD had already written off $3.2 billion in bottom-line value. Of course, as with most major downturns in life, so long as you don’t stop moving, you’re not dead —and the odds are that things will improve. Having shed much of its former self, AMD is now left as a predominantly R&D- and IP-based company. It’s smaller, lighter, and more flexible. But is that enough? AMD’s board didn’t seem to think so. In January 2011, the last of the old directorate, Dirk Meyer, was pushed aside, according to the press release, to "accelerate the company’s ability" to "have significant growth, establish market leadership and generate superior financial returns."
Some argued that this was undeserved for a guy who apparently brought AMD back from the brink of ruin and saw the first fruits of his Fusion "grand vision" finally start to reach the market. But others questioned at the time of Meyers’s ascendance to the top post whether he had the sales chops to make the big deals that AMD so desperately needed. With that in mind, one might examine his replacement, Rory Read. Read spent 23 years at IBM, where he held a "broad range of management positions." Following IBM’s PC division, Read then moved to Lenovo in 2006 and eventually became its president and COO in 2009. During his time there, Lenovo became the third-largest PC vendor in the world. Read was appointed president and CEO of AMD on August 25, 2011. Not even a year into the job, it’s still too early to render a verdict on Read. But if the AMD board was hoping for a guy that would yield big deals and new directions, it appears they got exactly what they wanted.
Fusion Ignites Throughout this shuffling of top office name plates, AMD engineers continued their dogged pursuit of Fusion. What began as a team of four people—former ATI vet Joe Macri, the recently deceased AMD fellow Chuck Moore, then-graphics CTO Eric Demmers (now at Qualcomm), and AMD fellow Phil Rogers, who was the group’s technical lead—had grown to envelop the top three layers of engineers from both the CPU and GPU sides of the company. Macri describes the early phase of their collaboration as "the funnest five months I’ve ever had." The first 90% of the Fusion effort was an executive engineer’s dream. "The last 10% was excruciating pain in some ways."
"That effort resulted in a couple of things," adds Macri. "One, we ended up with the best architecture out there that’s unifying scalar and vector compute. It blows away what [Intel] did with Larrabee. The Nvidia guys have only attacked part of the problem, because they only have the IP portfolio to attack part of the problem. What they’ve done isn’t bad. It’s actually good for having one hand tied behind their back. But with [Fusion], we had the full IP capability, and it truly is the first unified architecture, top to bottom." Technical architecture aside, AMD developed something else: a cohesive, merged company. Out of the pressure and pain of Fusion development emerged a different company than either of the two that had gone into it. The old days of talking about "red" and "green" teams were finally gone. "We were similar in that we were both in a major fist fight with one guy," says Macri. "I think ATI had the fairer fight in that we were up against a similarly-sized company [Nvidia]. But this had a lot of impact on design and implementation cycles. Now, the guys at AMD had won a number of times, but it was more like David and Goliath [Intel]. It was like, 'Wow, we actually beat Goliath!' With ATI, we’d been in a fist fight for many years with Nvidia, and we won as many as we lost. So we had a different attitude about winning. ATI needed to learn that there were some Goliaths out there, and you have to be pretty damned smart to beat a Goliath. AMD learned that it actually was a Goliath in certain cases. It could be an equal. Now merge that with some faster time to market strategies. Today, our product cycle time is faster than ever across the board. So the melding gave both sides a better ability to attack not just their traditional competitors but also new competitors coming up. And those new guys coming up aren’t big. They’re all kind of small. They’re all AMD-sized. I don’t think AMD ever would have had the right attitude on how to beat someone their own size without inheriting ATI. And I don’t think ATI could have figured out how to beat someone several times bigger without AMD’s attitude of asking how you aim where the other guy’s not aiming."
Just as the two organizations were completing their cultural fusion, the Fusion effort itself was nearing the end of its first stage. AMD showed its first Fusion APUs to the world at CES in early 2011, and product started shipping shortly thereafter. In the consumer space, the Llano platforms, based on the 32 nm K10 core, arrived in the A4, A6, A8, and E2 APU series. Another announcement from 2011 CES was that the Fusion System Architecture would henceforward be known as the Heterogeneous System Architecture (HSA). According to AMD, the company wanted to turn HSA into an open industry standard, and a name that didn’t reflect a long-standing AMDcentric effort would help illustrate that fact. This would prove to be the first hint of AMD’s even larger aspirations.
Heterogeneous Roots In the end, did Fusion matter? Quite simply, it changed the direction of modern mainstream computing. All parties agree that discrete graphics will remain firmly entrenched at the high-end. But according to IDC, by the end of 2011, nearly three out of every four PC processors sold were integrated, hybrid processors—APUs, as AMD calls them. AMD adds that half of all processors sold across all computing device segments, including smartphones, are now what it refers to as APUs.
Ubiquitous as that might sound, though, the APU is not the endgame; it’s only the beginning. Simply having two different cores on the same die may improve latency, but the aim of Fusion was always to leverage heterogeneous computing in the most effective ways possible. Having discrete CPUs and GPUs each chew on the application tasks best suited to them is still heterogeneous computing. Having those two cores present on the same die is merely an expression of heterogeneous computing more suited to certain system goals, such as optimizing high performance in a lower power envelope. Of course, this assumes that programs are being written to leverage a heterogeneous compute model—and most are not. Ageia was one of the first companies in the PC world to address this problem. In 2004, a fledgling semiconductor company named Ageia purchased a physics middleware company called NovodeX, and thus was born the short-lived field of physics processing units (PPUs), available on third-party standalone cards. For games coded to leverage Ageia’s PhysX engine, these cards could radically improve physics simulation and particle motion. PhysX caught on with many developers, and Nvidia bought Ageia in 2008. Over time, Nvidia phased out the PPU side of the technology and supported PhysX on any CUDA-ready 8-series or newer GeForce card.
Ageia’s fame drew the attention of Dave Orton and others at ATI. Even before the AMD merger, ATI had been working to enable general-purpose GPU computing (GPGPU) in its Radeon line. In 2006, the R580 GPU became the first ATI product to support GPGPU, which the company soon branded as Stream. The confusing nomenclature of Stream, FireStream, stream processors, and so on gives some indication to the initial lack of cohesion in this effort. Stanford’s folding@home distributed computing project became ATI’s first showcase for just how mind-blowing the GPGPU performance advantage could be. The trouble was that Stream never caught on. Nvidia seized its 2006/2007 execution upswing, capitalized on the confusion reigning at AMD at that time, and solidified CUDA as the go-to technology for GPGPU computing. But this is a bit like describing a goldfish as the hugest creature in the tank when all of the other fish are guppies. Despite a lot of notoriety in gaming and academic circles, GPGPU development remained very niche and far from mainstream awareness. "AMD has been promoting GPU compute for a really long time," says Manju Hegde, former CEO of Ageia and now corporate vice president of heterogeneous applications and developer solutions at AMD. "But eight years ago, it wasn’t right. Five years ago, it wasn’t right. Now, with the explosion of the low-power market, smartphones and tablets, it’s right. And for developers to create the kinds of experiences that normal PC users expect, they have to go to GPU computing—but it has to be based on something easy like HSA."
OpenCL And HSA The slow take-off of GPGPU computing had less to do with niche markets than it did with problematic programming. Simply put, the world was built to code compute for CPUs, and shifting some of that code over to GPUs was anything but straightforward. "Various specialized hardware designs, such as Cell, GPGPUs, and MIC, have gained traction as alternative hardware designs capable of delivering higher flop rates than conventional designs," notes IEEE author D.M. Kunzman in the abstract for the paper Programming Heterogeneous Systems. "However, a drawback of these accelerators is that they simultaneously increase programmer burden in terms of code complexity and decrease portability by requiring hardware specific code to be interleaved throughout application code...Further, balancing the application workload across the cores becomes problematic, especially if a given computation must be split across a mixture of core types with variable performance characteristics."
Not surprisingly, all of the traditional APIs built to interface with GPUs were designed for graphics. To make a GPU compute math, one had to pretend operations were based on textures and rectangles. The great advance of OpenCL was that it dispensed with this work-around and provided a straight compute interface for the GPU. OpenCL is managed by the non-profit Khronos Group, and it is now supported by a wide range of industry players involved with heterogeneous computing, including AMD, ARM, Intel, and Nvidia. So, if OpenCL provides a software framework for heterogeneous computing, that still doesn’t address the hardware side of the problem. Whether discussing servers, PCs, or smartphones, how should the hardware platform (distinct from the CPU, GPU, and/or APU) perform heterogeneous computing? Clearly, platforms were not designed for this paradigm in the past. The computing device typically had one system memory pool, and the programmer has to copy data from the CPU memory space to the GPU memory space—within the same pool—before the application can start executing its process. That same is true for fetching the results back again. In a system with only one memory pool, repetitive copying of data to different areas within the same memory is highly inefficient. This is where HSA comes in. HSA brings the GPU into a shared memory environment with the CPU as a co-processor. The application gives work directly to the GPU just as it does to the CPU. The two cores can work together on the same data sets. With a shared memory space, the processors use the same points and addresses, making it much more efficient to offload smaller amounts of work because all of that old copying overhead is gone.
In addition to unified memory, AMD notes that HSA establishes cache coherency between the CPU and GPU, eliminating the need to do a DMA flush every time the programmer wants to move data between the CPU and GPU. The GPU is also now allowed to reference pageable memory, so the entire virtual memory space is available. Not least of all, HSA adds context switching, enabling quality of service. With these features in hardware, an HSA platform becomes very similar in programming style to that of a CPU. "Shared memory makes the whole system much easier to program," adds AMD fellow Phil Rogers.
"One of the barriers to using GPU compute today is a lot of programmers tell us they find it too hard. They have to learn a new API. They have to manage these different address spaces. They’re not sure when the right time is to copy the data. When you eliminate barriers like this across the board and enable high-level languages, you make it so much easier to program that suddenly you get tens of thousands of programmers working on your platform instead of dozens or hundreds. That’s a really big deal."
Focus On The Programmer "Every programmer has their own favorite language," says Phil Rogers. "They can almost be religious about it. You don’t want to tell a programmer that he has to change the way he currently develops applications in order to deliver better experiences. HSA enables heterogeneous computing for all high-level languages over time." Compatibility with C and C++ might have been sufficient for some, but AMD wanted to make sure everyone was covered, so it expanded HSA to work with C#, Java, and even functional languages. And because there’s only so much AMD can do on its own, the company decided to turn HSA into an open standard governed by the HSA Foundation, which boasts founding members including ARM, MediaTek, and Texas Instruments. Officially launched in June 2012, the HSA Foundation’s goal is to promote HSA-enabled platforms and software at all levels. This includes making SDKs, libraries, training, and other resources available to all programmers, often for free. From a developer perspective, the whole idea of HSA is that programmers can now easily take advantage of the heterogeneous compute model in their apps without being bound to design or write in any certain way.
"Programmers don’t just program to the metal," says AMD’s Manju Hegde. "They need proper compilers, debuggers, profilers, optimization tools, libraries. These are the tasks ahead of us, which is why we established the HSA Foundation, to drive the standard forward. A lot of the tools we do will be open sourced. This will give partners quicker time to market and lower the financial burden. When the software ecosystem sees that this is a genuine industry effort, the value of that will not be lost upon them. Because this is one of the first times that hardware companies are making changes to chip architecture to accommodate ease of software development. Tons of companies make changes because they want new capabilities, new features. But we’ve made changes simply to make it easier for programmers. That’s what’s needed to make things pervasive." It’s important to keep in mind that OpenCL and HSA are two very different things, and it’s likely that the former will evolve to better fit the latter with time. Even without HSA, OpenCL offers a much different programming experience than it did even two years ago. For example, OpenCL 1.2 drastically reduces the amount of initialization code and other overhead code that used to be required for OpenCL. With HSA, that trend toward simplicity and performance will continue as programmers no longer need to manage two different memory spaces.
"Say a programmer uses Visual Studio today, writing C++ apps in Windows," says Phil Rogers.
"There are hundreds of thousands of programmers doing that. For them, they can use this new technology in Visual Studio Microsoft released called C++ AMP, short for Accelerated Massive Parallelism. In C++ AMP, they only added two keywords to the language, restrict and array_view, and just by adding those two functions, those programs are marked for offload into GPU. The tiny change to the program gives numerous benefits when they have chunks of very parallel code in existing applications. It’s a much easier transition than one might expect."
HSA's Big Picture With all of the focus on APUs and GPGPU, it’s easy to forget that there’s more to life than parallel code. The CPU remains a critical part of heterogeneous computing. Much of the code in modern applications remains serial and scalar in nature and will only run well on strong CPU cores. But even for the CPU, there are different types of workloads. Some loads do best on a few fast cores, while others excel on a larger number of lower-power cores. In both cases, and as mentioned earlier, applications need to be tailored to fit a power envelope for a particular device, whether it’s an all-in-one, notebook, tablet, or phone. As APUs gradually take over most (but not all) of the CPU arena, we’re seeing APUs diversify and segment in order to address these different power requirements. The difference now is that APUs seem likely to soon offer nearly twice the diversity of recent CPU families since they must address both scalar- as well as vector-based needs across device markets.
I attended AMD’s 2012 Fusion Developer Summit (AFDS) in Seattle, and, to my ears, it sounded like the last thing on anyone’s mind was the desktop market. There was a lot of buzz about AMD leveraging HSA to find better roads into the mobile markets. The biggest news was far and away ARM, the leading name in ultramobile processors, joining the HSA Foundation (remember, HSA is agnostic to architecture). This carries significant ramifications in many directions. Everybody knows that mobile is hot and desktop is flat, at least in industry sales terms. To me, much of the messaging at AFDS seemed to reflect this, perhaps because desktop is the segment that seems to care least about the power and efficiency benefits HSA promises. So when I was able to sit down with Phil Rogers, I asked him if one of the outcomes of HSA would be a gradually increasing shift by AMD toward battery-powered devices and a leaning away from desktops.
"This is a common misconception," he said. "Power matters a lot on the whole range of platforms. On the battery situation, everybody gets it. Even with desktops, and more and more, what we’re calling desktops are becoming all-in-ones, people want to know not just how fast it runs but how quiet it is and how attractive it is as a product. We’re seeing that even gamers don’t want a box next to their leg pumping hot air on their shins while they’re playing. What they really want is a 30” screen on the wall with a PC built into it that runs fantastic. And in that environment, you do care about power. Even if you don’t care about the electricity bill, you don’t want fans whining and screaming or heating up and taking away clock speed." At the other end of the market, servers stand to benefit greatly from HSA. Consider data centers and the continuing growth of cloud computing. With even smallish data centers now hosting more
than 10 000 servers each, power efficiency continues to grow in urgency. Generally speaking, hardware costs comprise only one-third of a server's total cost over its service life. Another third is spent on electricity used, and the remaining third goes toward cooling costs. If HSA can help improve compute efficiency, allowing systems to complete tasks more quickly so they can turn off large logic blocks or entire cores, then power consumption can decline drastically.
"You can only pack processors so densely," says Hegde. "HSA allows you to process more densely and at a lower power envelope. I don’t need to tell you that CUDA has been doing HPC applications for five years. All those applications are so much easier with HSA because HSA is heterogeneous between GPU and CPU. Nvidia always leans toward the GPU. And certainly, there are some embarrassingly parallel applications out there, but the vast majority of applications are not, including many HPC applications. So don’t think that HSA is just about client; it’s an architecture that spans many platforms."
More About The Big Picture For many PC users, "integrated" chips are synonymous with lower performance. This perception perhaps remains from the days of northbridge-based graphics, which were mediocre even on the best days. Most likely, that stigma will soon vanish. We don’t have the side-by-side data to show how CPU Die X performs in both APU and graphics-free versions. But it seems safe to say that in a heterogeneous environment, leveraging software designed for APU-type processing, an APU will deliver better total performance than if those CPU and GPU cores are separated into discrete components. If the APU paradigm wasn’t inherently better, the entire industry wouldn’t be shifting to it so rapidly.
Testing at Tom’s Hardware shows that, in a toe-to-toe battle between AMD and Intel, just looking at x86 performance, Intel is today’s clear victor. Far less clear is what happens once heterogeneous graphics are factored in. Just as Intel has the stronger CPU, AMD obviously has the stronger graphics architecture. What happens when the software running across these platforms takes fair and ample use of OpenCL and other heterogeneous architectures? That’s what our ongoing heterogeneous compute series is looking to answer. "Today," says AMD’s Hegde, "the CPUs in our APUs may be lower in performance in your tests against Intel. But that’s not an indication of where we are going. We’re now embracing the lowpower space in a very strong way, so we are building CPUs that are going to have very good performance with a very low power envelope. So when we say ‘balanced platform,’ we’re not saying that you have to take one or the other. We’re saying that, in a balanced platform, for every workload, do it in the place that makes the most sense. Intel’s approach of doing everything on the CPU is just wrong, because when you put a balanced workload on just the CPU, your power draw is dramatic and unnecessary. That’s why we don’t think that’s a good model. We’re going to make the cost of transition from CPU to GPU next to nothing. Then it’s up to each application to choose the right execution engine in terms of performance and power."
Similarly, AMD feels that HSA may be its path to success in smartphones and tablets, because when applications are properly optimized for balance, the GPU becomes much more influential in determining battery life. HSA targets two things: ease of programming and performance per watt. Once GPGPU compute comes into play, analysis shows that the GPU is 4x to 6x more performance per watt efficient than a CPU. ARM may have the biggest piece of that heterogeneous smartphone opportunity, but AMD is betting there’s still plenty of room for a strong number-two player who just happens to be using the same programming architecture as ARM. As mentioned, the implications for Nvidia and ultramobile would-be contender Intel are significant.
HSA Tomorrow I asked Manju Hedge what part of the whole HSA effort keeps him awake at night. He answered without hesitation: "How we’re going to get adoption. Our plan is to go broad. With the HSA Foundation, we made a huge bet. We’re contributing...I don’t know how many millions of dollars of IP. The reason we’re donating this to the industry is we want a rising tide to lift all. When that happens, the software ecosystem gets excited and that will be a catalyst to increasing adoption."
AMD is dead serious about this. Today, all of the company’s OpenCL tools are freely distributed. To my knowledge, every other company in the heterogeneous space charges for its tools. AMD wants all kinds of developers, from the largest names to one-person garage outfits, to be free from worrying about the economics of tools for developer environments. Will HSA complete the reversal of AMD’s misfortunes and restore the company to its former glory? A cynic might answer: it sure couldn’t hurt. But the more thoughtful answer is that HSA is the culmination of a heterogeneous strategy born almost a decade ago. Dave Orton and Dirk Meyer saw the inevitability of today and set out to make that future happen in a way that was as beneficial to everyone involved. We see time and again that the industry gravitates to open standards and increasing efficiency. Given this, it seems likely that AMD has finally scored the victory it’s sought for so long.
"At the end of the day, we’re not maniacally focused on beating a single company," says AMD’s Joe Macri. "We’re maniacally focused on up-leveling the experience of all consumers. By doing that, many of our competition will fall out. They’ll not have the IP, or they’ll have only some of the IP. Or maybe they’ll be forced to merge, which is a very difficult thing. Intel is a wildcard. No doubt about it, they’re a very capable company and a great bunch of folks. But when you’re the incumbent, it’s much more difficult to embrace change. You only want to embrace unavoidable change because everything else costs money. So we gotta see how they pull their triggers. But I’d hate to be in their position. You don’t know how many people I interview from Intel specifically because their research doesn’t get utilized, because a VP says, 'Hey, that’s just going to cost us
money. Maybe we’ll utilize your research somewhere over the next ten years.' That kills an engineer.” At a time when cracks are appearing in Moore’s Law and the costs of shrinking fab processes continue to skyrocket, an industry that wants to keep accelerating compute capabilities must turn increasingly to optimizing efficiency. Ultimately, this is what GPGPU and HSA enable. By old methods, how much would a CPU need to evolve in order to facilitate a 5x performance gain? Now, such gains are possible simply through hardware and software vendors adopting an end-toend platform such as HSA. No pushing the envelope of lithography physics. No new multi-billiondollar factories. Just more efficient utilization of the technologies already on the table. And through that, the world of computing can take a quantum leap forward. Tom's Hardware - http://www.tomshardware.com