Introduction to VLSI Circuits and Systems, NCUT 2007
Parallel processing FALL 2018 Outline Parallel Processor GPUs A brief history of GPUs Early limitation in GPGPU CUDA s step towards GPGPU Applications of CUDA Introduction to CUDA
Introduction to VLSI Circuits and Systems, NCUT 2007
Parallel processing FALL 2018 Parallel Processors Why multi-processor? » Since 2003, the energy consumption and heat-dissipation issues have limited the increase of the clock frequency and the level of productive activities that can be performed in each clock period within a single CPU. » Virtually all microprocessor vendors have switched to models where multiple processing units, referred to as processor cores, are used in each chip to increase the processing powers. » This switch has exerted a tremendous impact on the software developer community » Traditionally, the vast majority of software applications are written as sequential programs, as described by von Neumann [1945] in his seminal report. The execution of these programs can be understood by a human sequentially stepping through the code.
Introduction to VLSI Circuits and Systems, NCUT 2007
Parallel processing FALL 2018 Parallel Processors Concurrency revolution » Rather, the applications software that will continue to enjoy performance improvement with each new generation of microprocessors will be parallel programs, in which multiple threads of execution cooperate to complete the work faster. This new, dramatically escalated incentive for parallel program development has been referred to as the concurrency revolution [Sutter 2005]. » The practice of parallel programming is by no means new. The high- performance computing community has been developing parallel programs for decades. These programs run on large-scale, expensive computers. Only a few elite applications can justify the use of these expensive computers, thus limiting the practice of parallel programming to a small number of application developers. Now that all new microprocessors are parallel computers, the number of applications that must be developed as parallel programs has increased dramatically. There is now a great need for software developers to learn about parallel programming, which is the focus of this topic.
Introduction to VLSI Circuits and Systems, NCUT 2007
Parallel processing FALL 2018 Parallel Processors GPU AS Parallel Computers » Since 2003, the semiconductor industry has settled on two main trajectories for designing microprocessor 1. Multi-core 2. Many-cores
Multi-core: seeks to maintain the execution speed of sequential programs
while moving into multiple cores. The multicores began as two-core processors, with the number of cores approximately doubling with each semiconductor process generation. A current example is the recent Intel Corei9 X series which has up to 16 processing cores supports hyper threading to maximize the execution speed of sequential programs.
Introduction to VLSI Circuits and Systems, NCUT 2007
Parallel processing FALL 2018 Parallel Processors Many core: The many-core trajectory focuses more on the execution throughput of parallel applications. For example, NVIDIA GeForce GTX TITAN X graphics processing unit (GPU) with 3072 cores, each of which is a heavily multithreaded Many-core processors, especially the GPUs, have led the race of floating- point performance since 2003.
Introduction to VLSI Circuits and Systems, NCUT 2007
Parallel processing FALL 2018 Parallel Processors Why there is such a large performance gap between many- core GPUs and general-purpose multicore CPUs? » Fundamental design philosophies between the two types of processors, as illustrated in Figure
Introduction to VLSI Circuits and Systems, NCUT 2007
Parallel processing FALL 2018 Parallel Processors Constraints (Multi-core CPUs) » The design of a CPU is optimized for sequential code performance. » It makes use of sophisticated control logic to allow instructions from a single thread of execution to execute in parallel or even out of their sequential order while maintaining the appearance of sequential execution. » More importantly, large cache memories are provided to reduce the instruction and data access latencies of large complex applications. » Memory bandwidth is another important issue. Graphics chips have been operating at approximately 10 times the bandwidth of contemporaneously available CPU chips.
Introduction to VLSI Circuits and Systems, NCUT 2007
Parallel processing FALL 2018 Parallel Processors GPU(what is idea behind evolution) » Capable of moving data at about 336 gigabytes per second (GB/s) in and out of its main dynamic random access memory (DRAM).
» Microprocessor system memory bandwidth will probably not grow
beyond 50 GB/s for about 3 years, so CPUs will continue to be at a disadvantage in terms of memory bandwidth for some time.
Introduction to VLSI Circuits and Systems, NCUT 2007
Parallel processing FALL 2018 Why used the GPU for computing
Introduction to VLSI Circuits and Systems, NCUT 2007
Parallel processing FALL 2018 GPUs
Introduction to VLSI Circuits and Systems, NCUT 2007
Parallel processing FALL 2018 GPUs Architecture (GPU)
Introduction to VLSI Circuits and Systems, NCUT 2007
GPUs It is organized into an array of highly threaded streaming multiprocessors (SMs). (Figure in previous slide) two SMs form a building block; however, the number of SMs in a building block can vary from one generation of CUDA GPUs to another generation. Also, each SM in Figure 1.3 has a number of streaming processors (SPs) that share control logic and instruction cache. Each GPU currently comes with up to 8 gigabytes of graphics double data rate (GDDR) DRAM, referred to as global memory in Figure 1.3. These GDDR DRAMs differ from the system DRAMs on the CPU motherboard in that they are essentially the frame buffer memory that is used for graphics. For graphics applications, they hold video images, and texture information for three-dimensional (3D) rendering, but for computing they function as very- high-bandwidth, off-chip memory, though with somewhat more latency than typical system memory. For massively parallel applications, the higher bandwidth makes up for the longer latency. Introduction to VLSI Circuits and Systems, NCUT 2007 GPUs The massively parallel G80 chip has 128 SPs (16 SMs, each with 8 SPs). Each SP has a multiply–add (MAD) unit and an additional multiply unit. With 128 SPs, that’s a total of over 500 gigaflops. In addition, special function units perform floating-point functions such as square root (SQRT). With 240 SPs, the GT200 exceeds 1 teraflops. Because each SP is massively threaded, it can run thousands of threads per application. A good application typically runs 5000–12,000 threads simultaneously on this chip. For those who are used to simultaneous multithreading, note that Intel CPUs support 2 or 4 threads, depending on the machine model, per core. The G80 chip supports up to 768 threads per SM, which sums up to about 12,000 threads for this chip. GT200 supports 1024 threads per SM and up to about 30,000 threads
Introduction to VLSI Circuits and Systems, NCUT 2007
Parallel processing FALL 2018 A brief history of GPUs In the late 1980s and early 1990s, the growth in popularity of graphically driven operating systems such as Microsoft Windows helped create a market for a new type of processor. In the early 1990s, users began purchasing 2D display accelerators for their personal computers In 1992, Silicon Graphics opened the programming interface to its hardware by releasing the OpenGL library Silicon Graphics intended OpenGL to be used as a standardized, platform- independent method for writing 3D graphics applications. The release of NVIDIA’s GeForce 256 further pushed the capabilities of consumer graphics hardware. For the first time, transform and lighting computations could be performed directly on the graphics processor, thereby enhancing the potential for even more visually interesting applications
Introduction to VLSI Circuits and Systems, NCUT 2007
Parallel processing FALL 2018 Early GPU Computing The release of GPUs that possessed programmable pipelines attracted many researchers to the possibility of using graphics hardware for more than simply OpenGL- or DirectX-based rendering. The GPUs of the early 2000s were designed to produce a color for every pixel on the screen using programmable arithmetic units known as pixel shaders. In general, a pixel shader uses its (x,y) position on the screen as well as some additional information to combine various inputs in computing a final color. The additional information could be input colors, texture coordinates, or other attributes that would be passed to the shader when it ran. But because the arithmetic being performed on the input colors and textures was completely controlled by the programmer, researchers observed that these input “colors” could actually be any data.
Introduction to VLSI Circuits and Systems, NCUT 2007
Parallel processing FALL 2018 Early limitations in GPGPU The programming model was still far too restrictive for any critical mass of developers to form. There were tight resource constraints, since programs could receive input data only from a handful of input colors and a handful of texture units. There were serious limitations on how and where the programmer could write results to memory, so algorithms requiring the ability to write to arbitrary locations in memory (scatter) could not run on a GPU Moreover, it was nearly impossible to predict how your particular GPU would deal with floating-point data Finally, when the program inevitably computed the incorrect results, failed to terminate, or simply hung the machine, there existed no reasonably good method to debug any code that was being executed on the GPU.
Introduction to VLSI Circuits and Systems, NCUT 2007
Parallel processing FALL 2018 CUDA a step toward to GPGPU CUDA » In November 2006, NVIDIA unveiled the GeForce 8800 GTX. The GeForce 8800 GTX was the first GPU to be built with NVIDIA’s CUDA Architecture. This architecture included several new components designed strictly for GPU computing and aimed to alleviate many of the limitations that prevented previous graphics processors from being legitimately useful for general-purpose computation. » The CUDA Architecture included a unified shader pipeline, allowing each and every arithmetic logic unit (ALU) on the chip to be marshaled by a program intending to perform general-purpose computations. » ALUs were built to comply with IEEE requirements for single-precision floating-point arithmetic and were designed to use an instruction set tailored for general computation rather than specifically for graphics. » the execution units on the GPU were allowed arbitrary read and write access to memory as well as access to a software-managed cache known as shared memory.
Introduction to VLSI Circuits and Systems, NCUT 2007
Parallel processing FALL 2018 Application of CUDA Medical Imaging Computational Fluid Dynamics Environmental Sciences
Introduction to VLSI Circuits and Systems, NCUT 2007
Parallel processing FALL 2018 Getting Started
Introduction to VLSI Circuits and Systems, NCUT 2007
Parallel processing FALL 2018 Introduction to CUDA NVIDIA introduced CUDA in 2006 Programming Model to program GPUs (C,C++) To a CUDA programmer, the computing system consists of a host, which is a traditional central processing unit (CPU), such as an Intel architecture microprocessor in personal computers today, and one or more devices, which are massively parallel processors equipped with a large number of arithmetic execution units. In modern software applications, program sections often exhibit a rich amount of data parallelism, a property allowing many arithmetic operations to be safely performed on program data structures in a simultaneous manner. The CUDA devices accelerate the execution of these applications by harvesting a large amount of data parallelism
Introduction to VLSI Circuits and Systems, NCUT 2007
Parallel processing FALL 2018 Introduction to CUDA Host (refers to the CPU memory ) Devise (refer to the GPU memory) A CUDA program consists of one or more phases that are executed on either the host (CPU) or a device such as a GPU. The phases that exhibit little or no data parallelism are implemented in host code. The phases that exhibit rich amount of data parallelism are implemented in the device code The host code is straight ANSI C code; it is further compiled with the host’s standard C compilers and runs as an ordinary CPU process. The device code is written using ANSI C extended with keywords for labeling data-parallel functions, called kernels, and their associated data structures.
Introduction to VLSI Circuits and Systems, NCUT 2007
Parallel processing FALL 2018 Introduction to CUDA
Introduction to VLSI Circuits and Systems, NCUT 2007
Parallel processing FALL 2018 Introduction to CUDA CUDA assumes that the device is a co-processor to the host, whereas both the host and device have their own separate dedicated memories in the form of DRAM. In this relationship between the host and device, the host runs the main program and sends instructions to the device by invoking programs to run parallel tasks on the GPUs. These programs are called kernels. GPUs can respond to the CPU request to send or receive data from host to device or device to host. For a given heterogeneous system with host and device, a typical sequence of operations for a CUDA program is: 1. The host allocates storage on device using cudaMalloc() 2. The host copies the input data from the device using cudaMemcpy() 3. The host launches kernel(s) on the device to process the data (transferred on device). 4. The host copies the results back from the device using cudaMemcpy.
Introduction to VLSI Circuits and Systems, NCUT 2007
Parallel processing FALL 2018 Introduction to CUDA In CUDA programming model the specific hardware architecture is identified by compute capability as shown in Table 1. The compute capability of a device is represented by a version number to identify the features supported by the GPU hardware. The compute capability version consists of a major and minor revision number. The major revision number is 5 for the devices based on Maxwell architecture, 3 for Kepler architecture, 2 for Fermi and 1 for devices based on Tesla architecture. The minor numbers correspond to incremental improvements in the core architecture.
Introduction to VLSI Circuits and Systems, NCUT 2007
Question?
Introduction to VLSI Circuits and Systems, NCUT 2007