Professional Documents
Culture Documents
Underpinned by Moores Moore s Law Computers i C in automobiles bil Cell phones Human genome project World Wide Web Search Engines
Classes of Computers
Desktop computers
z
Designed to deliver good performance to a single user at low cost usually executing 3rd party software, usually incorporating a graphics display, a keyboard, and a mouse Used to run larger programs for multiple multiple, simultaneous users typically accessed only via a network and that places a greater emphasis on dependability and (often) security A high performance, high cost class of servers with hundreds to thousands of processors, terabytes of memory and petabytes of storage that are used for high-end scientific and engineering applications A computer inside another device used for running one predetermined application
Irwin, PSU, 2008
Servers
z
Supercomputers
z
Kilobyte 210 or 1,024 bytes Megabyte 220 or 1,048,576 Megabyte 1 048 576 bytes
z
Often have minimum performance requirements. Example? Often have stringent limitations on cost. Example? Often have Oft h stringent ti t limitations li it ti on power consumption. ti Example? Often have low tolerance for failure failure. Example?
System y software
z
Operating system supervising program that interfaces the users program with the hardware (e.g., Linux, MacOS, Windows)
- Handles basic input and output operations - Allocates storage and memory - Provides for protected sharing among multiple applications
Compiler translate programs written in a high-level language (e.g., C, Java) into instructions that the hardware can execute
Irwin, PSU, 2008
one-to-many C compiler
one-to-one assembler
Higher-level languages
Higher-level languages
z
Allow the programmer to think in a more natural language and for their intended use (Fortran for scientific computation, C b l for Cobol f business b i programming, i Li Lisp f for symbol b l manipulation, Java for web programming, ) Improve programmer productivity more understandable code that is easier to debug and validate Improve program maintainability Allow programs to be independent of the computer on which they are developed (compilers and assemblers can translate high-level language programs to the binary instructions of any machine) Emergence of optimizing compilers that produce very efficient assembly code optimized for the target machine
z z z
Five classic components of f a computer input, output, memory, datapath, and control
Anatomy of a Computer
Output device
Network cable
Input device
Input device
Anatomy of a Mouse
Optical mouse
Datapath: performs operations on data Control: sequences datapath datapath, memory memory, ... Cache memory
51 12KB L2 2
51 12KB L2 2
Four out-oforder cores on one chip p 1.9 GHz clock rate 65nm technology Three levels of caches (L1 L2 (L1, L2, L3) on chip Integrated g Northbridge
Irwin, PSU, 2008
Core 1
Core 2
Northbridge
512KB B L2
Core 3
512KB B L2
Core 4
http://www.techwarelabs.com/reviews/processors/barcelona/
ISA, S or simply architecture the abstract interface f between the hardware and the lowest level software that encompasses all the information necessary to write a machine language program, including instructions, registers, memory access, I/O,
z
The combination Th bi ti of f the th basic b i instruction i t ti set t (the (th ISA) and the operating system interface is called the application binary interface (ABI)
z
ABI The user portion of the instruction set plus the operating system interfaces used by application programmers. Defines a standard for binary portability across computers.
Moores Law
In 1965, Intels G Gordon Moore predicted that the number of transistors that can be integrated on single chip would double about every two years
Courtesy, Intel
2004 90 2
2006 65 4
2008 45 6
2010 32 16
2012 22 32
30 million can fit on the head of a pin You could fit more than 2,000 across the width of a human hair If car prices had fallen at the same rate as the price of a single transistor has since 1968, a new car today would cost about 1 cent
256K
120 100
Powe er(Watts)
CSE431 Chapter 1.22
80 60 40 20 0
For the P6, success criteria included performance above a certain level and failure criteria included power di i i above dissipation b some threshold. h h ld
Bob Colwell, Pentium Chronicles
Since 2002 the rate of improvement p in the response p time of programs on desktop computers has slowed from a factor of 1.5 per year to less than a factor of 1.2 per year
As of 2006 all desktop and server companies are shipping microprocessors with multiple processors cores per chip
P d Product Cores per chip Clock rate Power AMD Barcelona 4 2.5 GHz 120 W Intel I l Nehalem 4 ~2.5 GHz? ~100 W? IBM Power P 6 Sun S Ni Niagara 2 2 4.7 GHz ~100 W? 8 1.4 GHz 94 W
Plan of record is to double the number of cores per chip per generation (about every two years)
Irwin, PSU, 2008
The hard hardware/software are/soft are interface What determines program performance
Understanding Performance
Algorithm
Determines number of operations executed Determine number of machine instructions executed per operation Determine how fast instructions are executed Determines how fast I/O operations are executed
Abstractions
The BIG Picture
Hide lower-level detail The hardware/software interface The ISA p plus system y software interface The details underlying and interface
Chapter 1 Computer Abstractions and Technology 27
Implementation
Loses instructions and data when power off Magnetic disk Flash memory Optical disk (CDROM, DVD)
Networks
Within a building
Wide area network (WAN: the Internet Wireless network: WiFi, Bluetooth
Technology Trends
DRAM capacity
Defining Performance
Passenger Capacity
Boeing 777 Boeing 747 BAC/Sud Concorde Douglas DC-8-50 0 500 1000 1500
Boeing 777 Boeing 747 BAC/Sud Concorde Douglas DC8-50 0 100000 200000 300000 400000 Passengers x mph
Response time
How long it takes to do a task Total work done p per unit time
Throughput
Replacing the processor with a faster version? Addi more processors? Adding ?
Relative Performance
10s on A, 15s on B Execution TimeB / Execution TimeA = 15s / 10s = 1.5 So A is 1.5 1 5 times faster than B
Chapter 1 Computer Abstractions and Technology 33
Elapsed time
CPU time
Comprises user CPU time and system CPU time ti Different programs are affected differently by CPU and system performance
Chapter 1 Computer Abstractions and Technology 34
CPU Clocking
CPU Time
CPU Time = CPU Clock Cycles Clock Cycle Time CPU Clock Cycles = Clock Rate
Performance improved by
Reducing number of clock cycles Increasing clock rate Hardware designer must often trade off clock rate against cycle count
Aim for 6s CPU time Can do faster clock, but causes 1.2 clock cycles
Clock CyclesB 1.2 Clock CyclesA = Clock RateB = CPU Time B 6s Clock Cycles A = CPU Time A Clock Rate A = 10s 2GHz = 20 10 9 1.2 20 10 9 24 10 9 Clock RateB = = = 4GHz 6s 6s
Chapter 1 Computer Abstractions and Technology 37
Determined by program, ISA and compiler Determined by CPU hardware If different instructions have different CPI
CPI Example
Computer A: Cycle Time = 250ps, CPI = 2.0 Computer p B: Cycle y Time = 500ps, p , CPI = 1.2 Same ISA Which is faster, and by y how much?
CPU Time A B = Instruction Count CPI Cycle Time A A = I 2.0 2 0 250ps = I 500ps A is faster faster = Instruction Count CPI Cycle Time B B = I 1.2 1 2 500ps = I 600ps
by this much
n Clock Cycles Instruction Count i = CPIi CPI = I t ti n Count Instructio C t i=1 I t ti n Count Instructio C t
Relative frequency
Chapter 1 Computer Abstractions and Technology 40
CPI Example
B 2 1 1
C 3 2 1
Sequence 1: IC = 5
Sequence 2: IC = 6
Performance Summary
The BIG Picture
Instructions Clock cycles Seconds CPU Time = Program Instruction Clock cycle
Performance depends on
Algorithm: affects IC, possibly CPI Programming language: affects IC, CPI Compiler: affects IC, CPI Instruction set architecture: affects IC, CPI, Tc
Chapter 1 Computer Abstractions and Technology 42
Power Trends
In CMOS IC technology
Power = Capacitive load Voltage 2 Frequency
30 5V 1V 1000
Reducing Power
85% of capacitive load of old CPU 15% voltage and 15% frequency reduction
Pnew Cold 0.85 (Vold 0.85)2 Fold 0.85 4 = = 0.85 = 0.52 2 Pold Cold Vold Fold
Uniprocessor Performance
Multiprocessors
Multicore microprocessors
More than one processor per chip C Compare with ihi instruction i l level l parallelism ll li
Hardware executes multiple instructions at once Hidden from the programmer Programming for performance Load balancing Optimizing communication and synchronization
Chapter 1 Computer Abstractions and Technology 46
Hard to do
Manufacturing ICs
X2: 300mm wafer, 117 chips, 90nm technology X4: 45nm technology
Chapter 1 Computer Abstractions and Technology 48
Wafer cost and area are fixed Defect rate determined by manufacturing process Die area determined by architecture and circuit design
Chapter 1 Computer Abstractions and Technology 49
Supposedly typical of actual workload Develops benchmarks for CPU, I/O, Web, Elapsed time to execute a selection of programs
SPEC CPU2006
Geometric mean
10 10 Overall ssj_ops ssj ops per Watt = ssj_ops ssj ops i poweri i=0 i=0
SPECpower_ssj2008 for X4
Target Load % 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Overall sum ssj_ops/ ssj ops/ power Performance (ssj_ops/sec) 231,867 211,282 185,803 163,427 , 140,160 118,324 920,35 70,500 47,126 23 066 23,066 0 1,283,590 Average Power (Watts) 295 286 275 265 256 246 233 222 206 180 141 2,605 493
Improving an aspect of a computer and expecting p gap proportional p improvement p in overall performance
Timproved Taffected = + Tunaffected improvemen t factor
At 100% load: 295W At 50% load: 246W (83%) At 10% load: 180W (61%) Mostly operates at 10% 50% load At 100% load less than 1% of the time
MIPS =
Instruction count Clock rate = = 6 Instruction count CPI CPI 10 6 10 Clock rate
Concluding Remarks
Cost/performance is improving
Due to underlying technology development In both hardware and software The hardware/software interface