A Summary on Characterizing Processor Architectures for Programmable Network Interfaces
Crossing points in communication
Nodes of Network Goal : Speedup nodes(Faster Equipment) Characterizing on Network Processors Application workloads Emerging Applications
Processor Architectures Out of order Speculative Super scalar processor Finegrain multithreaded processor Single chip multiprocessor Simultaneous Multithreaded processor(Effective for NI environment) Current trend : Programmable microprocessors on network interfaces (PNI) that can be customized with domain- specific software.
Fill with chip architectures designed specifically to match the network application workload of PNIs what workloads must the processor architecture support, what level of performance is required, what type of architecture provides the required level of performance. Performance Evaluation of Network processor Architectures
Bottlenecks of current trend in Networking technology The key metric is the number of messages per second Instruction issue and Execution Cache accesses Memory bandwidth and latency Memory contention between the processor and DMA transfers caused by network send and receive operations. Performance evaluation is done assuming accurate single cycle with respect to
Workflow
Identifying number of applications that can considered as components of workload
Maximum sustainable link-rate
Best performance shown by architectures designed for a high-degree of thread-level parallelism Conventional application workload of such communication devices used to consist of simple packet forwarding and filtering algorithms based on the addresses found in layer-2 or layer-3 protocol packets. Current Trend Application workloads
Traffic shaping, Network firewalls, Network address and protocol translations (NAT), High-level data transcoding (e.g., to convert a data stream going from a high speed link to a low-speed link). Load balance HTTP client requests over a set of WWW servers to increase service availability Exploiting packet level parallelism at the architectural level to achieve the sustained high-performance Application domains for Programmable NIs Server NI software, Web Switching software Active Networking software Application specific packet processing routines Packet Classification /Filtering IP Packet Forwarding Network Address Translation(N AT) Flow management TCP/IP Web Switching Virtual Private Network IP Security (IPSec) Data Transcoding Duplicate Data Suppression Process limited amount of data in protocol headers Process all of the data contained in the packet Packet processing is parallel in these applications Ipv4(IP forwardi ng) MD5(IP Security) 3DES(IP Security) Benchmarks from workloads Packets are delivered to the NI via The host controller (out-bound packets) or The network controller (in-bound packets) Prediction rates Lower level performance metrics
Execution Environment of PNI Store-process-forward mechanism
Store, Forward transfers data into and out of NIs Buffer Memory
Process stage invokes application specific handlers based on some matching criteria applied to the message Network Processor Host Controller Network Controller Buffer Memory Generic Programmable Network Interface Architecture High throughput and low latency can be achieved making messages pipelined through these stages Superscalar (SS). Deep Pipeline(7stages) Score boarding and register renaming to solve dynamic dependencies Issued number of instructions in each cycle Fine-Grain Multithreaded (FGMT). Multiple hardware thread contexts Extending core out of order super scalar Exploit ILP within thread of execution Improvement in system throughput Round robin fetch and issue policy Chip-Multiprocessor (CMP) Separate Execution pipelines, Separate register files Separate fetch units Private L1 cache (Instruction and data)and shared L2 cache Simultaneous Multithreaded (SMT). Instruction fetch and issue from multiple threads Exploit both ILP and TLP
Experiments wrt to Standalone application performance Standalone operating system overhead OS governed application performance Dynamic Discovery of ILP : Aggressive Superscalar Tolerating blocked threads: FGMT Observations Simple replication: CMP SS and FGMT have basically the same performance on workloads, and, likewise, CMP and SMT have roughly equivalent performance that is 2 to 4 times greater. Key : Scale both issue width and number of hardware thread contexts Network processor workloads exhibit a high-degree of parallelism at the packet-level, which represents an opportunity for high performance SMT performs better than CMP and more than a factor of two better than FGMT and SS by dynamically exploiting both instruction and thread level parallelism. Question and Answer
1)What is DMA controller A)Direct Memory Access (DMA) is one of several methods for coordinating the timing of data transfers between an input/output (I/O) device and the core processing unit or memory in a computer.
DMA saves core MIPS because the core can operate in parallel. DMA saves power because it requires less circuitry than the core to move data. DMA saves pointers because core AGU pointer registers are not needed. DMA has no modulo block size restrictions, unlike the core AGU.
2)Specify the OS used for the experiment of OS overhead and its details
A)SPINE is the OS used for basic packet delivery equal rate on FGMT,SS,and SMT architectures for standalone operating system overhead It runs a single thread of execution
3)Which processor architectures exploit packet level parallelism in a better way (consider first experiment set)
A)CMP and SMT clearly demonstrate their superiority over SS and FGMT in exploiting the packet-level parallelism available within the workloads. The ability to issue from multiple threads simultaneously is key to this scalable performance