LSI-WP RNC User Plane Acceleration

Technical Whitepaper
WCDMA RNC HSPA User Plane Acceleration Using LSI Networking Solution
Version 3.0 Feb 3, 2009 Abstract
Introduction
An important requirement for cellular systems is to provide high data rates for packet data services. To meet this requirement, HSPA is introduced in releases 5 and 6 of the 3GPP/WCDMA specifications.1, 2 Although packet data communication is supported in the first release of the 3GPP/WCDMA standard, HSPA brings further enhancements including higher order modulation, fast scheduling, and rate control to support higher peak data rates per end user. As the evolution of HSPA continues, peak data rate will only increase. It is expected that in 3GPP Release 7+ [3], peak data rate increases to more then 200Mb/s per user. In a 3G network with HSPA, RNC typically controls several hundred base stations. The RNC is in charge of call setup and radio resource management of the cells under its control. WCDMA user plane protocol layers, including PDCP, RLC, MAC, and FP, are initiated in the RNC in the downlink direction and terminated in the RNC in the uplink direction. The RLC protocol layer is the only layer of the RNC user plane layers that is terminated in the user equipment in 3G network (mobile device). All the other layers are only between the RNC and the Base Station. Existing RNC platforms typically use several general purpose CPUs to process WCDMA user plane protocol stacks. With the evolution of HSPA and increase in the cell data rate, the existing RNC architectures do not scale to meet the increased traffic workload in WCDMA networks. Moores Law calls for 2x scaling of CPU performance every 18 months. According to several market research studies, network demand for RNC user plane processing capacity is estimated to increase by roughly 3x every 12 months. This obviously leads to a significant problem with current RNC user plane processing approaches (Figure 1). LSI provides both short term and long term solutions for this problem.
As High Speed Packet Access (HSPA) peak data rates increase, current Radio Network Controller (RNC) platforms, which rely on a collection of General Purpose Processors (CPU) to do user plane processing, cannot scale to meet increased traffic workloads. RNCs therefore require a new approach for processing user plane wireless protocols such as PDCP, Radio Control Link (RLC), MAC, and FP. RNCs need HSPA acceleration because of both the higher peak rate demands and the overall increase in the number of HSPA users and the associated traffic in WCDMA networks. This paper describes how to accelerate current HSPA user plane design, regardless of the hardware they run on today, by using the LSI APP650 Advanced PayloadPlus network processor. By offloading user plane processing such as RLC segmentation/concatenation and reassembly to the APP650, RNCs can achieve a user peak data rate of 100+ Mb/s for small RLC Service Data Units (SDU) and an aggregate throughput across 30k users of over 700 Mb/s. Using an APP650 offload approach with flexible RLC (3GPP Release 7+), the peak rate throughput per user can be more than 200 Mb/s.
Figure 1: Moores Law Versus Network Demand

Performance (Peak Rate & Total Rate) Network Demand on RNC 3X/12 months Increasing CPU-frequency or the number of CPU cores is NO longer adequate.
100,000 10,000 1000 100 10 1
GAP
aw s s L onth ore Mo y 18 m ver Xe
LSI Bridging the Gap

- Short term: APP650 acceleration - Long term LSI next-generation products
2000
2002
2004
2006
2008
2010
2012
This paper proposes a solution which offloads RLC segmentation/concatenation to substantially reduce the load on existing CPUs. Segmentation offload has been previously used for other protocols (e.g., TCP) to save CPU cycles in server platforms. This paper describes the application of similar concepts to accelerate WCDMA user plane processing (specifically the RLC protocol). The LSI APP650 processor [5] offloads segmentation/concatenation and reassembly for up to 30k RLC connections. Figure 2 shows several CPUs using a single APP650 processor as the acceleration engine. This acceleration approach provides significant advantages over non-accelerated implementations that are typically limited by single core or single thread performance. Increasing HSPA peak rate and number of users (who are supported with a typical CPU and operating system model which uses CPUs for user plane processing) would require single user processing software to be parallelized or pipelined across multiple processors. Such a software effort would be extremely complex, expensive, and error prone. In contrast, moving some of the most CPU-intensive processing to the LSI APP650 processor can eliminate 50% or more of the CPU processing load, enabling high peak rates and overall aggregate throughput to more than double using the same hardware.
Figure : Offloading RLC Segmentation/Concatenation and Reassembly to APP650 Processor

Status RLC PDU Control RLC PDU Command & Configuration
RLC
RLC
CPUs
RLC
APP650 RLC PDUs RLC Reassembly RLC Segmentation & Concatenation RLC SDUs
single context and the context executing in the pipeline is switched only when the context encounters a stall (cache miss, memory access, branch misprediction, etc.). In a conventional single-threaded architecture, keeping the execution pipeline busy is challenging since all the instructions in the pipeline belong to a single thread. In the APP650 architecture, if a context executes a high latency function call, its place in the pipeline will be assigned to another context. Consequently, the APP650 multi-threaded architecture provides a zerocycle context switch capability which is not present in single-threaded multi-core architecture. The pattern processing engine has 144 separate contexts to allow it to fully utilize hardware resources and hide memory latency effects. In contrast, the memory bottleneck in CPUs prevents resources from being fully utilized and causes CPU cycles to be wasted. The APP650 network processor allocates a context to an incoming packet and many packets are processed concurrently. By processing many packets at the same time, resources are fully utilized and up to 5.9 Gb/s of data rate can be achieved. In the APP650 architecture, mechanisms are separated from policies. Hardware is in charge of providing mechanisms and software is in charge of providing policies. The APP650
architecture implements packet memory management and data movement in hardware. Consequently, software does not consume cycles to allocate or free memory, keep track of packet pointers, or copy data to different memory addresses. On each packet, the APP650 hardware invokes the software to provide policy decisions, eliminating wasted cycles for processing interrupts or polling. The APP650 network processor also includes a Pre-Queuing Modification (PQM) engine which can insert or delete data to or from different parts of a packet. The PQM engine can also segment a packet to many subpackets. These features of the PQM engine significantly accelerate RLC segmentation/concatenation. Another important feature of the APP650 network processor is hardware assist for multifield packet classification. Packet classification can take significant cycles on CPUs but is highly efficient on the APP650 network processor. The APP650 state engine provides a mechanism to keep track of states associated with packets. In RLC processing, this engine is used to keep track of RLC connection states. For example, the 12 bits of sequence number associated with each RLC connection is one piece of protocol state kept by the state engine.
Advantages of APP650 for User Plane Processing

The APP650 network processor consists of several processing units including pattern processor, traffic manager, and state engines (Figure 3). The pattern processor is mainly responsible for packet classification. It uses a pipelined, multithreaded, multiprocessor architecture. Each pipeline stage in the pattern processor can work on a different context/thread every clock cycle. This differs from the conventional general purpose architecture where all instructions in the pipeline must belong to a
In the APP650 network processor, hardware invokes software as a subroutine to provide policy decisions on buffer management, traffic shaping/scheduling, and packet modification. The software runs on three compute engines based on very long instruction word (VLIW) architecture. The buffer management compute engine enforces packet discard policies and keeps queue statistics. The traffic shaper compute engine determines Quality of Service (QOS) and Class of Service (COS) treatment for each queue. The sed stream editor compute engine performs Protocol Data Unit (PDU) modification. The APP650 network processors hardwareassisted traffic management supports deterministic traffic management behavior across thousands of queues, while providing a
framework to customize traffic management algorithms in software via a subset of C programming language. Since traffic management functions are done in separate engines, the classification workload does not impact traffic management determinism. In contrast, CPU architectures execute traffic management algorithms either on the same processor pool that supports packet processing applications or on a separately allocated core. In both cases, hardware resources are underutilized to achieve determinism. In addition, software programmers are responsible for all aspects of developing a traffic management solution. The APP650 architecture hides most of the complexity in a framework that is implemented in hardware and only exposes
policy decisions to software programmers. The APP650 architecture is built so that it hides hardware multithreading and parallel processing from the software developer. Consequently, the APP650 architecture requires many fewer lines of software and provides significantly higher throughput compared to existing CPU-based wireless user plane solutions. LSI provides a rich software development environment including a cycle-accurate simulator that can be used for functional debugging and performance analysis of applications. Further, the simulator tool can be used to determine the utilization of different hardware resources.
Figure : APP650 Block Diagram
Reassembly Buffer
PDU Buffer
Traffic Manager Port 0-3 PPE + Classifier Buffer Mgr CE TS CE SED CE Port 0-3 Port 4
Inputs
Port 4
Packet Generation Engine State Engine
PCI Program Memory External Host CPU State Memory Context Memories Context Memories Context Memories
Outputs
PQM
RLC Segmentation/Concatenation Offload

The RLC has three modes of operation depending on the level of the required reliability. In this paper, only RLC Acknowledged Mode (AM) is discussed. RLC AM mode uses Automatic Repeat Request (ARQ) protocol to provide reliable communication. In the RNC downlink direction, the transmitter performs segmentation and concatenation of SDUs. RLC SDUs are mapped to RLC PDUs, which are sent for transmission and also placed in the retransmission queue. Under various conditions, the transmitter generates status reports back to the peer RLC. Status reports are either transmitted as separate RLC PDUs or are piggybacked at the end of a data PDU, if there is enough padding. In the RNC uplink direction, the RLC AM entity receives the RLC PDUs from the MAC layer. After deciphering, the RLC header is extracted and is used to reassemble SDUs. All status and control PDUs are processed, and relevant information is passed to the RLC transmitting side. The transmitting side will check its retransmission buffer against the received status PDU. The information in the RLC header is also used to generate status PDUs. In the CPU farm, most of the cycles are consumed by segmentation/concatenation of SDUs to RLC PDUs in the RLC layer. Because segmentation/ concatenation and reassembly is performed across all the RLC channels at a high data rate, offloading these operations saves significant CPU cycles. Figure 4 shows different components of an RLC transmitter and how they are partitioned between the acceleration engine and CPU farm. The goal is to offload high bandwidth operations. In this design, RLC status management and control are still handled by the CPU farm. Status PDUs are processed and a set of commands are subsequently issued to the offload engine. For example, asserting the RLC PDU POLL bit in the RNC transmitter will cause the RLC peer to transmit a status PDU. The Status PDU is processed by the RNC CPU farm. Then the acceleration engine will be instructed to release RLC PDUs from the retransmission queue or retransmit PDUs to the peer RLC. Figure 5 shows that RLC SDU buffers are kept in the acceleration engine. Since the CPU farm does not receive SDUs, it will save cycles by offloading, classifying, and buffering the SDUs. Flow control is another function of RLC protocol. This function allows an RLC receiver to control the rate at which the peer transmitting RLC PDUs. Flow control logic is implemented in the CPU farm and the commands to stop or resume the RLC channels are submitted by this logic to the acceleration engine.
Figure : Partitioning RLC Processing Between CPU Farm and Acceleration Engine
SDUs
SDU Buffer
SDUs
RLC Control Status Management Status & Control PDU
Reassembly
Segmentation & Concatenation
Receive Queue Retransmit Queue Status Queue
Process incoming status
Extract piggy-back status
CPU Farm APP650 External Logic Ciphering Deciphering
PDUs
PDUs
RLC Segmentation/Concatenation and Reassembly Offload Performance Analysis

To demonstrate the capabilities of the APP650 network processor as an RLC acceleration engine and to analyze the performance of the system, a proof-of-concept prototype is designed and implemented. In the prototype, the incoming RLC SDUs are buffered in the APP650 network processor. Every Transmission Time Interval (TTI), all the buffered SDUs are segmented and concatenated and the RLC PDUs are transmitted to a Gigabit Ethernet port. Later, RLC PDUs are looped back to the APP650 network processor and, after going through a reassembly process, SDUs are transmitted back to the test equipment. Figure 5 shows the test configuration. Up to 30K RLC connections are created and sustainable throughput is measured for different SDU sizes. Segmentation/ concatenation and reassembly are done across all RLC connections. In all the experiments, periodic bursts with the burst size of two SDUs are used. The period of the bursts represents TTI. In all the experiments, RLC PDU size is set to 100 bytes. Table 1 shows the RLC SDU aggregate throughput for 30k RLC channels when SDU size is varied from 142 bytes to 442 bytes. Note that the throughput for all 30k channels is near 700 Mb/s, irrespective of SDU size. This level of determinism cannot be achieved by general purpose processor architectures. For 30k connections, the throughput is limited by the bandwidth of Gigabit Ethernet interface transmitting RLC PDUs, not APP650 processing power. The number of provisioned RLC connections does not impact the throughput. This is because all the RLC configuration data is kept in the classification trees (lookup delay depends on the pattern size, not on the number of entries in the tree) and all the states associated with RLC connections are kept in the state engine internal memory.
Figure 5: Prototype Test Setup
Embedded Processor RLC PDUs RLC/Eth GE Tester/Analyzer in Echo Mode Port 1 APP6xx Segmentation/ Concatenation GE PCI Bus
System Memory
Tester IP/MPLS/ETH Port 0
RLC SDUs Port 2, 3, and 4
Table 1: RLC Throughput in Proof-of-Concept Prototype

Channels/SDU Size 0K RLC Channels Mbps kpps SDU Size =1 Bytes 666.1 568. SDU Size = Bytes 68.5 5 SDU Size = Bytes 689.6 5 SDU Size = Bytes 698. 197.5
The APP650 simulator is used to provide resource utilization information (Table 2). Results show the APP650 context utilization for high RLC channel counts and aggregated throughput of 700 Mb/s. First pass and second pass context utilization is 51% and 10% respectively. This shows that even at such high rate the APP650 network processor still has plenty of headroom left for additional functionality.
Table : Resource Utilization for High Channel Counts and 700 Mb/s Aggregate Throughput Provided by APP Simulator Environment
Metric Average Flow Instructions Average Microroot Tree Instructions Average Internal Tree Instructions Average External Tree Instructions Flow Instruction Budget Tree Instruction Budget (@100% efficiency) Average Flow Engine Utilization Average Tree Engine Utilization Classification Program Memory Efficiency Classification Program Memory Utilization (Internal) Classification Program Memory Utilization (External) Classification PDU Buffer Memory Utilization Classification Control Memory Utilization Average First Pass Context Utilization Average Second Pass Context Utilization Maximum Active First Pass Contexts Maximum Active Second Pass Contexts Maximum Active PDUs VALUe 9 instructions/pdu 1 instructions/pdu 0 instructions/pdu 10 instructions/pdu 66 instructions/pdu 66 instructions/pdu 0% 9% 9.506% 0% 1.15% 11.6% 9.98% 51.7% 10.788% 9 17 19
Conclusion
As HSPA peak data rates increase, existing RNC platforms that rely on a collection of CPU cores to do WCDMA user plane processing do not scale to meet increased traffic workloads. The problem with existing RNC platforms is that the nature of user plane processing (mostly data processing) is not suited for general purpose CPU architectures. Wireless user plane processing requires optimization of cycle-consuming functions such as RLC segmentation/ concatenation and reassembly. This paper describes an approach that accelerates an existing RNC WCDMA user plane stack by offloading RLC segmentation/concatenation and reassembly to an APP650 network processor. The paper discusses the advantages of the APP650 architecture such as determinism in efficiently processing packets. Simulation and prototyping show that the APP650 network processor can sustain up to 700 Mb/s aggregated throughput for 30K RLC channels. For Flexible RLC, a single RLC channel peak rate of more than 200Mb/s can be achieved. In short, the APP650 network processor can be used as a user plane accelerator to solve both user plane peak and aggregate rate challenges in todays RNC systems.
Revision History
Version .0 .1 .0 1. 1. 1.1 1.0 0.0 0.01 D At e 0/0/009 0/0/009 0/0/009 01/1/009 01/0/009 01/9/009 01/7/009 01/1/009 01/0/009 Description Merging comments. (Reza Etemadi) Final edits. (Henri Tervonen) Merging comments from Henri Tervonen, Curtis Hillier, Robert Munoz, Tareq Bustami and Jas Tremblay. (Reza Etemadi) Edits. (Henri Tervonen) Edits. (Robert Munoz) Edits. (Curtis Hillier) First draft. (Reza Etemadi) Edits focusing on benefits of LSI solution. (Henri Tervonen) Initial Draft. Abstract Added (Reza Etemadi)
References
1. 3rd Generation Partnership Project; Technical Specification Group Radio Access Network; High Speed Downlink Packet Access (HSDPA) Overall Description (Release 5), 3GPP TS 25.308. 2. 3rd Generation Partnership Project; Technical Specification Group Radio Access Network; FDD Enhanced Uplink; Overall Description (Release 6), 3GPP TS 25.309. 3. 3rd Generation Partnership Project; Technical Specification Group Services and System Aspects; 3GPP System Architecture Evolution; Report on Technical Options and Conclusions (Release 7), 3GPP TR 23.882. 4. 3rd Generation Partnership Project; Technical Specification Group Radio Access Network; Radio Link Control (RLC) Protocol Specification (Release 7), 3GPP TS 25.322 V7.3.0. 5. APP650 Product Brief, LSI Corporation.
For more information and sales office locations, please visit the LSI web sites at: lsi.com lsi.com/contacts
LSI and LSI logo design are trademarks or registered trademarks of LSI Corporation or its subsidiaries. All other brand and product names may be trademarks of their respective companies. LSI Corporation reserves the right to make changes to any products and services herein at any time without notice. LSI does not assume any responsibility or liability arising out of the application or use of any product or service described herein, except as expressly agreed to in writing by LSI; nor does the purchase, lease, or use of a product or service from LSI convey a license under any patent rights, copyrights, trademark rights, or any other of the intellectual property rights of LSI or of third parties. Copyright 2009 by LSI Corporation. All rights reserved. February 2009 PB06-028CMPR

LSI-WP RNC User Plane Acceleration

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

LSI-WP RNC User Plane Acceleration

Uploaded by

Copyright:

Available Formats

Technical Whitepaper

Figure 1: Moores Law Versus Network Demand

100,000 10,000 1000 100 10 1

aw s s L onth ore Mo y 18 m ver Xe

LSI Bridging the Gap

Figure : Offloading RLC Segmentation/Concatenation and Reassembly to APP650 Processor

Advantages of APP650 for User Plane Processing

Figure : APP650 Block Diagram

Packet Generation Engine State Engine

RLC Segmentation/Concatenation Offload

RLC Control Status Management Status & Control PDU

Segmentation & Concatenation

Receive Queue Retransmit Queue Status Queue

Process incoming status

Extract piggy-back status

CPU Farm APP650 External Logic Ciphering Deciphering

RLC Segmentation/Concatenation and Reassembly Offload Performance Analysis

Figure 5: Prototype Test Setup

Tester IP/MPLS/ETH Port 0

RLC SDUs Port 2, 3, and 4

Table 1: RLC Throughput in Proof-of-Concept Prototype

You might also like

LSI-WP RNC User Plane Acceleration

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

LSI-WP RNC User Plane Acceleration

Uploaded by

Copyright:

Available Formats

Technical Whitepaper

Figure 1: Moores Law Versus Network Demand

100,000 10,000 1000 100 10 1

aw s s L onth ore Mo y 18 m ver Xe

LSI Bridging the Gap

Figure : Offloading RLC Segmentation/Concatenation and Reassembly to APP650 Processor

Advantages of APP650 for User Plane Processing

Figure : APP650 Block Diagram

Packet Generation Engine State Engine

RLC Segmentation/Concatenation Offload

RLC Control Status Management Status & Control PDU

Segmentation & Concatenation

Receive Queue Retransmit Queue Status Queue

Process incoming status

Extract piggy-back status

CPU Farm APP650 External Logic Ciphering Deciphering

RLC Segmentation/Concatenation and Reassembly Offload Performance Analysis

Figure 5: Prototype Test Setup

Tester IP/MPLS/ETH Port 0

RLC SDUs Port 2, 3, and 4

Table 1: RLC Throughput in Proof-of-Concept Prototype

You might also like

Figure : Offloading RLC Segmentation/Concatenation and Reassembly to APP650 Processor

Figure : APP650 Block Diagram