You are on page 1of 12

Wireless network cloud: Architecture and system requirements

With the growth of the mobile communication network, from second-generation to third-generation or fourth-generation networks, technologists in the mobile industry continue to consider advanced wireless network architectures that have the potential to reduce networking costs and provide increased exibility with respect to network features. In this paper, we propose the wireless network cloud (WNC), a wireless system architecture for a wireless access network. This system makes use of emerging cloud-computing technology and various technologies involved with wireless infrastructure, such as software radio technology and remote radio head technology. Based on open information technology architecture, the WNC provides all the necessary transmission and processing resources for a wireless access network operating in a cloud mode. Note that it is useful to separate the hardware and software for different wireless standards and various services and business models, as well as to meet the new system requirements for emerging wireless technologies, such as collaborative processing at different scales of network use. We analyze several important system challenges involving computational requirements of virtual base stations, I/O throughput, and timing networks for synchronization. Based on current information technologies, we make several suggestions with respect to future system design.

Y. Lin L. Shao Z. Zhu Q. Wang R. K. Sabhikhi

Introduction
Given the rapid growth of mobile communications in recent years, mobile communication systems are now required to support much larger system capacities and much higher service data rates over large coverage areas and in high-mobility environments. To meet these requirements, many new wireless technologies have evolved and are represented by standards involving wideband code-division multiple access (WCDMA), 802.16d/e (represented by WiMAX**, i.e., Worldwide Interoperability for Microwave Access), Long-Term Evolution (LTE), and 802.11a/b/g/n technologies. At the same time, the fourth-generation (4G) mobile networks are entirely Internet Protocol (IP)-based networks and systems. Most of the existing Global System for Mobile communication (GSM) operators are considering the upgrade from GSM networks to newer generation mobile networks [1].

Digital Object Identifier: 10.1147/JRD.2009.2037680

The wireless access network has been regarded as the most important part of the entire wireless network. There are two reasons for this: First, the radio interface technology used in wireless access networks determines the service data rates, geographic coverage, and system capacity supported in a cell area. Thus, the differences among wireless standards mainly involve the wireless access network. Second, base stations (BSs), which are the main components of wireless access networks, usually represent the largest monetary investment associated with a mobile network. Usually, this investment exceeds 40% of total investment. Today, BSs in wireless access networks make use of proprietary hardware designs and support dedicated standards. When the wireless network is upgraded, almost all of the network equipment must be replaced. Furthermore, during the transition, to satisfy the coexistence of new standards (such as WCDMA in 3G) and old standards (such as GSM in 2G), mobile operators must keep the old network and create another one for the new standard. Therefore, the wireless network upgrades require huge nancial investment and have limited

Copyright 2010 by International Business Machines Corporation. Copying in printed form for private use is permitted without payment of royalty provided that (1) each reproduction is done without alteration and (2) the Journal reference and IBM copyright notice are included on the rst page. The title and abstract, but no other portions, of this paper may be copied by any means or distributed royalty free without further permission by computer-based and other information-service systems. Permission to republish any other portion of this paper must be obtained from the Editor. 0018-8646/10/$5.00 B 2010 IBM

IBM J. RES. & DEV.

VOL. 54

NO. 1

PAPER 4

JANUARY/FEBRUARY 2010

Y. LIN ET AL.

4:1

adoption of emerging wireless technologies. In addition, the current structure of typical wireless access network installations will effect the adoption of future wireless communication technologies. In existing wireless access networks, the BS equipment is located close to its antenna tower and radio head. Links only exist between the BS and its access network gateway, and there is no link between BSs. Therefore, the BS equipment can only serve those radiofrequency channels in its physical cell, where the hardware resources cannot be reallocated to address varied communication throughput needs in different cells. However, emerging wireless technology, such as cooperative multiple-inputmultiple-output (MIMO) technology, requires cooperation among BSs. In cooperative MIMO systems, assuming full BS cooperation, the multiple BSs can be viewed as a single BS with multiple geographically dispersed antennas. The uplink or downlink between mobile users and BSs can be modeled as a virtual MIMO channel [2]. Requiring signicant data sharing and exchange in the physical (PHY) layer, communication cannot efciently be supported using traditional wireless access network architecture. Here, the term PHY layer refers to the signal processing technology for the radio signal transmission in a wireless network. In this paper, we propose a new approach to system architecture for next-generation wireless access networks. This approach is called the wireless network cloud (WNC). With this concept, a radio head is decoupled from the BS by using remote radio heads (RRHs). The BS system is developed using open information technology (IT) architecture using software radio (SWR) technology. This approach allows BS equipment to be located in the data centers. SWR uses general-purpose processors (GPPs) with multicore and multithread techniques to implement baseband processing, such as PHY and media access control (MAC) layers. PHY and MAC involve software components. This approach addresses the issues of supporting multiple standards in the same BS. The upgrading of BS systems to different standards can be nished using software approaches and does not require hardware upgrades. The WNC may be regarded as a breakthrough combination of emerging technologies from both the wireless industry and the IT industry. These technologies include RRH, SWR, and cloud-computing technologies. The RRH allows the radio header and antenna to be separated from the BS. This exibility allows RRHs to be installed at distributed remote towers. All the baseband units (BBUs) may be clustered as a BBU pool in a centralized location. SWR technology involves the implementation of all wireless baseband processing in software [3]. It creates a new trend in research, in which commodity IT platforms may be used in wireless BSs, replacing the traditional dedicated hardware platform design. As mentioned, in the WNC, RRH technology provides the possibility to support wireless signal processing in a centralized location, and SWR technology further enables all

centralized processing using the platforms based on an open IT architecture. Cloud computing provides the capability for virtualization, resource provisioning, and dynamic resources allocation, among other features. The distributed wireless communication system (DWCS) [3], along with concepts from [4], introduces the concept of distributed antennas with centralized computation to support the wireless access network. Ramamurthi [4] only provides the analysis on optical links for such wireless system architecture. Without multicore technology, without standards to support RRH technology, and without 10-Gb/s Ethernet (10-GbE) and InniBand [5] products for reference, the author of [3] could only describe the ideal concept for the DWCS, with no detailed analysis on various kinds of system issues. In contrast with the work presented in [3] and [4], the WNC is the rst architecture to make use of cloud computing in a wireless access network system and the rst one with emphasis on open IT architecture. Based on up-to-date technologies, in this paper, we also provide detailed analyses on system requirements. In this paper, we present our vision of the future with respect to the WNC. In the remainder of this paper, we introduce some possible application scenarios associated with the WNC, to show the value proposition of the WNC with respect to mobile communications. A novel system architecture for the WNC is proposed to support the next-generation wireless access network. We also analyze the most important system requirements together with suggestions for future system design.

Application scenarios for the WNC


Our vision of the WNC involves a new deployment model and a business model for wireless access networks and derives new application scenarios that may enhance mobile communication systems in different ways. Let us consider several scenarios. In Scenario 1, the WNC can be regarded as a resource pool to support various kinds of wireless access networks. When an operator plans to build a new wireless access network, he may lease the required computation and transmission resources from cloud service providers who own and operate active cloud-computing systems to deliver service to third parties. Initial investments can be lowered, and the time-to-market can be shortened. Scenario 2 involves a mobile virtual network operator (MVNO), which is a popular operation model in which service providers do not have their own licensed spectrums or the infrastructure required to provide a mobile telephone service. Service providers simply lease the spectrum and infrastructure or directly buy the time from physical mobile operators who actually own the spectrum and infrastructure. With the WNC, physical mobile operators can provide the resources to MVNOs on demand. In Scenario 3, the wireless trafc will be quite different in the daytime and nighttime for some districts. For example,

4:2

Y. LIN ET AL.

IBM J. RES. & DEV.

VOL. 54

NO. 1

PAPER 4

JANUARY/FEBRUARY 2010

in an industry park, the wireless trafc will be heavy in the daytime but very small in the nighttime, and the reverse trend is often seen in a residential area. In the WNC, through the dynamic allocation, the utilization of computation resources may signicantly be increased. Scenario 4 involves a rural area with rough or difcult-to-access terrain. Here, it is difcult and costly to construct the environment satisfying industry requirements for holding the electronic equipment. With the WNC, the operators may obtain the computation resources from computation clouds that are built in a more reliable environment, with only the antenna towers set up in the outdoor rural environment. In Scenario 5, to increase the wireless system performance, more researchers have turned their attention to collaboration of multiple BSs in the PHY-layer signal processing, such as in cooperative MIMO mentioned earlier in this paper. This will require BS systems to be exible with respect to data sharing and communication in different stages within the PHY layer among BSs. In the traditional model with proprietary hardware designs, there is no data interface among PHY-layer modules of different BSs. Compared with the traditional model, the WNC centralizes the computation resources of BSs using general IT platforms so that it can efciently support different kinds of cooperative digital signal processing among BSs. In Scenario 6, the wireless access network will be constructed using open IT platforms in the WNC. The applications in the service layer of the mobile network can be run on the same physical nodes with the wireless access network. Therefore, the service layer application can obtain more subscriber information through the integration with the virtual BS and provide less latency due to much shorter distances between the service layer and the access network.

Description of the WNC


Architecture of the WNC Figure 1 shows the architecture of the WNC at the conceptual level, divided into several parts. The radio front end (RFE) (Figure 1, left) includes the antenna, RRH, and antenna tower. Usually, RFEs need be deployed in a mobile cell. Digital-to-analog (D/A) and analog-to-digital (A/D) converters are integrated into the RRH. The link between the RFE and the BBU is a digital interface. Both the Reference Point 3 specication of the Open Base Station Architecture Initiative [6] and Common Public Radio Interface (CPRI) [7] have dened the standards for this interconnection. High-end eld-programmable gate arrays (FPGAs) are often integrated within RRHs. To reduce the cost and enhance the exibility, high-speed serial link (HSSL) standards will be adopted (e.g., 10-GbE or InniBand) to carry the CPRI streams over optical links. Either the FPGA or another chipset should support the 10-GbE or InniBand in the RRH. The RRH also contains analog devices, such as the variable-gain amplier and the

voltage-controlled oscillator (VCO) to control the receive and transmit power and to compensate for frequency and timing offsets among BSs. These parameters will be controlled by the system control layer in virtual BSs. The term RB link refers to the link between the RFE and the virtual BS pool. In our proposed architecture, this link is able to support the topology of multiple-point-to-multiple-point (MM) models. Switches will exist between RFEs and the virtual BS pool. Currently, the CPRI is popular in actual deployments. However, being designed in the time-division multiplexing (TDM) mode, the CPRI does not support any switching in the MM model. Therefore, we propose the use of 10-GbE or InniBand technology in this architecture to carry the CPRI protocol over the optical link. The RB link may be presented as a three-layer structure shown in Figure 2. Commercial-off-the-shelf switches of 10-GbE or InniBand can be used to support the MM switching topology between RFEs and the virtual BS pool. The virtual BS pool refers to a resource pool containing all the required processing resources for BSs. It supports the entire digital signal processing in PHY layers and packet processing in MAC layers in the traditional BSs. Hardware platforms are based on open IT architectures, and the interconnection among them is of standardized interface technologies widely used in IT infrastructure, for example, GbE, 10-GbE, InniBand, and PCI Express** (PCIe**) [8]. The BS system is enabled by SWR technology. All the functionalities will be presented as a software package, which will be called a virtual BS. With the virtualization technology, hardware resources may dynamically be allocated for virtual BSs, according to the complexity of different wireless standards, the number of subscribers, throughput, etc. Based on the characteristics of BS processing, we further divide the virtual BS pool into a PHY-layer part (virtual BS-PHY), a MAC-layer part (virtual BS-MAC), and the interconnection between them (PHYMAC link). A detailed analysis of this will be provided in the next section. In every digital communication network, precise synchronization and timing is required for the reliable transmission of voice, video, and data. The wireless system with a time-division duplex (TDD) mode adds the requirement to synchronize the BSs to a common time reference to assure that transmitters are able to synchronize their uplink and downlink timeslots to avoid interference [9]. In the WNC, the timing and synchronization system includes two parts, namely, the master timeserver to provide the accurate timing reference and the timing network to distribute the precise timing signal throughout the virtual BS pool and RFEs. To reduce the additional cost, the timing network will be constructed using the existing Ethernet/IP network, together with some kind of timing protocol such as the IEEE 1588 Precision Timing Protocol (PTP) [10]. The IEEE 1588 master clock could be installed in the master timeserver with the GPS time reference, and the IEEE 1588 slave clock may

IBM J. RES. & DEV.

VOL. 54

NO. 1

PAPER 4

JANUARY/FEBRUARY 2010

Y. LIN ET AL.

4:3

Figure 1
Wireless network structure with the WNC. (RFE: radio front end; RRH: remote radio head; RB link: RFE-to-base-station pool link; BS-PHY: base station physical layer; BS-MAC: base station media access control layer; PHYMAC link: link between the physical layer and the MAC layer.)

Figure 2
Layer structure of the RB link. (CPRI: Common Public Radio Interface; HSSL: high-speed serial link; BS: base station; RB link: RFE-to-base-station pool link.)

4:4

Y. LIN ET AL.

IBM J. RES. & DEV.

VOL. 54

NO. 1

PAPER 4

JANUARY/FEBRUARY 2010

Figure 3
Structure of the virtual BS pool. (BS: base station; RRH: remote radio header; BS-PHY: base station physical layer; BS-MAC: base station media access control layer; BS-PHY-CH: base station physical layer channel; BS-PHY-CENTRAL: base station physical layer central; PHYMAC link: link between the physical layer and the MAC layer.) (a) Basic model. (b) Large-scale clustering model. (c) Cooperative processing model.

be a software implementation in every virtual BS. Using the RB link, the virtual BS will relay the master clock to its RFE in software, and the IEEE 1588 PTP slave part solution using FPGAs or application-specic integrated circuits (ASICs) needs to be embedded in each RRH. Its results of timing and frequency offsets will also be used to control the VCO device in the RRH. Virtual BS pool Several different choices are possible for the construction of the virtual BS pool. Figure 3 shows three kinds of structures that we are considering. Figure 3(a) is the most basic structure. The software packages of virtual BS-PHY and virtual BS-MAC are combined into one virtual BS in software. The resource is dynamically allocated for each virtual BS. This structure inherits characteristics from legacy BS designs, in which there is one MAC component for each PHY-layer module. This structure is the starting point for constructing WiMAX BS prototypes using an IT platform. Figure 3(b) illustrates the structure proposed for better performance efciency in a large virtual BS pool, where virtual BS-PHY and virtual BS-MAC will be separated. The resource is separately allocated for virtual BS-PHY and virtual BS-MAC. Furthermore, one virtual BS-MAC can serve multiple virtual BS-PHY components. The behaviors of the PHY and MAC layers are quite different. PHY-layer processing requires the vector execution technique to accelerate the signal processing, such as single instruction, multiple data (SIMD) of the Cell Broadband Engine** (Cell/B.E.**) processor, stream SIMD extensions (SSE) in Intel processor, and vector multimedia extension (VMX) in IBM POWER* processor. MAC-layer processing requires multithread architecture and

some network accelerators for high-efciency packet and protocol processing. Furthermore, the MAC layer only occupies 5% to 10% of the processing resources in a BS. To support a larger system with numerous virtual BSs, we could separate the PHY and MAC layers onto different platforms and merge those MAC-layer modules for BSs into one virtual BS-MAC. Not only could this concept enhance the hardware efciency, but it could also enable exible collaboration in MAC layers among BSs, such as represented by handoff and spectrum resource management. Multithread processors designed for network applications, such as the IBM wire-speed processor (WSP) [11] (with 16 cores and 64 hardware threads) and the Raza XLR** processor [12] (with eight cores and 32 hardware threads), could match the multithread program model in the MAC layer. Moreover, they have network accelerators and cryptographic coprocessors to accelerate MAC-layer processing. Thus, they can provide high performance to support a large virtual BS-MAC node in the WNC. We propose the structure in Figure 3(c) for the virtual BS pool to support the cooperative radio processing. In this structure, virtual BS-PHY is separated into virtual BS-PHY-CH and virtual BS-PHY-CENTRAL. In wireless applications, cooperative processing among BSs at the PHY layer has become a strong trend in academic research and standards. For example, coordinated multiple-point transmission and reception (CoMP), or the so-called cooperative MIMO, is the main technical feature under discussion in the LTE-Advanced communication standard [13], which could signicantly improve system capacity. In this PHY-layer system, the components close to the RB link sideVsuch as synchronization, fast Fourier transform and inverse FFT

IBM J. RES. & DEV.

VOL. 54

NO. 1

PAPER 4

JANUARY/FEBRUARY 2010

Y. LIN ET AL.

4:5

Table 1

Million-instructions-per-second (MIPS) requirements for some key wireless standards on BSs. (GSM: Global System for Mobile communication; W-CDMA: Wideband Code-Division Multiple Access; WiMAX: Worldwide Interoperability for Microwave Access.)

what kind of GPP platforms can provide similar computation efciency values as provided by proprietary hardware platforms to support wireless BSs? We would like to introduce rst the workload model of the PHY layer in wireless BSs before providing the answer to this question.

Workload model of the PHY layer in BSs


The workload of the PHY layer can be summarized using the categories of algorithm- and system-level behaviors. Figure 4 shows the algorithm-level behavior in the PHY layer of current popular orthogonal frequency-division multiplexing (OFDM) and MIMO wireless systems. Here, the rectangle labeled spacetime encoding performs encoding of a data stream across a number of antennas used in wireless communication systems, and the function labeled interleave is used in digital data transmission technology to protect the transmission against burst errors. There are two main processes in the BS PHY layer. One is a downlink process, which is used to handle the signals to be sent out from the BS to subscribers. The other is an uplink process, which is used to handle the received signal sent by subscribers. According to different wireless standards, data-dependent processing will occur within one, two, or three OFDM symbols. Here, an OFDM symbol block is dened as the minimum unit with data-dependent processing, which will be equal to one OFDM symbol, or the combination of two or three OFDM symbols in different standards. As summarized in [19], both uplink and downlink processes consist of multiple signal processing algorithm kernels connected together, and data are sequentially streamed through kernels. Most of these kernels are signal processing algorithms with abundant data-level parallelism. Such kind of parallelism will occur within the processing of each OFDM symbol block, which could be called in-block parallelism. A number of OFDM symbol blocks will be combined into an uplink or a downlink frame. Usually, there is a strict time duration for a frame, such as 5, 2.5, or 1 ms. This means that the downlink or uplink process of an OFDM symbol block should be nished within the duration of a frame. The system-level behavior is shown in Figure 5. Usually, the antenna system will be designed to support different angles, such as 360 , 180 , or 120 . Accordingly, one mobile cell of 360 supported by a wireless BS can be divided into one, two, or three sectors. It is assumed that N sectors will be supported in a wireless BS. There is no data dependence among the N sectors in the PHY layer. Within one sector, there will be uplink and downlink logics. These can be decoupled and individually processed in parallel. Thus, the PHY workload within a wireless BS can inherently be divided into 2N logics. These logics can be executed in parallel on a platform, which we call link-logic parallelism in this paper. Within an uplink or a downlink logic, there is no data dependence among the frames, as well as no data dependence among the OFDM symbol blocks within a frame. Thus, each uplink or downlink logic has multiple process threads executed

(FFT/IFFT), and channel estimationVneed to process baseband signals by the RB link. We group these components into a virtual BS-PHY-CH module for each RB link channel. Additionally, a centralized processing module will be required to complete the joint processing for signals from all BSs within one CoMP group. The virtual BS-PHY-CENTRAL module is in this structure. The resources will dynamically be allocated for each virtual BS-PHY-CH module and virtual BS-PHY-CENTRAL module.

System requirements
In this section, we discuss the most important system requirements. Computation requirement to support virtual BSs The Software Dened Radio Forum (SDRF) has dened ve tiers of solutions, namely, hardware radio, software-controlled radio, software-dened radio (SDR), ideal SWR, and ultimate SWR [16]. SDRF denes an SDR as one that implements a specied range of capabilities through elements that are software recongurable, which is largely employed in todays BS design. If the functionalities of the radio can totally be redened in software, this would be the ideal implementation of SWR [17]. In the WNC, the virtual BS using open IT architecture with a GPP is a kind of ideal SWR solution. The wireless BS has MAC and PHY layers. When compared with its corresponding PHY layer, MAC-layer processing will require less than 10% of the computation resources of the whole BS. The PHY layer will occupy 90% of the resources. In this section, we focus on the analysis of the PHY layer. Assuming there is only serial processing, Table 1 shows the required instructions for PHY layers in some typical wireless standards. To handle 1-Mb/s data throughput, the required processing capability will range from 1.5 giga-instructions per second (GIPS) for WiMAX to 6.9 GIPS for GSM on the BS. To meet such high levels of computation requirements, the design of most BSs relies on different kinds of dedicated hardware components (such as digital signal processing chipsets, FPGAs, and ASICs) [18]. Open IT architectures with GPPs can provide much higher exibility for platforms being shared among different standards, different network functionalities, and different deployment requirements. However, several questions may be raised. For example,

4:6

Y. LIN ET AL.

IBM J. RES. & DEV.

VOL. 54

NO. 1

PAPER 4

JANUARY/FEBRUARY 2010

Figure 4
Algorithm-level behavior of the PHY layer in a MIMO OFDM wireless BS. (I/Q: in-phase/quadrature; A/D: analog to digital; D/A: digital to analog; FFT: fast Fourier transform; IFFT: inverse fast Fourier transform; RRH: remote radio header; MAC: media access controller.)

Figure 5
System-level behavior of the PHY layer in a wireless BS. (PHYMAC link: link between the physical layer and the media access control layer; RB link: RFE-to-base-station pool link.)

Figure 6
In-link-logic parallelism. (OFDM: orthogonal frequency-division multiplexing.)

in parallel, as shown in Figure 6. There is one component to dispatch the OFDM symbol blocks to process threads. After processing, the output data blocks will be merged and reordered to keep their sequence in the uplink or downlink logic. Such a kind of parallelism is called in-link-logic parallelism in this paper.

In total, there are three kinds of parallelism in the workload model of the PHY layer, namely, link-logic parallelism, in-link-logic parallelism, and in-block parallelism. It is obvious that the workload of the PHY layer is an Bembarrassingly parallel[ workload with rare data dependences [20]. An embarrassingly parallel workload is

IBM J. RES. & DEV.

VOL. 54

NO. 1

PAPER 4

JANUARY/FEBRUARY 2010

Y. LIN ET AL.

4:7

Table 2 System performance projection for different processor congurations. (PHY: physical; SIMD: single instruction, multiple data; SMT: simultaneous multithreading.)

dened in the literature as a workload for which little or no effort is required to separate the workload into parallel tasks.

Consideration with respect to multicore, SMT, SIMD, and hardware accelerators


Today, multicore, simultaneous multithreading (SMT), and SIMD processing are the three main technologies widely utilized to provide high parallel performance with lower power [21]. Usually, they are used in combination. For example, the multicore processor can have two or four hardware threads in each core, and the core will also support SIMD instructions. However, not every kind of workload can get signicant benets from all of these technologies being considered in modern architecture design. SIMD processing requires workloads with certain characteristics relating to data parallelism at the algorithm level. With SMT technology, multiple threads of the same CPU core will share the same set of execution units. If one thread stalls, perhaps waiting for the memory I/O, the execution unit continues to execute the other thread, resulting in a more fully utilized CPU. We have tried a WiMAX (802.16e) BS solution using an IBM QS21 blade with two Cell/B.E. processors on the blade. We ned that one QS21 blade with dual Cell/B.E. processors can support a computation corresponding to 60-Mb/s data throughput for both uplink and downlink at the same time [17]. This is a competitive performance result provided by a single-board platform. We use this as the example for the analysis in this section. Link-logic parallelism is used for parallel processing among different uplink or downlink logics and among different sectors. Without data dependence and sequence limitation, the uplink and downlink logics can be wrapped into stand-alone images. Through the virtualization provided by cloud computing, these images can be assigned to the virtual CPUs with the required computation resources. In-link-logic parallelism is used to parallelize the process of all OFDM symbol blocks belonging to the same uplink or downlink logic. All of the OFDM symbol blocks should keep the original sequence after being processed in the PHY layer, and reordering with less delay is required. Therefore, it is useful that the computation resources (cores or hardware

threads) for in-link-logic parallelism share the same system memory. The multicore and SMT architectures may directly benet the in-link-logic parallelism. The in-symbol parallelism in the PHY layer is mainly used to accelerate the algorithm-level process for each OFDM symbol block. Obviously, it takes advantage of SIMD processing in the architecture design. The WiMAX PHY layer using the Cell/B. E. platform [17] is used as an example. We implemented all the components shown in Figure 4. We also used a convolution encoder and a Viterbi decoder as forward error control (FEC; with a constraint length of 7 and a data rate of 1/2), along with 6-quadratic-amplitude modulation (6QAM) modulation, a 1,024-point FFT, and a 2 2 MIMO technique. The statistics of our optimized WiMAX PHY layer indicate that only 29.6% of the instructions in the PHY layer are of SIMD instructions. However, we need to further explore how these SIMD instructions will have an impact on the performance. The bit width of different operands will result in different parallelism for a given SIMD instruction. Thus, the operands of each module should be taken into account. For the entire PHY-layer stack, the performance enhancement due to the 128-bit SIMD instruction is a factor of 3.7. A 128-bit width is the most popular design in the SIMD execution unit today, such as exemplied by the SIMD unit in the Cell/B.E. synergistic processor unit, the SSE2/3/4 in Intel processors, and the VMX in POWER processors. However, there is still some research with respect to wider SIMD units, e.g., SODA in [19] for a 256-bit width. Thus, the SIMD engine will be important for supporting the PHY layer. In various wireless standards, the channel decoder is always the most computationally intensive component in the PHY layer. Viterbi, Turbo, and low-density parity-check (LDPC) are the three major decoders used in a wireless PHY layer. The complexities of Turbo and LDPC are almost ve times that for Viterbi. To improve the overall system efciency without losing the overall exibility, a hardware accelerator for the channel decoder can also be considered. To compare different techniques (such as multicore, SMT, SIMD, and hardware accelerator techniques) for virtual BS workloads, refer to Table 2 for an analysis. Assuming the basic processor with 16 cores, a single hardware thread on each core,

4:8

Y. LIN ET AL.

IBM J. RES. & DEV.

VOL. 54

NO. 1

PAPER 4

JANUARY/FEBRUARY 2010

and no SIMD engine, we denote the throughput of the WiMAX PHY layer (with Viterbi) supported by this processor is supposed as P1 . Performance projections can be achieved for different kinds of congurations, such as multicore with SMT, multicore with an SIMD engine, and multicore with a hardware decoder accelerator. In the estimation of multicore with SMT, we use 2.8 as the scaling factor for a four-thread SMT architecture (such as the A2 core in the WSP). The factor is obtained through the testing of signal processing workload on a Mambo [22] simulator for the A2 SMT core. From Table 2, we can draw conclusions for the case of the processor utilizing multicore and SMT techniques simultaneously, which will become mainstream in modern processor design. Using the Viterbi or Turbo decoder, both SIMD and hardware accelerator techniques can signicantly increase the performance. SIMD techniques can be more exible for different kinds of wireless algorithms, but they require higher power consumption. A hardware accelerator in an ASIC has higher power efciency, but it can only support a dedicated functionality, such as a Viterbi or a Turbo decoder. Once the channel decoder in a wireless standard is changed, the accelerator should also be changed. As a tradeoff, an FPGA can be considered to support a channel decoder in hardware, which can be recongured as required. Throughput requirement on the RB link After a series of PHY-layer signal processing steps, such as channel coding (1/2 coding rate), modulation (16QAM), pilot mapping, and spacetime coding, the size of original user data will signicantly be increased before being transmitted. For example, with our selected parameters in WiMAX, the data stream of 20 Mb/s from the MAC layer will become 1.3 Gb/s when being transmitted to a 16-bit D/A converter. This will be similar for LTE. If both MIMO and beam forming (2 4 antenna array) are used, this number will be increased by a factor of 4. Such a throughput requirement should be satised by each of the three layers in the RB link shown in Figure 2. Optical cabling layer The optical cabling layer can be regarded as the PHY layer of an RB link. By employing wavelength-division multiplexing (WDM) technology, a high-throughput optical communication system has been realized over long distances and with a low cost. For example, in 2008, a commercial product featuring 40 Gb/s with dense WDM technology became available. Thus, there will be no problems in providing high throughput over long distances for RB links. HSSL layer The HSSL layer is the MAC layer used to carry the CPRI protocol. Today, both 10-GbE and InniBand are popular and

Figure 7
System throughput of the WiMAX BS-PHY stack using a Cell/B.E. blade. (WiMAX: Worldwide Interoperability for Microwave Access.)

mature commercial technologies. Moreover, both 10-GbE and InniBand support remote direct memory access (RDMA) [23] capabilities with very low CPU overhead. This is very helpful when a GPP is used to support very high data throughput. To illustrate this point, an experiment based on our optimized WiMAX PHY stack was designed using a Cell/B.E. blade. The computation capability of two Cell/B.E. processors on a QS21 blade can support 75.2 Mb/s of user data in a downlink and 65.8 Mb/s user data in an uplink simultaneously, which will require, in total, 6.2 Gb/s data throughput at the RB link. In Figure 7, the result illustrated for the left two bars is for the scenario with computation only, where all the RB link data are stored in memory other than that transferred through an Ethernet interface. Thus, it can support the full user data rate in both uplinks and downlinks. The second result is for the scenario with computation and data transfer over a 10-GbE interface without RDMA. Compared with the case of computation-only, it has a 68% performance loss. Here, the PPU core has become the bottleneck in response to massive interrupts and in handling the network stack. The third result is for the scenario with computation and data transfer over a 10-GbE interface with RDMA. It only has a 4.5% performance loss when compared with the one for computation-only. With RDMA, data can directly be moved between the memory of the RRH and the virtual BS, or between two virtual BSs, without involving the operating system of either one. Only a small overhead will be required for a zero-copy protocol used in RDMA. With RDMA capability, there is no problem in supporting the point-to-point RB link at the HSSL layer. However, it will require RDMA enablement on the RRH side, which has not been supported in current RRH products. Aside from RDMA technology, there are other techniques to support high-throughput, low-latency, and low-cost attributes for a host on a 10-GbE interface. For example, the network

IBM J. RES. & DEV.

VOL. 54

NO. 1

PAPER 4

JANUARY/FEBRUARY 2010

Y. LIN ET AL.

4:9

accelerators adopted on some processors, such as WSP and Raza, may directly handle the high-throughput streams between a hardware accelerator and memory. Some network interface cards (NICs) may support a zero-copy OS-bypass functionality for Ethernet streams [24]. With these technologies, the system can achieve the required data rate at the HSSL layer with little CPU overhead, without any special requirement on the RRH. CPRI protocol layer The high-throughput requirement generates two challenges for the CPRI protocol layer. First, according to the version 4.0 CPRI protocol, the highest CPRI line bit rate is 3.072 Gb/s [7]. A single LTE RRH would not support the RRH with more than three antennas. Thus, a CPRI frame structure should be revised to support a higher data rate. Second, in a WNC with open IT architecture designs, the CPRI protocol will be handled on a virtual BS. Without a special accelerator embedded on IT servers, it will require a very efcient implementation to process CPRI packets with less overhead on a GPP. Challenge of timing and synchronization In traditional BS systems, a TDM network and a GPS-based receiver are the two major technologies used for timing and synchronization, but they are not suitable in a WNC. A TDM network implements timing mechanisms within the physical network itself, creating a reliable end-to-end synchronization chain. Running a traditional TDM backhaul to a BS, however, can account for 30% to 50% of operating expenses. For a GPS receiver, it may be challenging to ensure that the installation and its antenna placement on each virtual BS is carried out with a view of the sky [25]. For the WNC, we need to nd ways to synchronize clocks through an Ethernet/IP network. The most common technique is the Network Time Protocol (NTP), which is widely used in the Internet and allows accuracy values into the millisecond range [26]. However, this would not satisfy the high-precision requirement in wireless BS systems. In TDD systems, such as WCDMA and WiMAX, the time accuracy requirements among BSs are less than 1 s. As introduced in the WNC architecture, the IEEE 1588 PTP will be employed to construct a precise timing network using the IP/Ethernet network. The IEEE 1588 PTP provides a means by which networked computer systems can agree on a master clock reference time, as well as a means by which slave clocks can estimate their offset from master clock time [27]. The time and offset information will be carried by PTP messages between master and slave computers. Thus, the accuracy of the PTP system will be affected by the variation of the latencies in PTP messaging. Reference [28] provides a good summary on this topic. The closer the timestamp is taken from the hardware transmission or receipt of the messages, the smaller the latency, and consequently, the accuracy is also better.

With a software implementation, the accuracy can be within 10 s [27, 28]. In particular, as indicated in [28], through implementation in a device driver, the accuracy is less than 1.8 s for both direct connection of two nodes and through a switch network. However, when there are complex applications and network I/O coexisting on the platform, the challenge is how to get the timestamp close to the hardware in software. In [28], the timestamp is stored when the interrupt raised is rx_frame or tx_done. However, when there are other applications with very large data ows, some mechanisms are required in the interrupt handler to lter out the timestamp in different protocols. Thus, the accuracy will strongly be affected. Here, we consider making effective use of multicore GPPs with network accelerators (such as the IBM WSP and the Raza XLR series) to provide a better software solution. For example, in a multicore setting, a dedicated core (e.g., core 1) can be used to record a timestamp. All of the other PTP processing can be handled by other cores shared with other applications. Without scheduling, there will be little variance of software processing cycles in core 1. In cases involving a network acceleratorVsuch as the wire-speed processor host Ethernet adaptor (wspHEA) of the WSP and a network accelerator on RazaVthe accelerator provides a set of hardware assist functions to ofoad the network stack handling from the processor cores. The PTP frames can be parsed and dispatched into the queue for core 1 by a network accelerator. With hardware assistance, there will be no signicant impact from other data ows of other applications. Moreover, the processes before timestamp recording in core 1 are executed in hardware. The time variance from them will be small. Aside from the software implementation, some hardware implementations have directly embedded the IEEE 1588 PTP functionality into the Ethernet PHY component, for example, the PHYTER Ethernet transceivers from National Semiconductor [29]. Thus, an Ethernet PHY component with IEEE-1588 enablement could be considered in a system design with low cost.

Conclusion
The WNC is a new architecture proposed for the next-generation wireless network. The WNC involves two important ideas. First, open IT architectures will replace todays proprietary hardware design in a BS system. Second, cloud-computing concepts are used in building the wireless access network. As analyzed in this paper, the WNC can provide operators with unprecedented exibility in building a mobile network with a lower investment risk, and it matches the evolutionary trend of next-generation wireless systems. Based on this architecture, the most important system requirements have been discussed, along with some suggestions. To meet the computational requirements, SIMD techniques or recongurable hardware accelerators (e.g., FPGAs) of channel decoders can be considered in

4 : 10

Y. LIN ET AL.

IBM J. RES. & DEV.

VOL. 54

NO. 1

PAPER 4

JANUARY/FEBRUARY 2010

future multicore and SMT-based system designs. To relieve the CPU overhead for the high I/O throughput on an RB link, different methods should be considered for the virtual BS platform, such as RDMA over the RB link, processors with network accelerators, or the advanced Ethernet NIC with an OS-bypass functionality. To construct an accurate timing network with the IEEE 1588 PTP, we propose a software implementation using multicore processors with a network accelerator. The Ethernet PHY with IEEE 1588 PTP enablement is also suggested for system design. Additionally, a virtual BS pool will require real-time support from the OS, hypervisor, and scheduler. Future research work on such considerations will be important.

Acknowledgments
The authors would like to thank several friends and colleagues: S. Kalyanaraman, S. Daijavad, and J. M. Tracey from the IBM Research Division; R. K. Sabhikhi from the IBM Systems and Technology Group (for technology and trend discussions and valuable comments); and J. Chen and L. Chen from the China Research Laboratory, with whom they worked together on generating the rst-of-a-kind WiMAX BS prototype on a blade server.
* Trademark, service mark, or registered trademark of International Business Machines Corporation in the United States, other countries, or both. ** Trademark, service mark, or registered trademark of WiMAX Forum, InniBand Trade Association, PCI-SIG, Sony Computer Entertainment, Inc., or Raza Microelectronics, Inc., in the United States, other countries, or both.

References
1. M. Hata, BFourth generation mobile communication systems beyond IMT-2000,[ in Proc. 5th IEEE Asia-Pacic Conf. Commun., Beijing, China, Oct. 1822, 1999, pp. 765767. 2. Y. Liang and A. Goldsmith, BSymmetric rate capacity of cellular systems with cooperative base stations,[ in Proc. GLOBECOM, San Francisco, CA, Nov. 27Dec. 1, 2006, pp. 15. 3. S. Zhou, M. Zhao, X. Xu, J. Wang, and Y. Yao, BDistributed wireless communication system: A new architecture for future public wireless,[ IEEE Commun. Mag., vol. 41, no. 3, pp. 108113, Mar. 2003. 4. B. Ramamurthi, BNext-generation wireless system architecture with optical-ber backhaul,[ in Proc. 1st Int. Workshop SRT, Beijing, China, Oct. 1617, 2008. 5. Mellanox, Inc., ConnectX: Dual-Port InniBand Adapter Cards With PCI Express 2.0. [Online]. Available: http://www.mellanox. com/content/pages.php?pg=products_dyn&product_family= 4&menu_section=41 6. OBSAI. [Online]. Available: http://www.obsai.com 7. CPRI Specication V4.0 (2008-6-30). [Online]. Available: http:// www.cpri.info/downloads/CPRI_v_4_0_2008-06-30.pdf 8. PCI-SIG. [Online]. Available: http://www.pcisig.com 9. Synchronization Requirements for Next-Generation WiMAX Networks. [Online]. Available: www.chronos.co.uk/pdfs/tel/ symmetricom/Sync_Requirements_for_NGN_WiMax_ Networks.pdf 10. PTPdVPrecision Time Protocol Daemon. [Online]. Available: http://ptpd.sourceforge.net/

11. H. Franke, J. Xenidis, C. Basso, B. M. Bass, S. S. Woodward, J. D. Brown, and C. L. Johnson, BIntroduction to the wire-speed processor and architecture,[ IBM J. Res. & Dev., vol. 54, no. 1, Paper 3:111, 2010, this issue. 12. XLR700 Processor Series Product Brief. [Online]. Available: http://www.razamicro.com/assets/docs/10768V200PB-PR_ XLR700_Series_Throughput_Optimized_MIPS64_Multi_Core_ Processor.pdf 13. N. Magnani, B3GPP presentation on the LTE-advanced as an IMT-advanced technology solution,[ in Proc. ITU-R IMT-Adv. Workshop (RP-080756), Seoul, Korea, Oct. 7, 2008. [Online]. Available: http://groups.itu.int/Portals/17/SG5/WP5D/ Workshop/3GPP%20-%20LTE-Advanced%20as%20an% 20IMT-Advanced%20Technology%20Solution% 20-%20N%20Magnani.pdf. 14. M. Alsliety and D. N. Aloi, BSignal processing choices and challenges for SDR in telematics,[ in Proc. ISSPA, Sharjah, United Arab Emirate, Feb. 1215, 2007, pp. 14. 15. Q. Wang, D. Fan, and Y. H. Lin, BDesign of BS transceiver for IEEE 802.16E OFDMA mode,[ in Proc. ICASSP, Las Vegas, NV, Mar. 30Apr. 4, 2008, pp. 15131516. 16. [Online]. Available: http://www.sdrforum.org 17. L. Mitola, III, BTechnical challenges in the globalization of software radio,[ IEEE Commun. Mag., vol. 37, no. 2, pp. 8489, Feb. 1999. 18. L. Pucker and J. Holt, BExtending the SCA core framework inside the modem architecture of a software dened radio,[ IEEE Commun. Mag., vol. 42, no. 3, pp. 2125, Mar. 2004. 19. Y. Lin, H. Lee, M. Woh, Y. Harel, S. Mahlke, and T. Mudge, BSODA: A low-power architecture for software radio,[ in Proc. ISCA, Boston, MA, Jun. 1721, 2006, pp. 89101. 20. I. Foster, Designing and Building Parallel Programs: Concepts and Tools for Parallel Software Engineering. Boston, MA: Addison-Wesley Longman Publishing Co., Inc., 1995. 21. J. L. Hennessy and D. A. Patterson, Computer Architecture: A Quantitative Approach, 3rd ed. San Fransisco, CA: Morgan Kaufmann, 2002. 22. P. Bohrer, J. Peterson, M. Elnozahy, R. Rajamony, A. Gheith, R. Rockhold, C. Lefurgy, H. Sha, T. Nakra, R. Simpson, E. Speight, K. Sudeep, E. V. Hensbergen, and L. Zhang, BMambo: A full system simulator for the PowerPC architecture,[ ACM SIGMETRICS Perform. Eval. Rev., vol. 31, no. 4, pp. 812, Mar. 2004. 23. RDMA Consortium. [Online]. Available: http://www. rdmaconsortium.org 24. P. Shivam, P. Wyckoff, and D. Panda, BEMP: Zero-copy OS-bypass NIC-driven gigabit Ethernet message passing,[ in Proc. ACM/IEEE Conf. Supercomputing, Denver, CO, Nov. 1016, 2001, p. 49. 25. Timing and Synchronization in WiMAX Networks. [Online]. Available: http://www.chronos.co.uk/pdfs/tel/ symmetricom/Timing_and_Sync_in_WiMAX_Networks.pdf 26. White Paper: Precision Clock SynchronizationVThe Standard IEEE 1588. [Online]. Available: http://hus.hirschmann.com/ English/industrial-ethernet-products/Downloadcenter/ technology-and-white-paper/white-papers/ White_Paper_Precision_Clock_Synchronization/index.phtml 27. K. Correll, N. Barendt, and M. Branicky, BDesign considerations for software only implementations of the IEEE 1588 Precision Time Protocol,[ in Proc. IEEE ISPCS, Zurich, Switzerland, Oct. 2005. 28. J. Kannisto, T. Vanhatupa, M. Hannikainen, and T. D. Hamalainen, BSoftware and hardware prototypes of the IEEE 1588 Precision Time Protocol on wireless LAN,[ in Proc. 14th IEEE Workshop LANMAN, Chania, Greece, Sep. 1821, 2005, p. 6. 29. Precision PHYTERA - 10/100 PHY With IEEEA1588 PTP Support. [Online]. Available: http://www.national.com/analog/interface/ ethernet

Received December 12, 2008; accepted for publication January 16, 2009

IBM J. RES. & DEV.

VOL. 54

NO. 1

PAPER 4

JANUARY/FEBRUARY 2010

Y. LIN ET AL.

4 : 11

Yonghua Lin IBM Research Division, China Research Laboratory, Beijing 100193, China (linyh@cn.ibm.com). Ms. Lin received the B.S. and M.S. degrees in information and communication from Xian Jiaotong University, Xian, China, in 2000 and 2003, respectively. She is a Research Staff Member with the System Software and Networking department, IBM China Research Laboratory. She subsequently joined IBM at the China Research Laboratory, where she has worked on multiple projects related to multicore processors (general multicore processors and network processors) and networking, including such topics as high-end routers, Internet Protocol television media gateways, and mobile base stations. She is leading an IBM Research group for appliance and infrastructure for next-generation wireless access networks. She is the author or coauthor of seven technical papers in related areas. She is the holder of 17 patents. She is a member of the Association for Computing Machinery. She was also the Chair of the Technical Program Committee of the International Software Radio Technology Workshop in 2008. Ling Shao IBM Research Division, China Research Laboratory,
Beijing 100193, China (shaol@cn.ibm.com). Mr. Shao received the B.S. and M.S. degrees from Fudan University, Shanghai, China, in 1997 and 2000, respectively. He joined IBM in 2000. He is a Senior Technical Staff Member and Senior Manager with the China Research Laboratory, responsible for the System Software and Network Appliance department. His major research focus is on applying multicore technology to multimedia and networking systems, and large-scale parallel programming. He is currently a member of the IBM Academy of Technology and the leader of the Technical Expert CouncilVGreater China, an IBM Academy of Technology afliate in China. He is the holder of more than 20 patents.

Ravinder K. Sabhikhi IBM Systems and Technology Group,


Triangle Park, NC 27709 USA (sravi@us.ibm.com). Mr. Sabhikhi received the B.S. and M.S. degrees in computer science and the M.S. degree in business management from North Carolina State University, Raleigh. He is a Distinguished Engineer at IBM. He is currently the Next-Generation Networks (NGN) Chief Technology Ofcer for Asia Pacic. In this role, he works with leaders in the industry on the strategy, architecture, and development of next-generation network solutions, Internet Protocol television, information management systems, soft switches, and wireless base stations. He understands the challenges and opportunities of convergence with respect to all IP networks from a business and technical point of view. He has been very instrumental in developing NGN solutions based on commercial off-the-shelf technology to provide customers lower capital and operational expenditures. He has worked with leading telephone companies in the United States, Europe, and Asia Pacic on NGN topics. He has more than 25 years of experience in designing, developing, and leading teams of engineers in the development of world-class networking products, as well as system architecture. He has several patents in the eld of networking.

Zhenbo Zhu IBM Research Division, China Research Laboratory, Beijing 100193, China (zhuzb@cn.ibm.com). Mr. Zhu received the B.S. and M.S. degrees in control theory from Tsinghua University, Beijing, China, in 2003 and 2006, respectively. He is a Researcher with the System Software and Networking department, IBM China Research Laboratory. He subsequently joined IBM at the China Research Laboratory, where he has worked on the workload study and optimization of signal processing algorithms and networks. He is the author or coauthor of six technical papers. He is the holder of four patents. Qing Wang IBM Research Division, China Research Laboratory, Beijing 100193, China (wangqing@cn.ibm.com). Dr. Wang received the B.S. degree in electronic engineering and the M.S. degree in control engineering from Northwestern Polytechnic University, Xian, China, in 1992 and 1999, respectively, and the Ph.D. degree in electronic engineering from Nanyang Technological University, Singapore, in 2004. She is a Research Staff Member with the Network and Workload Appliance Team, IBM China Research Laboratory (CRL). She subsequently joined IBM CRL, where she has worked on embedded systems, media processing, and system design for wireless communications. She is the author or coauthor of 12 technical papers. She is the holder of two patents. She is a Senior Member of the China Institute of Electronics.

4 : 12

Y. LIN ET AL.

IBM J. RES. & DEV.

VOL. 54

NO. 1

PAPER 4

JANUARY/FEBRUARY 2010

You might also like