You are on page 1of 6

The 9th International Conference on Computer Supported Cooperative Work in Design Proceedings

SHUM--COS: A RTOS Using Multi-task Model to Reduce Migration Cost between SW/HW Tasks
Bo Zhoul, Weidong Qiu', Yan Chen', Chenglian Peng' Department of Computer and Information Technology, Fudan UniversiQ, Shanghai, China allenzhou@xasumail.corn
S Y S ~ M S ,called hardware-software partitioning, Currently, the dividing line is made by hand. An experienced system analyzer would attempt to let hardware engineers impIement the time-consuming components, thus maximizing execution speed. To determine which part is the performance bottleneck, we often need several product prototypes with different hardware-sohare dividing lines and realize the same hnctions in different methods. Then we will get the proper boundary between sohare. and hardware by comparisons. In this procedure, there exist many migrations between software and hardware. However, due to the lack of uniform programming model and system components for these different implementation methods, the migration costs of a function implementation from software to hardware are normally high. Even a small task migration needs an excessive modification, because it relates to both design teams. But the recent developments in configurable devices have increasingly blurred the traditional line between hardware and software. Using this excite characteristic, it seems that we can reduce the migration cost greatly. Operating system is a reasonable solution because it i s the traditional boundary between hardware and software. Although commercial RTOSs available for popular embedded processors provide significant reductions in design time, they typically do not take advantage of the intrinsic parallelism o f hardware tasks, probabiy because FPGAs and ASICs have historically been treated as hardware accelerators, for which there are only device drivers provided by the operating system. To cope with this problem, we have adopted a uniform multi-task (thread) model and implemented a RTOS with uCOSII [I] RTOS as its prototype, called Software Hardware Uniform Management uCOS (SHUM-uCOS). The basic concept of multi-thread model was first discussed in [2], which is proposed for hybrid chips containing both CPU and FPGA components in one chip. We extend this model into the embedded system design that is composed of a host processor and several reconfigurable devices. This programming model allows hardware tasks on reconfigurable devices to execute in a truly-parallel multitasking manner, which are organized like software

The design of embedded systems has become more complex than ever, and the design qualities depend more on the cooperation of multidisciplinary design teams: hardware engineers and sofryonre engineers in general. However, due to the Iack of uniform programming model and system components for these different teams, the migrations costs o a function f model from software to hardware are high. But these actions are necessary in the hardwure-sojhvaye partitioning of embedded systems, especially in the prototype designs. To cope with this problem, we adopt a ungorm multi-task model and implement U RTOS (Red- Time Operating System). caIled SHUM-uCOS, which deals with hardware functions IZS same as software tasks. This RTOS uses uCUSII as is t protootype, traces and manages the sfates o f reconjigurable resources (FPGAs), which allows ihe f execution o hardware task in a true multitasking munner. Moreover. SHUM-uCOS also dejnes a standard hardware-task inter$ace, which supports share-bus protocol. I t has been proved by experiments that SHUM-uCOS can shorten the migration time from sofrware implementations to hardware implementations with /he performance improvement.

Keywords: Reconfigurable Computing System, RTOS, multi-task Model, uCOS

1. Introduction
Embedded systems experienced a considerable expansion in the last few years. With the silicon technology advancement, more powerful devices, (e.g., the higher frequency CPU, the larger memory) are provided. At the same time, the design complexity also increases dramatically, and the design qualities depend more on the effective cooperation of multidisciplinary design teams: hardware engineers and software engineers in general. gut how would the designer determine where to place the work-dividing line between software engineers and hardware engineers? This is a wellknown problem that hasn't been solved in embedded


Authorized licensed use limited to: Bharat University. Downloaded on August 09,2010 at 08:25:53 UTC from IEEE Xplore. Restrictions apply.

The 9th International Conference on Computer Supported Cooperative Work in Design Proceedings.

tasks, and substantially decreases the migration time for a task from SW implementation to HW implementation.
Sumanth Donthi [33 classifies FPGAs into two categories. If only a portion of the chip is modified and the remaining logic operates normally without any disruption, then it is partially reconfigurable. If the whole chip is modified at once, with a total loss of the previous configuration and the state of the flip-flops, then it is fully reconfigurable. The main functions of SHUM-uCOS are task and resource management. Several recent publications deal with task and resource management problems, e.g., [4] and [5], especially the problem of finding placements for hardware tasks on a reconfigurable surface, e.g., in [6j [7]. However, their discussions mainly focus on the partially configurable FPGAs. ft seems that there are few attentions paid to the fully configurable FPGAs in the operating system, which take a great share of the FPGA market currently. The SHUM-uCOS deals with these devices and uses preconfiguration table to increase the utilization of reconfigurable resources.

2. SHUM-uCOS Framework
The SHUM-uCOS is an extended version of uCOS11, expanding its management range by adding extra functions. It reserves most of data structures from uCOSII, and the priority-based scheduling policy.

While dealing with the s o b a r e tasks only, the SHUMuCOS is almost the same as uCOSII. While involving the hardware tasks, the SHUM-uCOS adopts uniform multi-task model to manage them, which can be seen in Figure 1. The whole model is divided into three parts: CPU, the hardware-task manager and reconfigurable devices. The software tasks execute on the CPU and the hardware tasks r n on the FPGAs. The software part of u SHUM-COS includes the soRware task interface, task scheduler and resource manager. The hardware part of SHUM-uCOS is called the hardware task manager, usually implemented in the FPGAs, including the communication controller, standard hardware-task interface, configuration interface and hardware-task configuration controlter. The SHUM-uCOS is composed of following parts in detail. Software task interface: a set of API functions. Designers can interact with the operating system through these functions by calling system services, e.g., creating semaphores and mutexes. The hardware task preconfiguration table: to reduce the configuration cost at runtime, we can get configuration sequences of confgurable devices by analyzing the task graph statically. The data is useful for the scheduler to configure devices before the hardware tasks run. Scheduler: the core of the RTOS. It is responsible

Figure 1- SHUM-uCOSframework


Authorized licensed use limited to: Bharat University. Downloaded on August 09,2010 at 08:25:53 UTC from IEEE Xplore. Restrictions apply.

The !Xh International Conference on Computer Supported Cooperative Work in Design Proceedings

for managing the states of tasks (HW and SW), handling the synchronous and the asynchronous events, such as the scheduling of software tasks or the configuration of hardware tasks, and the synchronization between tasks. Resource manager: because of the dynamic creation and deletion of hardware tasks, the usage of the reconfiguration resource also changes steadily. The resource manager traces and records these changes, providing information for scheduler to configure hardware tasks. Communication controller: this moduIe handles the low-level communication detail, and translates the command to binary signals according to the application, e.g. the count of hardware tasks. Hardware task configuration database: this database contains all the hardware-task configuration data, which was synthesized ahead. Hardware-task configuration controller: the controller will retrieve corresponding configuration data from database, and configure the devices after receiving the configuration command from scheduler. A 4-bit or 8-bit microcontroller can be used as configuration controller because of the light workload, Hardware task interface: it supplies the communication controller with the standard signals and protocols. Hardware task implementation: it includes all the function modules in the FPGAs, which will be described in the section 3.3.

3. The implementation of SHUM-uCOS

3.1 Preconfiguration table generation
Many embedded appIications can be represented by data flow graphs (DFG). A DFG is a directed acyclic graph. The probIem of generating hardware-task preconfiguration table can be viewed as two separate problems: 1. From spatial point ofview, hardware tasks can be organized as task groups, and the total area of

each task group is smaller than that of the configuration devices, in which the task group would be put in. 2. From temporal point of view, we must schedule the task groups to ensure that they just need minimum amount of reconfiguration devices. The grouping and scheduling of a DAG are all N-P complete problems.[8][9][10] Paper 191 has made a detailed discussion about the problem of task-group partition, and two algorithms are proposed: level based partitioning algorithm and clustering based partitioning algorithm. The former algorithm mainly exposes the parallelism hidden in the graph nodes, and the aim of the latter algorithm is to decrease the communication overhead, i.e., the number of terminal edges resulting from partitioning. In the multiprocessor field, there are already many discussions about how to get parallelism by analyzing the task graph statically. Correspondingly, numerous methods have been proposed, such as the MCP algorithm, the DCP algorithm [ 101. With above consideration, the basic idea of generating preconfiguration table is: at first, divide the hardware tasks into task groups that can be fit into the reconfigurable devices, and then view the configuration procedures as tasks with deadline. Finally, we can get the preconfiguration table by scheduling these tasks. Following steps will describe this procedure in detail (an example can be seen in the Fig.2). I . Remove the software-task nodes from the origin task graph G1, and then we get a task graph G2 only containing hardware tasks. The precedence relations between hardware tasks in G2 must be kept as same as them in G1. For example, in the Figure 2(a), there exist three tasks: T4, TS and T12, where the latter task depends on the former, and T8 is the only software task. Thus, we must remove it and keep T4 dependinng on T12 in the Figure 2(b).

Replace *e hardware task node Ti as configuration task node Ci, and the deadiine of Ci equals the arriving time of Ti minus configuration

b z



Figure 2. The generation of preconfigurationtable


Authorized licensed use limited to: Bharat University. Downloaded on August 09,2010 at 08:25:53 UTC from IEEE Xplore. Restrictions apply.

The 9th International Conference on Computer Supported Cooperative Work in Design Proceedings


According to the level based partitioning algorithm [9], get the task groups under the area constraint. Merge each task group into one configuration node. Usethe DCP algorithm [lo] to schedule the configuration nodes, and the result is the preconfiguration table.



To reduce the cost of resource configuration, when a resource does not contain any active task, the scheduler sets its state as preconfiguration instead of putting it into blank state directly. When preconfiguration miss occurs subsequently, the resource is moved to blank state. This approach adds preconfiguration state between used state and blank state while recycling reconfigurable resources, and makes the resource recycle much like the cache manner for memory. As a result, it will improve the preconfiguration efficiency.

The procedure demonstrated in Figure 2 can generate the preconfiguration table, but there i s no guarantee for the optimization result. However, The focus of the SHUM-uCOS is not on the optimization. And in any case, the reconfigurations of the reconfigurable devices are always beneficial to the execution of hardware tasks.


3.2 Reconfigurable resources management

The SHUM-uCOS uses the RCB (Resource Control Block) structure to trace and control the usage of reconfigurable resources. A RCB is a data structure as follow: typedef struct os-rcb f INTSU ResourceArea; // the area of the resource INTlUResourceNo; // the unique ID of the resource INTSU ActiveTaskCount; //the count of sleeping tasks in the resource. struct os-rcb *OSRCBNext; //pointer to the next RCB struct os-hcb *OSHCBFirst; // the pointer to the first task in this resource. OS-RCB In the SHUM-uCOS, the reconfigurable resources are always in one of the following four states: used state, preconfiguration state, blank state and configuring state, which are shown in Figure 3. And the SHUM-uCOS maintains four chains corresponding to the four states respectively. Used state: the resource has been configured with a task group, and there is at least one task in the group controlled by the scheduler. Preconfiguration state: the resource has been configured with a task group, but all the tasks of the group are in sleeping state, waiting for activation. Blank state: there is no task group in the resource or the resource is going to be reconfigured with a new task group. Configuring state: the configuration procedure on the resource is ongoing. If one task in preconfiguration state is activated by scheduler within a given time interval, we call it as preconfiguration hit, otherwise as preconfiguration miss.

Figure 3. The state graph of configurable resources

3.3 Hardware-task implementation

In the SHUM-uCOS, the hardware task implementation is divided into three layers: I ) timingconvert layer, whose main function is to convert other timing to standard memory timing, e.g., CAN or I2C timing to memory timing. The aim of this layer is to reduce the usage of precious FPGA pins; 2) primitive layer, which is responsible for managing the states of hardware tasks and providing the synchronization mechanisms; 3) Function entity layer, which implements the users functions.

Figure 4. Hardware-task implementation

The first two layers belong to the SHUM-uCOS, and they are provided as IP (Intellectual Property) . The timings between layers are all standard memory timing. The SHUM-uCOS provides two methods for intertask communication: global variables and massage passing, where massage passing includes mutex, semaphore and message box. There is no semaphore


Authorized licensed use limited to: Bharat University. Downloaded on August 09,2010 at 08:25:53 UTC from IEEE Xplore. Restrictions apply.

The 9th International Conference on Computer Supported Cooperative Work in Design Proceedings

queue and message- box queue support for the hardware task at present. The hardware-task implementation uses share-bus protocol, the hardware-task can access the main memory if the share-bus is available. If the timing of main memory is standard memory timing, there is no need for timing-convert layer. Four parts compose the primitive operation layer in the standard hardware task implementation: The data pathway: connected with the main memory, allows the function entity to access the data stored in memory. The control pathway: connected with the DMA (direct memory access) signals of CPU or bus arbiter, handling the bus request or release. The initialization pathway: connected with the hardware-task controller. It is used to initialize the internal registers of primitive layers after the creation of hardware tasks. Hardware state controller: the core of the primitive operation layer. It interprets the CPU command, controls the hardware task state and reports the task status. There are no local registers or memory in the primitive layer, all the data is stored in the main memory. And each hardware task has a Task Interface Control Block (TICB) data structure to define its control registers, which are mapped into the main

4. Experiment Results 4.1 OS performance evaluation

The operating systems using uniform multi-tasks model are a rather new line of research. And there is no ih explicit numerical result to compare wt until now, In order to demonstrate the quality of the proposed operating system, we evaluate the performances of the 1 SHUM-uCOS using the Rhealstone benchmark [I 1 , and compare with uCOS. The Rhealstone is a well-known benchmark for real time operating systems. The benchmark identifies the execution times (or time delays) associated with six operations that are vital indicators of real-time multitasking system performance. These operations include: task-switch time, preemption time, interrupt latency, semaphoreshuffle time, deadIock-break time and intertask message latency. The Rhealstone is intended to be independent of the CPU architecture, and it adopts a small Whetstone benchmark as the workload of each task. Because the tasks in the uCOS are all software tasks, we do not implement the Whetstone using hardware in the SHUM-uCOS, but wait the same time as s o h a r e execution to keep the result independent of the task workload. In the Realstone, the task-switch time is defined as the average time to switch between two active tasks of equaI priority. But the SHUM-uCOS inherits the priority-based scheduling policy of uCOS, and does not allow the existence of tasks with equal priority either. Thus, we have ignored the first experiment. We use a platform composed of four Altera FPGAs, and the detailed information about experiment environment is described as folIowing: 1) FPGA: EPlC6-8 of the Alteas Cyclone series [12]. 2) CPU: NIOSII standard version at 5OMHz [13], which is a soft-core CPU from Altera Corporation. 3) Benchmark Rhealstone Benchmark, 10 bytes will be sent every time while using the message-box. 4) Targets for test: SHLM-uCOS Verl.0 and uCOSII Ver2.76. 5 ) System tick period: lms. 6 ) Main memory: IDT71V416 SRAM.
Table i The Rhealstone benchmark results (unit: us)

typedef struct os-ticb { INT32U Receive-Cmd; // command received from CPU INT32U Send-Req; //request sent to CPU INT32U Return-Code //The result code INT32U Param-Reg //command parameter INT32U Pointer-Reg //the pointer to data fhme JNT32U Len-Reg //the length of data frame } OS-TICB; After the hardware task i s created, t h e task ID and the start address of TICB will be saved into registers of the hardware tasks. These parameters are apt to change at runtime. Only with the start address of TICB, can task state controller access the memory data. If there are some commands need to be sent to a hardware task, CPU will write the command into the memory location of Receive-Cmd parameter in TCIB first, then set the Cmd-Aquire in TCIB to teli the hardware task that there is a new command. At last the hardware requests the bus and obtains the data. If hardware tasks ask for the CPUs services, they will write the service type into the memory location of the Send-Req parameter in TCIB, then sets interrupt to notice CPU that something happens. Finally, according to the Send-Req parameter in TCIB, the CPU selects the proper service function.


I Remark

(SWT: Softwan: task;

HWT:Hardware task)

Authorized licensed use limited to: Bharat University. Downloaded on August 09,2010 at 08:25:53 UTC from IEEE Xplore. Restrictions apply.

The 9th International Conference on Computer Supported Cooperative Work in Design Proceedings

4.2 Case study

The SHUM-uCOS has been used in a VOIP terminal. For this project, the most important part is the voice compression and decompression, which will affect the system performance greatly because of its heavy computation load. To demonstrate the performance difference between two implementations, we migrate the ADPCM compression (decompression) from software implementation to hardware implementation. The communication style from hardware task to CPU is by a message box. We choose the ITU G.726 standard for the voice compression (decompression) and aggravate the workload for the system by increasing the compression ratio. To make the final result distinct, we set the frequency of the NIOS at ISMHz, which is much lower than usual. If the CPU busies itself with the older frame, the new frame will be discarded. And we evaluate the performance though the frame-lost ratio. The result is shown in table 2.
Table 2. Frame-lost ratios for vary implementations

Furthermore, it can also handle hardware tasks. Thirteen modifications in our VOIP case study have proved that the SHUM-uCOS can shorten the migration time greatly with the performance improvement.

[I] Jean J. Labrosse, Micro/OS-I1 The Real-Time Kernel, Second Edition, CMP Books, 2002 [2] David Andrews and Douglas Niehaus. Programming Models for Hybrid FPGA-CPU Computational Components: A Misssing Link , Micro, IEEE Transactions, Volume: 24 , Issue, 4 ,July-Aug. 2004, pp: 42 -53 [3] Donthi, S.; Haggard, R.L.A Survey of Dynamically

Software W k

Hanvare tark

Table 2 shows that the lost-frame ratio decreases dramatically after the compression (decompression) task migrates from software to hardware. It is true that any migrations form SW to HW is apt to increase system performances, and the more important the migrated function is, the more benefit we can get. However, w t the SHUM-uCOS, this kind of ih migrations will be mote natural, and affect the other parts less. In this case study, we changed only 13 locations to migrate the compressioddecompression functions from software to hardware successfully, which is even beyond our expectation.

Reconfigurable FPGA Devices. Proceedings of the 35th Southeastern Symposium on System Theory, 16-18 March 2003. Pages: 422 - 426 [4] 0. Diessel, H. EIGindy, M. Middendorc H.Schmeck, and B. Schmidt. Dynamic scheduling of tasks on partiaIly reconfigurable FPGAs. In IEE Proceedings on Computers and Digital Techniques, volume 147, pages 181-188, May 2000. [SI Katherine Compton, James Cooky, Stephen Knol, and Scott Hawk. Configuration Relocation and Defiagmentation for Reconfigurable Computing. In Proceedings of the IEEE Symposium ou FPGAs for Custom Computing Machines (FCCM). IEEE CS Press, April 2003. [6] Kiarash Bazargan, Ryan Kastner, and Majid Sarrafiadeh. Fast Template Placement for Reconfigurable Computing Systems. In IEEE Design and Test of Computers, volume 17, pages 6843,2000. 171 Herbert Walder, Christoph Steiger, and Marco Platzner. Fast Online Task Placement on FPGAs: Free Space Partitioning and 2D-Hashing. In Proceedings of the 10th Reconfigurable Architectures Workshop (RAW). IEEE CS Press, April 2003. [a] Thomas H.Cormen and Charles E. Leiserson. Introduction to Algorithms, The MIT Press. ,2001, Pages:
1043-1054 [9] Karthikeya M. G a j d a Puma and Dinesh Bhatia.

5. Conclusion
We implemented a RTOS based on the multi-task model. The aim of this approach is to provide a uniform platform for both software and hardware engineers, and reduce the migration cost for embedded system designs, which is a time-consuming step in the whole design flow, The SHUM-uCOS traces and manages the states of reconfigurable resources (FPGAs), allowing the execution of hardware tasks in a true multitasking manner. The Rhealstone Benchmarks have shown the SHUM-uCOS has almost the same performance as the UCOSII while dealing with software tasks only.

Temporal Partitioning and Scheduling Data Flow Graphs for Reconfigurable Computers, IEEE Transactions on Computer, 1999. pp.579-590 [lo] Kwork YK, Ahmad I. Dynamic critical-path scheduling: An effective technique for allocation task graphs to multiprocessors. IEEE Trans. on Parallel and Distributed System, 1996, 7(5): 506-521 [I 11 Rabindra P. Kar, Implementing the Rhealstone Real-time Benchmark, Dr. Dobbs Journal, Sep. 1990. [U]Altera Corporation, Cyclone Programmable Logic Device Family Datasheet, 2003, http://www.alteracom . [I31 Peng Cheng-lian and Zhou bo, SOPC design and practice using NOS, Beijing, Tsinghua Press, 2004


Authorized licensed use limited to: Bharat University. Downloaded on August 09,2010 at 08:25:53 UTC from IEEE Xplore. Restrictions apply.