You are on page 1of 10

International Journal of Advanced Computer Science, Vol. 3, No. 7, Pp. 368-377, Jul., 2013.

A Hybrid Operating System for a Computing Node with Multi-Core and Many-Core Processors
Mikiko Sato, Go Fukazawa, Kazumi Yoshinaga, Yuichi Tsujita, Atsushi Hori, & Mitaro Namiki
Manuscript
Received: 11, Apr.,2013 Revised: 27,May,2013 Accepted: 31,May,2013 Published: 15,Jun.,2013

Keywords
Many-core processor, Multi-core processor, Operating system, Process management, I/O delegation.

AbstractThis paper describes the design of an operating system to manage the hybrid computer system architecture with multi-core and many-core processors for exa-scale computing. In this study, a host operating system (Host OS) on a multi-core processor performs some functions of a lightweight operating system (LWOS) on a many-core processor instead, in order to dedicate to executing the parallel program on a many-core processor. Specifically, process execution and I/O processing on a many-core processor are supported by Host OS with effective access method. To demonstrate this design, we implement a prototype system using an Intel Xeon dual-core CPUs, and Linux and the original LWOS are loaded on to each processor. The basic performances about process controls and file I/O access for LWOS are evaluated. The LWOS process can be started with at least 110 sec overhead for the many-core program, and the bandwidth was about the same as for file I/O on Linux with an I/O access size of about 16 MB. These results show that a many-core processor can be used effectively on the cluster system minimizing a system overhead with this hybrid OS approach.

1. Introduction
The supercomputers that are used for large-scale simulations and other tasks that demand high computational performance perform at the petaflops level [1]. One approach to implementing such supercomputers is to use general purpose GPUs (GPGPU) [2], which are capable of fast numerical computation, as accelerators. A different approach that is coming into use is to use many-core CPUs, which integrate many general purpose CPU cores. Many-core systems can be constructed with units of many-core CPU such as the Intel SCC (Single-chip Cloud Computer)[3] or by using many-core CPU such as the Intel
This work is partially supported by JST, CREST (Japan Science and Technology Agency, CREST). Mikiko Sato & Go Fukazawa, Tokyo University of Agriculture and Technology, Japan ({mkko, fzawa}@namklab.tuat.ac.jp). Kazumi Yoshinaga, Yuichi Tsujita, & Atsushi Hori, RIKEN Advanced Institute for Computational Science, Japan ({kazum.yoshnaga, yuch.tsujta, ahor}@rken.jp). Mitaro Namiki, Tokyo University of Agriculture and Technology, Japan (namk@cc.tuat.ac.jp).

MIC (Many Integrated Core)[4] chip as accelerators. For systems in which the many-core CPU is used as an accelerator, program development is mainly done in C or Fortran. That approach has the advantage of reusing existing parallel programs. We can therefore expect active use of that approach in future high-performance computing. Heterogeneous computing systems that use many-core CPUs as accelerators include two different types of CPUs that have different characteristics. The computing performance of the individual cores that constitute a many-core CPU is designed to be lower than the performance of the cores of a multi-core CPU by about one-third, and the processor cache memory is also smaller by several hundred kilobytes per core. The on-board memory capacity of PCIe accelerator boards that are equipped with Intel Xeon Phi [5] is also low at 8 GB. Thus, when Linux or other such conventional general purpose operating system (OS) is run on a many-core platform, the memory and core performance is inadequate for executing the rich functions of the OS, so it is highly likely that the full benefit of the many-core CPU cannot be obtained. In this study, we have proposed an architecture in which the OSs run on many-core CPUs and multi-core CPUs are suited to the respective features of each type of processor to extract the best parallel computing performance of many-core CPUs [6]. A general purpose OS is applied on the multi-core processor as the master OS (it is called Host OS) and the many-core processor is used as an accelerator for parallel computing. A lightweight OS is applied on the many-core processor (it is called LWOS) and a standardized application interface (POSIX) subset is hosted. Host OS on a multi-core processor performs some functions of LWOS on a many-core processor in order to dedicate to executing the parallel program on a many-core processor. Parallel computation processing is executed by the LWOS on the many-core CPUs, and file I/O and other such I/O resources are managed by Host OS on the multi-core CPUs. In this approach, the two types of OS provide all of the required functions in the construction of a high-performance computing environment and parallel computing performance of many-core processor can be utilized without any significant changes to the program developed for the existing x86-architecture. This paper describes the hybrid OS design and especially shows the effective delegating method of file I/O processing. The rest of this paper is organized as follows. Section 2 presents the system architecture of our computer system with multi-core and many-core processors and hybrid OS. Section 3 presents OS design. Section 4 presents an OS

Mikiko Sato et al.: A Hybrid Operating System for a Computing Node with Multi-Core and Many-Core Processors.

369

structure for realizing a light weight process management. Finally, Section 5 presents evaluations on a prototype system, followed by related work in Section 6 and a summary in Section 7.

2. System Overview
This section describes a system overview of the hardware and the system software architecture for a computing node with multi-core and many-core processors. A. Hardware Architecture Fig.1 shows the computer system architecture with multi-core and many-core processors of this study. The many-core processor such as the Intel MIC is assuming (1) the utilization style whereby the accelerator board equipped with the many-core processor is plugged into the host computer PCIe bus, and (2) the utilization style whereby the many-core processor is put into a socket on the CPU board and is used as a hybrid-processor system with a Xeon processor [7]. In both styles, communication and memory sharing between the processors are possible through the low-latency, high-bandwidth internal bus. Therefore, our study assumed a hybrid computer that connects a homogenous many-core processor, such as the MIC, which integrates many conventional CPU cores, with an existing multi-core processor using an internal bus. This hybrid structure is designated as one node. Several nodes are connected by a high-speed network, such as an InfiniBand, and thus, a large-scale parallel computer cluster is built. In the case where there is a many-core processor for each node, the performance of a single core is the same as that of an old generation CPU. Although many-cores are integrated into units of around dozen cores, it is possible to retain cache coherence among cores. The multi-core processor and the many-core processor equip a physical memory independently. A part of the multi-core CPU memory region is mapped to the many-core CPU memory space and a part of the many-core CPU memory region is mapped to the multi-core CPU memory space, then a shared memory is prepared between the multi-core CPU and the many-core CPU. This shared memory is used for high-speed communication between multi-cores and many-cores. B. Software Architecture In this study, an OS is loaded onto both the multi-core processor and the many-core processor. The many-core processor OS enables execution of a parallel program without significant changes to the program source-code developed for a conventional general purpose OS by providing a POSIX. However, if a general purpose OS such as Linux is chosen as the many-core processor OS, various management functions consume the CPU resources and the execution speed of the parallel program can be expected to decrease. Therefore, a lightweight OS that specializes in parallel computing is applied to the many-core processor as LWOS. On the multi-core processor, a general purpose OS such as Linux is loaded as Host OS. In order to ensure that
International Journal Publishers Group (IJPG)

Fig. 1 The computer system architecture with multi-core and many-core processors

LWOS is lightweight, the LWOS delegates heavy functions, such as management of I/O resources, a file system, network I/O, to Host OS. In this study, with the collaboration between LWOS and Host OS, it is expected to obtain the high execution performance of a parallel program on the many-core processor. As a mechanism to enable collaboration between LWOS and Host OS, an inter-OS communication mechanism is used on both OS. The Host OS supports the execution of a parallel program on LWOS and LWOS delegates the resource management to Host OS. C. Process Model The process that is running on LWOS (LW-process) is assigned core resources and memory resources on the many-core platform. The light-weight threads (Th in Fig. 1) which share the resources assigned to the LW-process are managed by LWOS to execute efficiently. On the other hand, the process that is running on Host-OS (Host-process) is generated in order to support execution of the LW-process. The Host-process and the LW-process cooperate via inter-OS communication to execute an application program. By executing many threads on many-core processor, the parallel computing performance in a many-core computing system is raised. The hybrid OS support for this process model is intended to improve the parallel computation performance by making effective use of many-core processor CPU resources. The Host-process of the computing node is assumed to communicate via conventional MPI messages. A high-speed communication scheme in which the LW-processes of each node also use MPI over a high-speed network such as InfiniBand has also been proposed [8].

3. Operating System Design


In this hybrid OS, LWOS dedicates to the parallel execution of threads which use the resources assigned to the LW-processes, and Host OS supports LW-process execution. In this section, we describe the basic design of

370

International Journal of Advanced Computer Science, Vol. 3, No. 7, Pp. 368-377, Jul., 2013.

the hybrid OS that takes good advantage of the parallel computation performance on many-core processors. A. Resource Management of the Hybrid OS LWOS performs assignment of many-core CPU resources and memory resources on a kernel level at a start-up to LW-process, and the I/O management is delegated to Host OS. The processing for the threads execution, such as thread management and I/O processing, is mostly a user-level function on LWOS. Accordingly, the overhead for the privileged level switching and the copying of data between the user space and the kernel space are eliminated during the thread execution. The hybrid OS provides the following resource management for LW-processes. 1) Core Management for LW-process: LWOS manages the physical cores of the many-core processor at kernel-level. As shown in Fig. 2, LWOS assigns the physical cores with the number demanded by LW-process at the start-up. LW-process occupies the assigned physical cores and the user-level thread library called MULiTh (User-level Thread Library for Multithreaded architecture) virtualizes these cores and performs a lightweight and flexible thread control according to behavior of threads. The thread management is described in Section 3-B. In this study, the many-core platform assumes that the processor is equipped with a lot of cores. Therefore, LW-process is not preempted by other LW-processes till the end of the LW-process execution to eliminate the LW-process switching containing multiple hardware contexts.

shared memory between Host-process and LW-process is made possible. When an executable file of the LW-process is loaded to the physical memory, the stack and heap areas independent of executable header are fully assigned. This approach aims at reducing the number of the page faults handling on LWOS in order not to interfere with LW-process execution. If a page fault occurs during the LW-process execution, LWOS manages the page fault handling. That is, LWOS allocates a free page on the physical memory and updates the page table. Further, in order that Host-process may access to the extended memory areas, LWOS notifies address translation information to Host OS and Host OS updates the address translation information of Host-process. In this hybrid OS, the executable files for LW-process are managed in the Host OS file system. Accordingly, the executable file header analysis and a setup of the page table are probably better performed on a multi-core processor that has high CPU performance than on a many-core processor.

Fig. 3 The overview of the address space managed in the hybrid OS

Fig. 2 CPU-core management for LW-processes on LWOS

2) Memory Management for LW-process: LWOS manages the physical memory of the many-core processor and assigns continuous memory areas to each LW-process. The overview of the address space managed in the hybrid OS is illustrated in Fig. 3. The inter-OS communication buffer, LW-process address space and even the page table of LW-process are shared between LWOS and Host OS. Host-process can also access directly the address space of LW-process. Thereby, lightweight data transfer using the

3) I/O Management for LW-process: To provide applications with I/O access functions, conventional OS are equipped with a standard I/O library, file system, and I/O device drivers. If LWOS were equipped with all of those functions, LWOS would become heavy. We therefore chose the approach in which I/O access requests from applications running on LWOS are processed by Host OS so that existing parallel programs that include I/O requests can be run on a many-core CPU. In the I/O management, LWOS is configured with only the standard I/O library. The file system and management of the I/O devices are delegated to Host OS. Therefore, the CPU and memory resources on the many-core can be used for the user-level thread execution by LWOS.
International Journal Publishers Group (IJPG)

Mikiko Sato et al.: A Hybrid Operating System for a Computing Node with Multi-Core and Many-Core Processors.

371

B. Thread Management on Many-core Processor This thread management is performed on a user-level as shown in Fig. 4. MULiTh manages the threads belonging to a LW-process and directly control them without OS intervention. MULiTh also manage the contexts of threads at user-level, so that a thread switch can be performed lightweight [9]. MULiTh virtualizes the CPU cores that are assigned to the LW-process. Those threads share the virtual address space that has been set to the LW-processes. Thus, just only the assigned cores execute the LW-process, so the updating of cache memory and TLB can be kept to the minimum that is necessary and performance is improved.

Fig. 5 I/O delegation for the I/O device on Host OS (The example of a file I/O request)

communication process. Furthermore, a two-way communication data queue is set up in the shared memory for use as a communication buffer, and the sending and receiving can be done on independent paths. The notifications described below are exchanged between Host OS and LWOS for collaboration in executing LW-process.
Fig. 4 Thread management in LW-processes with user-level thread library MULiTh

C. I/O Delegation In this system, LWOS is equipped with the standard I/O library including the functions for delegating I/O access to Host OS. That is, I/O processing on the many-core platform side is mostly a user-level library function. The flow of I/O access from LWOS to Host OS is illustrated in Fig. 5. In the case of file I/O access, the file I/O access requests from the application running on LWOS are performed by the kernel file system and I/O device drivers on Host OS. When Host OS receives the file I/O delegation request from the I/O library of LWOS, it sends an I/O access request to the file system. The result of executing the I/O access is then returned to the I/O library of LWOS. Therefore, the overhead of privilege level switching is eliminated. Furthermore, excessive data copying between the OS is eliminated by the direct data transfer using shared I/O buffer between the file system of Host OS and the LW-process. The result of this design is that I/O access on LWOS is processed with minimum delay. D. Inter-OS Communication In the hybrid OS, shared memory is used in the exchange of communication data between Host OS and LWOS. The contents of the notification for inter-OS communication are stored in the communication buffer and an IPI is sent as notification of the beginning of communication. Because the shared memory for the inter-OS communication can be accessed by both Host OS and LWOS, the communication overhead is reduced by elimination of data copying from the
International Journal Publishers Group (IJPG)

1) LW-process Control Notifications: Host OS sends requests to LWOS for control of LW-process that run on the LWOS. The notifications include LW-process creation, suspension, resumption, and stopping, as shown in Table 1. Upon receiving these notifications, LWOS generates the LW-process and changes the execution state of the LW-process. These process control notifications return the results of procedure execution to the requester, so synchronous communication is used. When an LW-process running on LWOS ends due to error or other reason, Host OS must also terminate management of that LW-process. For that purpose, LWOS notifies Host OS of the change in LW-process execution state by unidirectional communication.
TABLE 1 LW-PROCESS MANAGEMENT NOTIFICATIONS

Items LW-process Create LW-process Suspend LW-process Resume LW-process Stop LW-process State Notify

Notification Argument the number of physical core (receive the LW-process ID) LW-process ID (receive the result) LW-process ID (receive the result) LW-process ID (receive the result) LW-process State (terminate /suspend /stop with exception error)

Send to LWOS LWOS LWOS LWOS HostOS

Type sync. sync. sync. sync. uni.

*sync. : Synchronous communication *uni. : Uni-direction notification

372

International Journal of Advanced Computer Science, Vol. 3, No. 7, Pp. 368-377, Jul., 2013.

2) Memory Management Notification: When a page fault occurs on a many-core processor and a page table operation is required, this type of notification is sent to Host OS. The Page Fault Notice is used by the LWOS to notify Host OS. The notification is shown in Table 2. Host OS receives this notice and updates the address translation information of Host-process to be able to access the LW-process address space from Host-process.
TABLE 2 MEMORY MANAGEMENT NOTIFICATION

Items Page Fault Notice

Notification Argument address translation information (receive the result)

Send to HostOS

Type sync.

APIs in Table 5. The exit API and I/O access APIs are sent to Host OS and memory and thread management APIs are performed in LWOS without notifying to Host OS. Pseudo-code of LW-process controls by Host-process is shown in Fig. 6. The number of CPU cores assigned to an LW-process when Host-process generates the LW-process ((a) in Fig. 6), the executable file path information of the LW-process ((b) in Fig. 6), the stack memory size ((c) in Fig. 6), and the heap memory size ((d) in Fig. 6) are specified and the LW-process generation function ((1) in Fig. 6) is called. Next step waits for the completion of the LW-process execution ((2) in Fig. 6).
TABLE 4 APPLICATION PROGRAMMING INTERFACE ON HOST OS

*sync. : Synchronous communication

3) File I/O Delegation Notifications: When the LWprocess accesses a disk or other device, LWOS sends a notification that contains the access information to Host OS. The notification returns the results of process execution to the requester. Because the time required for the I/O processing may depend on the amount of data to be accessed, the communication is entirely asynchronous. The inter-OS communication for providing the file I/O access from LWOS is shown in Table 3. To implement the function for access to Host OS file system from LWOS, messages such as file open, close, read, write, lseek (it is moving the read/write file offset) and disk cache flush requests are established. The inter-OS communication may be synchronous or asynchronous. The File read or write requests and other communication for which the timing of the response is not fixed is done asynchronously. Location setting requests, which must be guaranteed to be completed before the file I/O, are handled by synchronous communication.
TABLE 3 FILE I/O ACCESS NOTIFICATIONS

Function name lwp_create lwp_suspend lwp_resume lwp_destroy lwp_wait

Specification LW-process control from Host-process (send to LWOS) Wait the LW-process exit

TABLE 5 APPLICATION PROGRAMMING INTERFACE ON LWOS

Function name exit brk sbrk open close read/write ioctl/fseek/flush pthread_*

Specification LW-process exit (send to HostOS) Memory management of heap memory Linux File I/O (send to HostOS)

POSIX thread I/F

Items File Open

File Close File Read File Write Move File Offset (File Seek) Disk Cache Flush

Notification Argument file path, file option (receive File Descriptor of the file) File Descriptor (receive the result) File Descriptor , buffer address, I/O size (receive the result) File Descriptor, offset size, position information (receive the result) LW-process State (terminate /suspend /stop with exception error)

Send to HostOS

Type async.

HostOS HostOS

async. async.
Fig. 6 The pseudo code which has controlled LW-process using LWOS API

HostOS

sync.

HostOS

async.

4. Hybrid OS Structure
Both Host OS and LWOS are constructed with user-level library and kernel module. In this sections, we describe the structure of the hybrid OS for LW-process management and file I/O delegation mechanism. A. Host OS The Host OS structure is shown in Fig. 7. The user-level Host OS takes charge of LW-process control shown in
International Journal Publishers Group (IJPG)

*sync. : Synchronous communication *async. : Asynchronous communication

E. Application Programming Interface for LW-process Table 4 and 5 show the LW-process management API provided by Host OS and LWOS. Host-process uses APIs in Table 4 for a LW-process control, and LW-process uses

Mikiko Sato et al.: A Hybrid Operating System for a Computing Node with Multi-Core and Many-Core Processors.

373

Fig. 8 The LWOS structure Fig. 7 The Host OS structure

Table 4. When creating the LW-process, Host OS loads an executable binary file to the memory allocated by LWOS and also initializes the page table for the LW-process at user-level. The kernel-level Host OS provides the functions for executing the I/O access delegated by LWOS, which is described in section 4-c. It also provides the IPI sending and receiving functions and the physical memory mapping function of the many-core processors is required by the user-level Host OS functions in the form of privileged level operations. B. LWOS The LWOS structure is shown in Fig. 8. The LWOS comprises the user-level library (LWOS Library illustrated in Fig. 8) for the greatest possible compatibility with the standard POSIX API for calls made by LW-process code, and the kernel module (LWOS Kernel illustrated in Fig. 8), which provides functions for cooperation with Host OS and functions for managing LW-processes. MULiTh provides for thread management at the user level. The MULiTh assigns threads to the physical cores of the many-core processor that has been assigned to the LW-process, and these threads are executed with non-preemptive scheduling. When accessing to an I/O device, the system call arguments are sent to Host OS using inter-OS communication, and Host OS executes the request. At kernel-level, LW-process management function receives process control requests from Host OS and LWOS starts the LW-process management, which are cores and memory assignment. LWOS kernel up-calls a notification to MULiTh for destroy, suspend, and resume. LWOS kernel also keeps the following LW-process context. (1) The number of physical cores and the physical core IDs. (2) The start address of the page table.
International Journal Publishers Group (IJPG)

(3) The inter-OS communication buffer address. (4) The entry point address of the program. In addition to the functions listed above, the LWOS Kernel also handles exceptions that accompany LW-process run-time errors, such as addressing errors and page faults. C. File I/O Management In this system, the file system and I/O device drivers provided by the Linux can be used for the I/O delegation processing. The Linux file system is constructed under the assumption of calls coming from process contexts. Therefore, a delegatee process (Delegatee-process illustrated in Fig. 7) is created as the child process of Host-process at the LW-process creation and the Delegateeprocess serves to access the Linux file system. So that multiple file I/O requests can be processed simultaneously within the Delegatee-process, Host OS provides multiple worker threads on Delegatee-process. Each worker thread stands by within the kernel level to receive file I/O delegation requests from the LW-process by inter-OS communication. When a request is received, the file system is immediately called within the kernel space to begin I/O access with low delay. To reduce the time required for I/O access, we implement the direct I/O access to the LWOS I/O buffer as shown in Fig. 3. This requires pre-mapping of many-core physical memory to the virtual address space of the Delegateeprocess. It is also necessary to be low-delay in the virtual address conversion from LW-process to Delegatee-process. Our system uses the method in which the virtual address spaces of the LW-process and the Delegatee-process are directly linked via a shadow page table that is set up on Host OS. Address conversion by this method is achieved with lower delay because only one page table is searched for the address conversion of I/O buffer rather than the searching of two page tables for LW-process and

374

International Journal of Advanced Computer Science, Vol. 3, No. 7, Pp. 368-377, Jul., 2013.

Delegatee-process. To manage the shadow page table, Host OS has to update the virtual address mapping for the Delegatee-process when LWOS allocates new physical memory page to a LW-process.

architecture in which LW-process management is supported by Host OS and LW-process can be executed without changing to a kernel-level as much as possible is considered to be effective.

5. Evaluation
A. Environment Fig. 10 shows the evaluation environment. To serve instead of a multi-core and many-core hybrid system for the evaluation, we constructed a NUMA system with two Intel Xeon X5690 processors (6 cores, 3.47 GHz). One of the two processors serves as the multi-core processor and runs the Linux Host OS. The other processor serves as the many-core processor and runs the LWOS kernel named Future, which we are developing originally. The LWOS is equipped with an I/O library that has a POSIX.1-compatible API. We also created a prototype I/O delegation mechanism for Linux on Host OS side. We then measured the time required when a request for I/O access to the RAM disk set up on Host OS is issued by an LWOS application via the I/O library and the read/write functions are called.

Fig. 11 The test code for the evaluation of LW-process control

C. I/O Delegation Overhead First, we measured the overhead for a file I/O request from the LWOS to Host OS. The delay details for the request are presented in Table 6. The total overhead for the file I/O is about 5sec. About half of that is the delay for inter-OS communication and the rest is delay due to the I/O delegation mechanism of Host OS. On Host OS side, it takes time to call the worker thread and prepare, but the main processing is for determining which worker thread to call, so time can be reduced by improving the implementation method. Although the shadow page table search is done in 7% of the total time required, using the existing page table requires at least twice that time. Therefore, introducing the shadow page table can reduce the total delay by about 6%.
TABLE 6 THE OVERHEAD OF FILE I/O REQUEST

Items Inter-OS communication Worker Thread Call Shadow Page Table Search Other

Fig. 10 The prototype system for evaluation

Processing time (sec) 2.24 1.85 0.34 0.45

Ratio (%) 46 38 7 9

B. Evaluation of LW-process control We measured the execution time of a program that generates the LW-process and then immediately terminates it. We then compared the measured times to the results for the Linux fork-exec overhead. Fig. 11 shows the test code. The round-trip time for generation of the process context in LWOS up to Host OS receiving the reply from LWOS is less than about 700sec, average 420sec, at least 110sec overhead for the many-core test code, whereas the result for the Linux fork and exec system call execution was less than about 8msec, at least 75sec. The processing for lwp_create includes the reading and parsing of the executable file, loading into memory, creation of the LW-process page table, and notifying LWOS of the process generation by inter-OS communication. Since all of these processing are performed on Host OS side and Linux uses demand-paging mechanism for process generation, the comparison with Linux includes a difference in the memory-copying overhead. For the target MIC architecture, even for a large number of cores, the cache memory is small and the core-unit performance is low. Therefore, the system software

The results of a bandwidth evaluation for various I/O access sizes are presented in Fig. 12. As a reference value, measured results for when the same RAM disk is accessed by Linux on the Host OS platform are also presented. The write performance is lower than the bandwidth for Host OS for all I/O sizes. For the 256MB I/O unit, it was 0.85 times the Linux bandwidth. For both the I/O delegation and Host OS, the bandwidth first increases in a convex curve and then converges on a constant value. That result comes from the change in the hit rate for the multi-core CPU cache on Host OS with I/O size. In particular, the L3 cache hit rate is 49% for data transfer when the I/O size is 4MB, which exhibits the maximum value for the I/O delegation, so the write performance is high. When the I/O size is less than 4MB, Host OS bandwidth is higher than that of the I/O delegation. The reason for that result is the difference in the bandwidth of the I/O read performed by the OS during the write operation. For Host OS alone, the bandwidth for reading from local memory is 9.04GB/s. By comparison, the bandwidth for reading the remote memory (that is, accessing the local memory managed by Host OS
International Journal Publishers Group (IJPG)

Mikiko Sato et al.: A Hybrid Operating System for a Computing Node with Multi-Core and Many-Core Processors.

375

Fig. 13 The data access flow by the direct access method

Fig. 12 The bandwidth of read/write access

from LWOS) by the I/O delegation is a low 5.06GB/s. Therefore, the write performance has not reached that of Host OS alone. The read performance differs from the write performance in that for the I/O size of 4MB or larger, the LWOS bandwidth was somewhat higher than that of Host OS alone. For the 256MB I/O size, the bandwidth was 1.05 times as high as for Linux. Furthermore, in the same way as for write, the bandwidth first increased in a convex curve and then converged on a constant value. That results from the high read performance due to the effect of the multi-core CPU cache memory on Host OS side. The L3 cache hit rate is a high 43% when the I/O size is 1MB, which exhibits the highest value for the I/O delegation. The read performance is higher than for Host OS alone because of the difference in the memory management method for allocating memory to the process. For Host OS alone, paging processing is invoked when data is written to the I/O buffer used by the process. When transferring 4KB of data, a page fault in the Linux file system results in fetching a new page and zero-clearing. For the I/O delegation, on the other hand, the LWOS reserves a static I/O buffer area for file read operations. The result is higher file read performance. D. Effect of Direct Access Method for I/O Buffer To reduce the data copying that occurs on the path to Host OS file system after an LWOS application has issued an I/O request, the file system of Host OS reads from and writes to the I/O buffer of the application running on the LWOS directly. To clarify the effect of this direct access method as shown in Fig. 13, we compare the bandwidth with that of the method in which I/O delegation requests are made via the kernel-level communication buffer as shown in Fig. 14. In the read/write system call handlers on the LWOS side, data is copied once from the application I/O buffer to the memory shared by the two OS. When Host OS receives an I/O delegation request, it calls the file system and copies the data between the page cache and the shared memory. The data copying required for I/O is thus done two times, once
International Journal Publishers Group (IJPG)

Fig. 14 The data access flow by the I/O delegation via LWOS kernel

in LWOS and once in the file system of Host OS. That is one time more than in the proposed direct access method. The bandwidths for the direct access and via LWOS methods for file reads and writes are shown in Fig. 15.

Fig. 15 The band widths for the direct access and via LWOS methods for file reads and writes

The increase in bandwidth up to the peaks near the I/O size of 4MB is the effect of the CPU cache. When direct access is not performed, the bandwidth decreases by roughly half. Because the single round of data copying of the direct access method is increased to two rounds, the time required for data copying is increased. Also, the greater part of the time required for the I/O delegation is for

376

International Journal of Advanced Computer Science, Vol. 3, No. 7, Pp. 368-377, Jul., 2013.

copying data. The reason that the bandwidth for both methods becomes the same for the I/O size of 4KB is that the inter-OS communication that accompanies the I/O request and other such overhead is large compared to the data copying time. The direct I/O buffer access method proposed for this system clearly improves bandwidth. Because direct access allows LWOS file cache to use the cache on Host OS side, memory use on LWOS can be reduced. However, this method requires a static mapping of virtual memory and physical memory on LWOS side. Future introduction of paging on LWOS side would require paging processing by LWOS.

6. Related Work
There has been much research of implementing OS on distributed systems, bundling multiple nodes into a single system image. BlueGene/L [10] runs a small kernel on compute nodes, with Linux running on the I/O node. The I/O node does compute node program execution management and file I/O. Shimizu in reference [11] proposes a remote process and remote file I/O management architecture that enables processes on compute nodes that have dedicated processors to be executed from a management node that has general purpose processors. In both those systems, the approach of division of OS functions in BlueGene/L and Shimizu's approach are very similar with our proposed functions. In particular, Shimizu's approach has remote paging operation on management node and page faults for the files mapped to the remote process space are resolved by the remote pager function in the remote I/O management kernel module issuing a remote read page to the management node. In our proposed system, Host OS shares the page management information with LWOS and Host-process can access virtual address space of the LW-process. Therefore, file I/O access is directly completed between Host OS and LWOS using the shared memory and the inter-processor communication through an internal bus. fos [12] is a multi-core operating system which consists of a microkernel and an OS function server. By the function of a microkernel, OS functions, such as scheduling, memory management, and file system and so on, are distributed to several cores. fos conducts inter-core communication by message passing, but our proposed system uses shared memories to communicate and directly transfers data between many-core and multi-core. FlexSC [13] proposes an implementation of exceptionless system calls in the Linux kernel for multi-core processor. A data structure called as a syscall page is assigned to each process and process makes a system call request to the syscall page. The actual execution of system calls is performed asynchronously by special in-kernel syscall threads. The mechanism of offloading a system call in our work is similar to that of using the data space like a syscall page of FlexSC. In our work, we apply the mechanism on many-core/multi-core computing system. So, user-level LW-processes on many-core delegate I/O

requests to Host OS on another processor with inter-OS communication. Our work assigns the data space for receiving I/O requests to each core of many-core processor and an I/O system of Host OS gets the request from the FIFO data queue, while FlexSC allocates a kernel thread to each syscall page and gets the system-call function from the syscall page. FusedOS [14] is a hybrid OS approach which combines a full-weight kernel and a light-weight kernel on a heterogeneous multi-core system. FusedOS provides an infrastructure capable of partitioning the resources of a heterogeneous multi-core system and collaboratively running different operating environments on subsets of the cores and memory, without the use of a virtual machine monitor. In their prototype system, FusedOS ships system calls to a proxy process running in a Linux system. In our system, the proxy process (that is, the delegatee Host-process of our system) runs on an entirely different processor from the parallel application, while in FusedOS the proxy is running on a same multi-core processor.

7. Conclusion
We have proposed the design of an operating system to manage the hybrid computer system architecture with multi-core and many-core processors for exa-scale computing. The Host operating system (Host OS) on a multi-core processor performs some functions of a lightweight operating system (LWOS) on a many-core processor, in order to dedicate to executing the parallel program on a many-core processor. We have described the structures of the Host OS and LWOS and explained about I/O delegation mechanism and inter-OS communication. In future work, differences in parallel computing performance can be expected from arranging and combining the CPU cores of the many-core processors. We are investigating core assignment algorithms with LWOS thread library to implement those functions, and will continue to improve the parallel computing performance of many-core processors. We will also attempt to evaluate the effect of this hybrid OS using Intel MIC processor.

Acknowledgment
This work is partially supported by JST, CREST (Japan Science and Technology Agency, CREST).

References
[1] Top500. Supercomputer sites (online). Available via WWW: http://www.top500.org/statistics/perfdevel/ (accessed Apr. 7. 2013). B. Dally, "GPU computing to exascale and beyond," (online) Available via WWW: http://www.nvidia.com/content/PDF/sc_2010/theater/Dally_ SC10.pdf. (accessed Apr.7.2013). P. Gschwandtner, T. Fahringer & R. Prodan, "Performance Analysis and Benchmarking of the Intel SCC," (2011) IEEE International Conference on Cluster Computing, pp.139-149. International Journal Publishers Group (IJPG)

[2]

[3]

Mikiko Sato et al.: A Hybrid Operating System for a Computing Node with Multi-Core and Many-Core Processors.

377

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

Intel, "Many integrated core (MIC) architecture - advanced," (online). Available via WWW: http://www.intel.com/content/www/us/en/architecture-and-te chnology/many-integrated-core/intel-many-integrated-core-a rchitecture.html?wapkw=many-integrated-core (accessed Apr.7.2013). Intel, "Intel Xeon Phi Coprocessor 5110P," (online). Available via WWW: http://www.intel.com/content/www/us/en/processors/xeon/xe on-phi-detail.html (accessed Apr.7.2013). M. Sato, G. Fukazawa, K. Nagamine, R. Sakamoto, M. Namiki, K. Yoshinaga, Y. Tsujita, A. Hori & Y. Ishikawa, "A design of hybrid operating system for a parallel computer with multi-core and many-core processors," (2012) the 2nd International Workshop on Runtime and Operating Systems for Supercomputers, ROSS 12, Article No. 9. K. Skaugen, "Petascale to exascale - extending intel's hpc commitment," (online). Available via WWW: http://download.intel.com/pressroom/archive/reference/ISC_ 2010_Skaugen_keynote.pdf (accessed Apr.7.2012). K. Yoshinaga, Y. Tsujita, A. Hori, M. Sato, M. Namiki & Y. Ishikawa, "Delegation-Based MPI Communications for a Hybrid Parallel Computer with Many-Core Architecture," (2012) 19th European MPI Users' Group Meeting, EuroMPI, pp. 47-56. K. Sasada, M. Sato, S. Kawahara, N. Kato, M. Yamato, H. Nakajo & M. Namiki, "Implementation and Evaluation of a Thread Library for Multithreaded Architecture," (2003) The 2003 International Conference on Parallel and Distributed Processing Techniques and Applications(PDPTA'03), Vol.II, pp. 609-615. M. Giampapa, T. Gooding, T. Inglett & R. W. Wisniewski, "Experiences with a lightweight supercomputer kernel: lessons Learned from Blue Gene's CNK," (2010) the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC'10, pp. 1-10. M. Shimizu & A. Yonezawa, "Remote process execution and remote file I/O for heterogeneous processors in cluster systems," (2010) the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, CCGRID'10, pp. 145-154. D. Wentzlaff & A. Agarwal, "Factored Operating Systems (fos): The Case for a Scalable Operating System for Multicores," (2009) ACM SIGOPS Operating Systems Review, Vol. 43, Issue 2, pp. 76-85. L. Soares & M. Stumm, "FlexSC: flexible system call scheduling with exception-less system calls," (2010) the 9th USENIX conference on Operating systems design and implementation, OSDI'10, Article No. 1-8. Y. Park, E. Van Hensbergen, M. Hillenbrand, T. Inglett, B. Rosenburg, K. D. Ryu & R. W. Wisniewski, "FusedOS: fusing LWK performance with FWK functionality in a heterogeneous environment," (2012) the 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), pp. 211-218.

Mikiko Sato received her B.E., M.E. and Ph.D. in Computer and Information Sciences from Tokyo University of Agriculture and Technology in 1988, 1990 and 2006, respectively. She is currently an Assistant Professor in the Faculty of Engineering at Tokyo University of Agriculture and Technology. Her research interests include operating systems, parallel computers, and embedded systems. She is a member of IEEE and Information Processing Society of Japan (IPSJ). Go Fukazawa received his B.E. in Computer and Information Science from Tokyo University of Agriculture and Technology in 2012. He is currently a Graduate Student at the Department of Computer and Information Science of Tokyo University of Agriculture and Technology. His research interests include system softwares for distributed systems. He is a student member of Information Processing Society of Japan (IPSJ). Kazumi Yoshinaga received his B.E., M.E., and Ph.D. degrees in Information Engineering from Kyushu Institute of Technology in Japan, in 2004, 2006, and 2011, respectively. He is currently a Postdoctoral Researcher at RIKEN Advanced Institute for Computational Science, JAPAN. His research interests include high performance computing, many-core systems and MPI middleware. He is a member of Information Processing Society of Japan (IPSJ). Yuichi Tsujita received his B.E., M.E., and Ph.D. in Engineering from University of Tsukuba in 1994, 1996, and 1999, respectively. He is currently a Research Scientist at RIKEN Advanced Institute for Computational Science, JAPAN. His research interests include parallel I/O and parallel computing middleware. Dr. Tsujita is a member of IEEE, ACM SIGHPC, and IPSJ. Atsushi Hori is a Senior Scientist of System Software Team at RIKEN Advanced Institute for Computational Science, JAPAN. His current research interests include parallel operating system. He received B.S. and M.S. degrees in Electrical Engineering from Waseda University, and received Ph.D. from the University of Tokyo. Mitaro Namiki is a professor in the Faculty of Engineering at the Tokyo University of Agriculture and Technology. His research interests include operating systems, programming languages, parallel processing, and computer networks. He has a PhD in computer science from the Tokyo University of Agriculture and Technology. He is a member of IEICE, the ACM and IPSJ.

International Journal Publishers Group (IJPG)