You are on page 1of 8

JOURNAL OF COMPUTER SCIENCE AND ENGINEERING, VOLUME 1, ISSUE 1, MAY 2010 47

FLEXIBLE MULTI-CORE PROCESSOR DESIGN


H. G. Konsowa 1, E. M. Saad 1, M. H. A. Awadalla1,2
Faculty of Engineering, Helwan University, Cairo, Egypt Electrical And Computer Engineering Department, SQU, Oman
1

AbstractMulticore processors integrate several cores on a single chip. The permanent need to enhance the performance of multicore motivates the development of dynamic design. To address this issue, this paper presents three new algorithms MTRANS, MTNMM and MICRO_COUNT to enhance the performance of multicore processors and avoid long latency load via dynamic multicore resource allocation. The proposed dynamic resource allocation technique is based on a dynamic distribution of the resources to the threads. Intensive simulated experiments have been conducted in the context of different SPLASH benchmark programs. The archived results show a remarkable performance enhancement in the Multicore processors system especially in MICRO_COUNT algorithm.
Index Terms Multicore design, Fetch policy, Performance, Simulation architecture.

1 INTRODUCTION
The rest of this paper is organized as follows. Section 2 shows the related work to the theme of the paper. Section 3 introduces a simulation framework and benchmark suite description. Section 4 presents the developed methodology. Section 5 demonstrates experiments and discussion. Section 6 concludes the paper.

Multicore processor depends on multithreading architecture in many stages. The multithreading architectures, especially the Simultaneous Multithreading (SMT) architecture, provide the microprocessor with significant parallelism, which is believed as the key potential to better performance and more throughputs. . It improves the resource efficiency via scheduling and executing concurrent threads in the same core. Multicore resource sharing leads to the uneven distribution among threads as well as performance unfairness, which might not be optimized until the well-designed scheme is carried out. In this paper multicore resources are distributed according to transaction metric. The transaction metric will illustrate below at section IV. It is used to evaluate the multicore performance which is depending on number of instructions in fetch queue, reorder buffer and commit queue for each thread.

2 RELATED WORK
Simultaneous Multithreading (SMT) increases processor throughput by allowing the parallel execution of several threads. Static resource partitioning techniques have been suggested, but are not as effective as dynamically controlling the resource usage of each thread since program phases are not fixed all the time. Static resource partitioning [1], [2] evenly splits critical resources among all threads, thus preventing resource monopolization by a single thread. However, this method lacks explicitly and can cause resources to remain idle when one thread has no need for them, even if other threads could benefit from additional resources. In this section we briefly review previous works on dynamic resource allocation in multiprocessor, multithreaded and multicore platforms Although several proposals that address the management of a single micro-architectural resource exist in the literature, proposals to manage multiple interacting resources on multicore chips at runtime are much scarce.

H. G. is PhD Student at Helwan University, Faculty of Engineering. She received the B.Sc. in Communication and Electronics Engineering from Ain Shams University, Cairo,Egypt. She works in Information technology department in Cocacola Botling Company. E. M. Author is a Professor of Electronic Circuits, Faculty of Engineering, Helwan Univ. He is a member of the national Radio Science comittee, member of the scientific consultant committee in the Egyptian Eng. Syndicate for Electrical Engineers, Communication, Electronics and Computers, Egypt. M.H. Author is with the Electrical and Computer Engineering Department, SQU, Oman. On leave from Helwan University, Department of Communication, Electronics and Computers, Egypt.

48

Martnez [3] introduced a machine learning approach to multicore resource management. It produces selfoptimizing on-chip hardware agents capable of learning, planning, and continuously adapting to changing workload demands. These results are more efficient and flexible to manage the critical hardware resources at runtime. A history-aware, resource based dynamic (or simply HARD) scheduler for heterogeneous CMPs has been developed [4]. HARD relies on recording application resource utilization and throughput to adaptively change cores for applications during runtime. This technique is used to achieve both performance and power improvements. HARD is a dynamic scheduler for heterogeneous multicore systems. HARD uses past thread assignments to find the best matching core for every core, saves power by downgrading applications with low resource utilization to weaker cores and improves performance by upgrading demanding application to stronger cores. The authors in [5] proposed algorithm that dynamically assign resources to each thread according to thread behavior changes. Adaptive Resource Partitioning Algorithm (ARPA) analyzes the resource usage efficiency of each thread in a time period and assigns more resources to threads which can use them in a more efficient way. The purpose of ARPA is to improve the efficiency of resource utilization, thereby improving overall instruction throughput. Another dynamic three fetch policies (EACH_LOOP_FETCH, INC-FETCH, and WZ-FETCH) are introduced in [6]. These policies depend on the Ordinary Least Square (OLS) regression statistic method [7]. To predict the future L2 data cache miss, OLS regression equation using number of samples equal to the window array size for each thread, i.e. the function to calculate the future L2 data cache miss is existed on thread level. This function is called regression engine. All algorithms for selecting fetch threads will select the thread which has least miss value of L2 data cache miss to increase data locality. The WZ-FETCH fetch policy is the best fetch policy in all used benchmarks programs for all used metrics. .

benchmark programs that can be executed in sequential workload way [4]. Our simulation tool supports a set of parameters that specify how stages are organized in multithreaded design. The pipeline model in multicore simulator is divided into five stages as shown in Fig. 1: fetch stage, decode/rename stage, issue stage, execution stage and commit stage. Stages can be shared among threads or private per thread except execute stage, which is shared by definition of multithread.
Trace/ Instr. cache
Trace Q. Opcode Q. Inst. O.

Fetch
Fetch Q.

Decode

Dispatch
Load/Store Q.

Data Cache

Issue
Reorder Buffer

FU FU

Writeback

Commit

Register File

Fig. 1. Multicore pipeline stages.

4 THE PROPOSED METHODOLOGY


Multicore switches among threads according to a thread selection policy. SMT processor is able to issue instructions from different threads in a single cycle. Time slice policy for SMT processor is used as a base for the comparison. The proposed algorithms are started by equally partitioning resources for each thread as a resource baseline. However, during the execution, the resources are redistributed on the multicore threads to gain a remarkable performance. In this paper the shared allocated resources are used. The sizes of the resources are changed dynamically among the threads during the run time. The deviation from the baseline in allocated resources is done according to transaction metric. This transaction metric [3] is modified. It is defined as the ratio between the number of instructions in commit queue to the sum of the number of instructions in the following: micro opcode queue, fetch queue and order buffer. Also it is used to differentiate between the threads. It can be defined as a change evaluation metric. Accordingly, the active thread can be determined which needs more resources. The instruction fetch engine is recognized very early as a natural resource usage controller in a SMT processor [8], [9]. In practice, the fetch policy indirectly controls the usage of all processor resources. So, it can be leveraged

3 SIMULATION FRAMEWORK
The used simulation (Multi2Sim) is modified [8] to be capable of simulating dynamic resources for multicore. This simulation is a framework for heterogeneous computing systems, including models for super-scalar, multithread, multicore, and graphics processors. Multi2Sim simulator is adapted to cope with multicore processor dynamic design by adding dynamic feature in the policy of thread selection in fetch stage [9]. Our framework consists of multicore simulation tool and a subset of benchmark programs used to evaluate an architectural enhancement of multicore by workload all threads of multicore using one benchmark program which can be executed in parallel way or multiple

49

to avoid phenomena that can starve the concurrent threads, e.g. monopolization of instruction queues by a thread due to long-latency load misses in the last cache levels. Memory-level parallelism and branch mispredictions can also be taken into account by the fetch policy. Fetch stage as a one of multicore resource is selected to implement the above mentioned algorithm. In this paper three algorithms are developed. The first is named as Maximum Transaction algorithm (MTRANS) which increments the allocated resource for specific thread. This thread must have maximum value of transaction metric. It must have more resources to speed up the execution of its instructions. It takes a resource from other threads. So the resources are granted from thread to thread in gradually bases. The resources are bounded between lower and upper limits. The calculation of transaction metric is performed periodically after some cycles called epoch. After each epoch an additional resource is granted gradually in steps. The second algorithm is named as Maximum Transaction No Max Misses algorithm (MTNMM) which is used to increment the allocated resource for specific thread. This thread must have maximum value of transaction metric and must not have the maximum misses of L1 data cache. In this case, the thread will gain more resources to speed up the execution of its instructions. However, if the thread which has had maximum transaction is also having maximum misses for L1 data cache then it will deal with its old resources size. This means that at each epoch, the transaction metric is recalculated to select the active thread. But it is not necessary that, the resources size changing is occurred every time. This is happened according to thread selection criteria. This criteria is the selected thread which must not have the maximum misses of L1 data cache. In this case, the thread will gain more resources to speed up the execution of its instructions. But if the thread which has had maximum transaction also has maximum misses for L1 data cache then it will deal with its old resources size. The concept of the third algorithm MICRO_COUNT is the same as the second algorithm but it is different in the condition of thread selection to dynamic resource change. The selected thread must not have minimum opcode in micro opcode queue. The increasing in resource allocation will be granted to the active thread as if it has not minimum number of opcode in micro_opcode queue. The pseudo codes and flowchart of the algorithms are shown in figures (2-5). The fetch stage as a one of multicore resources is selected to implement the above mentioned algorithms. // MTANS // select thread with maximum transaction regardless data misses get_thrd_max_Transacton() { for(i=0;i<4;i++){ if(thrd_max_trans==i)continue; q_sz[i]= q_sz[i]-step _q; if(q_sz[i]<=q_sz_min)

q_sz[i]=q_sz_min; q_sz[thrd_max_trans]= queue_sz[thrd_max_]+step_q; if(q_sz[thrd_max_ trans]>=q_sz_max) queue_sz [thrd_max_ transaction]= queue_sz _max; break;}
Fig. 2. MTANS pseudo-code

// MTNMM //select thread with maximum transaction and it has not maximum data cache misses in L1. //get_thrd_max_Trans() get_thrd_max_Transaction() { get_thrd_max_miss(); for(i=0;i<4;i++){ if(thrd_max_trans==i)continue; //exclude the thread with maximum data cache misse if(thrd_max_trans==thrd_max_misses)conti nue; q_sz[i]=q_sz[i]-step _q; if(q_sz [i]<= q_sz_min) q_sz[i]=q_sz_min; q_sz [thrd_max_trans] =q_sz[thrd_max_trans]+step_q; if(q_sz [thrd_max_ trans]>= q_sz _max) q_sz[thrd_max_trans]=q_sz _max; break; }
Fig. 3. MTNMM pseudo-code

//MICRO_COUNT uopq_count(core,thread); min_micr=0;frst=0; for(i=0;i<4;i++){ if(frst==0){min_micr=micr_count[0];thrd_ min_micr=0;frst=1;} if(micr_count[i]==0) continue; if(micr_count[i]<min_micr){thrd_min_micr =i;} } for(i=0;i<4;i++){

50

if(thrd_max_trans==i)continue; //exclude the thread with maximum data cache misse if(thrd_min_micr!=min_micr)continue; q_sz[i]=q_sz[i]-step _q; if(q_sz [i]<= q_sz_min) q_sz[i]=q_sz_min; q_sz [thrd_max_trans] =q_sz[thrd_max_trans]+step_q; if(q_sz [thrd_max_ trans]>= q_sz _max) q_sz[thrd_max_trans]=q_sz _max; break;} }

MTNMM Algorithm

MTRANS Algorithm Initiation by equally allocated resources Initiation by equally allocated resources Take baseline

Thread Behavior analysis Take baseline

Update allocated resource

Increment counters
Fig. 4. MICRO_COUNT pseudo-code

Thread Behavior analysis

Update allocated resource

Deal with old resources No values

Epochs=max?

Yes

Thread selection

Increment counters

Metric calculation No
Metric calculation Deal with old Yes resources No values

Thread selection
Epochs=max? Yes

Thread has maxL2 misses


Fig. 5. Dynamic resources allocation based on MTRANS Algorithm

51

MTNMM Algorithm

5 SIMULATION RESULTS
The number of executed instructions per cycle (IPC) is the main measurement unit in this paper. The measured IPC is the overall throughput for all threads. To apply the proposed algorithms, it is necessary to select suitable values for the following: Epoch size that represents the period to calculate the MTRANS metric. Also epoch is defined as the time to fire the dynamic resources changing. New fetch queue entries (step size) that transfer from thread to another at each epoch. The results and graphs from running four benchmark applications with two different fetch policies (MTARNS and Static) is shown in Fig. 6 . In static fetch policy the resource has fixed value, the instruction is fetched from threads according to fixed time slice.

Initiation by equally allocated resources

Take baseline

Thread Behavior analysis Update allocated resource

The results are divided into many sets: 1)

Increment counters

Deal with old resources No values

Epochs=max?

Yes

Metric calculation No Thread selection


Yes

Thread has maxL2 misses

Fig. 7: Instructions per cycle for some SPLASH benchmark programs when epoch=2000 instructions for static and MTRANS fetch policies .

Fig. 6. Dynamic resources allocation Algorithms MTNMM Algorithm.

TABLE 1 INSTRUCTIONS PER CYCLE (IPC) FOR SOME SPLASH BENCHMARK PROGRAMS FOR STATIC AND MTRANS FETCH POLICIES WHEN EPOCH=2000 AND STEP=1.

Application / Fetch policy MTRANS Static

Sort 2.039 1.9

Ocean 1.83 1.82

Barnes 3.412 3.3

Fmm 1.532 1.5

Table 1 and Fig. 7 represent the change in performance

52

Table 1 and Fig. 6 represent the change in performance according to allocated (fixed or dynamic) resource characteristic. The dynamic resource is applied on fetch queue. The effect of changing delta value in fetch queue after each 2000 instructions versus the used of static fetch queue size. The obtained results show that increasing in performance has been occurred with MTRANS fetch policy. 2) The results and graphs from running four benchmark applications with two new fetch policies (MTARNS and MTNMM) are shown in Fig. 7.

TABLE 2 AVARAGE INSTRUCTIONS PER CYCLE (AVG_IPC) FOR SOME OF SPLASH BENCHMARK APPLICATIONS USIN MTRANS,MTNMM AND MICRO_COUNT FETCH POLICIES.

Fetch policy/ Application Fmm,Ftt Ocean,Fmm Barns,Lu Radix,Fft Srt,Fmm,Ftt

MTRANS 1.524 1.358 1.087 0.841 1.636

MTNMM

1.4 1.358 1.087 0.837 1.656

MICRO_COUN T 1.522 1.359 1.087 0.8395 1.77

Fig 8. Inctructions per cycle for some SPLASH benchmark programs when epoch=2000 instructions for MTRANS and MTNMM algorithm. Fig. 9. Average of instruction per cycles for SPLASH benchmark programs when epoch=2000 instructions for MTRANS, MTNMM and MICRI_COUNT algorithms

It is not necessary that the best algorithm for certain application to be the best for all applications. This means that the MTNMM algorithm is an application specific algorithm. As shown in Fig. 7, MTRANS fetch policy outperforms the MTNMM for all benchmarks except FFT application. 3) The results and graphs from running mixed of SPLASH benchmark applications together. In this case the average instructions per cycle (Avg_IPC) is used as a metric for each group of application to measure the performance. As shown in Fig. 8, the MICRO_COUNT fetch policy is the best one.

4) The results and graphs illustrate how to select the used values of epoch and step which are used in above results set. As shown in Fig. 9, Fig. 10, table 3 and table 4, the highest instructions per cycle is occurred at 6000 epochs and step size equal 5 using MTRANS algorithm.
TABLE 3 AVARAGE INSTRUCTIONS PER CYCLE (AVG_IPC) FOR SOME OF SPLASH BENCHMARK APPLICATIONS AT DIFFERENT STEP VALUES FOR MTRANS ALGORITHM.

Epochs IPC

1000 1.91

2000 1.965

6000 1.965

10000 1.965

20000 1.953

30000 1.965

53

6 CONCLUSION AND FUTURE WORK


This paper has introduced three algorithms MTRANS, MTNMM and MICO_ACCOUNT. The results show that the developed algorithm outperforms the static one. The dynamic resource allocation technique is based on a dynamic distribution for the resources to threads. Which thread is competing for a given resource, able to give part of their resources to another thread without degrading the system performance is identified. The proposed algorithms continuously distribute the resources and directly ensure that no resource-hungry thread exceeds its rightful allocation. For future work, the target is to apply the same algorithms on other resources.

Fig. 10. Average of instruction per cycles for SPLASH benchmark program at different step values for MTRANS algorithm.

7 REFERENCES
TABLE 4 AVARAGE INSTRUCTIONS PER CYCLE (AVG_IPC) FOR SOME OF SPLASH BENCHMARK APPLICATIONS AT DIFFERENT EPOCH VALUES FOR MTRANS ALGORITHM.
[1] D. T. Marr, F. Binns, D. L. Hill, G. Hinton, D. A. Koufaty, J. A. Miller, and M.Upton, Hyper-Threading Technology Architecture and Microarchi-tecture," Intel Technology J., vol.6, no.1, pp. 4-15,Feb. 2002. [2] S. E. Raasch and S. K. Reinhardt, The Impact of Resource Partitioning on SMT Processors,"Proc. 12th Int'l Conf. Parallel Architecture and Compilation Techniques, pp. 15-26, Sept. 2003. [3] J.F. Martnez. Dynamic Multicore Resource Management: A Macine Learning Approach IEEE Micro, Volume 29 Issue 5, September 2009 [4] A. Z. Jooya and A. Baniasadi and M. Analou History-Aware, Resource-Based Dynamic Scheduling for Heterogeneous Multicore Processors; IET Computers & Digital Techniques (2011)254-262. [5] H.Wang, I. Koren and C. KrishnaAn Adaptive Resource Partitioning Algorithm in SMT Processors. Parallel and Distributed Systems, IEEE Transactions on, Volume: 22, Issue: 7 July. 2011. [6] H. Konsowa, E. M. Saad, and M. H. Awadalla New Fetch Policies for Multicore Processor Simulator to Support Dynamic Design in Fetch Stage. JOURNAL OF COMPUTER SCIENCE AND ENGINEERING, VOLUME 12, ISSUE 2, Pp. 6-14. [7] A. Cottrell, Regression Analysis: Basic Concepts, [Online]. Available: Regression.pdf [8] The Multi2Sim Simulation Framework. http://www.multi2sim.org. [9] H. Konsowa, E. M. Saad, and M. H. Awadalla Updating Multicore Processor Simulator to Support Dynamic Design in Fetch Stage 29th NATIONAL RADIO SCIENCE CONFERENCE (NRSC 2012). PP. 1-7. [10] TULLSEN, D. M., EGGERS, S. J., EMER, J. S., LEVY, H. M., LO, J. L., AND STAMM, R. L. 1996. Exploiting choice: Instruction fetche and issue on an implementable simultaneous multithreading processor. In Proceedings of the 23rd Annual International Symposium on Computer Architecture. Philadelphia, PA, 191202. [11] TULLSEN, D. M., EGGERS, S. J., AND LEVY, H. M. 1995. Simultaneous multithreading: Maximizing on-chip parallelism. In Proceedings of the 22nd Annual International Symposium on Computer Architecture. 392304.

STEP IPC

1 1.965

2 1.966

3 1.918

4 1.965

5 1.975

Fig. 11. Average of instruction per cycles for SPLASH benchmark program at different epoch values for MTRANS algorithm.

54

Eng. Hanan Mohamed Gamal Konsowa is PhD Student at Helwan University, Faculty of Engineering. She received the B.Sc. in Communication and Electronics Engineering from Ain Shams University, Cairo,Egypt. and M.Sc degrees in Electronics and Communications Engineering from Helwan University, Egypt in 2006. Her research interest includes Multicore design, soft computing and real time systems. She works in Information technology department in Cocacola Botling Company. Prof. Dr. Elsayed Mostafa Saad is a Professor of Electronic Circuits, Faculty of Engineering, Helwan Univ. He is Author and/or Co-author of 132 scientific papers. He is a member of the national Radio Science Comittee, member of the scientific consultant committee in the Egyptian Eng. Syndicate for Electrical Engineers, till l May 1995, Member of the Egyptian Eng. Sydicate, Member of the European Circuit Society (ECS), Member of the Society of Electrical Engineering(SEE). His research interest Electronic Circuits and Communication. Dr. Medhat Awadalla is an Associate professor Electrical And Computer Engineering Department, SQU. He obtained his PhD from university of Cardiff, UK. Msc and Bsc from Helwan University, Egypt. His research interest includes Multicore design and real time systems.