You are on page 1of 75

Process Scheduling

國立中正大學
資訊工程研究所
羅習五 老師

1
Outline
• OS schedulers
• Unix scheduling
• Linux scheduling
• Linux 2.4 scheduler
• Linux 2.6 scheduler
– O(1) scheduler
– O(2) scheduler

2
5/31 ~

一 二 三 四 五 六 日
5/31 6/1 6/2 6/3 6/4 6/5 6/6
6/7 6/8 6/9 6/10 6/11 6/12 6/13
6/14 6/15 6/16 6/17 6/18 6/19 6/20
6/21 6/22 6/23 6/24 6/25 6/26 6/27
6/28 6/29 6/30 7/1 7/2 7/3 7/4
7/5 7/6 7/7 7/8 7/9 7/10 7/11

3
Introduction
preemptive & cooperative multitasking
• A multitasking operating system is one that
can simultaneously interleave execution of
more than one process.
• Multitasking operating systems come in two
flavors: cooperative multitasking and
preemptive multitasking.
– Linux provides preemptive multitasking
– MAC OS 9 and earlier being the most notable
cooperative multitasking .

4
Linux scheduler –
Scheduling Policy
• Scheduling policy determines what runs when
– fast process response time (low latency)
– maximal system utilization (high throughput)
• Processes classification:
– I/O-bound processes: spends much of its time
submitting and waiting on I/O requests
– Processor-bound processes: spend much of their time
executing code
• Unix variants tends to favor I/O-bound processes,
thus providing good process response time
5
Linux scheduler –
Process Priority
• Linux’s priority-based scheduling
– Rank processes based on their worth and need for
processor time.
– Both the user and the system may set a process's priority
to influence the scheduling behavior of the system.
• Dynamic priority-based scheduling
– Begins with an initial base priority
– Then enables the scheduler to increase or decrease the
priority dynamically to fulfill scheduling objectives.
– E.g., a process that is spending more time waiting on I/O
will receive an elevated dynamic priority.

6
Linux scheduler –
Priority Ranges
• Two separate priority ranges.
– nice value, from -20 to +19 with a default of 0.
• Larger nice values correspond to a lower priority. (you
are being nice to the other processes on the system).
– real-time priority, by default range from 0 to 99.
• All real-time processes are at a higher priority than
normal processes.
• Linux implements real-time priorities in accordance
with POSIX standards on the matter.

7
2.4 scheduler
• Non-preemptible kernel
– Set p->need_resched if schedule() should be
invoked at the ‘next opportunity‘ (kernel => user
mode).
• Round-robin
– task_struct->counter: number of clock ticks left to
run in this scheduling slice, decremented by a
timer.

8
2.4 scheduler
1. Check if schedule() was invoked from interrupt
handler (due to a bug) and panic if so.
2. Use spin_lock_irq() to lock ‘runqueue_lock’
3. Check if a task is ‘runnable’
– in TASK_RUNNING state
– in TASK_INTERRUPTIBLE state and a signal is pending
4. Examine the ‘goodness’ of each process
5. Context switch

9
2.4 scheduler – ‘goodness’
• ‘goodness’: identifying the best candidate
among all processes in the runqueue list.
– ‘goodness’ = 0: the entity has exhausted its
quantum.
– 0 < ‘goodness’ < 1000: the entity is a conventional
process/thread that has not exhausted its
quantum; a higher value denotes a higher level of
goodness.

10
2.4 scheduler – ‘goodness’
if (p->mm == prev->mm)
return p->counter + p->priority + 1;
else
return p->counter + p->priority;

• A small bonus is given to the task p if it


shares the address space with the previous
task.

11
2.4 scheduler - SMP

run queue

12
2.4 scheduler - SMP
Examine the processor field of the processes
and gives a consistent bonus (that is
PROC_CHANGE_PENALTY, usually 15) to the
process that was last executed on the ‘this_cpu’
CPU.

13
2.4 scheduler - performance
• The algorithm does not scale well
– It is inefficient to re-compute all dynamic priorities at
once.
• The predefined quantum is too large for high
system loads (for example: a server)
• I/O-bound process boosting strategy is not
optimal
– a good strategy to ensure a short response time for
interactive programs, but…
– some batch programs with almost no user interaction
are I/O-bound.

14
Recalculating Timeslices
(kernel 2.4)

• Problems:
– Can take a long time. Worse, it scales O(n) for n tasks
on the system.
– Recalculation must occur under some sort of lock
protecting the task list and the individual process
descriptors. This results in high lock contention.
– Nondeterminism is a problem with deterministic real-
time programs.

15
2.6 scheduler

run queue

task migration
(put + pull)

run queue

16
2.6 scheduler –
User Preemption
• User preemption can occur
– When returning to user-space from a system call
– When returning to user-space from an interrupt
handler

17
2.6 scheduler –
Kernel Preemption
• The Linux kernel is a fully preemptive kernel.
– It is possible to preempt a task at any point, so long as the
kernel is in a state in which it is safe to reschedule.
– “safe to reschedule”: kernel does not hold a lock
• The Linux design:
– additing of a preemption counter, preempt_count, to each
process's thread_info
– This count increments once for each lock that is acquired and
decrements once for each lock that is released
• Kernel preemption can also occur explicitly, when a task in
the kernel blocks or explicitly calls schedule().
– no additional logic is required to ensure that the kernel is in a
state that is safe to preempt!

18
Kernel Preemption
• Kernel preemption can occur
– When an interrupt handler exits, before returning
to kernel-space
– When kernel code becomes preemptible again
– If a task in the kernel explicitly calls schedule()
– If a task in the kernel blocks (which results in a call
to schedule())

19
O(1) & CFS scheduler
• 2.5 ~ 2.6.22: O(1) scheduler
– Time complexity: O(1)
– Using “run queue” (an active Q and an expired Q)
to realize the ready queue
• 2.6.23~present: Completely Fair Scheduler
(CFS)
– Time complexity: O(log n)
– the ready queue is implemented as a red-black
tree

20
O(1) scheduler
• Implement fully O(1) scheduling.
– Every algorithm in the new scheduler completes in constant-time, regardless of the
number of running processes. (Since the 2.5 kernel).
• Implement perfect SMP scalability.
– Each processor has its own locking and individual runqueue.
• Implement improved SMP affinity.
– Attempt to group tasks to a specific CPU and continue to run them there.
– Only migrate tasks from one CPU to another to resolve imbalances in runqueue sizes.
• Provide good interactive performance.
– Even during considerable system load, the system should react and schedule interactive
tasks immediately.
• Provide fairness.
– No process should find itself starved of timeslice for any reasonable amount of time.
Likewise, no process should receive an unfairly high amount of timeslice.
• Optimize for the common case of only one or two runnable processes, yet
scale well to multiple processors, each with many processes. 21
The Priority Arrays
• Each runqueuecontains two priority arrays (defined in
kernel/sched.cas struct prio_array)
– Active array: all tasks with timesliceleft.
– Expired array: all tasks that have exhausted their timeslice.
• Priority arrays provide O(1) scheduling.
– Each priority array contains one queue of runnable processors per
priority level.
– The priority arrays also contain a priority bitmap used to efficiently
discover the highest-priority runnable task in the system.

22
The Linux O(1) scheduler algorithm

23
The Priority Arrays
• Each runqueuecontains two priority arrays (defined in
kernel/sched.cas struct prio_array)
– Active array: all tasks with timesliceleft.
– Expired array: all tasks that have exhausted their timeslice.
• Priority arrays provide O(1) scheduling.
– Each priority array contains one queue of runnable processors per
priority level.
– The priority arrays also contain a priority bitmap used to efficiently
discover the highest-priority runnable task in the system.

24
 Each runqueue contains two priority
arrays – active and expired.
runqueue  Each of these priority arrays contains a
list of tasks indexed according to priority

Priority
queue
(0-139)

expired
active

25
 Linux assigns higher-priority tasks
longer time-slice
runqueue
Time quantum ≈
1/priority

tsk1

tsk2 tsk3
expired

active

26
 Linux chooses the task with the
highest priority from the active
runqueue array for execution.

tsk1

tsk2 tsk3
expired

active

27
runqueue

tsk1

Round-robin

tsk2 tsk3
expired

active

28
runqueue

tsk1

Round-robin

tsk3 tsk2
expired

active

29
runqueue

tsk1

tsk2 tsk3
expired

active

30
 Most tasks have dynamic priorities
that are based on their “nice” value
runqueue 
(static priority) plus or minus 5
Interactivity of a task ≈
1/sleep_time

dynPrio = staticPrio +
bonus
tsk1
bonus = -5 ~ +5
bonus ≈ 1/sleep_time
tsk3

tsk2 tsk3

I/O
bound
expired

active

31
 When all tasks have exhausted
their time slices, the two priority
runqueue arrays are exchanged!

tsk1

tsk3

tsk2
expired

active

32
The O(1) scheduling algorithm
sched_find_first_bit()

1 1 1
tsk1

tsk3

tsk2

33
The O(1) scheduling algorithm

Insert O(1)

1 1 1 Remove O(1)

find first set bit O(1)

34
find first set bit O(1)
static inline unsigned long __ffs word >>= 8;
(unsigned long word) { }
if ((word & 0xf) == 0) {
int num = 0; num += 4;
#if BITS_PER_LONG == 64 word >>= 4;
if ((word & 0xffffffff) == 0) { }
num += 32; if ((word & 0x3) == 0) {
word >>= 32; num += 2;
} word >>= 2;
#endif }
if ((word & 0xffff) == 0) { if ((word & 0x1) == 0)
num += 16; num += 1;
word >>= 16; return num;
} }
if ((word & 0xff) == 0) {
num += 8;

35
2.6 scheduler –
CFS
• Classical schedulers compute time slices for
each process in the system and allow them to
run until their time slice/quantum is used up.
– After that, all process need to be recalculated.
• CFS considers only the wait time of a process
– The task with the most need for CPU time is
scheduled.

36
2.6 scheduler –
CFS

1
𝐹𝑎𝑖𝑟𝑛𝑒𝑠𝑠 ≅
σ 𝑤𝑎𝑖𝑡𝑖𝑛𝑔 𝑡𝑖𝑚𝑒

37
2.6 scheduler –
CFS (motivation)
• Traditional Unix scheduling policy

HP LP

HP: high priority


LP: low priority
38
2.6 scheduler –
CFS (motivation)
• Traditional Unix scheduling policy

WT
WT
WT

WT
WT: waiting time
39
2.6 scheduler –
CFS (the idea case)
task task task task task task

virtualization

High priority Low priority

40
2.6 scheduler –
CFS (the idea case)
task task task task task task

virtualization

fastly slowly

41
2.6 scheduler –
CFS (the idea case)
task task task task task task

8 : 8 : 8 : 3 : 3 : 3

speed

42
2.6 scheduler –
CFS (the idea case)

time
8 8 8 3 3 3

43
2.6 scheduler –
CFS (the idea case)

time
8 8 8 3 3 3

44
2.6 scheduler –
CFS (the idea case)

time
8 8 8 3 3 3

45
2.6 scheduler –
CFS (the idea case)

time
8 8 8 3 3 3

46
2.6 scheduler –
CFS (the idea case)

time
8 8 8 3 3 3

47
2.6 scheduler –
CFS (the implementation)

time
8 8 8 3 3 3

48
2.6 scheduler –
CFS (the implementation)

time
8 8 8 3 3 3

49
2.6 scheduler –
CFS (the implementation)

8 8 8 3 3 3

50
2.6 scheduler –
CFS (the implementation)

8 8 8 3 3 3

51
2.6 scheduler –
CFS (the implementation)

8 8 8 3 3 3

52
2.6 scheduler –
CFS (the implementation)

8 8 8 3 3 3

53
2.6 scheduler –
CFS (the implementation)

8 8 8 3 3 3

54
2.6 scheduler –
CFS (the implementation)

8 8 8 3 3 3

55
2.6 scheduler –
CFS (the implementation)

8 8 8 3 3 3

56
2.6 scheduler –
CFS (the implementation)

8 8 8 3 3 3

57
2.6 scheduler –
CFS (the implementation)

8 8 8 3 3 3

58
2.6 scheduler –
CFS (the implementation)

8 8 8 3 3 3

59
2.6 scheduler –
CFS (the implementation)

8 8 8 3 3 3

60
2.6 scheduler –
CFS (the implementation)

8 8 8 3 3 3

61
2.6 scheduler –
CFS (the implementation)

8 8 8 3 3 3

62
2.6 scheduler –
CFS (the implementation)

8 8 8 3 3 3

63
2.6 scheduler –
CFS (the implementation)

8 8 8 3 3 3

64
2.6 scheduler –
CFS

65
2.6 scheduler –
the RB-Tree
• To sort tasks on the red-black tree, the kernel
uses the difference fair_clock -wait_runtime.
– While fair_clock is a measure for the CPU time a
task would have gotten if scheduling were
completely fair,
– wait_runtime is a direct measure for the
unfairness caused by the imperfection of real
systems.

66
2.6 scheduler –
issues
• Different priority levels for tasks (i.e., nice
values) must be taken into account

• Tasks must not be switched too often because


a context switch has a certain overhead.

67
2.6 scheduler –
fields in the task_struct

68
2.6 scheduler –
fields in the task_struct
• prio and normal_prio indicate the dynamic
priorities, static_prio the static priority of a
process.
– The static priority is the priority assigned to the
process when it was started.
– normal_priority denotes a priority that is
computed based on the static priority and the
scheduling policy of the process.

69
2.6 scheduler –
fields in the task_struct

• The scheduler is not limited to schedule


processes, but can also work with larger
entities. This allows for implementing group
scheduling.

70
2.6 scheduler –
fields in the task_struct

• cpus_allowed is a bit field used on


multiprocessor systems to restrict the CPUs on
which a process may run.
– setaffinity()

– getaffinity()

71
2.6 scheduler –
priority

72
2.6 scheduler –
priority
kernel/sched.c
static const int prio_to_weight[40] = {
/* -20 */ 88761, 71755, 56483, 46273, 36291,
/* -15 */ 29154, 23254, 18705, 14949, 11916,
/* -10 */ 9548, 7620, 6100, 4904, 3906,
/* -5 */ 3121, 2501, 1991, 1586, 1277,
/* 0 */ 1024, 820, 655, 526, 423,

/* 5 */ 335, 272, 215, 172, 137,


/* 10 */ 110, 87, 70, 56, 45,
/* 15 */ 36, 29, 23, 18, 15,
};

prio_to_weight[i] = prio_to_weight[i] ×1.25

73
2.6 scheduler –
priority

74
Summary
• The concept of OS schedulers
• Maximize throughput.
– This is what system administrators care about.
– How to maximize throughput (CPU & I/O).
• What is the major drawback of Linux 2.4
scheduler
• Pros and cons of Linux 2.6 schedulers
– O(1)
– CFS
75