You are on page 1of 15

Adaptive Partition Scheduling Part 1: Why we did it

Cool stuff from QNX

A.Danko

January 24, 2012

Why?

Evolution of schedulers
Timeline
priority pre-emptive SCHED_FIFO Timeslicing SCHED_RR Time-varying priority

Yes, but:
System locks up Backhoes and Mothers day Untuneable for more than 1 application. US Military Satcom Hard to manage share interactions. Not invented until now.

SCHED_SPORADIC Really clever time-varying Fair Share scheduling Adaptive configuration

January 24, 2012

Cool Stuff from QNX

Why?

Evolution: Lessons learned


Numerical priorities are chosen by applications but system scheduling behavior must be designed globally Degradation and overload: Priorities are not constants. Importance of work depends on circumstances.
> Modes: normal operation, restart, emergency maintenance

Scheduling strategy needs to be based on unit of work, but what we have is communicating threads. must measure real-time behavior.
> 0.1 % accuracy

Want to specify shares as global percentages


> Applications dont get to pick their importance or shares. System engineers do.

Need to throttle cpu usage without losing realtime latencies.


January 24, 2012
Cool Stuff from QNX

Design

What is Partitioning?
General Answer Separation of work To isolate:
> cpu usage > memory usage > system resource usage > Failures

QNX Answer
POSIX compatible design which can be applied to existing systems with little or no recoding Partition Scheduling Adaptive A global hard real-time scheduler with overload protection and CPU guarantees
> Separation of work based on working for common purpose

Runtime typed memory and kernel object guarantees and limits


> With full inheritance and accounting for all children

Persistent storage (file system) guarantees and limits Process model for fault isolation Dynamic configuration
Cool Stuff from QNX

January 24, 2012

Design

Principles
Scheduler must not trigger an overload
> Overhead may not increase with # of threads
Throughput

Real-time during underload


> Same behavior as today

Real-time during overload


> At least for interrupt handling

Must also be a fair-share scheduler


> > global scheduler algorithm globally configured

Offered load

Must mesh with current QNX architecture


Preemptive priority, individual thread scheduling Heavy use of message passing

> >

Easy to drop onto existing applications Cant be a bag on the side


Insert picture of Juggling Watermelons here

Simple enough for customers to use


> > Engineerable Reconfigure on the fly

January 24, 2012

Cool Stuff from QNX

Design

Counting time
What does 14% cpu mean?
> > CPU usage is calculated over a sliding window.

T= -100ms

T= now

Accuracy:
> > > > > > > Counting ticks is not enough. Micro-billing is used to track actual CPU utilization even when threads dont use their whole timeslice. micro- and nano-second resolution Threads are billed based on real usage, not statistics Tradeoff maximum READY-state latency with accuracy of CPU budgeting  100ms window -> 1% accuracy or better. Internal arithmetic accurate to 0.5% or better ns cpu time executed, during last sliding window, expressed as percentage Guaranteed percentage of cpu time, balanced over sliding window

windowsize is configurable as an argument to kernel at boot

Partition usage Partition budget

January 24, 2012

Cool Stuff from QNX

Design

Whos got time: Partition Inheritance

File System Process


7
Message

6 6

11 8
Message

10 10

4 9
Receive Threads

9 CPU budget available Adaptive Partition 1 (Multi-media)

CPU budget available Adaptive Partition 2 (Java application)

Resource manager threads work on behalf of sender Priority and adaptive partition in inherited on receive
> Execution time in server billed to clients partition

This allows proper accounting for shared resources


January 24, 2012
Cool Stuff from QNX

Design

Real time: Behavior under normal load


Blocked Ready 6 6 6 8 11 Running 9 4 CPU budget available Adaptive Partition 1 (Multi-media) CPU budget available Adaptive Partition 2 (Java application) 7

10

Hard real-time scheduler under normal load Running thread selected as highest priority READY thread No delay on scheduling if adaptive partition has budget
January 24, 2012
Cool Stuff from QNX

Design

Out of time: Behavior under overload


Blocked Ready 6 6 6 8 11 Running 9 4 CPU budget available Adaptive Partition 1 (Multi-media) CPU budget exceeded Adaptive Partition 2 (Java application) 7

10

Highest priority READY thread in Partition with budget runs No delay on scheduling if adaptive partition has budget

January 24, 2012

Cool Stuff from QNX

Design

Free Time: Behavior with unused CPU


Blocked 6 6 11 Running 6 7 10 6

10 9 4 CPU budget exceeded Adaptive Partition 1 (Multi-media) CPU budget exceeded Adaptive Partition 2 (Java application)

CPU budget available Adaptive Partition 3

If no partitions with remaining budget have READY threads, highest priority READY thread is selected to run from other partitions This allows free time to be given based upon priority
> Free time is still accounted and may have to be paid back (for example, if partition 3 becomes ready within 1 averaging window)
Cool Stuff from QNX

January 24, 2012

10

Design

Borrowed Time: Critical Threads


Blocked Ready 6 6 6 8 11 Running 30 11 4 CPU budget available Adaptive Partition 1 (Multi-media) CPU budget exceeded Adaptive Partition 2 (Air Bag Control) 7 Critical Thread

Critical threads still run (based on priority) even if partition has no budget Critical threads provide deterministic scheduling even in overload Critical threads are given critical budget and can go into short-term debt
> > Critical time is accounted and has to be repaid Exceeding critical budget is considered an error and causes notification/action
Cool Stuff from QNX

January 24, 2012

11

Design

Equal time.
How to choose between partitions of equal priority
> Unimportant? > Many threads run at default priority, therefore equal priority

Possible algorithms:
> - round robin > - favor partition with most free time > - favor longest waiter

Requirement:
> Minimize latencies during underload > WBN: divide free time by % cpu share.

Solution:
Interleave partitions by ratio of partition shares
We found a clever way to do that, so its in the patent.

January 24, 2012

Cool Stuff from QNX

12

How it does it

uKernel
Process creation

libmod_aps.a
messaging
Per-partition Ready Q

Scheduler
clock intr handler ready() block() select_thread()

for all partitions, p Def m(p) -> (bud(p)||crit(p), prio(p), run_t/wsize/bud(p)) Then schedule ps Def ps -> rdy(ps) and (m(ps) < m(pi)) For all i != s

January 24, 2012

Cool Stuff from QNX

13

Overhead: Fancy, but is it fast? Scheduling overhead increases with:


> > > > - number of partitions - number of messages/sec - number of clock interrupts/sec, i.e. ClockPeriod() * does not increase with number of threads *

Free or almost free operations:


> Inheriting partition as part of message receive > Joining a thread to a partition > Dynamically changing budgets

Computational requirements
> 32 bit multiply, 64bit add > *no floating point* *no divides* *no address space swapping* *short-circuit calculation of merit function* *no inter-cpu msging on SMP* *history-less algorithm*

Overhead typically 1% of total cpu


January 24, 2012
Cool Stuff from QNX

14

Any Queries????

January 24, 2012

Cool Stuff from QNX

15

You might also like