You are on page 1of 45

Ftrace Tutorial

Steven Rostedt <srostedt@redhat.com>


(rostedt@goodmis.org)

Introduction

Kernel internal tracer


Derived from -rt patch Latency Tracer
Plugin tracers

ftrace : function tracer


irqsoff : interrupt disabled latency
wakeup : latency of highest priority task to
wake up
sched_switch: task context switches
(more)

Ring buffer
Saved traces (snap shots)

used to save maximum latency traces

The Debug File System

/sys/kernel/debug
I prefer:

mkdir /debug
mount -t debugfs nodev /debug

/etc/fstab

debugfs /sys/kernel/debug debugfs defaults 0 0


debugfs /debug debugfs defaults 0 0

/debug/tracing

available_tracers
current_tracer
tracing_enabled
trace
latency_trace
trace_pipe
iter_ctrl
tracing_max_latency
tracing_cpumask
trace_entries

Selecting a tracer

#cat/debug/tracing/available_tracers
wakeuppreemptirqsoffpreemptoffirqsoffftracesysprofsched_switchnone
#echowakeup>/debug/tracing/current_tracer
#cat/debug/tracing/current_tracer
wakeup

The none tracer

No tracer selected
none is special

it is not a tracer

echo none > /debug/tracing/current_tracer

Starting a trace

do not relay on tracing being enabled


echo 1 > /debug/tracing/tracing_enabled

note, make sure to have a space between the '1'


and the '>'. This has burnt many a kernel
programmer.
The enabled stays across tracers.

echo 1 > /debug/tracing/tracing_enabled


echo ftrace > /debug/tracing/current_tracer
echo irqsoff > /debug/tracing/current_tracer

Stopping a trace

echo 0 > /debug/tracing/tracing_enabled

do not forget that space!

Or in a program:
inttrace_fd;
[...]
intmain(intargc,char*argv[]){
[...]
trace_fd=open("/debug/tracing/tracing_enabled",O_WRONLY);
[...]
if(condition_hit()){
write(trace_fd,"0",1);
}
[...]
}

Reading the Output

latency_trace
trace
trace_pipe

Latency Trace Output


#tracer:irqsoff
#
irqsofflatencytracev1.1.5on2.6.26tip

latency:971us,#3/3,CPU#1|(M:preemptVP:0,KP:0,SP:0HP:0#P:2)

|task:swapper0(uid:0nice:20policy:0rt_prio:0)

=>startedat:acpi_os_acq uire_lock
=>endedat:cpuidle_idle_call
#_=>CPU#
#/_=>irqsoff
#|/_=>needresched
#||/_=>hardirq/softirq
#|||/_=>preemptdepth
#||||/
#|||||delay
#cmdpid|||||time|caller
#\/|||||\|/
<idle>01d..11us!:_spin_lock_irqsave(acpi_os_acq uire_lock)
<idle>01d..1971us:acpi_idle_enter_bm(cpuidle_idle_call)
<idle>01d..2972us:trace_hardirqs_on(cpuidle_idle_call)

Various outputs

<idle>01d.h21335164us:tick_sched_timer(__run_hrtimer)

<idle>00.Ns21386686us+:_spin_lock_irq(run_timer_softirq)

<idle>01d.H4138821
7us:ktime_get_ts(ktime_get)

bash34981....1576794us:rw_verify_area(vfs_write)

bash34981d..4120768us+:0:140:R+3096:120:Sgnometer minal

bash34981d..3120796us!:3498:120:S==>0:140:R<idle>

trace output

<idle>0[01]1977
.853298:read_hpet<getnstimeofday
<idle>0[01]1977
.853300:set_nor malized_timespec<ktime_get_ts
<idle>0[01]1977
.853301:_spin_lock<hrtimer_interrupt

bash3498[01]2057
.488856:0:140:R+3096:120:S
gnometer minal3096[00]2057
.488882:3096:120:S==>0:140:R

iter_ctrl

print-parent
sym-offset
sym-addr
verbose
raw
hex
binary
block
stacktrace
sched-tree

Using iter_ctrl

<idle>0[01]2975.463936:tick_program_event<tick_broadcast_oneshot_control
#echonoprintparent>/debug/tracing/iter_ctrl
<idle>0[01]2975.463936:tick_program_event
#echosymoffset>/debug/tracing/iter_ctrl
<idle>0[01]2975.463936:tick_program_event+0xb/0x6a
#echosymaddr>/debug/tracing/iter_ctrl
<idle>0[01]2975.463936:tick_program_event+0xb/0x6a<ffffffff81057c49>

The tracers

sched_switch
ftrace
wakeup
irqsoff
preemptoff
preemptirqsoff

Available Tracers?

#cat/debug/tracing/available_tracers
wakeuppreemptirqsoffpreemptoffirqsoffftracesysprofsched_switchnone

sched_switch

Traces task wakeups


Traces task context switches
bash3498[01]5459.824565:0:140:R+7971:120:R
<idle>0[00]5459.824836:0:140:R==>7971:120:R
bash3498[01]5459.824984:3498:120:S==>0:140:R
<idle>0[01]5459.825342:0:140:R==>7971:120:R
ls7971[00]5459.825380:7971:120:R+3:0:S
ls7971[00]5459.825384:7971:120:R==>3:0:R
migration/03[00]5459.825401:3:0:S==>0:140:R
ls7971[01]5459.825565:7971:120:R+598:115:S

stacktrace

iter_ctrl that effects the tracing itself

bash3498[01]6216.772637
:0:140:R+8495:120:R
bash3498[01]6216.772639:do_fork<=sys_clone<=ptregscall_common<=<=0<=0
<=0<=0
<idle>0[00]6216.773108:0:140:R==>8495:120:R
<idle>0[00]6216.773109:schedule<=cpu_idle<=rest_init<=<=0<=0<=0<=
0
bash3498[01]6216.773234:3498:120:S==>0:140:R
bash3498[01]6216.773235:schedule<=do_wait<=sys_wait4<=tracesys<=<=0<=
0<=0
ls8495[00]6216.773719:8495:120:R+3:0:S
ls8495[00]6216.773720:wake_up_process<=sched_exec<=do_execve<=sys_execve
<=stub_execve<=<=0<=0
ls8495[00]6216.773889:8495:120:R==>3:0:R

ftrace - function tracer

Traces at every non inline function


Other functions not traced

annotated with notrace


Makefile with CFLAGS_REMOVE_... = -pg

Must have /proc/sys/kernel/ftrace_enabled=1


Appears in most other tracers
Very verbose

init1[00]6710.079562:_spin_lock<tick_sched_timer
init1[00]6710.079563:do_timer<tick_sched_timer
bash3498[01]6710.079810:mutex_unlock<tracing_ctrl_write
bash3498[01]6710.079812:dnotify_parent<vfs_write
bash3498[01]6710.079813:_spin_lock<dnotify_parent
bash3498[01]6710.079814:_spin_unlock<dnotify_parent
bash3498[01]6710.079815:inotify_dentry_parent_q ueue_event<vfs_write
bash3498[01]6710.079815:inotify_inode_q ueue_event<vfs_write
bash3498[01]6710.07981
:syscall_trace_leave<int_very_careful
7

Latency Tracers

Stores the last maximum latency trace


wakeup : scheduling latency of RT tasks
irqsoff : interrupts off
preemptoff : preemption off
preemptirqsoff: interrupts and/or preemption
off
tracing_max_latency

wakeup - sched latency

Only traces RT tasks

use LatencyTop for non-RT tasks

Records and traces the maximum latency an


RT task took from wake up to schedule
Remember to reset tracing_max_latency
#echowakeup>/debug/tracing/current_tracer
#echo0>/debug/tracing/tracing_max_latency
#echo1>/debug/tracing/tracing_enabled
#chr tf10usleep10
#echo0>/debug/tracing/tracing_enabled

Wakeup without
function tracing
#tracer:wakeup
#
wakeuplatencytracev1.1.5on2.6.26tip

latency:9us,#2/2,CPU#1|(M:preemptVP:0,KP:0,SP:0HP:0#P:2)

|task:migration/17663(uid:0nice:5policy:1rt_prio:99)

#_=>CPU#
#/_=>irqsoff
#|/_=>needresched
#||/_=>hardirq/softirq
#|||/_=>preemptdepth
#||||/
#|||||delay
#cmdpid|||||time|caller
#\/|||||\|/
usleep102371d..22us+:try_to_wake_up(wake_up_process)
usleep102371d..39us:schedule(preempt_schedule)

With function tracing


#tracer:wakeup
#
wakeuplatencytracev1.1.5on2.6.26tip

latency:19us,#18/18,CPU#0|(M:preemptVP:0,KP:0,SP:0HP:0#P:2)

|task:usleep10133(uid:0nice:0policy:1rt_prio:10)

#_=>CPU#
#/_=>irqsoff
#|/_=>needresched
#||/_=>hardirq/softirq
#|||/_=>preemptdepth
#||||/
#|||||delay
#cmdpid|||||time|caller
#\/|||||\|/
<idle>00d.h41us:try_to_wake_up(wake_up_process)
<idle>00dNh42us:_spin_unlock_irqrestore(try_to_wake_up)
<idle>00dNh33us:_spin_lock(__run_hrtimer)
<idle>00dNh44us:_spin_unlock(hrtimer_interrupt)
<idle>00dNh35us:tick_program_event(hrtimer_interrupt)
[...]
<idle>00.N.214us:_spin_lock_irqsave(hrtick_set)
<idle>00dN.315us+:_spin_unlock_irqrestore(hrtick_set)
<idle>00dN.216us:_spin_lock(schedule)

irqsoff

local_irq_save(flags);
[...]
preempt_disabled();
[...]
local_irq_restore(flags);
[...]
preempt_enabled();

preemptoff

local_irq_save(flags);
[...]
preempt_disabled();
[...]
local_irq_restore(flags);
[...]
preempt_enabled();

preemptirqsoff

local_irq_save(flags);
[...]
preempt_disabled();
[...]
local_irq_restore(flags);
[...]
preempt_enabled();

trace_entries

Not enough data recorded


Too much data recorded
Run-time configurable
Must be done with none tracer or it will give
an -EBUSY
Number is number of entries, but the buffers
are allocate via pages.

If more entries can fit on a page that was


allocated to handle requested entries, the
remaining page will be filled with entries

Dynamic Ftrace
(the fun begins!)

Produces non-measurable overhead


Requires kernel thread ftraced to check for
more updates
Calls kstop_machine to execute text
modification

Not safe to modify code text in SMP environment

/debug/tracing/ftraced_enabled

How it works?

With the gcc profiler switch -pg


Every non-inline function calls mcount
00001adb<do_fork>:
1adb:55push%ebp
1adc:89e5mov%esp,%ebp
1ade:57push%edi
1adf:56push%esi
1ae0:53push%ebx
1ae1:83ec1csub$0x1c,%esp
1ae4:e8fcffffffcall1ae5<do_fork+0xa>
1ae5:R_386_PC32mcount
1ae9:89c3mov%eax,%ebx
1aeb:89c7mov%eax,%edi
1aed:81e300000002and$0x2000000,%ebx
1af3:89cemov%ecx,%esi

Non dynamic i368 mcount


ENTRY(mcount)
cmpl$ftrace_stub,ftrace_trace_function
jnztrace
.globlftrace_stub
ftrace_stub:
ret
/*takenfromglibc*/
trace:
pushl%eax
pushl%ecx
pushl%edx
movl0xc(%esp),%eax
movl0x4(%ebp),%edx
subl$MCOUNT_INSN_SIZE,%eax
call*ftrace_trace_function
popl%edx
popl%ecx
popl%eax
jmpftrace_stub
END(mcount)

Dynamic i386 mcount


ENTRY(mcount)
pushl%eax
pushl%ecx
pushl%edx
movl0xc(%esp),%eax
subl$MCOUNT_INSN_SIZE,%eax
.globlmcount_call
mcount_call:
callftrace_stub
popl%edx
popl%ecx
popl%eax
ret
END(mcount)

Call ftrace_record_ip
ENTRY(mcount)
pushl%eax
pushl%ecx
pushl%edx
movl0xc(%esp),%eax
subl$MCOUNT_INSN_SIZE,%eax
.globlmcount_call
mcount_call:
callftrace_record_ip
popl%edx
popl%ecx
popl%eax
ret
END(mcount)

ftrace_record_ip
HASH

do_fork

do_fork+0x9
ftrace_record_ip

ftraced

HASH

ftraced

List
do_fork+0x9

do_fork+0x9

kstop_machine
(modify code: nop)

nop

00001adb<do_fork>:
1adb:55push%ebp
1adc:89e5mov%esp,%ebp
1ade:57push%edi
1adf:56push%esi
1ae0:53push%ebx
1ae1:83ec1csub$0x1c,%esp
1ae4:908d742600nop
1ae9:89c3mov%eax,%ebx
1aeb:89c7mov%eax,%edi
1aed:81e300000002and$0x2000000,%ebx
1af3:89cemov%ecx,%esi

Starting of ftrace

00001adb<do_fork>:
1adb:55push%ebp
1adc:89e5mov%esp,%ebp
1ade:57push%edi
1adf:56push%esi
1ae0:53push%ebx
1ae1:83ec1csub$0x1c,%esp
1ae4:e8fcffffffcall1ae5
<do_fork+0xa>
1ae5:R_386_PC32ftrace_caller
1ae9:89c3mov%eax,%ebx
1aeb:89c7mov%eax,%edi
1aed:81e300000002and$0x2000000,%ebx
1af3:89cemov%ecx,%esi

ftrace_caller
ENTRY(ftrace_caller)
pushl%eax
pushl%ecx
pushl%edx
movl0xc(%esp),%eax
movl0x4(%ebp),%edx
subl$MCOUNT_INSN_SIZE,%eax
.globlftrace_call
ftrace_call:
callftrace_stub
popl%edx
popl%ecx
popl%eax
.globlftrace_stub
ftrace_stub:
ret
END(ftrace_caller)

Registering an ftrace caller


ENTRY(ftrace_caller)
pushl%eax
pushl%ecx
pushl%edx
movl0xc(%esp),%eax
movl0x4(%ebp),%edx
subl$MCOUNT_INSN_SIZE,%eax
.globlftrace_call
ftrace_call:
callfunction_trace_call
popl%edx
popl%ecx
popl%eax
.globlftrace_stub
ftrace_stub:
ret
END(ftrace_caller)

Selective function tracer

tracing is dynamically enabled


have a list of functions that need to be traced
Why not filter which functions we trace?

Picking what functions to trace

/debug/tracing/available_filter_functions
/debug/tracing/set_ftrace_filter
/debug/tracing/set_ftrace_notrace

available_filter_functions

$cat/debug/tracing/available_filter_functions|
head
filelock_init
__rcu_read_lock
kmem_cache_create
notifier_call_chain
down_write
__rcu_read_unlock
_spin_lock_irq
_spin_unlock_irq
_spin_lock
__kmalloc

set_ftrace_filter
#echosys_open>/debug/tracing/set_ftrace_filter
#echoftrace>/debug/tracing/current_tracer
#echo1>/debug/tracing/tracing_enabled
#ls>/dev/null
#echo0>/debug/tracing/tracing_enabled
#cat/debug/tracing/trace
#tracer:ftrace
#
#TASKPIDCPU#TIMESTAMPFUNCTION
#|||||
ls5652[00]2320.450897
:sys_open<tracesys
ls5652[01]2320.452238:sys_open<tracesys
ls5652[01]2320.452311:sys_open<tracesys
ls5652[01]2320.452436:sys_open<tracesys
ls5652[01]2320.452548:sys_open<tracesys
ls5652[01]2320.452651:sys_open<tracesys
ls5652[01]2320.452749:sys_open<tracesys
ls5652[01]2320.452854:sys_open<tracesys
ls5652[01]2320.452950:sys_open<tracesys
ls5652[01]2320.453551:sys_open<tracesys
ls5652[01]2320.453767
:sys_open<tracesys
ls5652[01]2320.453851:sys_open<tracesys
bash5495[00]2320.453977
:sys_open<tracesys
ls5652[01]2320.454030:sys_open<tracesys

set_ftrace_notrace

Modify like set_ftrace_filter


Acts like a notrace added to the function
The function will not be traced even if in the
set_ftrace_filter

set_ftrace_* wildcards

Prefix: echo 'sys_*' > /debug/tracing/set...


Postfix: echo '*lock' > /debug/tracing/set...
Included: echo '*device*' > /debug/tracing/set...
Anything else:

use grep on available_filter_functions

#grep'^selinux.*open$'/debug/tracing/available_filter_functions>\
/debug/tracing/set_ftrace_notrace

Todo:

ftrace dump on OOPS


change sleep interval of ftraced thread
use CPU clock (aka TSC) for interrupt and
preemption latency traces
option to force per CPU trace interleaving
integrity
printk like hooks (for debugging purposes
only)
Hooks for tuna to show in the oscilloscope

You might also like