You are on page 1of 46

Linux Kernel Internals

Booting the Kernel


Jeremy Beker
2
Players in Kernel Booting
BIOS
Boot Loader
Kernel
Init
Startup scripts
Login
3
Big Picture

Boot 16-bit kernel 32-bit kernel 32-bit kernel 32-bit user


BIOS
Loader assembly assembly C mode
4
Hard Disk Layout

Boot Code

1 Partition 1 Partition 2
2
3
4 Master Boot Record Boot Sector Data
5
Power On
PROCESSOR MANAGEMENT AND INITIALIZATION
CPU registers and RAM
contain random data Paging disabled: 0
Caching disabled: 1
Voltage is applied to RESET Not write-through disabled: 1

pin on CPU Alignment check disabled: 0


Write-protect disabled: 0

CPU registers are set to 31 30 29 28 19 18 17 16 15 6 5 4 3 2 1 0

known good values


P C N Reserved A W Reserved N T E M P
1
G DW M P E S MP E

CPU executes code at External x87 FPU error reporting: 0


(Not used): 1
No task switch: 0
physical address x87 FPU instructions not trapped: 0
WAIT/FWAIT instructions not trapped: 0
0xfffffff0 Real-address mode: 0

Figure 9-1. Contents of CR0 Register after Reset

9.1.3 Model and Stepping Information


Following a hardware reset, the EDX register contains component identification and revision
information (see Figure 9-2). For example, the model, family, and processor type returned for
the first processor in the Intel Pentium 4 family is as follows: model (0000B), family (1111B),
and processor type (00B).
6
BIOS
Remember, we are an 8086
Hardware self tests (POST)
Initialize hardware devices
PCI IRQ distribution
Find a boot device
Load first sector of device at location 0x00007c00
Jumps to that address
7
Boot Loader
For Linux on i386, either lilo or grub is used
Multi-stage loaders (first stage can’t be larger than 512 bytes)
lilo is 100% assembler
grub is combination of assembler and C
Boot loader is responsible for finding kernel image on disk and
loading it into memory
8
Assumptions about Code
Running on i386 or greater
Not SMP
Not NUMA
No PAE
9
Getting to protected mode
arch/i386/boot/setup.S
Responsible for using the BIOS and real-mode code to get the
system to the point where protected mode can be enabled
Get HD data
Check for PS/2 devices
Check for APM BIOS
Examines memory of computer
Prepares system for protected mode
10
ACPI RAM detection
meme820:

xorl
%ebx, %ebx

# continuation counter Map stored at 0x2d0

movw
$E820MAP, %di

# point into the whitelist





# so we can have the bios





# directly write into it. Offset in
Name Description
Bytes
jmpe820:

movl
$0x0000e820, %eax
# e820, upper word zeroed

movl
$SMAP, %edx

# ascii 'SMAP' 0 BaseAddrLow Low 32 bits of base address

movl
$20, %ecx

# size of the e820rec

pushw
%ds


# data record.

popw
%es 4 BaseAddrHigh High 32 bits of base address

int
$0x15


# make the call

jc
bail820


# fall to e801 if it fails 8 LengthLow Low 32 bits of length in bytes

cmpl
$SMAP, %eax

# check the return is `SMAP'

jne
bail820


# fall to e801 if it fails 12 LengthHigh High 32 bits of length in bytes

# If this is usable memory,

# we save it by simply 16 Type Address type of this range

# advancing %di by

# sizeof(e820rec). 20 Ext. Attributes Extended attributes
good820:

movb
(E820NR), %al

# up to 32 entries

cmpb
$E820MAX, %al Bit Name Description

jnl
bail820


incb
(E820NR) If clear, ignore the Address
0 AddressRangeEnabled

movw
%di, %ax Range Descriptor.

addw
$20, %ax

movw
%ax, %di If set, the Address Range
again820:

cmpl
$0, %ebx

# check to see if 1 AddressRangeNonVolatile Descriptor represents

jne
jmpe820


# %ebx is set to EOF nonvolatile memory.
bail820:
2-31 Reserved Reserved for future use.
11
E801h RAM detection
#
#
method E801H:
memory size is in 1k chunksizes, to avoid confusing loadlin. Available Memory stored at
# we store the 0xe801 memory size in a completely different place,
#
#
because it will most likely be longer than 16 bits.
(use 1e0 because that's what Larry Augustine uses in his 0x1e0
# alternative new memory detection scheme, and it's sensible
# to write everything into the same place.)

meme801:

stc


# fix to work around buggy

xorw
%cx,%cx

# BIOSes which dont clear/set

xorw
%dx,%dx

# carry on pass/error of




# e801h memory size call




# or merely pass cx,dx though




# without changing them.

movw
$0xe801, %ax

int
$0x15

jc
mem88


cmpw
$0x0, %cx
# Kludge to handle BIOSes

jne
e801usecxdx
# which report their extended

cmpw
$0x0, %dx
# memory in AX/BX rather than

jne
e801usecxdx
# CX/DX. The spec I have read

movw
%ax, %cx
# seems to indicate AX/BX

movw
%bx, %dx
# are more reasonable anyway...

e801usecxdx:

andl
$0xffff, %edx
# clear sign extend

shll
$6, %edx
# and go from 64k to 1k chunks

movl
%edx, (0x1e0)
# store extended memory size

andl
$0xffff, %ecx
# clear sign extend

addl
%ecx, (0x1e0)
# and add lower memory into




# total size.
12
E88h RAM detection
# Ye Olde Traditional Methode. Returns the memory size (up to
# 16mb or 64mb, depending on the bios) in ax. Available Memory stored at
mem88:


movb
$0x88, %ah 0x2

int
$0x15

movw
%ax, (2)
13
ACPI RAM detection
14
A20 is the devil
“This is at the very best an annoying procedure.”
Due to compatibility with 8088, at boot time, bit 20 of all
addresses are cleared to 0
The A20 line must be enabled to allow all memory (even in real
mode) to be addressed.
First attempt via int 0x15, AX=0x2401
Second attempt via keyboard controller (original method)
Third attempt write to I/O port 0x92
15
Protected mode: Ahhh......
# Well, that certainly wasn't fun :-(. Hopefully it works, and we don't
# need no steenking BIOS anyway (except for the initial loading :-).
#
#
#
The BIOS-routine wants lots of unnecessary data, and it's less
"interesting" anyway. This is how REAL programmers do it. Finally
# Well, now's the time to actually move into protected mode. To make

Note that there is no


# things as simple as possible, we do no register set-up or anything,
# we let the gnu-compiled 32-bit programs do that. We just jump to
# absolute address 0x1000 (or the loader supplied one),

explicit jmp at the


# in 32-bit protected mode.
#
# Note that the short jump isn't strictly needed, although there are

end of the code.


# reasons why it might be a good idea. It won't hurt in any case.

movw
$1, %ax



# protected mode (PE) bit

lmsw
%ax



# This is it!

jmp
flush_instr

flush_instr:

xorw
%bx, %bx


# Flag to indicate a boot The jmp instruction is

xorl
%esi, %esi


# Pointer to real-mode code






movw

subw

shll

%cs, %si
$DELTA_INITSEG, %si
$4, %esi


# Convert to 32-bit pointer
created at run-time
# jump to startup_32 in arch/i386/boot/compressed/head.S
#

# NOTE: For high loaded big kernels we need a
#
jmpi 0x100000,__BOOT_CS
#
#
but we yet haven't reloaded the CS register, so the default size
#
of the target offset still is 16 bit.
# However, using an operand prefix (0x66), the CPU will properly
#
take our 48 bit far pointer. (INTeL 80386 Programmer's Reference
#
Manual, Mixing 16-bit and 32-bit code, page 16-6)


.byte 0x66, 0xea


# prefix + jmpi-opcode
code32:
.long
0x1000



# will be set to 0x100000






# for big kernels

.word
__BOOT_CS
16
startup_32: Take 1
Clears interrupts
Clears eflags
Decompress the kernel
Move the kernel to its final location at 0x001000000
Perform unconditional jump to 0x001000000
17
startup_32: Take 1
show some code
18
startup_32: Take 2
This is a different function with the same name. Confusing.
Enable Paging (and extended paging)
Reinitialize eflags
Load initial Global Descriptor Table
Initialize IDT: setup_idt
Calls start_kernel
19
Global Descriptor Table
.quad 0x0000000000000000
/* NULL descriptor */
.quad
.quad
.quad
0x0000000000000000

0x0000000000000000

0x0000000000000000

/*
/*
/*
0x0b
0x13
0x1b
reserved */
reserved */
reserved */
Loaded immediately after
.quad
.quad
.quad
0x0000000000000000

0x0000000000000000

0x0000000000000000

/*
/*
/*
0x20
0x28
0x33
unused */
unused */
TLS entry 1 */
we enter startup_32
.quad 0x0000000000000000
/* 0x3b TLS entry 2 */

Loads 4GB address


.quad 0x0000000000000000
/* 0x43 TLS entry 3 */
.quad 0x0000000000000000
/* 0x4b reserved */
.quad 0x0000000000000000
/* 0x53 reserved */

windows for kernel and


.quad 0x0000000000000000
/* 0x5b reserved */
.quad 0x00cf9a000000ffff
/* 0x60 kernel 4GB code at 0x00000000 */
.quad 0x00cf92000000ffff
/* 0x68 kernel 4GB data at 0x00000000 */

user mode
.quad 0x00cffa000000ffff
/* 0x73 user 4GB code at 0x00000000 */
.quad 0x00cff2000000ffff
/* 0x7b user 4GB data at 0x00000000 */
.quad 0x0000000000000000
/* 0x80 TSS descriptor */
.quad 0x0000000000000000
/* 0x88 LDT descriptor */

/* Segments used for calling PnP BIOS */


.quad 0x00c09a0000000000
/* 0x90 32-bit code */
.quad 0x00809a0000000000
/* 0x98 16-bit code */
.quad 0x0080920000000000
/* 0xa0 16-bit data */
.quad 0x0080920000000000
/* 0xa8 16-bit data */
.quad 0x0080920000000000
/* 0xb0 16-bit data */
/*
* The APM segments have byte granularity and their bases
* and limits are set at run time.
*/
.quad 0x00409a0000000000
/* 0xb8 APM CS code */
.quad 0x00009a0000000000
/* 0xc0 APM CS 16 code (16 bit) */
.quad 0x0040920000000000
/* 0xc8 APM DS data */

.quad 0x0000000000000000
/* 0xd0 - unused */
.quad 0x0000000000000000
/* 0xd8 - unused */
.quad 0x0000000000000000
/* 0xe0 - unused */
.quad 0x0000000000000000
/* 0xe8 - unused */
.quad 0x0000000000000000
/* 0xf0 - unused */
.quad 0x0000000000000000
/* 0xf8 - GDT entry 31: double-fault TSS */
20
setup_idt
/*
* setup_idt Loop through all 255
*
*
*
sets up a idt with 256 entries pointing to
ignore_int, interrupt gates. It doesn't actually load interupt vectors
* idt - that can be done only after paging has been enabled
* and the kernel moved to PAGE_OFFSET. Interrupts
*
*
*
are enabled elsewhere, when we can be relatively
sure everything is ok. Point them at
* Warning: %esi is live across this function.
*/

setup_idt:

lea ignore_int,%edx

movl $(__KERNEL_CS << 16),%eax

movw %dx,%ax

/* selector = 0x0010 = cs */

movw $0x8E00,%dx
/* interrupt gate - dpl=0, present */


lea idt_table,%edi

mov $256,%ecx
rp_sidt:

movl %eax,(%edi)

movl %edx,4(%edi)

addl $8,%edi

dec %ecx

jne rp_sidt

ret
21
ignore_int
/* This is the default interrupt "handler" :-) */

ALIGN Dummy function
ignore_int:

cld






pushl %eax
pushl %ecx
pushl %edx
Prints out error message

pushl %es

Placeholder until real



pushl %ds

movl $(__KERNEL_DS),%eax

movl %eax,%ds

interrupt handlers are



movl %eax,%es

pushl 16(%esp)

pushl 24(%esp)

ready.

pushl 32(%esp)

pushl 40(%esp)

pushl $int_msg

call printk

addl $(5*4),%esp

popl %ds

popl %es

popl %edx

popl %ecx

popl %eax

iret

/* ... */

int_msg:

.asciz "Unknown interrupt or fault at EIP %p %p %p\n"
22
start_kernel
No more assembly, I promise
Orchestrates all initialization of kernel
Launches init process
23
start_kernel overview - 1
asmlinkage void __init start_kernel(void)
{

char * command_line;

extern struct kernel_param __start___param[], __stop___param[];
/*
* Interrupts are still disabled. Do necessary setups, then
* enable them
*/

lock_kernel();

page_address_init();

printk(linux_banner);

setup_arch(&command_line);

setup_per_cpu_areas();


/*

* Mark the boot cpu "online" so that it can call console drivers in

* printk() and can access its per-cpu storage.

*/

smp_prepare_boot_cpu();


/*

* Set up the scheduler prior starting any interrupts (such as the

* timer interrupt). Full topology setup happens at smp_init()

* time - but meanwhile we still have a functioning scheduler.

*/

sched_init();

/*

* Disable preemption - early bootup scheduling is extremely

* fragile until we cpu_idle() for the first time.

*/

preempt_disable();

build_all_zonelists();

page_alloc_init();

printk("Kernel command line: %s\n", saved_command_line);

parse_early_param();

parse_args("Booting kernel", command_line, __start___param,


__stop___param - __start___param,


&unknown_bootoption);

sort_main_extable();

trap_init();
24


start_kernel overview - 2
rcu_init();

init_IRQ();

pidhash_init();

init_timers();

softirq_init();

time_init();


console_init();

if (panic_later)


panic(panic_later, panic_param);

profile_init();

local_irq_enable();

vfs_caches_init_early();

mem_init();

kmem_cache_init();

numa_policy_init();

if (late_time_init)


late_time_init();

calibrate_delay();

pidmap_init();

pgtable_cache_init();

prio_tree_init();

anon_vma_init();


fork_init(num_physpages);

proc_caches_init();

buffer_init();

unnamed_dev_init();

security_init();

vfs_caches_init(num_physpages);

radix_tree_init();

signals_init();

/* rootfs populating might need page-writeback */

page_writeback_init();
#ifdef CONFIG_PROC_FS

proc_root_init();
#endif

check_bugs();


acpi_early_init(); /* before LAPIC and SMP init */
25
start_kernel overview - 3

acpi_early_init(); /* before LAPIC and SMP init */


/* Do the rest non-__init'ed, we're now alive */

rest_init();
}
26
Scheduler Initialization
Quick review
Two arrays, active and
expired processes
Each array contains a
list for each priority
27
Scheduler Initialization
void __init sched_init(void)

Sets up active and inactive


{

runqueue_t *rq;

int i, j, k;





for (i = 0; i < NR_CPUS; i++) {

prio_array_t *array;
arrays


rq = cpu_rq(i);








spin_lock_init(&rq->lock);
rq->active = rq->arrays; Initializes priority arrays and


rq->expired = rq->arrays + 1;


rq->best_expired_prio = MAX_PRIO;
lists
/* ... */



atomic_set(&rq->nr_iowait, 0);



for (j = 0; j < 2; j++) {



array = rq->arrays + j;



for (k = 0; k < MAX_PRIO; k++) {




INIT_LIST_HEAD(array->queue + k);




__clear_bit(k, array->bitmap);



}



// delimiter for bitsearch



__set_bit(MAX_PRIO, array->bitmap);


}

}


/* ... */


/*

* Make us the idle thread. Technically, schedule() should not be

* called from this thread, however somewhere below it might be,

* but because we are the idle thread, we just pick up running

* again when this runqueue becomes "idle".

*/

init_idle(current, smp_processor_id());
}
28
IDT Initialization
Replacing ignore_int with real interrupt and trap handlers
29
IDT Initialization
void __init trap_init(void)
{


set_trap_gate(0,&divide_error);

set_intr_gate(1,&debug);

set_intr_gate(2,&nmi);

set_system_intr_gate(3, &int3); /* int3-5 can be called
from all */

set_system_gate(4,&overflow);

set_system_gate(5,&bounds);

set_trap_gate(6,&invalid_op);

set_trap_gate(7,&device_not_available);

set_task_gate(8,GDT_ENTRY_DOUBLEFAULT_TSS);

set_trap_gate(9,&coprocessor_segment_overrun);

set_trap_gate(10,&invalid_TSS);

set_trap_gate(11,&segment_not_present);

set_trap_gate(12,&stack_segment);

set_trap_gate(13,&general_protection);

set_intr_gate(14,&page_fault);

set_trap_gate(15,&spurious_interrupt_bug);

set_trap_gate(16,&coprocessor_error);

set_trap_gate(17,&alignment_check);

set_trap_gate(19,&simd_coprocessor_error);


set_system_gate(SYSCALL_VECTOR,&system_call);
}
30
Tasklet Initialization
void __init softirq_init(void)
{

open_softirq(TASKLET_SOFTIRQ, tasklet_action, NULL);

open_softirq(HI_SOFTIRQ, tasklet_hi_action, NULL);
}

/* ... */

void open_softirq(int nr, void (*action)(struct softirq_action*), void *data)


{

softirq_vec[nr].data = data;

softirq_vec[nr].action = action;
}
31
Time Initialization
void __init time_init(void)
{
Starts with CMOS




xtime.tv_sec = get_cmos_time();
xtime.tv_nsec = (INITIAL_JIFFIES % HZ) * (NSEC_PER_SEC / HZ); time (second
resolution)

set_normalized_timespec(&wall_to_monotonic,


-xtime.tv_sec, -xtime.tv_nsec);


cur_timer = select_timer();

printk(KERN_INFO "Using %s for high-res timesource\n",
cur_timer->name); Sets nanosecond


}
time_init_hook(); based on CPU
/* -------- */ frequency and
struct timer_opts* __init select_timer(void)
{ cycles since power
on.

int i = 0;



/* find most preferred working timer */

while (timers[i]) {








if (timers[i]->init)

if (timers[i]->init(clock_override) == 0) Picks timer to use




return timers[i]->opts;






}
++i; for system from list
of available timers




panic("select_timer: Cannot find a suitable timer\n");

return NULL;
}
32
Memory Initialization
33
Memory Initialization
void __init mem_init(void)
{

extern int ppro_with_ram_bug(void);

int codesize, reservedpages, datasize, initsize;

int tmp;

int bad_ppro;


bad_ppro = ppro_with_ram_bug();


set_max_mapnr_init();


high_memory = (void *) __va(max_low_pfn * PAGE_SIZE);


/* this will put all low memory onto the freelists */

totalram_pages += __free_all_bootmem();


reservedpages = 0;

for (tmp = 0; tmp < max_low_pfn; tmp++)


/*


* Only count reserved RAM pages


*/


if (page_is_ram(tmp) && PageReserved(pfn_to_page(tmp)))



reservedpages++;


set_highmem_pages_init(bad_ppro);


codesize = (unsigned long) &_etext - (unsigned long) &_text;

datasize = (unsigned long) &_edata - (unsigned long) &_etext;

initsize = (unsigned long) &__init_end - (unsigned long) &__init_begin;


kclist_add(&kcore_mem, __va(0), max_low_pfn << PAGE_SHIFT);

kclist_add(&kcore_vmalloc, (void *)VMALLOC_START,


VMALLOC_END-VMALLOC_START);


printk(KERN_INFO "Memory: %luk/%luk available (%dk kernel code, %dk reserved, %dk data, %dk init, %ldk highmem)\n",


(unsigned long) nr_free_pages() << (PAGE_SHIFT-10),num_physpages << (PAGE_SHIFT-10),
codesize >> 10,


reservedpages << (PAGE_SHIFT-10),datasize >> 10,initsize >> 10,(unsigned long) (totalhigh_pages << (PAGE_SHIFT-10)));


if (boot_cpu_data.wp_works_ok < 0)


test_wp_bit();
}
34
CPU speed is calculated
35
CPU speed is calculated
void __devinit calibrate_delay(void)
{

/* Round the value and print it */

unsigned long ticks, loopbit;

printk("%lu.%02lu BogoMIPS (lpj=%lu)\n",

int lps_precision = LPS_PREC;


loops_per_jiffy/(500000/HZ),



(loops_per_jiffy/(5000/HZ)) % 100,

if (preset_lpj) {


loops_per_jiffy);


/* ... */
}

} else {


loops_per_jiffy = (1<<12); }



printk(KERN_DEBUG "Calibrating delay loop... ");


while ((loops_per_jiffy <<= 1) != 0) {



/* wait for "start of" clock tick */



ticks = jiffies;



while (ticks == jiffies)




/* nothing */;



/* Go .. */



ticks = jiffies;



__delay(loops_per_jiffy);



ticks = jiffies - ticks;



if (ticks)




break;


}



/*


* Do a binary approximation to get
* loops_per_jiffy set to


* equal one clock (up to lps_precision bits)


*/


loops_per_jiffy >>= 1;


loopbit = loops_per_jiffy;


while (lps_precision-- && (loopbit >>= 1)) {



loops_per_jiffy |= loopbit;



ticks = jiffies;



while (ticks == jiffies)




/* nothing */;



ticks = jiffies;



__delay(loops_per_jiffy);



if (jiffies != ticks)
/* longer than 1
tick */




loops_per_jiffy &= ~loopbit;


}
36
Fork of init Process
static void noinline rest_init(void) foo

__releases(kernel_lock)
{

kernel_thread(init, NULL, CLONE_FS | CLONE_SIGHAND);

numa_default_policy();

unlock_kernel();

preempt_enable_no_resched();

cpu_idle();
}
37
Fork of init Process
static
{
int init(void * unused) foo

lock_kernel();

/*

* Tell the world that we're going to be the grim

* reaper of innocent orphaned children.

*

* We don't want people to have to make incorrect

* assumptions about where in the task array this

* can be found.

*/

child_reaper = current;


/* ... */


/*

* Ok, we have completed the initial bootup, and

* we're essentially up and running. Get rid of the

* initmem segments and start the user-mode stuff..

*/

free_initmem();

unlock_kernel();

system_state = SYSTEM_RUNNING;

numa_default_policy();


if (sys_open((const char __user *) "/dev/console", O_RDWR, 0) < 0)


printk("Warning: unable to open an initial console.\n");


(void) sys_dup(0);

(void) sys_dup(0);



if (execute_command)


run_init_process(execute_command);
38
Fork of init Process
foo



/*

* We try each of these until one succeeds.

* The Bourne shell can be used instead of init if we are

* trying to recover a really broken machine.

*/


if (execute_command)


run_init_process(execute_command);


run_init_process("/sbin/init");

run_init_process("/etc/init");

run_init_process("/bin/init");

run_init_process("/bin/sh");


panic("No init found. Try passing init= option to kernel.");
}
39
Kernel is Done initializing
System is now up and running from kernel’s POV
However, the system is not up from user’s POV
init process is responsible for bringing up system to usability
Running startup scripts
Running login processes
Handling other system events (power failure, ctrl-alt-delete, etc.)
40
Startup Scripts - inittab
41
Startup Scripts - rc.sysinit
Set hostname
Load modules for hardware
Performs checks on filesystems if needed
Mount filesystems (real and nodev)
Enable swap
42
Startup Scripts
43
Login Prompt
44
Booting in the Future: EFI
Extensible Firmware Interface
Replaces legacy BIOS
Provides a new, robust interface between OS and firmware
Fully 32-bit protected mode operation
Clean abstraction layers
Disk I/O
Graphics

(Intel)
45
Legacy Boot
Current Kernel Initialization
Kernel Initialization

BIOS
! Processor in Real
Mode upon transfer 1st stage loader

of control to kernel 2nd stage loader Select Kernel

! Multi-stage boot Loader


Load Kernel

loader required Kernel


setup.S
! Legacy BIOS calls
employed to obtain video.S

system information
startup_32
! Kernel self-
decompression startup_32

start_kernel

/sbin/init
*Third party marks and brands are the property of their respective
respective owner 18

(2004 Intel Developers Forum)


46
EFI Boot
Kernel Initialization
Simplifies the Kernel Boot Process
I Initializes System in EFI
otected Mode
EFI Boot Manager ELILO
I Boot Manager
ables automatic kernel ELILO Load Kernel
ot via ELILO
Collect Boot
vocation Parameters
rnel loaded, boot
rameters collected Jump to kernel
EFI / Loader
ear Handoff via Kernel
itBootServices()
startup_32
ntrol transferred
ectly to native mode start_kernel
rnel entry point
/sbin/init

*Third party marks and brands are the property of their respective
respective owner 21 (2004 Intel Developers Forum)

You might also like