Professional Documents
Culture Documents
Boot Code
1 Partition 1 Partition 2
2
3
4 Master Boot Record Boot Sector Data
5
Power On
PROCESSOR MANAGEMENT AND INITIALIZATION
CPU registers and RAM
contain random data Paging disabled: 0
Caching disabled: 1
Voltage is applied to RESET Not write-through disabled: 1
incb
(E820NR) If clear, ignore the Address
0 AddressRangeEnabled
movw
%di, %ax Range Descriptor.
addw
$20, %ax
movw
%ax, %di If set, the Address Range
again820:
cmpl
$0, %ebx
# check to see if 1 AddressRangeNonVolatile Descriptor represents
jne
jmpe820
# %ebx is set to EOF nonvolatile memory.
bail820:
2-31 Reserved Reserved for future use.
11
E801h RAM detection
#
#
method E801H:
memory size is in 1k chunksizes, to avoid confusing loadlin. Available Memory stored at
# we store the 0xe801 memory size in a completely different place,
#
#
because it will most likely be longer than 16 bits.
(use 1e0 because that's what Larry Augustine uses in his 0x1e0
# alternative new memory detection scheme, and it's sensible
# to write everything into the same place.)
meme801:
stc
# fix to work around buggy
xorw
%cx,%cx
# BIOSes which dont clear/set
xorw
%dx,%dx
# carry on pass/error of
# e801h memory size call
# or merely pass cx,dx though
# without changing them.
movw
$0xe801, %ax
int
$0x15
jc
mem88
cmpw
$0x0, %cx
# Kludge to handle BIOSes
jne
e801usecxdx
# which report their extended
cmpw
$0x0, %dx
# memory in AX/BX rather than
jne
e801usecxdx
# CX/DX. The spec I have read
movw
%ax, %cx
# seems to indicate AX/BX
movw
%bx, %dx
# are more reasonable anyway...
e801usecxdx:
andl
$0xffff, %edx
# clear sign extend
shll
$6, %edx
# and go from 64k to 1k chunks
movl
%edx, (0x1e0)
# store extended memory size
andl
$0xffff, %ecx
# clear sign extend
addl
%ecx, (0x1e0)
# and add lower memory into
# total size.
12
E88h RAM detection
# Ye Olde Traditional Methode. Returns the memory size (up to
# 16mb or 64mb, depending on the bios) in ax. Available Memory stored at
mem88:
movb
$0x88, %ah 0x2
int
$0x15
movw
%ax, (2)
13
ACPI RAM detection
14
A20 is the devil
“This is at the very best an annoying procedure.”
Due to compatibility with 8088, at boot time, bit 20 of all
addresses are cleared to 0
The A20 line must be enabled to allow all memory (even in real
mode) to be addressed.
First attempt via int 0x15, AX=0x2401
Second attempt via keyboard controller (original method)
Third attempt write to I/O port 0x92
15
Protected mode: Ahhh......
# Well, that certainly wasn't fun :-(. Hopefully it works, and we don't
# need no steenking BIOS anyway (except for the initial loading :-).
#
#
#
The BIOS-routine wants lots of unnecessary data, and it's less
"interesting" anyway. This is how REAL programmers do it. Finally
# Well, now's the time to actually move into protected mode. To make
flush_instr:
xorw
%bx, %bx
# Flag to indicate a boot The jmp instruction is
xorl
%esi, %esi
# Pointer to real-mode code
movw
subw
shll
%cs, %si
$DELTA_INITSEG, %si
$4, %esi
# Convert to 32-bit pointer
created at run-time
# jump to startup_32 in arch/i386/boot/compressed/head.S
#
# NOTE: For high loaded big kernels we need a
#
jmpi 0x100000,__BOOT_CS
#
#
but we yet haven't reloaded the CS register, so the default size
#
of the target offset still is 16 bit.
# However, using an operand prefix (0x66), the CPU will properly
#
take our 48 bit far pointer. (INTeL 80386 Programmer's Reference
#
Manual, Mixing 16-bit and 32-bit code, page 16-6)
.byte 0x66, 0xea
# prefix + jmpi-opcode
code32:
.long
0x1000
# will be set to 0x100000
# for big kernels
.word
__BOOT_CS
16
startup_32: Take 1
Clears interrupts
Clears eflags
Decompress the kernel
Move the kernel to its final location at 0x001000000
Perform unconditional jump to 0x001000000
17
startup_32: Take 1
show some code
18
startup_32: Take 2
This is a different function with the same name. Confusing.
Enable Paging (and extended paging)
Reinitialize eflags
Load initial Global Descriptor Table
Initialize IDT: setup_idt
Calls start_kernel
19
Global Descriptor Table
.quad 0x0000000000000000
/* NULL descriptor */
.quad
.quad
.quad
0x0000000000000000
0x0000000000000000
0x0000000000000000
/*
/*
/*
0x0b
0x13
0x1b
reserved */
reserved */
reserved */
Loaded immediately after
.quad
.quad
.quad
0x0000000000000000
0x0000000000000000
0x0000000000000000
/*
/*
/*
0x20
0x28
0x33
unused */
unused */
TLS entry 1 */
we enter startup_32
.quad 0x0000000000000000
/* 0x3b TLS entry 2 */
user mode
.quad 0x00cffa000000ffff
/* 0x73 user 4GB code at 0x00000000 */
.quad 0x00cff2000000ffff
/* 0x7b user 4GB data at 0x00000000 */
.quad 0x0000000000000000
/* 0x80 TSS descriptor */
.quad 0x0000000000000000
/* 0x88 LDT descriptor */
.quad 0x0000000000000000
/* 0xd0 - unused */
.quad 0x0000000000000000
/* 0xd8 - unused */
.quad 0x0000000000000000
/* 0xe0 - unused */
.quad 0x0000000000000000
/* 0xe8 - unused */
.quad 0x0000000000000000
/* 0xf0 - unused */
.quad 0x0000000000000000
/* 0xf8 - GDT entry 31: double-fault TSS */
20
setup_idt
/*
* setup_idt Loop through all 255
*
*
*
sets up a idt with 256 entries pointing to
ignore_int, interrupt gates. It doesn't actually load interupt vectors
* idt - that can be done only after paging has been enabled
* and the kernel moved to PAGE_OFFSET. Interrupts
*
*
*
are enabled elsewhere, when we can be relatively
sure everything is ok. Point them at
* Warning: %esi is live across this function.
*/
setup_idt:
lea ignore_int,%edx
movl $(__KERNEL_CS << 16),%eax
movw %dx,%ax
/* selector = 0x0010 = cs */
movw $0x8E00,%dx
/* interrupt gate - dpl=0, present */
lea idt_table,%edi
mov $256,%ecx
rp_sidt:
movl %eax,(%edi)
movl %edx,4(%edi)
addl $8,%edi
dec %ecx
jne rp_sidt
ret
21
ignore_int
/* This is the default interrupt "handler" :-) */
ALIGN Dummy function
ignore_int:
cld
pushl %eax
pushl %ecx
pushl %edx
Prints out error message
pushl %es
ready.
pushl 32(%esp)
pushl 40(%esp)
pushl $int_msg
call printk
addl $(5*4),%esp
popl %ds
popl %es
popl %edx
popl %ecx
popl %eax
iret
/* ... */
int_msg:
.asciz "Unknown interrupt or fault at EIP %p %p %p\n"
22
start_kernel
No more assembly, I promise
Orchestrates all initialization of kernel
Launches init process
23
start_kernel overview - 1
asmlinkage void __init start_kernel(void)
{
char * command_line;
extern struct kernel_param __start___param[], __stop___param[];
/*
* Interrupts are still disabled. Do necessary setups, then
* enable them
*/
lock_kernel();
page_address_init();
printk(linux_banner);
setup_arch(&command_line);
setup_per_cpu_areas();
/*
* Mark the boot cpu "online" so that it can call console drivers in
* printk() and can access its per-cpu storage.
*/
smp_prepare_boot_cpu();
/*
* Set up the scheduler prior starting any interrupts (such as the
* timer interrupt). Full topology setup happens at smp_init()
* time - but meanwhile we still have a functioning scheduler.
*/
sched_init();
/*
* Disable preemption - early bootup scheduling is extremely
* fragile until we cpu_idle() for the first time.
*/
preempt_disable();
build_all_zonelists();
page_alloc_init();
printk("Kernel command line: %s\n", saved_command_line);
parse_early_param();
parse_args("Booting kernel", command_line, __start___param,
__stop___param - __start___param,
&unknown_bootoption);
sort_main_extable();
trap_init();
24
start_kernel overview - 2
rcu_init();
init_IRQ();
pidhash_init();
init_timers();
softirq_init();
time_init();
console_init();
if (panic_later)
panic(panic_later, panic_param);
profile_init();
local_irq_enable();
vfs_caches_init_early();
mem_init();
kmem_cache_init();
numa_policy_init();
if (late_time_init)
late_time_init();
calibrate_delay();
pidmap_init();
pgtable_cache_init();
prio_tree_init();
anon_vma_init();
fork_init(num_physpages);
proc_caches_init();
buffer_init();
unnamed_dev_init();
security_init();
vfs_caches_init(num_physpages);
radix_tree_init();
signals_init();
/* rootfs populating might need page-writeback */
page_writeback_init();
#ifdef CONFIG_PROC_FS
proc_root_init();
#endif
check_bugs();
acpi_early_init(); /* before LAPIC and SMP init */
25
start_kernel overview - 3
acpi_early_init(); /* before LAPIC and SMP init */
/* Do the rest non-__init'ed, we're now alive */
rest_init();
}
26
Scheduler Initialization
Quick review
Two arrays, active and
expired processes
Each array contains a
list for each priority
27
Scheduler Initialization
void __init sched_init(void)
for (i = 0; i < NR_CPUS; i++) {
prio_array_t *array;
arrays
rq = cpu_rq(i);
spin_lock_init(&rq->lock);
rq->active = rq->arrays; Initializes priority arrays and
rq->expired = rq->arrays + 1;
rq->best_expired_prio = MAX_PRIO;
lists
/* ... */
atomic_set(&rq->nr_iowait, 0);
for (j = 0; j < 2; j++) {
array = rq->arrays + j;
for (k = 0; k < MAX_PRIO; k++) {
INIT_LIST_HEAD(array->queue + k);
__clear_bit(k, array->bitmap);
}
// delimiter for bitsearch
__set_bit(MAX_PRIO, array->bitmap);
}
}
/* ... */
/*
* Make us the idle thread. Technically, schedule() should not be
* called from this thread, however somewhere below it might be,
* but because we are the idle thread, we just pick up running
* again when this runqueue becomes "idle".
*/
init_idle(current, smp_processor_id());
}
28
IDT Initialization
Replacing ignore_int with real interrupt and trap handlers
29
IDT Initialization
void __init trap_init(void)
{
set_trap_gate(0,÷_error);
set_intr_gate(1,&debug);
set_intr_gate(2,&nmi);
set_system_intr_gate(3, &int3); /* int3-5 can be called
from all */
set_system_gate(4,&overflow);
set_system_gate(5,&bounds);
set_trap_gate(6,&invalid_op);
set_trap_gate(7,&device_not_available);
set_task_gate(8,GDT_ENTRY_DOUBLEFAULT_TSS);
set_trap_gate(9,&coprocessor_segment_overrun);
set_trap_gate(10,&invalid_TSS);
set_trap_gate(11,&segment_not_present);
set_trap_gate(12,&stack_segment);
set_trap_gate(13,&general_protection);
set_intr_gate(14,&page_fault);
set_trap_gate(15,&spurious_interrupt_bug);
set_trap_gate(16,&coprocessor_error);
set_trap_gate(17,&alignment_check);
set_trap_gate(19,&simd_coprocessor_error);
set_system_gate(SYSCALL_VECTOR,&system_call);
}
30
Tasklet Initialization
void __init softirq_init(void)
{
open_softirq(TASKLET_SOFTIRQ, tasklet_action, NULL);
open_softirq(HI_SOFTIRQ, tasklet_hi_action, NULL);
}
/* ... */
cur_timer = select_timer();
printk(KERN_INFO "Using %s for high-res timesource\n",
cur_timer->name); Sets nanosecond
}
time_init_hook(); based on CPU
/* -------- */ frequency and
struct timer_opts* __init select_timer(void)
{ cycles since power
on.
int i = 0;
/* find most preferred working timer */
while (timers[i]) {
if (timers[i]->init)
if (timers[i]->init(clock_override) == 0) Picks timer to use
return timers[i]->opts;
}
++i; for system from list
of available timers
panic("select_timer: Cannot find a suitable timer\n");
return NULL;
}
32
Memory Initialization
33
Memory Initialization
void __init mem_init(void)
{
extern int ppro_with_ram_bug(void);
int codesize, reservedpages, datasize, initsize;
int tmp;
int bad_ppro;
bad_ppro = ppro_with_ram_bug();
set_max_mapnr_init();
high_memory = (void *) __va(max_low_pfn * PAGE_SIZE);
/* this will put all low memory onto the freelists */
totalram_pages += __free_all_bootmem();
reservedpages = 0;
for (tmp = 0; tmp < max_low_pfn; tmp++)
/*
* Only count reserved RAM pages
*/
if (page_is_ram(tmp) && PageReserved(pfn_to_page(tmp)))
reservedpages++;
set_highmem_pages_init(bad_ppro);
codesize = (unsigned long) &_etext - (unsigned long) &_text;
datasize = (unsigned long) &_edata - (unsigned long) &_etext;
initsize = (unsigned long) &__init_end - (unsigned long) &__init_begin;
kclist_add(&kcore_mem, __va(0), max_low_pfn << PAGE_SHIFT);
kclist_add(&kcore_vmalloc, (void *)VMALLOC_START,
VMALLOC_END-VMALLOC_START);
printk(KERN_INFO "Memory: %luk/%luk available (%dk kernel code, %dk reserved, %dk data, %dk init, %ldk highmem)\n",
(unsigned long) nr_free_pages() << (PAGE_SHIFT-10),num_physpages << (PAGE_SHIFT-10),
codesize >> 10,
reservedpages << (PAGE_SHIFT-10),datasize >> 10,initsize >> 10,(unsigned long) (totalhigh_pages << (PAGE_SHIFT-10)));
if (boot_cpu_data.wp_works_ok < 0)
test_wp_bit();
}
34
CPU speed is calculated
35
CPU speed is calculated
void __devinit calibrate_delay(void)
{
/* Round the value and print it */
unsigned long ticks, loopbit;
printk("%lu.%02lu BogoMIPS (lpj=%lu)\n",
int lps_precision = LPS_PREC;
loops_per_jiffy/(500000/HZ),
(loops_per_jiffy/(5000/HZ)) % 100,
if (preset_lpj) {
loops_per_jiffy);
/* ... */
}
} else {
loops_per_jiffy = (1<<12); }
printk(KERN_DEBUG "Calibrating delay loop... ");
while ((loops_per_jiffy <<= 1) != 0) {
/* wait for "start of" clock tick */
ticks = jiffies;
while (ticks == jiffies)
/* nothing */;
/* Go .. */
ticks = jiffies;
__delay(loops_per_jiffy);
ticks = jiffies - ticks;
if (ticks)
break;
}
/*
* Do a binary approximation to get
* loops_per_jiffy set to
* equal one clock (up to lps_precision bits)
*/
loops_per_jiffy >>= 1;
loopbit = loops_per_jiffy;
while (lps_precision-- && (loopbit >>= 1)) {
loops_per_jiffy |= loopbit;
ticks = jiffies;
while (ticks == jiffies)
/* nothing */;
ticks = jiffies;
__delay(loops_per_jiffy);
if (jiffies != ticks)
/* longer than 1
tick */
loops_per_jiffy &= ~loopbit;
}
36
Fork of init Process
static void noinline rest_init(void) foo
__releases(kernel_lock)
{
kernel_thread(init, NULL, CLONE_FS | CLONE_SIGHAND);
numa_default_policy();
unlock_kernel();
preempt_enable_no_resched();
cpu_idle();
}
37
Fork of init Process
static
{
int init(void * unused) foo
lock_kernel();
/*
* Tell the world that we're going to be the grim
* reaper of innocent orphaned children.
*
* We don't want people to have to make incorrect
* assumptions about where in the task array this
* can be found.
*/
child_reaper = current;
/* ... */
/*
* Ok, we have completed the initial bootup, and
* we're essentially up and running. Get rid of the
* initmem segments and start the user-mode stuff..
*/
free_initmem();
unlock_kernel();
system_state = SYSTEM_RUNNING;
numa_default_policy();
if (sys_open((const char __user *) "/dev/console", O_RDWR, 0) < 0)
printk("Warning: unable to open an initial console.\n");
(void) sys_dup(0);
(void) sys_dup(0);
if (execute_command)
run_init_process(execute_command);
38
Fork of init Process
foo
/*
* We try each of these until one succeeds.
* The Bourne shell can be used instead of init if we are
* trying to recover a really broken machine.
*/
if (execute_command)
run_init_process(execute_command);
run_init_process("/sbin/init");
run_init_process("/etc/init");
run_init_process("/bin/init");
run_init_process("/bin/sh");
panic("No init found. Try passing init= option to kernel.");
}
39
Kernel is Done initializing
System is now up and running from kernel’s POV
However, the system is not up from user’s POV
init process is responsible for bringing up system to usability
Running startup scripts
Running login processes
Handling other system events (power failure, ctrl-alt-delete, etc.)
40
Startup Scripts - inittab
41
Startup Scripts - rc.sysinit
Set hostname
Load modules for hardware
Performs checks on filesystems if needed
Mount filesystems (real and nodev)
Enable swap
42
Startup Scripts
43
Login Prompt
44
Booting in the Future: EFI
Extensible Firmware Interface
Replaces legacy BIOS
Provides a new, robust interface between OS and firmware
Fully 32-bit protected mode operation
Clean abstraction layers
Disk I/O
Graphics
(Intel)
45
Legacy Boot
Current Kernel Initialization
Kernel Initialization
BIOS
! Processor in Real
Mode upon transfer 1st stage loader
system information
startup_32
! Kernel self-
decompression startup_32
start_kernel
/sbin/init
*Third party marks and brands are the property of their respective
respective owner 18
*Third party marks and brands are the property of their respective
respective owner 21 (2004 Intel Developers Forum)