You are on page 1of 44

1

Initialization (1)
Taku Shimosawa

Pour le livre nouveau du Linux noyau


2

Agenda
• Initialization Phase of the Linux Kernel
• Turning on the paging feature
• Calling *init functions
• And miscellaneous things related to initialization
3

1. vmlinux
This is the linux kernel
4

vmlinux
• Main kernel binary
• Runs with the final CPU state
• Protected Mode in x86_32 (i386)
• Long Mode in x86_64
• And so on…
• Runs in the virtual memory space
• Above PAGE_OFFSET (default: 0xc0000000) (32-bit)
• Above __START_KERNEL_map (default: 0xff…f80000000)
• i.e. All the absolute addresses in the binary are virtual ones
• Entry points
Architecture Name Location Name (secondary)
x86_32 startup_32 arch/x86/kernel/head_32.S startup_32_smp
x86_64 startup_64 arch/x86/kernel/head_64.S secondary_startup_64
ARM stext arch/arm/kernel/head[_nommu].S secondary_startup
ARM64 stext arch/arm64/kenel/head.S secondary_holding_pen
secondary_entry
PPC _stext arch/powerpc/kernel/head_32.S* (__secondary_start)
5

Virtual memory mapping


0xFFFFFFFFFFFFFFFF
2GB
text/data __START_KERNEL_map
(0xFFFFFFFF
80000000)

0xFFFFFFFF

Up to ~896 MB LOWMEM
PAGE_OFFSET
(0xC0000000) PAGE_OFFSET
(0xFFFF8800
00000000)

0x00000000 0x0000000000000000
i386 Virtual Physical x86_64 Virtual
6

Why different mapping in 64-bit?


• The kernel code, data, and BSS reside in the last 2-
GB of the memory
=> Addressable by 32-bit!
• -mcmodel option in GCC
• Specifies the assumptions for the size of code/data
sections

-mcmodel option text data


(x86)
small within 2GB
kernel within -2GB
medium within 2GB Can be > 2GB
large Anywhere in 64bit
7

Column: -mcmodel in gcc


int g_data = 4; 8b 05 c6 0b 20 00 mov 0x200bc6(%rip),%eax # 601040 <g_data>
...
int main(void) bf 01 00 00 00 mov $0x1,%edi
8d 50 07 lea 0x7(%rax),%edx
{
g_data += 7; *The offset of RIP-relative addressing is 32-
... 48 b8 40 10 60 00 00 bit
movabs $0x601040,%rax
} 00 00 00
large bf 01 00 00 00 mov $0x1,%edi
8b 30 mov (%rax),%esi
...
8d 56 07 lea 0x7(%rsi),%edx

#define SZ (1 << 30) $ gcc -O3 -o ba -mcmodel=small bigarray.c


/usr/lib/gcc/x86_64-linux-gnu/4.8/crtbegin.o: In function
small `deregister_tm_clones':
int buf[SZ] = {1};
kernel crtstuff.c:(.text+0x1): relocation truncated to fit:
R_X86_64_32 against symbol `__TMC_END__' defined in .data
int main(void) section in ba
{
buf[0] += 3; 48 b8 60 10 a0 00 00 movabs $0xa01060,%rax
} medium 00 00 00
large 8b 08 mov (%rax),%ecx
8d 51 03 lea 0x3(%rcx),%edx
8

Column: -mcmodel in gcc (2)


• Code?
void nop(void)
{
asm volatile(".fill (2 << 30), 1, 0x90");
}

$ gcc -O3 -o ba -mcmodel=small supernop.c


/usr/lib/gcc/x86_64-linux-gnu/4.8/../../../x86_64-linux-
small
gnu/crt1.o: In function `_start':
medium (.text+0x12): relocation truncated to fit: R_X86_64_32S
kernel against symbol `__libc_csu_fini' defined in .text section in
/usr/lib/x86_64-linux-gnu/libc_nonshared.a(elf-init.oS)

$ gcc -O3 -o ba -mcmodel=large supernop.c


/usr/lib/gcc/x86_64-linux-gnu/4.8/../../../x86_64-linux-
gnu/crt1.o: In function `_start':
large (.text+0x12): relocation truncated to fit: R_X86_64_32S
against symbol `__libc_csu_fini' defined in .text section in
/usr/lib/x86_64-linux-gnu/libc_nonshared.a(elf-init.oS)
9

Initialization Overview
arch/*/boot/
Booting Code
(Preparing CPU states, Gathering HW information, Decompressing vmlinux etc.)
arch/*/kernel/head*.S, head*.c vmlinux
Low-level Initialization
(Switching to virtual memory world, Getting prepared for C programs)

init/main.c (startup_kernel) Call arch/*/kernel, arch/*/mm, …


Initialization
(Initializing all the kernel features including architecture-dependent parts)

init/main.c (rest_init)
Creating the “init” process, and letting it the rest “init” (PID=1)
initialization
(Setting up multiprocessing, scheduling) init/main.c (kernel_init)
kernel/sched/idle.c (cpu_idle_loop)
Performing final initialization
“Swapper” (PID=0) now sleeps and
“Exec”ing the “init” user
10

2. Towards Virtual
Memory
11

Enabling paging
• The early part is executed with paging off.
• Physical address space
• vmlinux is assumed to be executed with paging on.
• The addresses in the binary are not physical addresses.
• The first big job in vmlinux is enabling paging
• Creating a (transitional) page table
• Setting the CPU to use the page table, and to enable
paging
• Jumping to the entry point in C (compiled in the virtual
address space)
12

Identity Map
• At first, the goal page table cannot be used
• Since changing PC and enabling paging are (at least, in
x86) separate instructions.

Enable
Paging

PC Page Fault!
Physical Virtual Physical Virtual
13

Identity Map
• Therefore, identity map is created in addition to the
(goal) map.

Jump

PC
Physical Virtual
(1) Create an initial page table (2) Enable paging, and (3) Zap the low
Jump to a virtual address. mapping
14

Addresses in the transitional phase


• x86_64
• The decompressing routine enables paging and creates
an identity page table (only for first 4GB)
• Paging is required for CPUs to switch to 64-bit mode
• Located in 6 pages (pgtable) in the decompressing routine
• Symbols in vmlinux are accessed with RIP-relative
• No trick is necessary for using the symbols
leaq _text(%rip), %rbp
subq $_text - __START_KERNEL_map, %rbp
...
leaq early_level4_pgt(%rip), %rbx
...
movq $(early_level4_pgt - __START_KERNEL_map), %rax
addq phys_base(%rip), %rax
movq %rax, %cr3
movq $1f, %rax
jmp *%rax
1: (arch/x86/kernel/head_64.S)
15

Addresses in the transitional phase


• i386
• Symbols in vmlinux are accessed with absolute
addresses
• Before paging is enabled, PAGE_OFFSET is always subtracted
from addresses
movl $pa(__bss_start),%edi
movl $pa(__bss_stop),%ecx #define pa(X) ((X) - __PAGE_OFFSET)
subl %edi,%ecx
shrl $2,%ecx
rep ; stosl
...
movl $pa(initial_page_table), %eax
movl %eax,%cr3 /* set the page table pointer.. */
movl $CR0_STATE,%eax
movl %eax,%cr0 /* ..and set paging (PG) bit */
ljmp $__BOOT_CS,$1f /* Clear prefetch and normalize %eip */
1:
...
lgdt early_gdt_descr
lidt idt_descr
(arch/x86/kernel/head_32.S)
16

3. Initialization
At last, we have come here!
17

Initialization (start_kernel)
• A lot of *_init functions!
• Furthermore, some init functions call another init
functions.
• At least, 80 functions are called in this function.
• This slide will pick up some topics from the
initialization functions
18

2.9. Before Initialization


A little more tricks
19

Special directives
• What are these?
asmlinkage __visible void __init start_kernel(void) {

}
• “I’m curious!”.
20

asmlinkage
• asmlinkage
• Ensures the symbol is not mangled
• (in x86_32) Ensures all the parameters are passed by the
stack
#ifdef __cplusplus
#define CPP_ASMLINKAGE extern "C"
#else
#define CPP_ASMLINKAGE
#endif

#ifndef asmlinkage
#define asmlinkage CPP_ASMLINKAGE
#endif
include/linux/linkage.h

#ifdef CONFIG_X86_32
#define asmlinkage CPP_ASMLINKAGE __attribute__((regparm(0)))
arch/x86/include/asm/linkage.h
21

__visible
• (Effective in gcc >=4.6)
#if GCC_VERSION >= 40600
/*
* Tell the optimizer that something else uses this function or
variable.
*/
#define __visible __attribute__((externally_visible))
#endif
include/linux/compiler-gcc4.h
commit 9a858dc7cebce01a7bb616bebb85087fa2b40871
author Andi Kleen <ak@linux.intel.com> Mon Sep 17 21:09:15 2012
committer Linus Torvalds <torvalds@linux-foundation.org> Mon Sep 17 22:00:38 2012

compiler.h: add __visible

gcc 4.6+ has support for a externally_visible attribute that prevents the
optimizer from optimizing unused symbols away. Add a __visible macro to
use it with that compiler version or later.

This is used (at least) by the "Link Time Optimization" patchset.


22

__init (1)
• To mark code(text) and data as only necessary
during initialization
#define __init __section(.init.text) __cold notrace
#define __initdata __section(.init.data)
#define __initconst __constsection(.init.rodata)
#define __exitdata __section(.exit.data)
#define __exit_call __used __section(.exitcall.exit)
(include/linux/init.h)
#ifndef __cold
#define __cold __attribute__((__cold__))
#endif
(include/linux/compiler-gcc4.h)
#ifndef __section
# define __section(S) __attribute__ ((__section__(#S)))
#endif
...
#define notrace __attribute__((no_instrument_function))
(include/linux/compiler.h)
23

__init (2)
• The init* sections are concentrated to a contiguous memory area
. = ALIGN(PAGE_SIZE);
.init.begin : AT(ADDR(.init.begin) - LOAD_OFFSET) {
__init_begin = .; /* paired with __init_end */
}
...
INIT_TEXT_SECTION(PAGE_SIZE) __init_begin
#ifdef CONFIG_X86_64 init.text
:init init.data
#endif __init_end …
INIT_DATA_SECTION(16)
....
. = ALIGN(PAGE_SIZE);
...
.init.end : AT(ADDR(.init.end) - LOAD_OFFSET) {
__init_end = .;
}
arch/x86/kernel/vmlinux.lds.S
24

__init (3)
• And, they are discarded (free’d) after initialization
• Called from kernel_init
void free_initmem(void)
{
free_init_pages("unused kernel",
(unsigned long)(&__init_begin),
(unsigned long)(&__init_end));
}
arch/x86/mm/init.c

void free_initmem(void)
{
...
poison_init_mem(__init_begin, __init_end - __init_begin);
if (!machine_is_integrator() && !machine_is_cintegrator())
free_initmem_default(-1);
}
arch/arm/mm/init.c
25

head32.c, head64.c
• Before calling start_kernel, i386_start_kernel or
x86_64_start_kernel is called in x86
• Located in arch/x86/kernel/head{32,64}.c
• No underscore between head and 32!
• x86 (32-bit)
• Reserve BIOS memory (in conventional memory)
• x86 (64-bit)
• Erase the identity map
• Clear BSS, copy boot information from the low memory
• And reserve BIOS memory
26

Reserve? But how?


• This is very initial time. No complicated memory
management is working right now.
• memblock (Logical memory blocks) is working!
#define BIOS_LOWMEM_KILOBYTES 0x413
lowmem = *(unsigned short *)__va(BIOS_LOWMEM_KILOBYTES);
lowmem <<= 10;
...
memblock_reserve(lowmem, 0x100000 - lowmem);
arch/x86/kernel/head.c
• memblock simply manages memory blocks
• And in some architecture, information is took over to another
mechanism, and discarded after initialization
#ifdef CONFIG_ARCH_DISCARD_MEMBLOCK Set in S+Core, IA64, S390, SH,
#define __init_memblock __meminit MIPS and x86
#define __initdata_memblock __meminitdata
#else
... Without memory hotplug,
#endif __meminit is __init.
include/linux/memblock.h
27

memblock
• Data Structure (include/linux/memblock.h)
Array of memblock_region
memblock (memblock) memblock_region
memory • base, size, flags[, nid]
(memblock_type)
memblock_region
reserved memblock_region
(memblock_type)
Array of memblock_region
(memblock: Global variable) memblock_region

• Initially the arrays are allocated statically


static struct memblock_region
memblock_memory_init_regions[INIT_MEMBLOCK_REGIONS] __initdata_memblock;
static struct memblock_region
memblock_reserved_init_regions[INIT_MEMBLOCK_REGIONS] __initdata_memblock;
*INIT_MEMBLOCK_REGIONS = 128
28

Reserving in memblock
• Reserving adds the region to the region array in the
“reserved” type
static int __init_memblock memblock_reserve_region(phys_addr_t base,
phys_addr_t size,
int nid,
unsigned long flags)
{
struct memblock_type *_rgn = &memblock.reserved;

...
return memblock_add_region(_rgn, base, size, nid, flags);
}

int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size)


{
return memblock_reserve_region(base, size, MAX_NUMNODES, 0);
}

• A function to adding the available region is


memblock_add
29
When the available memory is
added?
• x86
• memblock_x86_fill
• called by setup_arch (8/80)
void __init memblock_x86_fill(void)
{ BTW, what’s this?
...
memblock_allow_resize();

for (i = 0; i < e820.nr_map; i++) {


... memblock_add(ei->addr, ei->size);
}
memblock_trim_memory(PAGE_SIZE);
...
}

• ARM
• arm_memblock_init
• Also called by setup_arch (8/80)
30

Resizing, or reallocation.
• Memblock uses slab for resizing if available
• # of e820 entries may be more than 128
• However, slab is available at kmem_cache_init called by
mm_init (25/80), so not at this time.
• Memblock tries to allocate by itself by finding an
area in memory && !reserved.
static int __init_memblock memblock_double_array(struct memblock_type *type,
phys_addr_t new_area_start,
phys_addr_t new_area_size)
{

addr = memblock_find_in_range(new_area_start + new_area_size,
memblock.current_limit,
new_alloc_size, PAGE_SIZE);
31

memblock: Debug options


• “memblock=debug”
static int __init early_memblock(char *p)
{
if (p && strstr(p, "debug"))
memblock_debug = 1;
return 0;
}
early_param("memblock", early_memblock);

static int __init_memblock memblock_reserve_region(...)


{
...
memblock_dbg("memblock_reserve: [%#016llx-%#016llx]
flags %#02lx %pF\n",
(unsigned long long)base,
(unsigned long long)base + size - 1,
flags, (void *)_RET_IP_);
32

3. Initialization
Okay, okay.
33

start_kernel
• What’s the first initialization function called?
smp_setup_processor_id() ((at least 2.6.18) ~ 3.2)
lockdep_init () (3.3 ~)
commit 73839c5b2eacc15cb0aa79c69b285fc659fa8851
Author: Ming Lei <tom.leiming@gmail.com>
Date: Thu Nov 17 13:34:31 2011 +0800

init/main.c: Execute lockdep_init() as early as possible


This patch fixes a lockdep warning on ARM platforms:

[ 0.000000] WARNING: lockdep init error! Arch code didn't call lockdep_init() early
enough?
[ 0.000000] Call stack leading to lockdep invocation was:
[ 0.000000] [<c00164bc>] save_stack_trace_tsk+0x0/0x90
[ 0.000000] [<ffffffff>] 0xffffffff

The warning is caused by printk inside smp_setup_processor_id().


34

init (1/80) : lockdep_init


• Initializes lockdep (lock validator) Config: CONFIG_LOCKDEP
• “Runtime locking correctness validator” selected by PROVE_LOCKING
• Detects or DEBUG_LOCK_ALLOC
• Lock inversion or LOCK_STAT
• Circular lock dependencies
• When enabled, lockdep is called when any spinlock or
mutex is acquired.
• Thus, the initialization for lockdep must be first.
• Initialization is simple (just initializing list_head’s of hashes)
void lockdep_init(void)
{...
for (i = 0; i < CLASSHASH_SIZE; i++)
INIT_LIST_HEAD(classhash_table + i);

for (i = 0; i < CHAINHASH_SIZE; i++)


INIT_LIST_HEAD(chainhash_table + i);
...}
kernel/locking/lockdep.c
35

init (2/80) : smp_setup_processor_id


• Only effective in some architecture
• ARM, s390, SPARC
u32 __cpu_logical_map[NR_CPUS] = { [0 ... NR_CPUS-1] =
MPIDR_INVALID };
void __init smp_setup_processor_id(void)
{ Hardware CPU (core) ID
int i;
u32 mpidr = is_smp() ? read_cpuid_mpidr() &
MPIDR_HWID_BITMASK : 0;
Exchange the logical ID
u32 cpu = MPIDR_AFFINITY_LEVEL(mpidr, 0);
for the boot CPU and
the logical ID for the
cpu_logical_map(0) = cpu;
CPU 0.
for (i = 1; i < nr_cpu_ids; ++i)
cpu_logical_map(i) = i == cpu ? 0 : i;
set_my_cpu_offset(0);
cpu_logical_map: 2 1 0 3
pr_info("Booting Linux on physical CPU 0x%x\n", mpidr);
}
arch/arm/kernel/setup.c
36

init (3/80) : debug_objects_early_init


• Initializes debugobjects Config:
CONFIG_DEBUG_OBJECTS
• Lifetime debugging facility for objects
• Seems to be used by timer, hrtimer, workqueue,
per_cpu_counter and rcu
• Again, this function initializes locks and listheads
void __init debug_objects_early_init(void)
{
int i;

for (i = 0; i < ODEBUG_HASH_SIZE; i++)


raw_spin_lock_init(&obj_hash[i].lock);

for (i = 0; i < ODEBUG_POOL_SIZE; i++)


hlist_add_head(&obj_static_pool[i].node, &obj_pool);
}
lib/debugobjects.c
37

init (4/80): boot_init_stack_canary


• Setup the stackprotector
• include/asm/stackprotector.h
• Decide the canary value based on random value and TSC
static __always_inline void boot_init_stack_canary(void)
{
u64 canary;
u64 tsc;

#ifdef CONFIG_X86_64
BUILD_BUG_ON(offsetof(union irq_stack_union, stack_canary) != 40);
#endif
get_random_bytes(&canary, sizeof(canary));
tsc = __native_read_tsc();
canary += tsc + (tsc << 32UL);

current->stack_canary = canary;
#ifdef CONFIG_X86_64
this_cpu_write(irq_stack_union.stack_canary, canary);
#else
this_cpu_write(stack_canary.canary, canary);
#endif
}
38

init (5/80): cgroup_init_early


• Initializes cgroups
• For subsystems that have early_init set, initialize the
subsystem.
• cpu, cpuacct, cpuset
• The rest of subsystems are initialized in cgroup_init (71/80)
• Initializes the structure, and the names for the
subsystems
39

init (6/80): boot_cpu_init


• Initializes various cpumasks for the boot CPU
• online : available to scheduler !HOTPLUG_CPU => same
• active : available to migration
• present : cpu is populated !HOTPLUG_CPU => same
• possible : cpu is populatable
• set_cpu_online adds the cpu to active
• set_cpu_present does not add the cpu to possible
static void __init boot_cpu_init(void)
{
int cpu = smp_processor_id();
/* Mark the boot cpu "present", "online" etc for SMP and UP
case */
set_cpu_online(cpu, true);
set_cpu_active(cpu, true);
set_cpu_present(cpu, true);
set_cpu_possible(cpu, true);
}
init/main.c
40

cpumask
• A bit map
typedef struct cpumask { DECLARE_BITMAP(bits, NR_CPUS); } cpumask_t;
include/linux/cpumask.h

#define DECLARE_BITMAP(name,bits) \
unsigned long name[BITS_TO_LONGS(bits)]
include/linux/types.h

#define BITS_TO_LONGS(nr) DIV_ROUND_UP(nr, BITS_PER_BYTE *


sizeof(long))
include/linux/bitops.h

array of long (4 / 8 bytes)


bits :

NR_CPU bits
41

Set bit! (x86)


#define IS_IMMEDIATE(nr) (__builtin_constant_p(nr))
...
static __always_inline void
set_bit(long nr, volatile unsigned long *addr)
{
if (IS_IMMEDIATE(nr)) {
asm volatile(LOCK_PREFIX "orb %1,%0"
: CONST_MASK_ADDR(nr, addr)
: "iq" ((u8)CONST_MASK(nr))
: "memory");
} else {
asm volatile(LOCK_PREFIX "bts %1,%0"
: BITOP_ADDR(addr) : "Ir" (nr) : "memory");
}
}
arch/x86/include/asm/bitops.h
• The register bitoffset operand for bts is
• -231 ~ 231-1 or -263 ~ 263-1
42

Set bit! (ARM)


#if __LINUX_ARM_ARCH__ >= 6 bitop _set_bit, orr
.macro bitop, name, instr
ENTRY( ¥name )
UNWIND( .fnstart )
ands ip, r1, #3
strneb r1, [ip] @ assert word-aligned
mov r2, #1
and r3, r0, #31 @ Get bit offset
mov r0, r0, lsr #5
add r1, r1, r0, lsl #2 @ Get word offset
...
mov r3, r2, lsl r3
1: ldrex r2, [r1]
¥instr r2, r2, r3
strex r0, r2, [r1]
cmp r0, #0
bne 1b
bx lr
UNWIND( .fnend )
ENDPROC(¥name )
.endm
43

smp_processor_id
• Returns the core ID (in the kernel)
• In ARM (and old days in x86)
• Located in “current”
• Located in the top of the current stack
• In x86
• Located in the per-cpu area.
#define raw_smp_processor_id() (this_cpu_read(cpu_number))
arch/x86/include/asm/smp.h

#define raw_smp_processor_id() (current_thread_info()->cpu)


arch/arm/include/asm/smp.h
static inline struct thread_info *current_thread_info(void)
{
register unsigned long sp asm ("sp");
return (struct thread_info *)(sp & ~(THREAD_SIZE - 1));
}
arch/arm/include/asm/thread_info.h
44

Next
• Topics and the rest of initialization
• Setup parameters (early_param() etc.)
• Initcalls
• Multiprocessor supports
• Per-cpus
• SMP boot (secondary boot)
• SMP altenatives
• And other alternatives
• And Others?
• Modules?

You might also like