You are on page 1of 169

LINUX INTERNALS

• Topics:
– Process definition and scheduling
– Memory management
– Process control and interaction
– Boot sequence
– I/O subsystem
– File system and VFS
– Networking

1
Introduction to Linux
• Public domain OS
– developed originally by Linus Torvalds, a Finnish computer
science student, in 1991
– placed under GNU Public License (anyone can use, copy,
and modify it)
– designed primarily for Intel PCs, but also runs on many
other processors (Sparc, Alpha, etc.)

2
Introduction to Linux (continued)
• distributed (CD with everything, including source and
installation/management tools) by a number of companies
(Redhat, Slackware, ...) which is sometimes better than pulling
it off the Internet
• ”maintained” by Internet users at large

3
Linux Information Sources

4
Linux Kernel Source Code
• The top level of the source tree is /usr/src/linux. The OS
source is found in subdirectories according to functionality:
– arch - all of the architecture specific kernel code;
subdirectories for each architecture.
– include - most of the header files needed to build the kernel
code.
– init - initialization code for the kernel (good place to start
looking at how the kernel works).
– mm - memory management code (architecture-dependent
mm code is under arch/*/mm/).

5
Linux Kernel Source Code (continued)
• drivers - contains all the system’s device drivers, further
sub-divided into classes (char, block, net, ...)
• ipc - interprocess communications code.
• modules - used to hold built modules.
• fs - file system code; further divided into fs types (e.g., vfat and
ext2)
• kernel - main kernel code (except architecture specific)
• net - networking code (sockets, tcp, ip,...)
• lib - various simple library routines.
• scripts - scripts (for example awk and tk scripts) that are used
when the kernel is configured.

6
Traditional Unix Kernel
• Unix is a monolithic operating system. Traditionally, the entire
Unix kernel was loaded into physical memory and remained
memory resident.
• Newer systems, such as Linux, relax this requirement, by
allowing demand-loadable kernel modules.

7
Traditional Unix Kernel (continued)
• Typical kernel functions
– Controlling the execution of processes by allowing their
creation, termination or suspension, and communication
– Scheduling processes fairly for execution on the CPU
– Allocating main memory for an executing process
– Allowing processes controlled access to peripheral devices
– Allocating secondary memory for efficient storage and
retrieval of user files and process images
• In Unix, the kernel does not persist as a process itself. Rather,
its routines are executed (in protected mode) on behalf of user
processes.

8
Outline of Linux Topics
• Process definition and scheduling
• Memory management and paging
• Process control and interprocess communication
• Linux boot sequence
• I/O subsystem and device drivers
• Linux file system design
• Linux networking
• Dynamically loadable modules

9
Process Definition
• In many (old and new) versions of Unix, two kernel data
structures describe the state of a process
– proc table entry : everything that must be known when the
process is swapped out
– u area: everything else
• In Linux, these are combined into a single task struct
structure. Contains information on:
– current process state, timers, signals, links to other tasks,
pointers to mmgt info, open files, permissions, etc.

10
Process Definition (continued)
• Process space comprises three regions
– text: read only and can be shared by other processes
– data: global variables, usually private to the process
– stack: growable, for execution of program

11
Process Activities
• Process creation
– In Unix processes are created through system call fork or
derivatives (vfork, clone).
– In all versions of Unix, certain processes created at boot
time have special significance.
• Process termination through exit.
• Process suspension and resumption
– Example: Sleep/Wakeup, Processes go to sleep because they
are awaiting the occurrence of some event (sleep on an
event).
– Sleeping processes do not consume CPU resources

12
Linux Task Structure
• task struct data structure (large!)
– one per process, pointed to in an array task[] of length 512
(default) in the kernel defined in include/linux/sched.h
– task array allocated in kernel/sched.c
struct task struct * task[NR TASKS] = {&init task };
– a global variable in the kernel current points to the
currently running process

13
Linux Task Structure (continued)
• State
– running - either running or ready to run
– waiting, interruptible - can be interrupted by signals
– waiting, uninterruptible - cannot be interrupted under any
circumstances.
– stopped - for example, being debugged
– zombie - halted, but task struct still allocated
struct task_struct{
/* these are hardcoded - don’t touch */
voltile long state;
/* -1 unrunnable, 0 runnable, 1 stopped */
...

14
Linux Task Structure (continued)
• Scheduling information
– determines which process is selected to run
– policy - scheduling priority for this process
– counter - how many clock ticks till time slice ends
– priority - static priority of process
– rt priority - real-time priority
• Identifiers
– pid, pgrp, session, leader, groups[],
– uid, euid, gid, egid, ...
– (what is effective uid and how is it set?)

15
Linux Task Structure (continued)
• IPC information
– status of signals, which are blocked (signal and blocked)
– signal handlers
/* signal handlers */

struct signal_struct *sig; <-- in task_struct

...

struct signal_struct{
int count;
struct sigaction action[32];

16
Task Structure Contents (continued)
• Links to parent and children processes
– tree of processes, wait queue for dying children
/*
* pointers to (original) parent process,
* youngest child, younger sibling,
* older sibling, respectively.
* (p->father can be replaced with
* p->p_pptr->pid)
*/

struct task_struct *p_opptr, *p_pptr, *p_cptr,


*p_ysptr, *p_osptr;
struct wait_queue *wait_chldexit;
/* for wait4() */

17
Task Structure Contents (continued)
• can see tree using pstree command (ptree in Solaris)
– also, doubly linked list of all processes; also, doubly linked
list of processes on run queue
• struct task struct *next task, *prev task;
• struct task struct *next run, *prev run;
• Exercise: Use the solaris ptree command to examine the
processes running under your machine.

18
Task Structure Contents (continued)
• Times and Timers
– scheduling information
– user-defined timers, interval timers (see setitimer and
getitimer system calls)
unsigned long timeout, policy, rt priority; unsigned long
it real value, it prof value,
it virt value; unsigned long it real incr, it prof incr, it virt incr;
struct timer list real timer; long utime, stime, cutime, cstime,
start time;

19
Task Structure Contents (continued)
• Memory management
– pointer to structure defining virtual memory and process
image
/* memory management info */
struct mm_struct *mm;
...

/* mm fault and swap info: this can arguably


* be seen as either mm-specific or
* thread-specific */

unsigned long min_flt, maj_flt, nswap,


cmin_flt, cmaj_flt, cn *swap;
int swappable:1;

20
unsigned long swap_address;
unsigned long old_maj_flt;
/* old value of maj_flt */
unsigned long dec_flt;
/* page fault count of the last time */
unsigned long swap_cnt;
/* number of pages to swap on next pass */

21
File Information in the Task Structure
• File system information
– local filesystem root, current working directory, and open
files
/* filesystem information */
struct fs_struct *fs;
/* open file information */
struct files_struct *files;
struct fs_struct{
int count; /* reserved*/
unsigned short umask;
struct inode * root, * pwd;
...
}

22
File Information in the Task Structure (continued)
struct files_struct {
int count; /* reserved*/
fd_set close_on_exec;
/* bit map - close these on exec */
fd_set open_fds;
struct file * fd[NR_OPEN];
/* fds are index */
}

23
File Structures

24
Task Structure Contents (continued)
• Personality - Linux can run more than i386-based Unix
environments
struct exec domain *exec domain;
unsigned long personality;
• Status information
int exit code, exit signal;
int errno;
• Program name
char comm[16];

25
Task Structure Contents (continued)
• Multiprocessor information
#ifdef SMP
int processor;
int last processor;
#endif
• Processor specific context
– different thread struct for each architecture
– include/asm-i386/processor.h
∗ struct thread struct tss; /* tss for this task */

26
Unix Scheduling
• Common features
– all versions of Unix support a time-slice scheduler (time
slice varies)
– process may also give up processor when it waits on an event
– timeout, sleep, wakeup: internal kernel routines used

27
Unix Scheduling (continued)
• Example states
– executing in user mode
– executing in kernel mode
– not executing but is ready to run as soon as the kernel
schedules it
– sleeping and interruptible
– sleeping and non-interruptible
– returning from the kernel to user mode, but the kernel
preempts it and does a context switch to schedule another
process
– stopped (for example, by a debugger) has executed the exit
system call and is in the zombie state, which is the final
state of a process

28
Process/Scheduler Interaction
• The scheduler always executes in the context of a user process.
• Context Switching
– Context of process is its state (text, values in data and
registers, values in process structure(s), and stack)
– When doing a context switch, the kernel saves enough
information so that process can be recovered and resumed
later.

29
Linux Scheduler
• Operation
– run whenever process is voluntarily relinquishes control or
time-slice expires
– time slice of 200 ms
– selects ”most deserving” process on run queue
• Priority
– in priority field of task struct
– equal to the number of clock ticks (jiffies) for which it will
run if it does not relinquish processor
– can be changed dynamically with renice system call
– counter field in task struct initially set to priority of process
– decremented with each clock tick

30
Linux Scheduler (continued)
• Linux also supports real-time processes
– identified by t.policy
– have higher priority than any non-real-time processes
– t.rt priority holds relative real-time priority

31
Process selection
• Algorithm
– step through run queue, and note process with highest
priority
– uses goodness function to compute priority (includes real
time and SMP weights)
• goodness() (kernel/sched.c)
/*
* This is the function that decides how desirable a
* process is. You can weigh different processes
* against each other depending on what CPU they’ve
* run on lately etc to try to handle cache and TLB
* miss penalties.
*
* Return values:
* -1000: never select this

32
* 0: out of time, recalculate counters
* (but it might still be selected)
* +ve: "goodness" value (the larger, the better)
* +1000: realtime process, select this.
*/

33
Process selection (cont)
int weight;

#ifdef __SMP__
/* We are not permitted to run a task someone
* else is running */
if (p->processor != NO_PROC_ID)
return -1000;
#ifdef PAST_2_0
/* This process is locked to a processor group */
if (p->processor_mask &&
!(p->processor_mask & (1<<this_cpu))
return -1000;
#endif
#endif

/*

34
* Realtime process, select the first one on the
* runqueue (taking priorities within processes
* into account).
*/
if (p->policy != SCHED_OTHER)
return 1000 + p->rt_priority;
/*
* Give the process a first-approximation goodness value
* according to the number of clock-ticks it has left.
*
* Don’t do any other calculations if the time slice is
* over..
*/

35
Process selection (continued)
weight = p->counter;
if (weight) {

#ifdef __SMP__
/* Give a largish advantage to the same processor... */
/* (this is equivalent to penalizing other processors) */
if (p->last_processor == this_cpu)
weight += PROC_CHANGE_PENALTY;
#endif
/* .. and a slight advantage to the current process */
if (p == prev)
weight += 1;
}

return weight;
}

36
Scheduler Invocation
• Scheduler is invoked ”voluntarily” from many places in the
kernel
• In addition, scheduler called whenever
current− >counter expires.
• Wait queues
– simply a list of processes associated with some resource
– processes add themselves to queue, then call scheduler

37
Scheduler Invocation
struct wait_queue {
struct task_struct * task;
struct wait_queue * next;
};

#define WAIT_QUEUE_HEAD(x)
((struct wait_queue *)((x)-1))

static inline void init_waitqueue


(struct wait_queue **q)
{
*q = WAIT_QUEUE_HEAD(q);
}

38
Bottom Half Handling
• Requiring interrupt handlers of device drivers to do all
processing of tasks related to a given interrupt may not be
advisable
– the rest of the system is suspended during interrupt
– many such tasks may not be time critical
– what is time critical is ”registering” these tasks to be done
later
• Linux supports up to 32 ”bottom half” handlers
– bh base points to handling routines
– bh mask indicates which entries are valid
– bh active indicates which need service
• Priority is low (0 is for timers) to high
• Jobs to be handled later are placed on task queues

39
Bottom Half Data Structures
Bit maps and function vector

40
Bottom Half Data Structures (continued)
Bit maps and function vector

41
Memory Management Techniques
• Swapping
– used in early System V Unix versions
– whole processes are moved in and out of memory
– first-fit allocation of main memory
– swapper process wakes up periodically and makes swapping
decisions

• Paging
– bring pages into memory on demand
– use page-fault handler
– pre-paging may be done for performance
– page-replacement algorithm (such as LRU or approximation) is
major design decision
– if paging system is overloaded (thrashing), swapper can swap out
whole processes

42
Linux Memory Management
• Features
– demand paged virtual memory
– memory space protection provided by architecture
– processes can share virtual memory (text, shared (dynamic)
libraries, shmem)
• Implementation
– virtual and physical memory divided into pages (4K on Intel
processors)
– if not present, page fault results

43
Linux Memory Management (continued)

44
Linux Page Tables
• Memory address structure
– Linux assumes three levels of page tables
– virtual address is broken into fields, three of which point to
page table entries, another is offset
– definition of field bits is architecture dependent, but macros
hide this from Linux kernel
– idea is to have a ”virtual’ mm, just as we see Unix supports
a virtual file system

45
Linux Page Tables (continued)

46
Details for i386 Architecture
• Example for 386 architecture
– two levels of indirection in address translation
– page directory contains pointers to 1024 page tables
– each page table contains pointers to 1024 pages
– the register CR3 contains the physical base address of the
page directory and is stored as part of the TSS in the
task struct and is loaded on each task switch
– 32 bit linear address is divided as follows: 31-22 DIR, 21-12
TABLE, 11-0 OFFSET

47
Details for i386 Architecture
• Example for 386 architecture (continued)
– physical address is then computed (in hardware) as follows:
∗ table base = CR3 + DIR
∗ page base = table base + TABLE
∗ physical address = page base + OFFSET

48
Page Table Entry
• Upper 20 bits contain address information
• lower 12 bits are used to store useful information about the page
table (or page) pointed to by the entry
• Format for page directory and page table entries:
31-12 11-9 8 7 6 5 4 3 2 1 0
ADDR OS 0 0 D A 0 0 U/S R/W P
OS : used by replacement policy.
D : denoted to mark a page as dirty (1) (undefined for page directory
entry)
A : to denote if page has been accessed (1).
U/S : to denote whether a user page(1) or a system page (0).
R/W : used to denote if the page is read-only (0).
P : to denote if the page is in memory (1).

49
Page table entry (continued)
• When a page is swapped out, bits 1-31 of the page table entry
are used to mark where a page is stored in swap (bit 0 must be
0).
• Of course, a TLB translation lookaside buffer is used as a
”cache” of address translations.

50
Page Allocation and Deallocation
• Each physical page is described by a mem map t data structure
(an array of these, called mem map, is defined in
include/linux/mm.h):
typedef struct page {
/* these must be first (free area handling) */
struct page *next;
struct page *prev;
struct inode *inode;
unsigned long offset;
struct page *next_hash;
atomic_t count;
unsigned flags;
/* atomic flags, some possibly updated asynchronously */
unsigned dirty:16,
age:8;

51
Page Allocation and Deallocation (continued)
struct wait_queue *wait;
struct page *prev_hash;
struct buffer_head * buffers;
unsigned long swap_unlock_entry;
} mem_map_t;
/* Page flag bit values */
#define PG_locked 0
#define PG_error 1
#define PG_referenced 2
#define PG_uptodate 3
#define PG_free_after 4
#define PG_decr_after 5
#define PG_swap_unlock_after 6
#define PG_DMA 7
#define PG_reserved 31

52
Free Area
• free area vector used by page allocation code to find free pages.
– 1st element points to list of single free pages
– 2nd element points to blocks of two (consecutive) free pages
– and so on
• Each element also points to a bit map of blocks of the
corresponding size.
• Standard buddy algorithm used for allocation, deallocation.

53
Free Area (continued)

54
MM Definitions
• Here are the definitions as they appear in mm/page alloc.c
/*
* Free area management
*
* The free_area_list arrays point to the queue heads of the
* free areas of different sizes
*/

#define NR_MEM_LISTS 6

/* The start of this MUST match the start of "struct page" */


struct free_area_struct {
struct page *next;
struct page *prev;
unsigned int * map;
};
#define memory_head(x) ((struct page *)(x))
static struct free_area_struct free_area[NR_MEM_LISTS];

55
Memory Mapping
• When an image is executed, the contents of the executable
image must be brought into the virtual address space of the
process.
The same is true for any shared libraries to which the image
has been linked.
• Of course, these parts are not brought into memory at once,
but rather are mapped into the address space of the process,
then demand paged.

56
Memory Mapping (continued)
• Specifically, when a process attempts to access a virtual
address within the new memory region for the first time,
– the processor will attempt to decode the virtual address,
– since there are no page table entries for this new area, the
processor will raise a page fault exception
– kernel will create entries, allocate physical page frame, and
bring in one (or more) pages from disk (either filesystem or
swap)
– process is then resumed at instruction that caused page fault

57
Process Address Space
• The address space of a process is described by an mm struct
data structure (sched.h), which points to a number of
vm area struct data structures (mm.h)
• Each vm area describes the start and end of part of the virtual
memory of the process, such as the code, data, shared regions,
etc. Key is that the area is treated as a unit (permissions, etc).

58
Process Address Space (continued)
struct mm_struct {
int count;
pgd_t * pgd;
unsigned long context;
unsigned long start_code, end_code, fstart_data, end_data;
unsigned long start_brk, brk, start_stack, start_mmap;
unsigned long arg_start, arg_end, env_start, env_end;
unsigned long rss, total_vm, locked_vm;
unsigned long def_flags;
struct vm_area_struct * mmap;
struct vm_area_struct * mmap_avl;
struct semaphore mmap_sem;
};

59
Process Virtual Memory
• Since an area of memory may be associated with an image on
disk, the vm area struct points to an inode.
• Also, the vm ops field points to the specific functions to be
used to map and unmap this area, etc.
• A very important such operation is the nopage operation,
which specifies what to do when a page fault occurs (for
example, bring in page from an image on disk) Different page
fault routines may be applied to different areas.

60
Process Virtual Memory (continued)

61
Traversing the Memory Structure of a Process
• The vm area struct’s of a process are arranged as an AVL tree
for fast searching. (AVL tree guarantees worst case O(log n)
time for insert, delete, membership, instead of linear.)

62
Demand Paging
• See mm/memory.c
• When a page fault occurs, Linux searches the AVL tree to find
which area is involved. Possibilities:
– if no such area is found, what happens?
– legal area, but wrong operation.
• Assuming address is legal, Linux checks to see if the page is
– in swap file (page table entry is marked invalid but the
address is not empty)
– in an executable image (invalid and address is empty), in
which case the page is read through page cache.

63
Page Cache
• Role
– similar to file cache, used to speed up access to any memory
mapped files (images)
– always checked first before going to disk
– read-aheads are done when pages are brought in from files
• Exercise: Peruse mm/filemap.c to become familiar with
functions that manage the page cache.

64
Page Cache Implementation
• Structure
– page hash table is a vector of pointers to mem map t data
structures
– hash function (index) derived from VFS inode number and
the offset of the page in the file
– if page is present in cache, pointer to its mem map t
structure is returned to fault handler
– page copied into user space

65
Page Cache Implementation (continued)

66
Kernel Swap Daemon (kswapd)
• Description
– job is to keep enough free pages in system for Linux’s needs
– kernel thread - runs in kernel mode in the physical address
space
• Operation
– periodically awakens and ensures number of free pages is
not too low
– tries to free up 4 pages each time it runs
• Three methods
– reduce size of buffer cache and page cache
– swap out shared pages
– swap out (or discard) other pages

67
Kernel Swap Daemon (kswapd)
• Exercise: Find kswapd in mm/vscan.c and peruse the routines
it calls to see how pages are freed.

68
Swapping Out Pages
• Rules
– never save a page to swap if it can be later retrieved from
some other place
– pages cannot be swapped or discarded if they are locked in
memory
• Linux swap algorithm
– based on page aging, age counter in mem map t structure
– initial age of 3, count bumped when referenced, up to max
of 20
– swap daemon ages pages by decrementing count
– only pages with age = 0 are considered

69
Swapping Out Pages (continued)
• When page is swapped, its PTE is replaced by one marked as
invalid but with a pointer (offset ) to location in swap file
• When a shared page is swapped, page tables of all processes
using it must be modified.

70
Page and Buffer Cache Sizes
• Kernel swap daemon checks these to see if they are getting too
large; may discard some pages from memory.
• Uses clock algorithm (cyclical scan through mem map page
vector)
• Buffer cache page may actually contain several buffers,
depending on block size of file system (when all buffers in a
page are freed, page is freed)
• Whenever in buffer or page cache, none of these pages in the
virtual memory of a process, so no page tables need updating.

71
Swap Cache
• Simply a list of page table entries, one per physical page in the
system
– just page table entries for swapped out pages
– non zero entry represents page, held in swap file on disk,
that has not been modified
– when page is modified, its entry is removed from swap cache
• When swapping out a page
– first check swap cache
– if valid entry, a copy is already on disk and page can simply
be freed

72
Swap Cache (continued)
• When swapping in a page, its PTE points to location in swap
file If access that caused fault was not a write, then
– entry for page is left in swap cache.
– page table entry is not marked as writable
– if page is later written, page fault occurs, page marked as
dirty, and entry removed from swap cache
• If access was a write then the entry is removed from swap
cache and page table entry marked as dirty and writable.

73
Process Control in Unix/Linux
• System calls controlling process context
– fork, clone - create a new process
– exit - terminate a process
– wait - synchronize with death of child
– exec variations- invoke a new program
– sbrk - change address space
• Signal Operations
– inform processes of asynchronous events
– may be sent (posted) by other processes
– may be sent by the kernel

74
Evolution of Fork
• Traditional fork system call in swapped systems
– complete copy of data and stack segments created for new
process
– text segment could be shared as read-only
• BSD optimization
– full copy, as in swapping system, is very wasteful
– in paging systems, just copy page tables and update page
frame data table reference points
– for data pages, remain shared until written to, at which
time copy occurs

75
Evolution of Fork (continued)
• Vfork
– child executes in parent’s address space
– can be ”dangerous”
• In Linux, fork is simply an alias for vfork, both of which are
copy on write (note: NOT like original BSD vfork)
• Linux clone call does allow parent and child to share writeable
data.

76
Process Creation in Linux
• Fork/vfork System Call
– pid = fork();
– pid - process id of child returned to parent, 0 returned to
child
– child process differs from the parent process only in its PID
and PPID; file locks and pending signals are not inherited.

77
Process Creation in Linux (continued)
• Fork procedures of kernel (see kernel/fork.c, do fork())
– allocate new task struct
– assign unique id to child
– get free page for kernel stack
– make ”logical” copy of parent process
– copy-on-write: vm area structs (segments) have copy on
write flag set, generating page fault upon write access
– increment file and inode counters for files associated with
the process
– return appropriate values to parent and child

78
Clone System Call
• Synopsis:
pid t clone(void *sp, unsigned long flags)
• clone is an alternate interface to fork, with more options; fork
is equivalent to
clone(0,CLONE VM).

79
Clone System Call (continued)
• Parameters
– if sp is non-zero, child process uses sp as its initial stack
pointer.
– CLONE VM flag: if set, child pages are copy-on-write
images of the parent pages. If not set, child process shares
the same pages as the parent, and both parent and child
may write on the same data.
– CLONE FD flag: is set, the child’s file descriptors are copies
of the parent’s file descriptors. If not set, the child’s file
descriptors are shared with the parent.

80
Process Termination
• Exit system call (see sys exit in kernel/exit.c)
– exit(status)
– SIGCHLD signal posted against parent process
– status is returned to parent process
– most of the code is cleaning up of resources
– if process exited due to uncaught signal, status is signal
number

81
Awaiting Process Termination
• Wait system call
– pid t = wait(int *status) or pid t waitpid(pid t pid, int
*status, int options)
– pid : process ID of zombie child, or can specify any of a set
of children (see man exit)
– status : address at which returned status will be stored for
signal
– if child already in ZOMBIE state, returns immediately
– else waits in kernel for child specified by pid

82
Awaiting Process Termination (continued)
• In the kernel (sys wait4 in kernel/exit.c)
– parent adds itself to wait queue
add_wait_queue(&current->wait_chldexit,
&wait);
– loops, checking for a child to have died (in earlier versions of
Unix, would have slept in INTERRUPTIBLE state)
– invokes scheduler each time through loop

83
Invoking Other Programs
• execve(2) system call
– many variations of front end: execl, execlp, execle, exect,
execv, execvp
– synopsis:
int execve (const char *filename,
const char *argv [* *],
const char *envp[]);

84
Invoking Other Programs (continued)
• Operation
– executes the program pointed to by a filename; can be
either a binary executable or a shell script
– on success, does not return
– text, data, bss, and stack of the calling process are
overwritten by that of the program loaded.
– new program invoked inherits the calling process’s PID and
any open file descriptors that are not set to close on exec
– signals pending on the parent process are cleared

85
Invoking Other Programs (continued)
• Kernel (see do exec in fs/exec.c) locates and reads beginning of
image, tries different binary formats till one works, sets up
memory map, and lets executable get demand paged.
• Linux can support various object file formats, but the most
commonly used is ELF.

86
Executable and Linkable Format
• An object file format designed at Unix System Laboratories,
alternative to (ECOFF, a.out)
• Description
– tables in image describe how program should be placed in
memory
– statically linked images are built by linker (ld) into single
image containing all code and data
– dynamically linked routines in tables so that the library can
be found and linked

87
Executable and Linkable Format (continued)
• Loading an image (ELF or otherwise)
– flush current executable image (e.g., shell) from its virtual
memory, clear any signals, close all files
– set up mm struct (start of text, data, pointers to
environment, etc)
– set up vm area struct structures and corresponding page
tables.

88
An ELF Example

89
Dynamically Linked (Shared) Libraries
• DLLs have been a part of Unix since the development of libc
• Executable image tables provide information on all library
routines referenced, indicating to dynamic linker (e.g., ld.so.1,
lib.so.1, ...) how to locate the library routine and link it into
the address space of the program.

90
Linux System Calls
• System call implementation is architecture specific
• i386 Implementation
– i386-compatible architectures support programmed
exceptions.
– execution of a system call is invoked by a programmed
exception, caused by the instruction ”int 0x80”
– interrupt vector 0x80 is set up (at boot, along with other
interrupt vectors) to transfer control to the kernel

91
Linux System Calls (continued)
• Library expansion
– each call is vectored through a stub in libc
– the routine is generally a syscallX() macro, where X is the
number of parameters used by the actual routine
– each syscall macro expands to an assembly routine which
sets up the calling stack frame and calls system call()
through an interrupt, via the instruction int $0x80

92
Implementation
Example library macro
#define _syscall1(type,name,type1,arg1) \
type name(type1 arg1) \
{ \
long __res; \
__asm__ volatile ("int $0x80" \
: "=a" (__res) \
: "0" (__NR_##name), \
"b" ((long)(arg1))); \
if (__res >= 0) \
return (type) __res; \
errno = -__res; \
return -1; \
}

93
Implementation (continued)
• See arch/*/kernel/entry.S, which defines entry points for
interrupts/exceptions set up a boot (segmentation error, divide
by zero, system call, ...)
– ENTRY(system call) : this code is responsible for saving all
registers, checking to make sure a valid system call was
invoked and then ultimately transferring control to the
actual system call code via the offsets in the sys call table
– ret from sys call() checks to see if the scheduler should be
run, and if so, calls it

94
System Call Table
• Defined at the end of entry.S
– simply a table that maps a routine to an index
– over 170 system calls presently defined
.data
ENTRY(sys_call_table)
.long SYMBOL_NAME(sys_setup) /* 0 */
.long SYMBOL_NAME(sys_exit)
.long SYMBOL_NAME(sys_fork)
.long SYMBOL_NAME(sys_read)
.long SYMBOL_NAME(sys_write)
.long SYMBOL_NAME(sys_open) /* 5 */
.long SYMBOL_NAME(sys_close)
.long SYMBOL_NAME(sys_waitpid)

95
Linux Interprocess Communication
• Signals
• Pipes
• System V IPC
• message queues
• semaphores
• shared memory
• sockets

96
Unix Signals
• Operations
– inform processes of asynchronous events
– may be sent by other processes using kill system call
– may be sent by the kernel

97
Unix Signals (continued)
• Classes of signals
– indicating termination of a process
– process induced exceptions
– unrecoverable conditions during system call
– unexpected error during system call
– user mode signals
– terminal-related signals
– tracing-related signals

98
List of Signals (Linux)
#define SIGHUP 1
#define SIGINT 2
#define SIGQUIT 3
#define SIGILL 4
#define SIGTRAP 5
#define SIGABRT 6
#define SIGIOT 6
#define SIGBUS 7
#define SIGFPE 8
#define SIGKILL 9
#define SIGUSR1 10
#define SIGSEGV 11
#define SIGUSR2 12
#define SIGPIPE 13
#define SIGALRM 14

...

#define SIGUNUSED 31

99
Kernel and Signals
• Sending a signal to a process
– set bit in signal field of task struct
– if process asleep and interruptible, kernel wakes it up
– processes do not know how many times a signal fired
• Kernel handling of signals
– kernel checks for received signals when process ready to
return to user mode
– signal has no instant effect on kernel mode process
– if process is running in user mode, it is interrupted; signal
takes effect when returning from the interrupt
– kernel only dumps core for signals that imply program error

100
Kernel and Signals (continued)
• Possible actions taken by process receiving signal
– exit in kernel mode (default)
– ignore the signal
– execute a particular user function

101
Signal System Call
• Syntax
void (*signal(int signum, void *handler)
(int)))(int);
– signum is the number of the signal
– handler is a user-defined function to be called upon receipt
of the signal, or special flag:
– if handler is SIG IGN, the signal will be ignored if it is
posted against the process
– if handler is SIG DFL, default action for the signal is
reinstated.
– kernel keeps track of how it is to handle signals in the task
struct :

102
Sending a Signal from a Process
• kill(pid, signum) system call
– pid - identifies set of processes to receive signal
– signum - signal being sent
• pid can specify process or a process group
• In the kernel (kernel/exit.c)
– simply post the signal against the process or each member
of a process group

103
Executing a User-Defined Signal Catcher
• Kernel activities
– access saved register context to get PC and SP
– set signal handler field to default state
– create new stack frame on user stack, so it looks like user
program directly called the signal handler
– change PC and SP in saved register context to indicate new
function
– when kernel returns to user mode, the process will execute
the signal handling code, returning to place in code where
interrupted

104
Pipes
• Enable standard output from one process to be directed to
standard input of another process
– processes themselves are not aware of the redirection
– example: ls | more
– created by shell using pipe(2) call - (do pipe in fs/pipe.c)

105
Pipes
• Implementation
– two file data structures point at temporary inode
– inode points at a physical page within memory
– file points to operations that are specific to pipes (pipe read,
pipe write, ...) instead of the regular file operations
– linux uses locks, wait queues, and signals to synchronize
access to the pipe (see fs/pipe.c)

106
Pipe Implementation

107
System V Interprocess Communication
• Message queues - allow processes to send formatted messages
to other processes
– msgget - create (or return) message queue
– msgctl - control
– msgsnd - send message
– msgrcv - receive message
• Shared memory - processes communicate by sharing pairs of
virtual address space
– shmget - create (or return) new share memory region
– shmat - attach to a region
– shmdt - detach
– shmctl - control

108
System V Interprocess Communication (continued)
• Semaphores - generalization of Dijkstra’s P and V operations
– semget - allocate entry for an array of semaphores
– semop - manipulate a semaphore with an operation

109
Linux System Boot
• PC boot is carried out by the BIOS
– initializes interrupt vector
– tries to read first sector (boot sector) of the first floppy
– if that fails, tries to read boot sector (called the master boot
record, MBR) of first hard disk
– (in many systems, you can re-order this sequence by
reconfiguring the BIOS)
– the MBR contains code to determine which partition to
boot from (active partition), load boot sector of that
partition, jump to beginning of that code

110
Linux System Boot (continued)
• Boot sector (code in arch/*/boot/bootsect.S)
– loaded by the BIOS to address 0x7C00
– relocates self to 0x90000
– loads the next 2 kBytes of code from the boot device to
address 0x90200
– loads the rest of the kernel to address 0x10000.
– prints message ”Loading...
– passes control to setup (boot/setup.S)

111
Linux Boot (continued)
• Setup
– identifies various hardware features of the host system
(memory size, video card type, hard disk info...)
– prompts user to choose the video mode for the console
– moves the whole system from address 0x10000 to address
0x1000 enters protected mode and jumps to the rest of the
system (at 0x1000)
• Kernel decompression (head.S)
– invokes decompress kernel(), which in turn is made up of
inflate.c, unzip.c and misc.c.
– decompressed kernel put at address 0x100000 (1 meg) and
execution transfers to start kernel()

112
Linux Boot (continued)
• start kernel (init/main.c)
– no assembly language after this point
– sets the memory bounds and calls
paging init().
– initializes the traps, IRQ channels and scheduling.
– parses the boot command line
– initializes all the device drivers and disk buffering, as well as
many other data structures

113
Linux Boot (continued)
asmlinkage void start_kernel(void)
{
char * command_line;

/*
* Interrupts are still disabled. Do necessary setups, then
* enable them
*/
setup_arch(&command_line, &memory_start, &memory_end);
memory_start = paging_init(memory_start,memory_end);
trap_init();
init_IRQ();
sched_init();
time_init();
parse_options(command_line);

114
Linux Boot (continued)
calibrate_delay();
...
mem_init(memory_start,memory_end);
buffer_init();
sock_init();
ipc_init();
dquot_init();
arch_syms_export();
sti();
check_bugs();

printk(linux_banner);

sysctl_init();

115
Linux Boot (continued)

/*
* We count on the initial thread going ok
* Like idlers init is an unlocked kernel thread,
* which will make syscalls (and thus be locked).
*/
kernel_thread(init, NULL, 0);

116
Linux Boot (continued)

/* task[0] is meant to be used as an


* "idle" task: it may not sleep, but it
* might do some general things like
* count free pages or it could be used
* to implement a reasonable LRU
* algorithm for the paging routines:
* anything that can be useful, but
* shouldn’t take time from the real
* processes. * Right now task[0] just
* does a infinite idle loop. */

cpu_idle(NULL);
}

117
Init Process
• System’s first real process
– created by start kernel
– process id of 1
– task struct is pointed to by global variable, init task

118
Init Process (continued)
• Operation
– initial processing: open console, mount root file system
– exec’s the system initialization program, historically
/etc/init, but now usually found in /sbin/init
– read /etc/inittab, which contains shell script to start up a
variety of daemons (see /etc/rc.d)
– as such, init becomes the ancestor of all other processes
(except the idle loop)
– from now on, the kernel runs only in one of two modes:
∗ executing a system call on behalf of a user process
∗ handling some asynchronous event, such as an interrupt

119
Unix Shells
• A getty process is usually set up by init to wait on each tty line.
• Getty gets the user login id and spawns login, gets and checks
password, then spawns whatever shell is indicated in password
file.
• A number of shells have evolved over the years: sh, csh, ksh,
tcsh, ...

120
Unix Shells (continued)
• Shell operation
– shell is one large loop
– each time through, shell reads command line
– interprets line according to standard set of rules
– built-in commands (cd, for, while, etc) executed internally
– else, assumes command is name of an executable file, forks,
then execs new program
• After exec’ing the command
– by default the shell executes a wait for the child to die, then
goes to top of a loop
– if the command was executed in background mode, goes to
top of the loop

121
Unix I/O Subsystem
• Components
– general device driver code
– drivers for specific hardware devices
– buffer cache
• Device access
– drivers are accessed through device files (/dev/...)
– device numbers (major and minor) specify device

122
Unix I/O Subsystem (continued)
• Three kinds of I/O
– block devices (disks and tapes) - usually accessed via file
system instead of directly
– character devices (terminals, printers,...)
– network devices

123
Unix I/O Subsystem (continued)
• Buffer cache
– keep disk blocks in memory for future use (read and write)
– dirty buffers written periodically to secondary storage using
an elevator algorithm

124
Linux Device Drivers
• Every physical device in the system has its own hardware
controller
– SuperIO chip for keyboard
– IDE controller for IDE disks
– SCSI controller for SCSI disks...
• Each hardware controller has its own control and status
registers, used by the device driver to interact with the device

125
Linux Device Drivers
• Device files
– one of the most elegant features of Unix
– make hardware devices look like files, to be opened, closed,
read, written
– e.g., /dev/tty, /dev/hda
– created with mknod command or by Linux upon
initialization
• Major and minor device numbers
– all devices controlled by same driver have same major
device number
– minor device numbers distinguish different (physical or
logical) devices
– e.g., disk partitions, ttys

126
Features of Linux Drivers
• Kernel code - even though drivers are often added to system
for new devices, by third parties, they are kernel code and, if
buggy, can easily crash the system or worse.
• Kernel interfaces - must provide a standard interface to Linux
kernel or subsystem (file I/O interface, SCSI interface, etc)
• Kernel mechanisms - make use of standard kernel services, such
as wait queues
• Most drivers can be configured as modules, so they are demand
loadable as well as boot configurable. If driver is present but
hardware is not, no problem.
• Drivers may use DMA for data transfers between an adapter
card and main memory

127
DMA
• Data transfer between devices and memory
• A small number of DMAs
• DMAs cannot be shared between devices
• Limited addressing capabilities (24bits)
• A device registers with the kernel for a DMA channel

128
Polling vs. Interrupts
• Polling device drivers
– don’t use true polling
– instead, timer interrupt routine checks status of command
and indicates to driver when it is complete
– floppy driver (has been) implemented this way
• Interrupts
– more efficient and responsive than polling
– device driver needs to register its usage of the interrupt
with the kernel

129
Polling vs. Interrupts
• /proc/interrupts indicates which drivers use which interrupts
0: 727432 timer
1: 20534 keyboard
2: 0 cascade

...
• Drivers should do as little as possible in interrupt handling
routine, deferring non-time-critical work to ”bottom half”
handler (called when scheduler runs)

130
The Kernel and Character Devices
• Simplest of Linux’s devices
– accessed as files
– standard open, close, read, write calls used
• Initialization
– device driver registers itself by adding entry to crhdevs
vector
– major device number is index into this vector
– each entry contains pointer name of driver and pointer to
set of file operations

131
Kernel and Character Devices (continued)
• File operations
– inode for character device file points only to open operation
– upon open, other ops retrieved from chrdevs entry and
placed in open file structure

132
The Kernel and Block Devices
• blkdevs table plays similar role as chrdevs
• There are classes of block devices (SCSI, IDE)
– class registers with the kernel and provides file operations
– driver provides interfaces to appropriate subsystem (e.g.,
SCSI), which the subsystem uses when providing interfaces
to kernel
• blk dev vector
– each block device driver must also provide interface to
buffer cache (address of request routine and pointer to list
of requests, each containing pointer to one or more
buffer head structures)

133
Kernel and Block Devices (continued)
• when buffer cache wishes to read/write block of data, it places
request on the appropriate list, and request function is called
• after request is completed, each buffer head is unlocked, waking
up any waiting processes
• Details of device drivers for specific hard disk types can be
found in Rusling, Ch. 8.

134
Network Devices
• Typically an adapter card, but could be software only, such as
loopback device
• All packets represented by sk buff structures, that allow
headers to be easily added and removed (see
include/linux/skbuff.h)
• Device files
– created at boot time as network devices are discovered and
initialized
– names are standard, multiple of same type are numbered
starting at 0
– e.g., /dev/eth0, /dev/eth1, ...

135
Device Data Structure Contents
• See include/linux/netdevice.h
• Name, as discussed earlier
• Bus information
– interrupt, base address in I/O memory, DMA channel being
used
• Interface flags - characteristics/abilities of device

136
Device Data Structure Contents (continued)
• Protocol information
– MTU for this interface
– family (AF INET for all Linux network devices)
– type (Ethernet, X.25, SLIP, PPP, ...)
– addresses (hw address, IP address, ...)
• Packet queue - queue of sk buff packets queued for transmission
• Support functions
– setup and frame routines
– statistics routines
– etc.

137
Interface Flags
#define IFF_UP 0x1 /* interface is up */
#define IFF_BROADCAST 0x2 /* broadcast address valid */
#define IFF_DEBUG 0x4 /* turn on debugging */
#define IFF_LOOPBACK 0x8 /* is a loopback net */
#define IFF_POINTOPOINT 0x10 /* interface is has p-p link */
#define IFF_NOTRAILERS 0x20 /* avoid use of trailers */
#define IFF_RUNNING 0x40 /* resources allocated */
#define IFF_NOARP 0x80 /* no ARP protocol */
#define IFF_PROMISC 0x100 /* receive all packets */
#define IFF_ALLMULTI 0x200 /* receive all multicast packets */
#define IFF_MASTER 0x400 /* master of a load balancer */
#define IFF_SLAVE 0x800 /* slave of a load balancer */
#define IFF_MULTICAST 0x1000 /* Supports multicast */

138
Linux Virtual File System
• Linux actually supports many file systems (simultaneously)
– ext, ext2, xia, minix, msdos, vfat, proc, ...
– an extremely powerful and useful feature
• Virtual file system
– supplies applications with system calls for file management
– hides details of individual file systems from user programs

139
Unix File Systems - Review
• Two main objects, files and directories
– directories are just files with a special format
– files are made up of data blocks on disk

140
Unix File Systems - Review (continued)
• A file is represented by an inode.
– resides on disk, copies in memory
– defines ownership, permissions, status
– type: plain, directory, symbolic link, character device, block
device, socket
• Mapping path names to inodes
– responsibility of kernel
– follow path and read inodes
– if directory, read to get inodes

141
Unix File Systems - Review (continued)
• Files opened by applications
– file descriptor points into table of open files
– entry there points to file-structure table entry
– entry there points to in-core copy of inode

142
Unix File System Structure
• File system data structures maintained by the kernel
– each process has its own file descriptor table, which
identifies all open files for a process
– a system-wide file table keeps track the of the byte offset in
the file where the user’s next read or write will start, and
the access rights allowed to the opening process
– a system-wide inode table

143
Unix File System Structure (continued)
• Traditional Unix file system layout
– boot block: bootstrap code (typically the first sector)
– superblock: describes the state of a file system - how large it
is, how many files it can store, where to find free space, etc.
– inode list: includes the root inode
– data blocks: each block can only belong to one file

144
Traditional Unix FS Operation
• Contents of superblock
– size of the file system
– number of free blocks in file system
– list of free blocks in file system
– index of next free block in the free block list
– size of inode list
– number of free inodes in file system
– list of free inodes in file system
– index of next free inode in the free inode list

145
Traditional Unix FS Operation (continued)
• Linear list of inodes follows super block
– each inode has a type field: 0 (available), 1 (used)
– linear search for a free one would be expensive (many disk
accesses)

146
Traditional Unix FS Operation
• Superblock holds list of free inodes (free inode list) and kernel
keeps track of search point (in the real inode list) where it
should begin looking next time it needs to fill up free inode list.
• When the free inode list is empty, the kernel will bring more
free inodes to the superblock; it will start looking at the
remembered inode
• Freeing an inode
– increment the total number of available inodes
– place in superblock free inode list if there is a slot available
• Access of superblock must be a critical section.
• Managing regular disk blocks is similar to that for inodes,
except that an explicit list of free disk blocks is maintained.

147
Performance Problems
• For Unix file management system described, the effective data
transfer rate is less than the disk bandwidth
• Major factors that affect the file system performance
– allocation of blocks of a file: the free list of blocks are
scrambled as files were created and removed, eventually the
free list becomes random causing files to have their blocks
allocated randomly
– overallocation of files: files in same directory are typically
not allocated consecutive slots in the inode list (need many
non-consecutive blocks of inodes to be accessed)
– inodes are not usually near their respective files

148
EXT2 File System
• History
– Minix file system originally supported, but very restrictive
(14-character filenames and 64Mbyte file size limit)
– EXT introduced in 1992 specifically for Linux (VFS added
at same time), but lacked performance
– EXT2 added in 1993
• EXT2
– design heavily influenced by BSD Fast File System (FFS)
– logical partitions are divided into block groups
– superblock is replicated on each block group, for fault
tolerance

149
EXT2 Inode
• Fairly traditional Unix inode
– type (file,directory, symbolic link, block device, char device,
FIFO)
– ownership
– permissions (owner, group, others)
– size
– timestamps (inode creation and modification times)
– pointers to data blocks

150
EXT2 Superblock
• Read into memory when file system is mounted
• Each block group contains a duplicate in case of corruption
• Contents
– magic number (0xEF53) indicates superblock of ext2 fs
– revision level
– mount count and maximum mount count.
– block group number - which group holds this copy
– block size (e.g. 1024 bytes)
– blocks per group
– number of free blocks in fs
– number of free inodes in fs
– inode number of first inode in fs (directory entry for ’/’)

151
Group Descriptor
• Describes a block group, but all group descriptors are
replicated in each block group
• Contents
– blocks bitmap for allocation and deallocation
– inode bitmap for inode alloc and dealloc
– first block of inode table
– free block count
– free inode count
– used directory count

152
EXT2 Directories
• Defn: list of directory entries, each containing
– inode (index into inode table of the block group)
– entry length
– name length
– name [255]
– first two entries are always ”.” and ”..”

153
EXT2 Directories (continued)
• When a file size is increased, EXT2 tries to allocate new blocks
for a file physically close to current data blocks or at least in
same Block Group
• Also, whenever it needs new block for a file, it looks for a block
of 8 if it can find one

154
EXT2 Directories (continued)

155
Finding Files in EXT2
• The file name is a series of directory names, separated by
forward slashes and ends in file’s name.
• Filename itself can be any length and consist of any printable
characters
• Linux parses pathname a directory at a time until it finds the
inode it wants
– first, get inode of root directory (given in superblock)
– read contents (all but last is a directory)
– get inode of next path element, and repeat
• Exercise: Find the code in Linux that parses pathnames to find
the inode of the file being referenced.

156
Changing File Size
• EXT2 tries to allocate new blocks for file in same Block Group
as its current blocks (and inode)
• When a write will go past last allocated block
– Linux suspends process
– locks EXT2 Superblock for this file system
– checks to make sure there are at least some free blocks left
– allocates block

157
Changing File Size (continued)
• Block allocation sequence
– may have been preallocated (prealloc block and
prealloc count in inode); then grab one and update fields
– else, see if next block after last in file is available (ideal)
– else, look for block within 64 blocks of ideal (same block
group)
– look in other block groups (look for sets of 8 blocks, if
available)
• When free block is found, update block bitmap in superblock
and allocate buffer in buffer cache (zeroed and marked dirty)
• Superblock is marked as dirty and is unlocked

158
Virtual File System
• Provides processes with transparent access to many types of
local file systems and to remote file systems
• Maintains data structures that describe the whole (virtual) file
system and the real, mounted file systems
• Maintains (VFS) superblocks and (VFS) inodes, just like real
file systems, BUT these exist only in memory, not on disk, and
are constructed based on information in their ”real”
counterparts

159
VFS Operation
• As each file system is initialized at boot time, it registers itself
with the VFS
• A file system can be either built into kernel or built as a
loadable module.
• When a file system is mounted, VFS reads its superblock and
maps this information onto a VFS superblock structure.
• VFS keeps list of mounted file systems with their superblocks.
• Each VFS superblock contains pointers to routines that
perform file-system-specific functions, e.g., reading inode. So,
the read inode routine (fs/inode.c) is generic and calls the
operation in superblock, which fills in a (VFS!) inode:

160
VFS Caches
• Three major caches are used by the VFS;
• Inode cache
– a hash table containing VFS inodes
– reading an inode will put it in cache, and references keep it
there
• Buffer cache
– cache of blocks from underlying file systems
– contains not only files but also (raw) inodes from various file
systems
– shared by all file systems
– buffers are identified by block number and unique device
identifier

161
VFS Caches (continued)
• Directory cache
– hash table that stores mapping between full directory names
and their inode numbers

162
VFS Superblock Contents
• Device identifier (e.g., /dev/hda1 is 0x301)
• Inode pointers
– mounted - points at first inode in this file system
– covered - points at directory that got covered by mount
(root fs doe not have one)
• Blocksize for this fs
• Pointer to set of super
• File system type
• File system specific information

163
VFS Inode Contents
• Identifier of device holding file (or whatever inode represents)
• (Raw) inode number (unique within given file system)
• Mode - file type and permissions
• User and group ids
• Pointer to set of file system specific inode routines
• Count of number of current users of the inode
• Lock used when being read from file system (disk/buffer cache)
• Dirty flag - raw inode will have to be written
• File system specific information

164
File System Mounting
• Three pieces of info passed to kernel
– file system type
– physical block device containing file system
– where in existing file system to mount file system
• Steps taken by kernel
– checks registered file system types, gets routine to read
superblock
– get (VFS) inode of mount point directory
– allocate VFS superblock and read the superblock
– fill in vfsmount structure

165
VFS Mount List
• List of all mounted file systems
• Each entry points to superblock, root inode, file system type

166
Buffer Cache
• As mounted file systems are used, the produce requests to
underlying device drivers to read blocks from various devices.
• These requests are in the form of buffer head data structures,
which contain all information necessary to read appropriate
block
• All block devices viewed as linear collection of blocks of same
size

167
Buffer Cache (continued)
• Buffer cache shared among all devices and real file systems
– lists of free buffers of different sizes (512, 1024, 2048, ...)
– hash table that contains used buffers, index generated from
device identifier and block number
• bdflush (kflushd) kernel daemon awakened when no. of dirty
buffers grows too large (60% of total!) or if there are not
enough free buffers and writes some (e.g., up to 500) to disk

168
The /proc File System
• Example of the power of the Linux VFS
• Neither /proc nor its subdirectories actually exist on disk
• /proc registers itself with VFS
• When requested files are opened, /proc routines create files
based on info in kernel
• /proc is user-readable window into kernel’s inner workings

169

You might also like