You are on page 1of 72

Chapter 2.

Memory Addressing

Today's microprocessors include several circuits to make memory managment both more efficient and more robust In this chapter we study details on how 80x86 (I !"#$ microprocessors address memory chips and how %inux uses the available addressing circuits& t the same time&&& 'ou will better understand the theory of paging 'ou will learn how to research the implementation on other platforms

This is the first of three chapters related to memory management (hapter 8 discusses how the kernel allocates main memory to itself (hapter ) considers how linear addresses are assigned to processes

2.1. Memory Addresses

*rogrammers refer to a memory address as the way to access a memory cell& +ut when dealing with 80 x 86 microprocessors, we have to distinguish three kinds of addresses

Logical address: Included in the machine language instructions to specify the address of an operand or of an instruction&

embodies the well!known 80 x 86 segmented architecture& consists of a segment and an offset single "#!bit unsigned integer

Linear address (also known as virtual address):

can be used to address up to . /+ usually represented in hexadecimal notation (from 0x00000000 to 0xffffffff$

hysical address: 0sed to address memory cells in memory chips&

They correspond to the electrical signals sent along the address pins of the microprocessor to the memory bus& *hysical addresses are represented as "#!bit or "6!bit unsigned integers&

Memory Managment !nit (MM!)

Translation *rotection

2.2. "egmentation in #ardware

1tarting with the 80#86 model, Intel microprocessors perform address translation in two different ways called real mode and protected mode.

2e'll focus in the next sections on address translation when protected mode is ena$led&

3eal mode exists mostly to maintain processor compati$ility with older models and to allow the operating system to $ootstrap&

2.2.1. "egment "electors and "egmentation %egisters

A logical address consists o& two parts- a segment identifier and an offset that specifies the relative address within the segment&

The segment identifier is a 46!bit field called the "egment "elector, while the offset is a "#!bit field&

To make it easy to retrieve segment selectors 5uickly, the processor provides segmentation registers whose only purpose is to hold 1egment 1electors6 these registers are called cs, ss, ds, es, fs, and gs&

"egmentation registers

Three of the six segmentation registers have specific purposescs: The code segment register, which points to a segment containing program instructions ss: The stack segment register, which points to a segment containing the current program stack ds: The data segment register, which points to a segment containing global and static data

The cs register has another important function- it includes a #!bit field that specifies the Current rivilege Level (C L) o& the C !& The value 0 denotes the highest privilege level, while the value " denotes the lowest one& %inux uses only levels 0 and ", which are respectively called 7ernel 8ode and 0ser 8ode&

"elector and descriptor

2.2.2. "egment 'escriptors

9ach segment is represented by an 8!byte 1egment :escriptor that describes the segment characteristics& 1egment :escriptors are stored either in the (lo$al 'escriptor )a$le ((') ) or in the Local 'escriptor )a$le(L'))& 0sually only one (') is de&ined, while each process is permitted to have its own %:T if it needs to create additional segments besides those stored in the /:T& The address and si;e of the /:T in main memory are contained in the gdtr control register, while the address and si;e of the currently used %:T are contained in the ldtr control register& <igure #!" illustrates the format of a 1egment :escriptor6 the meaning of the various fields is explained in Table #!4&

"egment descriptor &ormat

)a$le 2*1. "egment 'escriptor &ield

+ase: (ontains the linear address of the first byte of the segment& (: /ranularity flag- if it is cleared (e5ual to 0$, the segment si;e is expressed in bytes6 otherwise, it is expressed in multiples of .0)6 bytes& Limit: =olds the offset of the last memory cell in the segment, thus binding the segment length& 2hen / is set to 0, the si;e of a segment may vary between 4 byte and 4 8+6 otherwise, it may vary between . 7+ and . /+& ": 1ystem flag- if it is cleared, the segment is a system segment that stores critical data structures such as the %ocal :escriptor Table6 otherwise, it is a normal code or data segment& )ype: (haracteri;es the segment type and its access rights& ' L: :escriptor *rivilege %evel- used to restrict accesses to the segment& It represents the minimal (*0 privilege level re5uested for accessing the segment& Therefore, a segment with its :*% set to 0 is accessible only when the (*% is 0 that is, in 7ernel 8ode while a segment with its :*% set to " is accessible with every (*% value& : 1egment!*resent flag- is e5ual to 0 if the segment is not stored currently in main memory& %inux always sets this flag (bit .>$ to 4, because it never swaps out whole segments to disk& ' or +: (alled : or + depending on whether the segment contains code or data& Its meaning is slightly different in the two cases, but it is basically set (e5ual to 4$ if the addresses used as segment offsets are "# bits long, and it is cleared if they are 46 bits long (see the Intel manual for further details$& A,L: 8ay be used by the operating system, but it is ignored by %inux&

)ypes o& segments widely used in Linu

Code "egment 'escriptor: Indicates that the 1egment :escriptor refers to a code segment6 it may be included either in the /:T or in the %:T& The descriptor has the 1 flag set (non!system segment$& 'ata "egment 'escriptor: Indicates that the 1egment :escriptor refers to a data segment6 it may be included either in the /:T or in the %:T& The descriptor has the 1 flag set& 1tack segments are implemented by means of generic data segments& )ask "tate "egment 'escriptor ()""'): Indicates that the 1egment :escriptor refers to a Task 1tate 1egment (T11$ that is, a segment used to save the contents of the processor registers (see the section ?Task 1tate 1egment? in (hapter "$6 it can appear only in the /:T& The corresponding Type field has the value 44 or ), depending on whether the corresponding process is currently executing on a (*0& The 1 flag of such descriptors is set to 0& Local 'escriptor )a$le 'escriptor (L')'): Indicates that the 1egment :escriptor refers to a segment containing an %:T6 it can appear only in the /:T& The corresponding Type field has the value #& The 1 flag of such descriptors is set to 0& The next section shows how 80 x 86 processors are able to decide whether a segment descriptor is stored in the /:T or in the %:T of the process&

Any "egment "elector includes three &ields

.nde-: Identifies the 1egment :escriptor entry contained in the /:T or in the %:T ).: Table Indicator - specifies whether the 1egment :escriptor is included in the /:T (TI @ 0$ or in the %:T (TI @ 4$& % L: 3e5uestor *rivilege %evel - specifies the (urrent *rivilege %evel of the (*0 when the corresponding 1egment 1elector is loaded into the cs register6 it also may be used to selectively weaken the processor privilege level when accessing data segments (see Intel documentation for details$&

2.2./. "egmentation !nit

<igure #!A shows in detail how a logical address is translated into a corresponding linear address& The segmentation unit performs the following operations

9xamines the TI field of the 1egment 1elector to determine which :escriptor Table stores the 1egment :escriptor& This field indicates that the :escriptor is either in the /:T (in which case the segmentation unit gets the base linear address of the /:T from the gdtr register$ or in the active %:T (in which case the segmentation unit gets the base linear address of that %:T from the ldtr register$& (omputes the address of the 1egment :escriptor from the index field of the 1egment 1elector& The index field is multiplied by 8 (the si;e of a 1egment :escriptor$, and the result is added to the content of the gdtr or ldtr register& dds the offset of the logical address to the +ase field of the 1egment :escriptor, thus obtaining the linear address&

Botice that, thanks to the nonprogrammable registers associated with the segmentation registers, the first two operations need to be performed only when a segmentation register has been changed&

)ranslating a logical address

2.0. "egmentation in Linu

%inux uses segmentation in a very limited way& In fact, segmentation and paging are somewhat redundant, because both can be used to separate the physical address spaces of processes

segmentation can assign a different linear address space to each process, while paging can map the same linear address space into different physical address spaces&

Linu- pre&ers paging to segmentation for the following reasons

8emory management is simpler when all processes use the same segment register values that is, when they share the same set of linear addresses& Cne of the design obDectives of %inux is portability to a wide range of architectures6 3I1( architectures in particular have limited support for segmentation&

Memory managment models

!se o& segment registers in segmented memory model

!se o& segment registers in &lat memory model

Code and data segments used $y Linu

The #&6 version of %inux uses segmentation only when re5uired by the 80 x 86 architecture& ll %inux processes running in 0ser 8ode use the same pair of segments to address instructions and data& These segments are called user code segment and user data segment , respectively& 1imilarly, all %inux processes running in 7ernel 8ode use the same pair of segments to address instructions and data- they are called kernel code segment and kernel data segment , respectively& Table #!" shows the values of the 1egment :escriptor fields for these four crucial segments& +ase ( 4 4 4 4 Limit 0xfffff 0xfffff 0xfffff 0xfffff " 4 4 4 4 )ype 40 # 40 # ' L " " 0 0 '1+ 4 4 4 4 4 4 4 4

"egment

user code 0x00000000 user data 0x00000000

kernel code 0x00000000 kernel data 0x00000000

+esides the four segments Dust described, %inux makes use of a few other speciali;ed segments& 2e'll introduce them in the next section while describing the %inux /:T&

Code and data segments used $y Linu

The corresponding 1egment 1electors are defined by the macros E E0193E(1, E E0193E:1, E E793B9%E(1, and E E793B9%E:1, respectively& To address the kernel code segment, for instance, the kernel Dust loads the value yielded by the E E793B9%E(1 macro into the cs segmentation register&

Botice that the linear addresses associated with such segments all start at 0 and reach the addressing limit of #F"# !4& This means that all processes2 either in !ser Mode or in 3ernel Mode2 may use the same logical addresses.

nother important conse5uence of having all segments start at 0x00000000 is that in %inux, logical addresses coincide with linear addresses6 that is, the value o& the 4&&set &ield o& a logical address always coincides with the value o& the corresponding linear address.

2.0.1 )he Linu- (')

In uniprocessor systems there is only one /:T, while in multiprocessor systems there is one (') &or every C ! in the system& ll /:Ts are stored in the cpu5gdt5ta$le array, while the addresses and si;es of the /:Ts (used when initiali;ing the gdtr registers$ are stored in the cpu5gdt5descr array& The layout of the /:Ts is shown schematically in <igure #!6& 9ach /:T includes 48 segment descriptors and 4. null, unused, or reserved entries& !nused entries are inserted on purpose so that 1egment :escriptors usually accessed together are kept in the same "#!byte line of the hardware cache ll copies of the /:T store identical entries, except for a few cases&

<irst, each processor has its own )"" segment, thus the corresponding /:T's entries differ& 8oreover, a few entries in the /:T may depend on the process that the (*0 is executing (L') and )L" "egment 'escriptors$& <inally, in some cases a processor may temporarily modify an entry in its copy of the /:T6 this happens, for instance, when invoking an *8's +IC1 procedure&

Linu- (')

2./ aging in #ardware

The paging unit translates linear addresses into physical ones& Cne key task in the unit is to check the re6uested access type against the access rights o& the linear address& If the memory access is not valid, it generates a age 7ault e-ception (see (hapter . and (hapter 8$& <or the sake of efficiency, linear addresses are grouped in &i-ed*length intervals called pages 6 contiguous linear addresses within a page are mapped into contiguous physical addresses& In this way, the kernel can speci&y the physical address and the access rights o& a page instead of those of all the linear addresses included in it& <ollowing the usual convention, we shall use the term ?page? to refer both to a set of linear addresses and to the data contained in this group of addresses& The paging unit thinks of all %AM as partitioned into &i-ed*length page &rames (sometimes referred to as physical pages $&

9ach page frame contains a page that is, the length of a page frame coincides with that of a page& page frame is a constituent of main memory, and hence it is a storage area& .t is important to distinguish a page &rom a page &rame6 the former is Dust a block of data, which may be stored in any page frame or on disk&

aging in #ardware

The data structures that map linear to physical addresses are called page ta$les 6 they are stored in main memory and must be properly initiali;ed by the kernel before enabling the paging unit&

1tarting with the 80"86, all 80 x 86 processors support paging6 it is enabled by setting the ( &lag o& a control register named cr8 &

9hen ( : 82 linear addresses are interpreted as physical addresses.

2./.1. %egular aging

1tarting with the ;80;<, the paging unit of Intel processors handles / 3+ pages& The "# bits of a linear address are divided into three fields

:irectory- The most significant 40 bits Table- The intermediate 40 bits Cffset- The least significant 4# bits

The translation of linear addresses is accomplished in two steps, each based on a type of translation table& The first translation table is called the *age :irectory, and the second is called the *age Table& (In the discussion that follows, the lowercase ?page table? term denotes any page storing the mapping between linear and physical addresses, while the capitali;ed ?*age Table? term denotes a page in the last level of page tables&$

multilevel page ta$les

If a simple one!level *age Table was used, then it would re5uire up to #F#0 entries (i&e&, at . bytes per entry, . 8+ of 3 8$ to represent the *age Table for each process (if the process used a full . /+ linear address space$, even though a process does not use all addresses in that range& )he two*level scheme reduces the memory $y re6uiring age )a$les only &or those virtual memory regions actually used $y a process. 9ach active process must have a *age :irectory assigned to it& =owever, there is no need to allocate 3 8 for all *age Tables of a process at once6 it is more efficient to allocate %AM &or a age )a$le only when the process e&&ectively needs it. The physical address of the *age :irectory in use is stored in a control register named cr0 & The :irectory field within the linear address determines the entry in the *age :irectory that points to the proper *age Table& The address's Table field, in turn, determines the entry in the *age Table that contains the physical address of the page frame containing the page& The Cffset field determines the relative position within the page frame (see <igure #!>$& +ecause it is 4# bits long, each page consists of .0)6 bytes of data&

2 level paging

=ntry &ields

+oth the :irectory and the Table fields are 40 bits long, so *age :irectories and *age Tables can include up to 4,0#. entries& It follows that a *age :irectory can address up to 40#. x 40#. x .0)6 @ #GG"# memory cells, as you'd expect in "#!bit addresses& )he entries o& age 'irectories and age )a$les have the same structure. 9ach entry includes the following fields

resent &lag: If it is set, the referred!to page (or *age Table$ is contained in main memory6 if the flag is 0, the page is not contained in main memory and the remaining entry bits may be used by the operating system for its own purposes& If the entry of a *age Table or *age :irectory needed to perform an address translation has the *resent flag cleared, the paging unit stores the linear address in a control register named cr# and generates exception 4.- the *age <ault exception& 7ield containing the 28 most signi&icant $its o& a page &rame physical address: +ecause each page frame has a .!7+ capacity, its physical address must be a multiple of .0)6, so the 4# least significant bits of the physical address are always e5ual to 0& If the field refers to a *age :irectory, the page frame contains a *age Table6 if it refers to a *age Table, the page frame contains a page of data&

aging in #ardware

Accessed &lag: 1et each time the paging unit addresses the corresponding page frame& This flag may be used by the operating system when selecting pages to be swapped out& The paging unit never resets this flag6 this must be done by the operating system& 'irty &lag: pplies only to the *age Table entries& It is set each time a write operation is performed on the page frame& s with the ccessed flag, :irty may be used by the operating system when selecting pages to be swapped out& The paging unit never resets this flag6 this must be done by the operating system& %ead19rite &lag: (ontains the access right (3eadH2rite or 3ead$ of the page or of the *age Table (see the section ?=ardware *rotection 1cheme? later in this chapter$& !ser1"upervisor &lag: (ontains the privilege level re5uired to access the page or *age Table (see the later section ?=ardware *rotection 1cheme?$& C' and 9) &lags: (ontrols the way the page or *age Table is handled by the hardware cache (see the section ?=ardware (ache? later in this chapter$& age "i>e &lag: pplies only to *age :irectory entries& If it is set, the entry refers to a # 8+! or . 8+!long page frame (see the following sections$& (lo$al &lag: pplies only to *age Table entries& This flag was introduced in the *entium *ro to prevent fre5uently used pages from being flushed from the T%+ cache (see the section ?Translation %ookaside +uffers (T%+$? later in this chapter$& It works only if the *age /lobal 9nable (*/9$ flag of register cr. is set&

2./.2. =-tended aging

1tarting with the *entium model, 80 x 86 microprocessors introduce e-tended paging, which allows page &rames to $e / M+ instead of . 7+ in si;e& 9xtended paging is used to translate large contiguous linear address ranges into corresponding physical ones6 in these cases, the kernel can do without intermediate *age Tables and thus save memory and preserve )L+ entries& s mentioned in the previous section, extended paging is enabled by setting the age "i>e &lag o& a age 'irectory entry& In this case, the paging unit divides the "# bits of a linear address into two fields

'irectory- The most significant 40 bits 4&&set- The remaining ## bits

age 'irectory entries &or e-tended paging are the same as for normal paging, except that

The *age 1i;e flag must be set& Cnly the 40 most significant bits of the #0!bit physical address field are significant& This is because each physical address is aligned on a .!8+ boundary, so the ## least significant bits of the address are 0&

=-tended paging coe-ists with regular paging? it is ena$led $y setting the "= &lag o& the cr/ processor register.

=-tended paging

2./.0. #ardware rotection "cheme

The paging unit uses a different protection scheme from the segmentation unit& 2hile 80 x 86 processors allow four possible privilege levels to a segment, only two privilege levels are associated with pages and age )a$les, because privileges are controlled by the !ser1"upervisor &lag

2hen this flag is 0, the page can be addressed only when the (*% is less than " (this means, for %inux, when the processor is in 7ernel 8ode$& 2hen the flag is 4, the page can always be addressed&

<urthermore, instead of the three types of access rights (3ead, 2rite, and 9xecute$ associated with segments, only two types o& access rights (%ead and 9rite) are associated with pages:

If the 3eadH2rite flag of a *age :irectory or *age Table entry is e5ual to 0, the corresponding *age Table or page can only be read6 otherwise it can be read and written& 3ecent Intel *entium . processors support an @A (@o eAecute) &lag in each </* $it age )a$le entry&

2././. An =-ample o& %egular aging

simple example will help in clarifying how regular paging works& %et's assume that the kernel assigns the linear address space between 0x#0000000 and 0x#00"ffff to a running process& This space consists of exactly 6. pages& 2e don't care about the physical addresses of the page frames containing the pages6 in fact, some of them might not even be in main memory& 2e are interested only in the remaining fields of the *age Table entries& %et's start with the 40 most significant bits of the linear addresses assigned to the process, which are interpreted as the :irectory field by the paging unit&

The addresses start with a # followed by ;eros, so the 40 bits all have the same value, namely 0x080 or 4#8 decimal& (binary- 0088 1888 8888) Thus the :irectory field in all the addresses refers to the 4#)th entry of the process *age :irectory& The corresponding entry must contain the physical address of the *age Table assigned to the process (see <igure #!)$& If no other linear addresses are assigned to the process, all the remaining 4,0#" entries of the *age :irectory are filled with ;eros&

The values assumed by the intermediate 40 bits, (that is, the values of the Table field$ range from 0 to 0x0"f, or from 0 to 6" decimal& Thus, only the first 6. entries of the *age Table are valid& The remaining )60 entries are filled with ;eros&

An e-ample o& paging

An e-ample o& paging

1uppose that the process needs to read the byte at linear address 8-28821/8<& This address is handled by the paging unit as follows

The :irectory field 0x80 is used to select entry 0x80 of the *age :irectory, which points to the *age Table associated with the process's pages& The Table field 0x#4 is used to select entry 0x#4 of the *age Table, which points to the page frame containing the desired page& <inally, the Cffset field 0x.06 is used to select the byte at offset 0x.06 in the desired page frame&

If the *resent flag of the 0x#4 entry of the *age Table is cleared, the page is not present in main memory6 in this case, the paging unit issues a *age <ault exception while translating the linear address& The same exception is issued whenever the process attempts to access linear addresses outside o& the interval delimited by 0x#0000000 and 0x#00"ffff, because the *age Table entries not assigned to the process are filled with ;eros6 in particular, their *resent flags are all cleared&

2./.B. )he hysical Address =-tension ( A=) aging Mechanism

The amount of 3 8 supported by a processor is limited by the number of address pins connected to the address bus& Clder Intel processors from the 80"86 to the *entium used 02*$it physical addresses& In theory, up to / (+ o& %AM could be installed on such systems6

in practice, due to the linear address space re6uirements of 0ser 8ode processes, the kernel cannot directly address more than 1 (+ o& %AM, as we will see in the later section ?*aging in %inux&?

=owever, big servers that need to run hundreds or thousands of processes at the same time re5uire more than . /+ of 3 8& Intel has satisfied these re5uests by increasing the num$er o& address pins on its processors &rom 02 to 0<& "tarting with the entium ro2 all .ntel processors are now a$le to address up to 2CC0< : </ (+ o& %AM& 2ith the *entium *ro processor, Intel introduced a mechanism called *hysical ddress 9xtension (* 9$&

A=

* 9 is activated by setting the *hysical ddress 9xtension (* 9$ flag in the cr/ control register& Intel has changed the paging mechanism in order to support * 9&

The </ (+ o& %AM are split into 2CC2/ distinct page &rames, and the physical address field of *age Table entries has been expanded from #0 to #. bits& +ecause a * 9 *age Table entry must include the 4# flag bits (described in the earlier section ?3egular *aging?$ and the #. physical address bits, for a grand total of "6, the age )a$le entry si>e has $een dou$led &rom 02 $its to </ $its& s a result, a /*3+ A= age )a$le includes B12 entries instead o& 1282/. new level of *age Table called the age 'irectory ointer )a$le ( ' )) consisting of &our </*$it entries has been introduced& The cr0 control register contains a #>!bit age 'irectory ointer )a$le $ase address field& +ecause *:*Ts are stored in the first . /+ of 3 8 and aligned to a multiple of "# bytes (#GGA$, #> bits are sufficient to represent the base address of such tables&

The *age 1i;e (*1$ flag in the page directory entry enables large page si;es (2 M+ when A= is ena$led$&

A=

2hen mapping linear addresses to . 7+ pages ( " &lag cleared in *age :irectory entry$, the "# bits of a linear address are interpreted in the following way

cr0: *oints to a *:*T $its 01*08: *oint to 4 of . possible entries in *:*T $its 2D*21: *oint to 4 of A4# possible entries in *age :irectory $its 28*12: *oint to 4 of A4# possible entries in *age Table $its 11*8: Cffset of .!7+ page

2hen mapping linear addresses to #!8+ pages ( " &lag set in *age :irectory entry$, the "# bits of a linear address are interpreted in the following way

cr0: *oints to a *:*T $its 01*08: *oint to 4 of . possible entries in *:*T $its 2D*21: *oint to 4 of A4# possible entries in *age :irectory $its 28*8: Cffset of #!8+ page

)o summari>e

Cnce cr" is set, it is possible to address up to . /+ of 3 8& If we want to address more 3 8, we'll have to put a new value in cr" or change the content of the *:*T& #owever2 the main pro$lem with A= is that linear addresses are still 02 $its long. )his &orces kernel programmers to reuse the same linear addresses to map di&&erent areas o& %AM. 2e'll sketch how %inux initiali;es *age Tables when * 9 is enabled in the later section, ?<inal kernel *age Table when 3 8 si;e is more than .0)6 8+&? Clearly2 A= does not enlarge the linear address space o& a process2 $ecause it deals only with physical addresses& <urthermore, only the kernel can modify the page tables of the processes, thus a process running in 0ser 8ode cannot use a physical address space larger than . /+& Cn the other hand, * 9 allows the kernel to exploit up to 6. /+ of 3 8, and thus to increase significantly the number of processes in the system&

2./.<. aging &or </*$it Architectures

The third level of paging present in 80 x 86 processors with * 9 enabled has been introduced only to lower &rom 182/ to B12 the num$er o& entries in the *age :irectory and *age Tables& This enlarges the *age Table entries &rom 02 $its to </ $its so that they can store the #. most significant bits of the physical address& Two!level paging, however, is not suitable for computers that adopt a 6.!bit architecture& %et's use a thought experiment to explain why

1tart by assuming a standard page si;e of . 7+& +ecause 4 7+ covers a range of #F40 addresses, . 7+ covers #F4# addresses, so the Cffset field is 4# bits& This leaves up to A# bits of the linear address to be distributed between the Table and the :irectory fields& If we now decide to use only .8 of the 6. bits for addressing (this restriction leaves us with a comfortable #A6 T+ address spaceI$, the remaining .8!4# @ "6 bits will have to be split among Table and the :irectory fields& If we now decide to reserve 48 bits for each of these two fields, both the *age :irectory and the *age Tables of each process should include #F48 entries that is, more than #A6,000 entries&

aging levels in some </*$it architectures

ll hardware paging systems for 6.!bit processors make use of additional paging levels& The num$er o& levels used depends on the type o& processor& Table #!. summari;es the main characteristics of the hardware paging systems used by some 6.!bit platforms supported by %inux&

*latform alpha ia6. ppc6. sh6. x86E6.

*age si;e 8 7b G . 7+ G . 7+ . 7+ . 7+

address bits used ." ") .4 .4 .8

paging levels " " " " .

splitting 40 J 40 J 40 J 4" ) J ) J ) J 4# 40 J 40 J ) J 4# 40 J 40 J ) J 4# ) J ) J ) J ) J 4#

GThis architecture supports different page si;es6 we select a typical page si;e adopted by %inux

s we will see in the section ?*aging in %inux? later in this chapter, %inux succeeds in providing a common paging model that fits most of the supported hardware paging systems&

2./.E. #ardware Cache

Today's microprocessors have clock rates of several gigahert;, while dynamic 3 8 (:3 8$ chips have access times in the range of hundreds of clock cycles& This means that the (*0 may be held back considerably while executing instructions that re5uire fetching operands from 3 8 andHor storing results into 3 8& =ardware cache memories were introduced to reduce the speed mismatch between (*0 and 3 8& They are based on the well!known locality principle , which holds both for programs and data structures& It therefore makes sense to introduce a smaller and &aster memory that contains the most recently used code and data& <or this purpose, a new unit called the line was introduced into the 80 x 86 architecture& It consists of a &ew do>en contiguous $ytes that are trans&erred in $urst mode between the slow :3 8 and the fast on!chip static 3 8 (13 8$ used to implement caches&

#it and miss2 write*through and write*$ack

2hen accessing a 3 8 memory cell, the (*0 extracts the subset index from the physical address and compares the tags of all lines in the subset with the high!order bits of the physical address& If a line with the same tag as the high!order bits of the address is found, the (*0 has a cache hit6 otherwise, it has a cache miss& 2hen a cache hit occurs, the cache controller behaves differently, depending on the access type&

<or a read operation, the controller selects the data from the cache line and transfers it into a (*0 register6 the 3 8 is not accessed and the (*0 saves time, which is why the cache system was invented& <or a write operation, the controller may implement one of two basic strategies called write*through and write*$ack & In a write!through, the controller always writes into both 3 8 and the cache line, effectively switching off the cache for write operations& In a write! back, which offers more immediate efficiency, only the cache line is updated and the contents of the 3 8 are left unchanged& fter a write!back, of course, the 3 8 must eventually be updated& The cache controller writes the cache line back into 3 8 only when the (*0 executes an instruction re5uiring a flush of cache entries or when a <%01= hardware signal occurs (usually after a cache miss$&

2hen a cache miss occurs, the cache line is written to memory, if necessary, and the correct line is fetched from 3 8 into the cache entry&

Cache snooping

Multiprocessor systems have a separate hardware cache for every processor, and therefore they need additional hardware circuitry to synchroni;e the cache contents& s shown in <igure #!44, each (*0 has its own local hardware cache& +ut now updating becomes more time consuming- whenever a (*0 modifies its hardware cache, it must check whether the same data is contained in the other hardware cache6 if so, it must notify the other (*0 to update it with the proper value& This activity is often called cache snooping & %uckily, all this is done at the hardware level and is of no concern to the kernel&

Caches in "M

Caches and page ta$le entries

The C' &lag o& the cr8 processor register is used to enable or disable the cache circuitry& The @9 &lag, in the same register, specifies whether the write!through or the write!back strategy is used for the caches& nother interesting feature of the *entium cache is that it lets an operating system associate a different cache management policy with each page frame& <or this purpose, each *age :irectory and each *age Table entry includes two flagsC' ( age Cache 'isa$le), which specifies whether the cache must be enabled or disabled while accessing data included in the page frame6 and 9) ( age 9rite*)hrough), which specifies whether the write!back or the write! through strategy must be applied while writing data into the page frame&

%inux clears the *(: and *2T flags of all *age :irectory and *age Table entries6 as a result, caching is enabled for all page frames, and the write!back strategy is always adopted for writing&

2./.;. )ranslation Lookaside +u&&ers ()L+)

+esides general!purpose hardware caches, 80 x 86 processors include another cache called Translation %ookaside +uffers (T%+$ to speed up linear address translation& 2hen a linear address is used for the first time, the corresponding physical address is computed through slow accesses to the *age Tables in 3 8& The physical address is then stored in a T%+ entry so that further references to the same linear address can be 5uickly translated& In a multiprocessor system, each (*0 has its own T%+, called the local )L+ of the (*0& (ontrary to the hardware cache, the corresponding entries of the )L+ need not $e synchroni>ed, because processes running on the existing (*0s may associate the same linear address with different physical ones& 2hen the cr0 control register o& a C ! is modi&ied, the hardware automatically invalidates all entries o& the local )L+, because a new set of page tables is in use and the T%+s are pointing to old data&

aging in Linu

%inux adopts a common paging model that fits both 02*$it and </*$it architectures& 0p to version #&6&40, the %inux paging model consisted of three paging levels& "tarting with version 2.<.112 a &our*level paging model has $een adopted. This change has been made to fully support the linear address bit splitting used by the x86E6. platform& The &our types o& page ta$les illustrated in <igure #!4# are called

*age /lobal :irectory *age 0pper :irectory *age 8iddle :irectory *age Table

Thus the linear address can $e split into up to &ive parts& <igure #!4# does not show the bit numbers, because the si>e o& each part depends on the computer architecture&

)he Linu- paging model

paging levels &or 02 and </ $its

<or 02*$it architectures with no are sufficient&

hysical Address =-tension, two paging levels

%inux essentially eliminates the *age 0pper :irectory and the *age 8iddle :irectory fields by saying that they contain ;ero bits& =owever, the positions of the *age 0pper :irectory and the *age 8iddle :irectory in the se5uence of pointers are kept so that the same code can work on "#!bit and 6.!bit architectures&

<or 02*$it architectures with the hysical Address =-tension enabled, three paging levels are used&

The %inux's *age /lobal :irectory corresponds to the 80 x 86's *age :irectory *ointer Table, the *age 0pper :irectory is eliminated, the *age 8iddle :irectory corresponds to the 80 x 86's *age :irectory, and the %inux's *age Table corresponds to the 80 x 86's *age Table&

<inally, for </*$it architectures three or four levels of paging are used depending on the linear address bit splitting performed by the hardware (see Table #!#$&

paging

%inux's handling o& processes relies heavily on paging& In fact, the automatic translation of linear addresses into physical ones makes the following design obDectives feasible

ssign a different physical address space to each process, ensuring an e&&icient protection against addressing errors& 'istinguish pages (groups o& data) &rom page &rames (physical addresses in main memory)& This allows the same page to be stored in a page frame, then saved to disk and later reloaded in a different page frame& This is the basic ingredient of the virtual memory mechanism&

In the remaining part of this chapter, we will refer for the sake of concreteness to the paging circuitry used by the 80 x 86 processors& s we will see in (hapter ), each process has its own age (lo$al 'irectory and its own set o& age )a$les& 2hen a process switch occurs, %inux saves the cr0 control register in the descriptor of the process previously in execution and then loads cr" with the value stored in the descriptor of the process to be executed next& Thus, when the new process resumes its execution on the (*0, the paging unit refers to the correct set of *age Tables&

2.B.0. hysical Memory Layout

:uring the initiali;ation phase the kernel must build a physical addresses map that specifies which physical address ranges are usable by the kernel and which are unavaila$le (either because they map hardware devices' IHC shared memory or because the corresponding page frames contain +IC1 data$& The kernel considers the following page frames as reserved

Those falling in the unavailable physical address ranges Those containing the kernel's code and initiali;ed data structures

page contained in a reserved page frame can never be dynamically assigned or swapped to disk&

3ernel installed in 8-88188888

s a general rule, the %inux kernel is installed in 3 8 starting from the physical address 0x00400000 i&e&, from the second megabyte& The total number of page frames re5uired depends on how the kernel is configured& configuration yields a kernel that can be loaded in less than " 8+ of 3 8& typical

2hy isn't the kernel loaded starting with the first available megabyte of 3 8K 2ell, the *( architecture has several peculiarities that must be taken into account& <or example

*age frame 0 is used by +IC1 to store the system hardware configuration detected during the *ower!Cn 1elf!Test(*C1T$6 the +IC1 of many laptops, moreover, writes data on this page frame even after the system is initiali;ed& *hysical addresses ranging from 0x000a0000 to 0x000fffff are usually reserved to +IC1 routines and to map the internal memory of I1 graphics cards& This area is the well! known hole from 6.0 7+ to 4 8+ in all I+8!compatible *(s- the physical addresses exist but they are reserved, and the corresponding page frames cannot be used by the operating system& dditional page frames within the first megabyte may be reserved by specific computer models& <or example, the I+8 Think*ad maps the 0xa0 page frame into the 0x)f one&

hysical addresses map

In the early stage of the boot se5uence (see ppendix $, the kernel 5ueries the +.4" and learns the si>e o& the physical memory& In recent computers, the kernel also invokes a +.4" procedure to build a list o& physical address ranges and their corresponding memory types& (http-HHwiki&osdev&orgH=owE:oEIE:etermineETheE mountECfE3 8$ %ater, the kernel executes the machine5speci&ic5memory5setup( ) function, which builds the physical addresses map (see Table #!) for an example$& Cf course, the kernel builds this table on the basis of the +IC1 list, if this is available6 otherwise the kernel builds the table following the conservative default setup- all page frames with numbers from 0x)f (%C28981IL9( $$ to 0x400 (=I/=E898C3'$ are marked as reserved&

)a$le 2*D. =-ample o& +.4"*provided physical addresses map


"tart 0x00000000 0x000f0000 0x00400000 0x0>ff0000 0x0>ff"000 0xffff0000 =nd 0x000)ffff 0x000fffff 0x0>feffff 0x0>ff#fff 0x0>ffffff 0xffffffff )ype 0sable 3eserved 0sable (*I data (*I BM1 3eserved

typical configuration for a computer having 4#8 8+ of 3 8 is shown in Table #!)& The physical address range from 0x0>ff0000 to 0x0>ff#fff stores information about the hardware devices of the system written by the +IC1 in the *C1T phase6 during the initiali;ation phase, the kernel copies such information in a suitable kernel data structure, and then considers these page frames usable& (onversely, the physical address range of 0x0>ff"000 to 0x0>ffffff is mapped to 3C8 chips of the hardware devices& The physical address range starting from 0xffff0000 is marked as reserved, because it is mapped by the hardware to the +IC1's 3C8 chip (see ppendix $& Botice that the +.4" may not provide in&ormation &or some physical address ranges (in the table, the range is 0x000a0000 to 0x000effff$& To be on the safe side, %inux assumes that such ranges are not usable&

)a$le 2*18. ,aria$les descri$ing the kernelFs physical memory layout


The setup5memory( ) function is invoked right after machine5speci&ic5memory5setup( )- it analy;es the table of physical memory regions and initiali;es a few variables that describe the kernel's physical memory layout&
,aria$le name numEphyspages ramEpages minElowEpfn maxEpfn maxElowEpfn totalhighEpages highstartEpfn highendEpfn 'escription *age frame number of the highest usable page frame total Total number of usable page frames *age frame number of the first usable page frame after the kernel mage in 3 8 *age frame number of the last usable page frame *age frame number of the last page frame directly mapped by the kernel (low mem$ Total number of page frames not directly mapped by the kernel (high mem$ *age frame number of the first page frame not directly mapped by the kernel *age frame number of the last page frame not directly mapped by the kernel

"kip &irst mega$yte

2.B./. rocess age )a$les

The linear address space of a process is divided into two parts%inear addresses from 0x00000000 to 0xbfffffff can be addressed when the process runs in either 0ser or 7ernel 8ode& %inear addresses from 0xc0000000 to 0xffffffff can be addressed only when the process runs in 7ernel 8ode&

2hen a process runs in 0ser 8ode, it issues linear addresses smaller than 0xc00000006 when it runs in 7ernel 8ode, it is executing kernel code and the linear addresses issued are greater than or e5ual to 0xc0000000& In some cases, however, the kernel must access the 0ser 8ode linear address space to retrieve or store data&

8AC888888

The * /9EC<<19T macro yields the value 0xc00000006 this is the offset in the linear address space of a process where the kernel lives&

The content of the first entries of the *age /lobal :irectory that map linear addresses lower than 0xc0000000 (the first >68 entries with * 9 disabled, or the first " entries with * 9 enabled$ depends on the specific process&

(onversely, the remaining entries should be the same for all processes and e5ual to the corresponding entries of the master kernel age (lo$al 'irectory (see the following section$&

2.B.B. 3ernel age )a$les

The kernel maintains a set of page tables for its own use, rooted at a so!called master kernel age (lo$al 'irectory& fter system initiali;ation, this set of page tables is never directly used by any process or kernel thread6 rather, the highest entries of the master kernel *age /lobal :irectory are the reference model for the corresponding entries of the *age /lobal :irectories of every regular process in the system& 2e now describe how the kernel initiali;es its own page tables& This is a two!phase activity& In fact, right after the kernel image is loaded into memory, the (*0 is still running in real mode6 thus, paging is not enabled& In the first phase, the kernel creates a limited address space including the kernel's code and data segments, the initial *age Tables, and 4#8 7+ for some dynamic data structures& This minimal address space is Dust large enough to install the kernel in 3 8 and to initiali;e its core data structures& In the second phase, the kernel takes advantage of all of the existing 3 8 and sets up the page tables properly& %et us examine how this plan is executed&

2.B.B.1. rovisional kernel age )a$les

provisional *age /lobal :irectory is initiali;ed statically during kernel compilation, while the provisional *age Tables are initiali;ed by the startupE"#( $ assembly language function defined in archHi"86HkernelHheadE"#&1 2e won't bother mentioning the *age 0pper :irectories and *age 8iddle :irectories anymore, because they are e5uated to *age /lobal :irectory entries& * 9 support is not enabled at this stage& The provisional *age /lobal :irectory is contained in the swapperEpgEdir variable& The provisional *age Tables are stored starting from pg0, right after the end of the kernel's uninitiali;ed data segments (symbol Eend in <igure #!4"$& <or the sake of simplicity, letFs assume that the kernelFs segments2 the provisional age )a$les2 and the 12; 3+ memory area &it in the &irst ; M+ o& %AM& In order to map 8 8+ of 3 8, two *age Tables are re5uired& The obDective of this first phase of paging is to allow these 8 8+ of 3 8 to be easily addressed both in real mode and protected mode& Identity mapping J linear mapping

rovisional kernel age )a$les

The kernel must create a mapping from both the linear addresses 0x00000000 through 0x00>fffff and the linear addresses 0xc0000000 through 0xc0>fffff into the physical addresses 0x00000000 through 0x00>fffff& In other words, the kernel during its first phase of initiali;ation can address the first 8 8+ of 3 8 by either linear addresses identical to the physical ones or 8 8+ worth of linear addresses, starting from 0xc0000000& The 7ernel creates the desired mapping by filling all the swapperEpgEdir entries with ;eroes, except for entries 0, 4, 0x"00 (decimal >68$, and 0x"04 (decimal >6)$6 the latter two entries span all linear addresses between 0xc0000000 and 0xc0>fffff& The 0, 4, 0x"00, and 0x"04 entries are initiali;ed as follows

The address field of entries 0 and 0x"00 is set to the physical address of pg0, while the address field of entries 4 and 0x"04 is set to the physical address of the page frame following pg0& The *resent, 3eadH2rite, and 0serH1upervisor flags are set in all four entries& The ccessed, :irty, *(:, *2:, and *age 1i;e flags are cleared in all four entries&

=na$ling paging

The startupE"#( $ assembly language function also enables the paging unit& This is achieved by loading the physical address o& swapper5pg5dir into the cr0 control register and by setting the ( &lag o& the cr8 control register, as shown in the following e5uivalent code fragment-

movl NswapperEpgEdir ! 0xc0000000,Oeax movl Oeax,Ocr" movl Ocr0,Oeax orl N0x80000000,Oeax movl Oeax,Ocr0 HG &&and set paging (*/$ bit GH HG set the page table pointer&& GH

2.B.E. #andling the #ardware Cache and the )L+

The last topic of memory addressing deals with how the kernel makes an optimal use of the hardware caches& =ardware caches and Translation %ookaside +uffers play a crucial role in boosting the performance of modern computer architectures& 1everal techni5ues are used by kernel developers to reduce the number of cache and T%+ misses&

2.B.E.1. #andling the hardware cache

=ardware caches are addressed by cache lines& The %4E( (=9E+'T91 macro yields the si;e of a cache line in bytes&

Cn Intel models earlier than the *entium ., the macro yields the value "#6 on a *entium ., it yields the value 4#8&

To optimi>e the cache hit rate, the kernel considers the architecture in making the following decisions&

The most fre5uently used &ields o& a data structure are placed at the low offset within the data structure, so they can be cached in the same line& 2hen allocating a large set of data structures, the kernel tries to store each of them in memory in such a way that all cache lines are used uni&ormly&

Cache synchroni>ation is per&ormed automatically by the 80 x 86 microprocessors, thus the %inux kernel for this kind of processor does not perform any hardware cache flushing& )he kernel does provide2 however2 cache &lushing inter&aces &or processors that do not synchroni>e caches.

2.B.E.2. #andling the )L+

rocessors cannot synchroni>e their own )L+ cache automatically because it is the kernel, and not the hardware, that decides when a mapping between a linear and a physical address is no longer valid& .ntel microprocessors o&&ers only two )L+*invalidating techni6ues:

ll *entium models automatically flush the T%+ entries relative to non!global pages when a value is loaded into the cr0 register& In *entium *ro and later models, the invlpg assem$ly language instruction invalidates a single T%+ entry mapping a given linear address&

Table #!4# lists the %inux macros that exploit such hardware techni5ues6 these macros are the basic ingredients to implement architecture!independent methods

)a$le 2*12. )L+*invalidating macros &or the .ntel entium ro and later processors

Macro name 5 5&lush5tl$( )

'escription 3ewrites cr" register back into itself

5 5&lush5tl$5glo$al( ) :isables global pages by clearing the */9 flag of cr., rewrites cr" register back into itself, and sets again the */9 flag 5 5&lush5tl$5single(addr) 9xecutes invlpg assembly language instruction with parameter addr

"M s and avoiding )L+ &lushes

The architecture!independent T%+!invalidating methods are extended 5uite simply to multiprocessor systems& The function running on a (*0 sends an .nterprocessor .nterrupt to the other (*0s that forces them to execute the proper T%+!invalidating function& s a general rule, any process switch implies changing the set of active page tables& The kernel succeeds, however, in avoiding )L+ &lushes in the following cases

2hen performing a process switch between two regular processes that use the same set o& page ta$les& 2hen performing a process switch between a regular process and a kernel thread&

More cases o& &lushes and la>y )L+ mode

+esides process switches, there are other cases in which the kernel needs to flush some entries in a T%+&

<or instance, when the kernel assigns a page &rame to a !ser Mode process and stores its physical address into a *age Table entry, it must &lush any local )L+ entry that refers to the corresponding linear address& Cn multiprocessor systems, the kernel also must flush the same )L+ entry on the C !s that are using the same set of page tables, if any&

To avoid useless T%+ flushing in multiprocessor systems, the kernel uses a techni5ue called la>y )L+ mode &

The basic idea is the following- if several (*0s are using the same page tables and a T%+ entry must be flushed on all of them, then T%+ flushing may, in some cases, be delayed on (*0s running kernel threads&

You might also like