You are on page 1of 404

V2.0.0.

3

cover

Front cover

AIX 5L Kernel Internals
(Course Code BE0070XS)

Student Notebook
ERC 4.0

eServer UNIX Technical Education IBM Certified Course Material

Student Notebook

Trademarks The reader should recognize that the following terms, which appear in the content of this training document, are official trademarks of IBM or other companies: IBM® is a registered trademark of International Business Machines Corporation. The following are trademarks or registered trademarks of International Business Machines Corporation in the United States, or other countries, or both: AIX® Chipkill™ Electronic Service Agent™ LoadLeveler® pSeries™ S/370™ zSeries™ AIX 5L™ DB2® IBM® NUMA-Q® PTX® Sequent® AS/400® DFS™ iSeries™ PowerPC® RS/6000® SP™

ActionMedia, LANDesk, MMX, Pentium and ProShare are trademarks of Intel Corporation in the United States, other countries, or both. Intel is a trademark of Intel Corporation in the United States, other countries, or both. Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. Java and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both. UNIX is a registered trademark of The Open Group in the United States and other countries. Linux is a registered trademark of Linus Torvalds in the United States and other countries. Other company, product and service names may be trademarks or service marks of others.

June 2003 Edition
The information contained in this document has not been submitted to any formal IBM test and is distributed on an “as is” basis without any warranty either express or implied. The use of this information or the implementation of any of these techniques is a customer responsibility and depends on the customer’s ability to evaluate and integrate them into the customer’s operational environment. While each item may have been reviewed by IBM for accuracy in a specific situation, there is no guarantee that the same or similar results will result elsewhere. Customers attempting to adapt these techniques to their own environments do so at their own risk. © Copyright International Business Machines Corporation 2001, 2003. All rights reserved. This document may not be reproduced in whole or in part without the prior written permission of IBM. Note to U.S. Government Users — Documentation related to restricted rights — Use, duplication or disclosure is subject to restrictions set forth in GSA ADP Schedule Contract with IBM Corp.

V2.0.0.3
Student Notebook

TOC

Contents
Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Course Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi Agenda . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii Unit 1. Introduction to the AIX 5L Kernel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-1 Unit Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-2 Operating System and the Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-3 Kernel Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-5 Address Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-7 Mode and Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-9 Context Switches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-11 Interrupt Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-13 AIX 5L Kernel Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-16 AIX 5L Execution Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-18 System Header Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-20 Conditional Compile Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-22 Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-24 Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-25 Unit Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-26 Unit 2. Kernel Analysis Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-1 Unit Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-2 What tools will you be using in this class? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-3 The Major Functions of KDB are: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-4 Enabling the Kernel Debugger . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-6 Verifying the Debugger is Enabled . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-8 Starting the Debugger . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-9 System Dumps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-10 kdb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-13 Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-15 Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-16 Unit Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-17 Unit 3. Process Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-1 Unit Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2 Parts of a Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-3 Threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-5 1:1 Thread Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-7 M:1 Thread Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-8 M:N Thread Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-9 Creating Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-11 Creating Threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-13

© Copyright IBM Corp. 2001, 2003
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

Contents

iii

Student Notebook

Process State Transitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-15 The Process Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-18 pvproc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-20 pv_stat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-21 Table Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-22 Extending the pvproc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-24 PID Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-26 Finding the Slot Number . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-28 Kernel Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-29 Thread Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-31 pvthread Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-33 TID Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-34 u-block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-35 Six Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-37 Thread Scheduling Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-39 Thread State Transitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-40 Thread Priority . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-43 Run Queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-45 Dispatcher and Scheduler Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-46 Dispatcher . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-47 Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-48 Preemption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-49 Preemptive Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-51 Scheduling Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-53 SMP - Multiple Run Queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-56 NUMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-58 Memory Affinity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-60 Global Run Queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-62 Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-64 Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-65 Unit Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-66 Unit 4. Addressing Memory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-1 Unit Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-2 Memory Management Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-3 Pages and Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-4 Address Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-6 Translating Addresses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-8 Segments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-9 Segment Addressing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-11 32-bit Hardware Address Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-13 64 Bit Hardware Address Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-15 Segment Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-16 Shared Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-19 shmat Memory Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-21 Memory Mapped Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-23 32-bit User Address Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-26 32-bit Kernel Address Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-28
iv Kernel Internals © Copyright IBM Corp. 2001, 2003
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

V2.0.0.3
Student Notebook

TOC

64-bit User/Kernel Address Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exercise. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Unit Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4-29 4-31 4-32 4-33

Unit 5. Memory Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-1 Unit Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-2 Virtual Memory Management (VMM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-3 Object Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-5 Demand Paging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-7 Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-10 Hardware Page Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-12 Page not in Hardware Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-13 Page on Paging Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-15 External Page Table (XPT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-16 Loading Pages From the File System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-18 Object Type / Backing Store . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-20 Paging Space Management Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-21 Paging Space Allocation Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-23 Free Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-25 Clock Hand Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-27 Fatal Memory Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-29 Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-30 Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-31 Unit Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-32 Unit 6. Logical Partitioning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-1 Unit Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-2 Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-3 Physical Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-5 Logical Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-7 Components Required for LPAR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-9 Operating System Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-13 Virtual Memory Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-14 Real Address Range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-15 Real Mode Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-17 Operating System Real Mode Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-19 Address Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-21 Allocating Physical Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-23 Partition Page Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-25 Translation Control Entries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-27 Hypervisor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-29 Dividing Physical Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-31 Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-33 Unit Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-34 Unit 7. LFS, VFS and LVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-1 Unit Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-2
© Copyright IBM Corp. 2001, 2003
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

Contents

v

Student Notebook

What is the Purpose of LFS/VFS? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-3 Kernel I/O Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-5 Major Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-7 Logical File System Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-9 User File Descriptor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-11 The file Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-13 vnode/vfs Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-15 vnode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-17 vfs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-19 root (l) and usr File Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-21 vmount . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-23 File and File System Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-25 gfs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-27 vnodeops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-29 vfsops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-31 gnode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-33 kdb devsw Subcommand Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-35 kdb volgrp Subcommand Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-37 AIX lsvg Command Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-39 kdb lvol Subcommand Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-40 AIX lslv Command Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-44 kdb pvol Subcommand Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-46 AIX lspv Command Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-48 Checkpoint (1 of 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-49 Checkpoint (2 of 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-50 Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-51 Unit Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-52 Unit 8. Journaled File System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-1 Unit Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-2 JFS File System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-3 Reserved Inodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-7 Disk Inode Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-9 In-core Inodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-11 Direct (No Indirect Blocks) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-15 Single Indirect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-17 Double Indirect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-18 Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-19 Unit Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-20 Unit 9. Enhanced Journaled File System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-1 Unit Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-2 Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-3 Aggregate and Fileset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-4 Aggregate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-6 Allocation Group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-9 Fileset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-11 Inode Allocation Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-13
vi Kernel Internals © Copyright IBM Corp. 2001, 2003
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

V2.0.0.3
Student Notebook

TOC

Extents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Increasing an Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Binary Tree of Extents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Inodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Inline Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Binary Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . More Extents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Continuing to Add Extents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Another Split . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . fsdb Utility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Directory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Directory Root Header . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Directory Slot Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Small Directory Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Adding a File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Adding a Leaf Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Adding an Internal Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Unit Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9-14 9-16 9-18 9-20 9-26 9-27 9-28 9-29 9-30 9-32 9-34 9-35 9-37 9-39 9-41 9-42 9-43 9-44 9-45 9-46 9-47

Unit 10. Kernel Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-1 Unit Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-2 Kernel Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-3 Relationship With the Kernel Nucleus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-5 Global Kernel Name Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-6 Why Export Symbols? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-9 Kernel Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-11 Configuration Routines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-13 Compiling and Linking Kernel Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-15 How to Build a Dual Binary Extension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-19 Loading Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-21 sysconfig() - Loading and Unloading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-22 sysconfig() - Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-23 sysconfig() - Device Driver Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-24 The loadext() Routine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-26 System Calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-28 Sample System Call - Export/Import File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-30 Sample System Call - question.c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-31 Sample System Call - Makefile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-32 Argument Passing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-33 User Memory Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-35 Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-38 Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-39 Unit Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-40

© Copyright IBM Corp. 2001, 2003
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

Contents

vii

Student Notebook

Appendix A. Checkpoint Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-1 Appendix B. KI Crash Dump . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Unit Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Crash Dumps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Process Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . About This Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-1 B-2 B-3 B-5 B-6

viii

Kernel Internals

© Copyright IBM Corp. 2001, 2003
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

Linux is a registered trademark of Linus Torvalds in the United States and other countries. Pentium and ProShare are trademarks of Intel Corporation in the United States. or both. or both. © Copyright IBM Corp. or both. or other countries. or both: AIX® Chipkill™ Electronic Service Agent™ LoadLeveler® pSeries™ S/370™ zSeries™ AIX 5L™ DB2® IBM® NUMA-Q® PTX® Sequent® AS/400® DFS™ iSeries™ PowerPC® RS/6000® SP™ ActionMedia. and the Windows logo are trademarks of Microsoft Corporation in the United States. other countries. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. Trademarks ix . Windows NT.0.0. other countries. in the United States. Intel is a trademark of Intel Corporation in the United States. Inc. Windows. Microsoft. are official trademarks of IBM or other companies: IBM® is a registered trademark of International Business Machines Corporation. MMX. Other company. which appear in the content of this training document. 2001. LANDesk. or both. Java and all Java-based trademarks are trademarks of Sun Microsystems. other countries. product and service names may be trademarks or service marks of others.V2. UNIX is a registered trademark of The Open Group in the United States and other countries.3 Student Notebook TMK Trademarks The reader should recognize that the following terms. other countries. The following are trademarks or registered trademarks of International Business Machines Corporation in the United States.

Student Notebook x Kernel Internals © Copyright IBM Corp. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 2001. .

working knowledge of AIX system calls.1 and 5. configuring file systems and configuring dump devices.0.V2. Course Description xi . 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. the following courses are helpful: — KornShell Programming (AU23/Q1123) — AIX Application Programming Environment (AU25/Q1125) © Copyright IBM Corp. such as the use of SMIT. pipes. Audience — AIX technical support personnel — Application developers who want to achieve a conceptual understanding of AIX 5L Kernel Internals Prerequisites Students are expected to have programming knowledge in the C programming language.AIX/UNIX (Q1070) — AIX 5L System Administration II: Problem Determination (AU16/Q1316) In addition. It is designed to provide background information useful to support engineers and AIX development/application engineers who are new to the AIX 5L Kernel environment as implemented in AIX releases 5. This course also provides background knowledge helpful for those planning to attend the AIX 5L Device Driver (Q1330) course. These skills can be obtained by attending the following courses or through equivalent experience: — Introduction to C Programming . including editors.2. 2001. Additionally knowledge of basic system administration skills is required.3 Student Notebook pref Course Description AIX 5L Kernel Internals Concepts Duration: 5 days Purpose This is a course in basic AIX 5L Kernel concepts.0. and user-level working knowledge of AIX/UNIX. shells. and Input/Output (I/O) redirection.

Student Notebook Objectives At the end of this course you will be able to: — List the major features of the AIX 5L kernel — Quickly traverse the system header files to find data structures — Use the kdb command to examine data structures in the memory image of a running system or system dump — Understand the structures used by the kernel to manage processes and threads. and how logical to physical address translation is achieved — Describe the operation of VMM subsystem and the different paging algorithms — Describe the mechanisms used to implement logical partitioning — Understand the purpose of the logical file system and virtual file system layers and the data structures they use — List and describe the components and function of the JFS2 and JFS file systems — Identify the steps required to compile. and the relationships between them — Describe the layout of the segmented addressing model. . 2001. link and load kernel extensions xii Kernel Internals © Copyright IBM Corp. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

Topic 2 lecture Exercise 8 .Memory Management Unit 6 .Kernel Extensions © Copyright IBM Corp.Kernel Analysis Tools lecture Exercise 2 .Logical Partitioning lecture Day 4 Daily review Unit 7 . VFS and LVM Unit 8 . VFS and LVM lecture Exercise 6 .Kernel Extensions lecture Exercise 9 . 2001.Topic 1 Unit 9 .Enhanced Journaled File System .3 Student Notebook pref Agenda Day 1 Welcome Unit 1 .Enhanced Journaled File System .Addressing Memory lecture Day 3 Daily review Exercise 4 .LFS.Topic 1 lecture Exercise 7 .Kernel Analysis Tools Day 2 Daily review Unit 3 .Topic 2 Day 5 Daily review Unit 10 .Enhanced Journaled File System .V2.Memory Management lecture Exercise 5 .Journaled File System lecture Unit 9 .0.Process Management Unit 4 .Addressing Memory Unit 5 . 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.Process Management lecture Exercise 3 .LFS.0. Agenda xiii .Enhanced Journaled File System .Introduction to the AIX 5L Kernel lecture Exercise 1 .Introduction to the AIX 5L Kernel Unit 2 .

Student Notebook xiv Kernel Internals © Copyright IBM Corp. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 2001. .

htm © Copyright IBM Corp.0.boulder. Introduction to the AIX 5L Kernel What This Unit Is About This unit describes the purpose. 2001. What You Should Be Able to Do After completing this unit. identify data element types for each of the available kernels in AIX 5L How You Will Check Your Progress Accountability: • Exercises using your lab system • Check-point activity • Unit review References The Design of the UNIX Operating System.0.com/pseries/en_US/infocenter/base/aix. by Maurice J. you should be able to: • Describe the role the kernel plays in an operating system • Define user and kernel mode and list the operations that can only be performed in kernel mode • Describe when the kernel must make a context switch • Describe the role of the mstsave area in a context switch • Name the execution environments available on each of the platforms supported by AIX 5L • Using the system header files.3 Student Notebook Uempty Unit 1. ISBN: 0132017997 AIX Online Documentation: http://publib16. Bach. concepts and features of the AIX 5L kernel. . Introduction to the AIX 5L Kernel 1-1 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.V2. 2003 Unit 1.ibm.

Unit Objectives BE0070XS4. 2001. .Student Notebook Unit Objectives At the end of this unit you should be able to: Describe the role the kernel plays in an operating system Define user and kernel mode and list the operations that can only be performed in kernel mode Describe when the kernel must make a context switch Describe the role of the mstsave area in a context switch Name the execution environments available on each of the platforms supported by AIX 5L Using the system header files.0 Notes: 1-2 Kernel Internals © Copyright IBM Corp. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. identify data element types for each of the available kernels in AIX 5L Figure 1-1.

3 Student Notebook Uempty Operating System and the Kernel Process system call Interface Process Process Kernel hardware Interface CPU CPU tty CPU Figure 1-2. Kernel The kernel is the base program of the operating system.0. The kernel prioritizes these requests and manages the hardware through its hardware interface. CPU and IO. It acts as intermediary between the application programs and the computer hardware.0.V2. 2003 Unit 1. Operating System and the Kernel BE0070XS4. Introduction to the AIX 5L Kernel 1-3 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.0 Notes: Operating system The principal purpose of the AIX operating system is to provide an environment where application programs can be executed. 2001. This mainly involves the management of hardware resources including memory. It provides the system call interface allowing programs to request use of the hardware. © Copyright IBM Corp. .

It is safe to say that the kernel is the most important part of the operating system. 1-4 Kernel Internals © Copyright IBM Corp. This class discusses the internal working of the kernel in the AIX 5L operating system. .Student Notebook The kernel is the key program The operating system is made up of many programs including the kernel. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 2001. if the kernel is not running nothing else in the operating system can function.

Kernel Components BE0070XS4. Introduction to the AIX 5L Kernel 1-5 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. providing space for file © Copyright IBM Corp. 2001. . Process management The process management function of the kernel is responsible for the creation. along with scheduling threads on CPUs. Each of these sections are discussed in this class.0 Notes: Introduction The kernel may be broken up into several sections based on the services provided to applications programs.3 Student Notebook Uempty Kernel Components Applications user kernel Buffered I/O Raw I/O File systems Disk space managment (LVM) I/O Subsystem Buffered I/O Process managment Device driver Device driver Virtual memory managment CPU CPU Disk tty Figure 1-3. This includes allocating physical page frames to virtual pages.V2.0. and termination of processes and threads.0. 2003 Unit 1. Virtual memory management The Virtual Memory Management (VMM) function of the kernel is responsible for managing all aspects of virtual and physical memory by processes and the kernel. The kernel components are shown in the visual above.

2001. JFS2. Device drivers are covered in detail in a separate class on writing device drivers. Disk space management The management of disk space in AIX is handled by a layer above the disk’s drivers.Student Notebook system buffering and keeping track of which process memory is resident in physical memory and which is stored on disk. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. I/O subsystem Parts of the kernel that interact directly with I/O devices are called device drivers. The Logical Volume Manger (LVM) provides the function of disk space management. File system AIX supports several types of file systems including JFS. Typically each type of device installed on the system will require its own device driver. This class covers the JFS and JFS2 file systems. The file system software interacts with the disk space management software. NFS and several CD-ROM file systems. 1-6 Kernel Internals © Copyright IBM Corp. .

A process’ address space contains both user. Virtual address space By using the concept of virtual memory.0.3 Student Notebook Uempty Address Space Process A Process B Process C Address space Address space Address space user kernel Figure 1-4. each process on the system can appear to have its own address space that is separate and isolated from other processes. 2001. The address translation tables are controlled by the kernel.0 Notes: Introduction AIX implements a virtual memory system. Introduction to the AIX 5L Kernel 1-7 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.V2. . Memory management Virtual addresses are mapped by the hardware to a physical memory address. instead they reference a virtual address. 2003 Unit 1.0. Address Space BE0070XS4. One set of address translation © Copyright IBM Corp. Addresses referenced by a user program do not directly reference physical memory. Translation tables are used by the hardware to map virtual to physical addresses.and kernel-memory addresses.

the kernel loads the appropriate address translation table into the hardware. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. . 2001.Student Notebook tables is kept for each process. To switch from one process’ address space to another. 1-8 Kernel Internals © Copyright IBM Corp.

runs in kernel mode. Introduction to the AIX 5L Kernel 1-9 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. Mode The computer hardware provides two modes of execution.3 Student Notebook Uempty Mode and Environment Process Environment Application code System Call Interrupt Environment Invalid combination . © Copyright IBM Corp.0. a privileged kernel mode and a less-privileged user mode.V2.interrupts always run in kernel mode User mode Kernel mode Kernel code Hardware interrupt Figure 1-5. Mode and Context BE0070XS4. as you would expect.0. . 2001. The kernel.0 Notes: Introduction Two key concepts of mode and environment are described in this section. 2003 Unit 1. The following table compares these two modes. Application programs must run in user mode thus are given limited access to the hardware.

2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 2001. In process environment. Kernel mode Can access all memory on the system. This generally occurs when a user program makes a system call. When the kernel responds to an interrupt. In this context the kernel cannot access the user address space or any kernel data related to the user process that was running on the processor just before the interrupt occurred.Student Notebook User mode Memory access is limited to the user’s private memory. the kernel is running on behalf of a user process. Kernel memory is not accessible. Memory management registers may be modified. Environment The AIX kernel may execute in one of two environments: process environment or interrupt environment. . although it is also possible to create a kernel-mode only process. 1-10 Kernel Internals © Copyright IBM Corp. it is running in the interrupt environment. I/O instructions are blocked. Can’t modify hardware registers related to memory management. All I/O is performed in kernel mode. Interrupts must be handled in kernel mode.

Context Switches BE0070XS4.3 Student Notebook Uempty Context Switches CPU context switch Thread 1 mstsave Saved: y CPUs registers y stack pointer y instruction pointer Thread 2 mstsave Saved: y CPUs registers y stack pointer y instruction pointer Figure 1-6. The AIX kernel manages many threads of execution by switching the CPUs between the different threads on the system. . Thread of execution Threads of execution are simply logical paths through the instructions of a program. Introduction to the AIX 5L Kernel 1-11 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.0 Notes: Introduction A context switch is the action of exchanging one thread of execution on a CPU for another.V2. 2003 Unit 1.0.0. 2001. © Copyright IBM Corp.

the instruction address register and stack pointer. This context includes information such as the values of the CPU registers. the system register values stored in the mstsave of the thread are loaded into the CPU. . This information is saved in a structure called the mstsave (machine state save) structure. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. The CPU then performs a branch instruction to the address of the saved instruction pointer. mstsave The context of the running thread must be saved when a context switch occurs. Restoring a context When a thread is restored (switched in).Student Notebook Context switches Context switches can occur at two points: a. A hardware interrupt occurs. b. 2001. 1-12 Kernel Internals © Copyright IBM Corp. Each thread of execution has an associated mstsave structure. Execution of the thread is blocked waiting for the completion of an event.

This is because a thread structure has an mstsave structure. . © Copyright IBM Corp. the current context of the processor must be saved so that processing can be continued after handling the interrupt. however an interrupt is a transient entity and does not have its own thread structure.0. therefore. AIX keeps a pool of mstsave areas to use. multiple mstsave areas are needed to save the context of each interrupt. Interrupt Processing BE0070XS4. Introduction to the AIX 5L Kernel 1-13 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. mstsave pool Interrupts can occur when the CPU is currently processing an interrupt. Each time an interrupt occurs. 2003 Unit 1.0 Notes: Introduction A hardware interrupt results in a temporary context switch.V2. 2001.0.3 Student Notebook Uempty Interrupt Processing current save area csa mstsave mstsave mstsave threads mstsave unused (next interrupt goes here) high priority interrupt low priority interrupt base interrupt level Figure 1-7.

Action Save the current context in the mstsave area pointed to by the CPU’s csa. Interrupt history When AIX receives an interrupt that is of higher priority than the one it is currently handling it must save the current state in a new mstsave area linking the new save area to the previous one. 1. 1-14 Kernel Internals © Copyright IBM Corp. 4. 2001. Update the CPU’s csa pointer to point to the new mstsave area. invoke the dispatcher. Step Action If returning to the base interrupt level and the interrupt has made a thread runnable. and place the best runnable thread at the end of the MST chain. 2. the steps AIX takes to save the currently running context are: Step 1. The dispatcher will move the thread originally on the end of the MST chain back to the run queue. Return the current mstsave area to the pool. Set the CPU’s csa pointer to the previous mstsave area. 3.Student Notebook csa pointer Each processor has a pointer to the mstsave area it should use when an interrupt occurs. Branch to the instruction referenced by the instruction address register. 4. The last or base-level mstsave in the chain is the mstsave of the thread that was running when the first interrupt occurred. Working backwards from the highest priority interrupt to the lowest and finally to the base-level mstsave. Unwinding the interrupts As the processing of each interrupt is completed the chain of mstsave areas are unlinked. 2. Interrupt processing Saving context When an interrupt occurs. 3. The steps to restore a context are shown in this table. Link the just used mstsave to the new mstsave. Get the next available mstsave area from the pool. Reload the registers from the processing the context. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 5. This forms a history of interrupt processing. or csa pointer. . This pointer is called the current save area.

Introduction to the AIX 5L Kernel 1-15 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.0. .3 Student Notebook Uempty Finding the current mstsave The csa always points to an unused mstsave area.0. This mstsave will be used if a higher-priority interrupt occurs.V2. The data in this mstsave will not be valid except for its pointer to the next mstsave in the chain. 2001. © Copyright IBM Corp. The last used mstsave area can be located by following the prev pointer from the mstsave pointed to by the csa. 2003 Unit 1.

This can result in long delays in the processing of real time threads.Student Notebook AIX 5L Kernel Characteristics Preemptable kernel Pageable kernel memory Dynamically extensible kernel Figure 1-8. These features are listed above. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. AIX improves real time processing by allowing for preemption in kernel mode. Many other UNIX kernels will not allow preemption to occur when running in kernel mode. Preemptable Preemptable means that the kernel can be running in kernel mode (running a system call for example) and be interrupted by another more important task.0 Notes: Introduction The AIX kernel was the first mainstream UNIX operating system to implement several important features. AIX 5L Kernel Characteristics BE0070XS4. Preemption causes a context switch to another thread inside the kernel. As an example. 2001. Linux does not support preemption when in kernel mode. 1-16 Kernel Internals © Copyright IBM Corp. .

for example. As an example.0. This means that not all the code required for the kernel needs to be included in a single binary (/unix). This keeps the kernel smaller and requires less memory. . Extensible The AIX kernel is dynamically extensible. Portions of the kernel’s code will be loaded at runtime. The ability to page kernel memory is a feature not found in all UNIX kernels. AIX supports paging both user. Kernel extensions can include: . This allows for better utilization of physical memory.V2. 2003 Unit 1. portions of device drivers must be pinned in memory. Areas of memory that are not subject to paging are called pinned memory.File systems © Copyright IBM Corp.3 Student Notebook Uempty Pageable Not all of the kernel’s virtual memory space needs to be resident in physical memory at all times.Extended system calls .and kernel-address space. Introduction to the AIX 5L Kernel 1-17 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. Kernel extensions typically add functionality that may not be needed by all systems. Portions of the kernel memory may be paged out to disk when not needed. the kernel memory of the Linux operating system is resident in physical memory at all times. Most kernels support the paging of user-virtual-address space.Device drivers . Dynamically loaded modules are called kernel extensions.0. Pinning memory Some areas of the kernel’s memory must stay resident meaning they may not be paged to disk. 2001.

The key to this 64-bit platform flexibility is that a 64-bit VMM (Virtual Memory Manager) is run in both cases. 1-18 Kernel Internals © Copyright IBM Corp. . but on 64-bit platforms either can be used.0 Notes: Introduction AIX 5L supports both 32-bit and a 64-bit execution environments. This allows systems to support increased workloads. 2001. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 32-bit and 64-bit kernel The primary advantage of the 64-bit kernel is the increased kernel address space. In these cases. using left zero fill of addresses for the 32-bit kernel environment. Not all applications will require the increased address space of the 64-bit kernel.Student Notebook AIX 5L Execution Environment 32-bit Hardware 64-bit Hardware 64-bit Hardware 32-bit Applications 32-bit Applications 64-bit Applications 32-bit Applications 64-bit Applications User Kernel 32-bit Kernel 32-bit Kernel 64-bit Kernel Figure 1-9. a 32-bit kernel is provided. However. AIX 5L Execution Environment BE0070XS4. On 32-bit hardware platforms only the 32-bit environment can be used. there is an added cost to managing a 64-bit address space.

The 32-bit command checks the kernel type (32.2. User commands User level commands included with the AIX 5L operating system are designed to work with either the 32-bit or 64-bit kernel. some commands require both a 32-bit and a 64-bit version. For example. These are typically commands that must work directly with the internal structures of the kernel. However. 2. . Hardware platform 32-bit or 64-bit 64-bit Kernel type 32-bit 64-bit Kernel file /usr/lib/boot/unix_mp /usr/lib/boot/unix_up /usr/lib/boot/unix_64 User applications Both 32-bit and 64-bit applications are supported when running on 64-bit hardware. the 32-bit command completes its execution. Funneling is not supported on the 64-bit AIX 5L kernel.V2. All kernel extensions must be SMP safe. 2003 Unit 1. 4.1 the command vmstat would run the command vmstat64. If a 32-bit kernel is detected.1. Introduction to the AIX 5L Kernel 1-19 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. Kernel extensions Only 64-bit kernel extensions are supported under the 64-bit kernel. The steps are shown in this table. 3.0. In later versions of AIX 5. then a 64-bit version of the command is started. Action 32-bit version of command is run by user. Only 32-bit kernel extensions are supported under the 32-bit kernel. If a 64-bit kernel is detected. Earlier versions of AIX supported running non-SMP safe kernel extensions on SMP hardware using a mechanism called funneling. Step 1. then the 64-bit version of the command is run. and in AIX 5.or 64-bit).3 Student Notebook Uempty Selecting a kernel The file /unix is a link to the kernel image file that is loaded at boot time. If a 64-bit kernel is detected. the 32-bit version of the command will determine if a 32-bit or 64-bit kernel is running. under the initial release of AIX 5. Depending on the hardware type and kernel type (32-bit or 64-bit) the link will point to the appropriate file as shown in this table. 2001. © Copyright IBM Corp.0. regardless of the kernel that is running. For these commands. vmstat (along with other performance commands) uses a performance tools API.

h Figure 1-10. Finding header files The drawing above shows the location of the system header files.h fcntl.h ino.h j2-inode.h inode. We will reference these files throughout this class.. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.h user. since they contain the C language definitions of the structures we will be describing.h thread.h jfs dir.h sys proc.h mode.h utherad. 2001.Student Notebook System Header Files / (root) usr include stdio. 1-20 Kernel Internals © Copyright IBM Corp.h signal. System Header Files BE0070XS4.h filsys.h types.h j2-dinode.h j2 j2-btree.h jfsmount.0 Notes: Introduction The system header files contain the definition of structures that are used by the AIX kernel.h j2-types. .

Some of the sub-directories are described in this table. 2003 Unit 1. Header file directories /usr/include /usr/include/sys /usr/include/jfs /usr/include/j2 Description General program header files Header files dealing directly with the operations of the system Header files for the JFS file system Header files for the JFS2 file system © Copyright IBM Corp.0.3 Student Notebook Uempty Location of header files The /usr/include directory contains several sub-directories containing header files.V2. Introduction to the AIX 5L Kernel 1-21 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 2001. .0.

Code is being compiled for a 64-bit kernel. Example Shown here is a portion of the definition of a struct thread. Compiling kernel extension or device driver code. BE0070XS4. 2001.Student Notebook Conditional Compile Values Value _POWER_MP Meaning Code is being compiled for a multiprocessor machine. This value is automatically defined by the compiler if the -q64 option is specified. Code is being compiled in 64-bit mode. Conditional Compile Values Notes: Conditional compile values Several conditional compiler directives are used in the system header files to select the platform and environment (32-bit or 64-bit kernel). 1-22 Kernel Internals © Copyright IBM Corp. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. The compiler directive #ifndef __64BIT_KERNEL is used to create different definitions for the 32-bit and 64-bit kernels. Enable kernel symbols in header files. This value should always be used for 64-bit kernel extensions and device drivers.0 _KERNSYS _KERNEL _64BIT_KERNEL _64BIT Figure 1-11. This value should always be used when compiling kernel code. This is because certain data types have different sizes depending on the execution environment (for example. This value should always be used when compiling kernel code. 32-bit or 64-bit). .

. /* key of user addr */ uint t_userdata64.lock or cv */ uint t_uchan64. . /* sigctx location in user space*/ #endif . © Copyright IBM Corp. /* my pvthread struct */ struct proc *t_procp.3 Student Notebook Uempty struct thread { /* identifier fields */ tid_t t_tid. /* high order 32-bits if 64bit mode */ char *t_stackp. /* user addr . /* owner process' ublock (const)*/ } t_uaddress.V2. /* key of user addr */ long t_userdata. /* high order 32-bits if 64bit mode */ struct sigcontext *t_scp. /* saved user stack pointer */ struct sigcontext *t_scp. /* user addr . /* owner process */ struct t_uaddress { struct uthread *uthreadp.0. Introduction to the AIX 5L Kernel 1-23 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. /* high order 32-bits */ uint t_uchan. /* user-owned data */ uint t_cv64. /* high order 32-bits if 64-bit mode */ int t_userdata.0. . /* user-owned data */ long t_cv. /* User condition variable */ char *t_stackp. /* user addresses */ #ifndef __64BIT_KERNEL uint t_ulock64. 2001. /* saved user stack pointer */ uint t_scp64. /* high order 32-bits if 64-bit mode */ int t_cv. /* high order 32-bits */ uint t_ulock. 2003 Unit 1. /* unique thread identifier */ tid_t t_vtid. /* Virtual tid */ /* related data structures */ struct pvthread *t_pvthreadp. /* local data */ struct user *userp. /* User condition variable */ uint t_stackp64.lock or cv */ long t_uchan. /* sigctx location in user space*/ #else long t_ulock. .

The processor runs interrupt routines in ______mode. The AIX kernel is _______. . Figure 1-12. 2001. Checkpoint BE0070XS4. The 64-bit AIX kernel supports only _______kernel extensions. 2. 3. ________ and __________. 5.0 Notes: 1-24 Kernel Internals © Copyright IBM Corp. The 32-bit kernel supports 64-bit user applications when running on ________hardware. and only runs on _______ hardware. The______ is the base program of the operating system. 4. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.Student Notebook Checkpoint 1.

V2.3 Student Notebook Uempty Exercise Complete exercise one Consists of theory and hands-on Ask questions at any time Activities are identified by a What you will do: Use the cscope tool to examine system header files Figure 1-13. 2003 Unit 1. © Copyright IBM Corp. . 2001. Exercise BE0070XS4. Introduction to the AIX 5L Kernel 1-25 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.0 Notes: Turn to your lab workbook and complete exercise one.0.0.

. Unit Summary BE0070XS4. 2001.0 Notes: 1-26 Kernel Internals © Copyright IBM Corp. identify data element types for each of the available kernels in AIX 5L Figure 1-14. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.Student Notebook Unit Summary Describe the role the kernel plays in an operating system Define user and kernel mode and list the operations that can only be performed in kernel mode Describe when the kernel must make a context switch Describe the role of the mstsave area in a context switch Name the execution environments available on each of the platforms supported by AIX 5L Using the system header files.

2003 Unit 2. Kernel Analysis Tools What This Unit Is About This unit describes the different tools that are available to debug the AIX 5L kernel. .3 Student Notebook Uempty Unit 2.0. you should be able to: • List the tools available for analyzing the AIX 5L kernel • Use KDB to display and modify memory locations and interpret a stack trace • Use basic kdb navigation to explore crash dump and live system How You Will Check Your Progress Accountability: • Exercises using your lab system References AIX Documentation: Kernel Extensions and Device Support Programming Concepts © Copyright IBM Corp.0. Kernel Analysis Tools 2-1 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. What You Should Be Able to Do After completing this unit.V2. 2001.

Unit Objectives BE0070XS4. 2001. .Student Notebook Unit Objectives At the end of this unit you should be able to: List the tools available for analyzing the AIX 5L kernel Use KDB to display and modify memory locations and interpret a stack trace Use basic kdb navigation to explore crash dump and live system Figure 2-1. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.0 Notes: 2-2 Kernel Internals © Copyright IBM Corp.

2003 Unit 2. .3 Student Notebook Uempty What tools will you be using in this class? Figure 2-2. Kernel Analysis Tools 2-3 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.0.V2. This table list the primary tools we will be covering in this unit. 2001.0.0 Notes: Kernel Analysis Tools Several tools are available in AIX 5L that are used to examine and debug the kernel. © Copyright IBM Corp. and lowercase kdb is used when referring to the image analysis command. Description Kernel debugger for live system debugging Used for system image analysis KDB kdb Tool Typographic conventions In this class an uppercase KDB will be used when referring to the kernel debugger. What tools will you be using in this class? BE0070XS4.

Interfacing with the debugger Once started the kernel debugger is operated from a terminal connected to a native serial port of the system. or from a serial terminal connected via an 8-port or 128-port adapter. 2001. . The Major Functions of KDB are: BE0070XS4. 2-4 Kernel Internals © Copyright IBM Corp.Student Notebook The Major Functions of KDB are: Set breakpoints within the kernel or kernel extensions Execution control through various forms of step execution commands Format display of selected kernel data structures Display and modification of kernel data Display and modification of kernel instructions Modify the machine state through alteration of system registers Figure 2-3. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. For the debugger to be used it must be enabled prior to booting.0 Notes: Introduction This section covers describes the kernel debugger available in AIX 5L. The debugger cannot be operated from the LFT graphics display. Overview The kernel debugger is built into the AIX 5L production kernel.

0. the kernel debugger does not run operating system routines.0. this means it is possible to set breakpoints anywhere within the kernel code. Kernel Analysis Tools 2-5 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 2001. . All processes are stopped and interrupts are disabled.V2. 2003 Unit 2. When exiting the kernel debugger. © Copyright IBM Corp. The kernel debugger runs with its own Machine State Save Area (mst) and a special stack. In addition. it is the only running program until you exit the debugger. all processes continue to run unless the debugger was entered via a system halt.3 Student Notebook Uempty Concept When KDB is invoked. Though this requires that kernel code be duplicated within the debugger.

Enabling the Kernel Debugger BE0070XS4. Build a new boot image (bosboot -ad /dev/ipldevice) 3. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. Verify the debugger is enabled (Check dbg_avail) Figure 2-4. After the boot image has been built the system must be re-booted for the new options to take effect. After changing these flags you must create a new boot image and reboot the system to use this new image.0 Notes: Kernel flags The kernel debugger feature is enabled by setting flags in the boot image prior to booting the kernel. . 2-6 Kernel Internals © Copyright IBM Corp. Building a new boot image The bosboot command is used to build boot images. Boot the new image (shutdown -Fr) 4.Student Notebook Enabling the Kernel Debugger Perform these steps to enable the kernel debugger: 1. Set Kernel boot Flags (bosdebug -D) 2. Arguments supplied to the bosboot command will set flags in the boot image causing the kernel debugger to be enabled or disabled. 2001.

. The kernel debugger will be invoked immediately on boot. Kernel Analysis Tools 2-7 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. Loads and invokes the kernel debugger. Example The following command will build a new boot image with the kernel debugger loaded: # bosboot -a -D -d /dev/ipldevice The system must be rebooted for the change to take effect.0. Creates complete boot image. bosdebug Attributes in the SWservAt ODM database can be set so that bosboot will enable the kernel debugger regardless of the command line argument used when building the boot image. © Copyright IBM Corp.3 Student Notebook Uempty bosboot syntax The syntax of the bosboot command is: bosboot -a [-D | -I] -d device Argument Description -d device -D -I -a Specifies the boot device.V2. The bosboot command reads these values and sets up the boot image accordingly. 2001. The current boot disk is represented by the device: /dev/ipldevice Loads the kernel debugger. The kernel debugger will not automatically be invoked when the system boots. To view the setting of the debug flags in the ODM database use the command: # bosdebug Memory debugger Memory sizes Network memory sizes Kernel debugger Real Time Kernel off 0 0 on off To set the kernel debugger attribute on use the command: # bosdebug -D To set the kernel debugger attribute off use the command: # bosdebug -o Note: All this command does is to set attributes in the SWservAt ODM database.0. 2003 Unit 2. The bosdebug command is used to view or set these attributes.

2-8 Kernel Internals © Copyright IBM Corp. Debugger is not ever to be called 3 0x00000000 0x00000001 0x00000002 Figure 2-5. . 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. Verifying the Debugger is Enabled BE0070XS4. Don't invoke at boot. you can use the following procedure to verify that the kernel debugger has been enabled.Student Notebook Verifying the Debugger is Enabled Step 1 2 Action Start the kdb command #kdb View the dbg_avail memory flag (0)> dw dbg_avail 1 dbg_avail + 000000: 00000002 Compare the value of dbg_avail against the mask value in this table. but debugger is still invokable. Mask Description Do invoke at bootup. 2001.0 Notes: Verifying the kernel debugger is enabled Once the kernel is booted.

© Copyright IBM Corp.V2. type the key sequence: Ctrl-\ From the LFT keyboard. If configured to be loaded but not invoked (the -D option) one of the conditions listed above must occur after the system is booted for the debugger to be started. 2001.0 Notes: Invoke vs. Starting the Debugger BE0070XS4. load only When the kernel debugger is configured to be invoked (the -I option) the debugger will start immediately after booting. Kernel Analysis Tools 2-9 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.0. 2003 Unit 2. .0. type the key sequence: Ctrl-alt-Numpad4 A kernel extension or application makes a call to brkpoint() A breakpoint previously set using the debugger has been reached A fatal system error occurs Figure 2-6.3 Student Notebook Uempty Starting the Debugger From a native serial port.

2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 2001. Normally this is not a problem since most of the kernel data structures are in memory. System Dumps BE0070XS4. and the uthread and ublock structures of the running thread are pinned as well. . so only what is currently in physical memory can be dumped. The dump contains: .Most of the kernel extensions code and data Paged memory The dump facility cannot page in memory. The process and thread tables are pinned.Student Notebook System Dumps A dump image is not actually a full image of the system memory but a set of memory areas copied out by the dump routines.Some data from the current running application .Operating system (kernel) code and data .0 Notes: What is in a system dump Typically. an AIX 5L dump includes all of the information needed to determine the nature of the problem. 2-10 Kernel Internals © Copyright IBM Corp. What is in a system dump? What is the effect of kernel paging? What is the role of the Master Dump Table? What tools are used to analyze system dumps? Figure 2-7.

4. • After the first call to a Component Dump routine. 3. Kernel extensions can specify a routine to be called to include data in a system dump.V2.2 uses the dmp_ctl() kernel service. Kernel Analysis Tools 2-11 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. if present Header information about the dump is written to the dump device The kernel steps through each entry in the Master Dump Table. Dump Creation Process Introduction This section describes the dump process. 2003 Unit 2. On AIX 5. Analyzing dumps System dumps can be examined using the kdb command. Kernel specific areas to be included in the dump are pre-loaded at kernel initialization. AIX 5.1 this is done with the dmp_add() kernel service. .0. calling each Component Dump routine twice: • Once to indicate that the kernel is starting to dump this component (1 is passed as a parameter). the kernel : • Checks every page in the identified data area to see if it is in memory or paged out • Builds a bitmap indicating each page's status • Writes a header. 2. the kernel processes the CDT that was returned For each CDT entry. • Again to say that the dump process is complete (2 is passed as a parameter).3 Student Notebook Uempty The master dump table The system dump function captures data areas by processing information returned by routines registered in the Master Dump Table. Interrupts are disabled 0c9 or 0c2 are written to the LED display. 2001.0. and those pages which are in memory to the dump device Action © Copyright IBM Corp. Process overview The following steps are used to write a dump to the dump device: Step 1. the bitmap.

2001. displaying 0c0 or flashing 888 2-12 Kernel Internals © Copyright IBM Corp.Student Notebook Step 5. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. . Action Once all dump routines have been called. the kernel enters an infinite loop.

© Copyright IBM Corp.0.3 Student Notebook Uempty kdb The kdb command allows examination of an operating system image Requires system image and /unix Can be run on a running system using /dev/mem Typical invocations: # kdb -m vmcore.V2. vmcore or /dev/mem) and a copy of /unix to operate. The /unix file provides the necessary symbol mapping needed to analyze the memory image file. kdb .0) must not be compressed.X -u /usr/lib/boot/unix or # kdb Figure 2-8. BE0070XS4.0. 2001.0 Notes: kdb Command Files needed The kdb command requires both a memory image (dump device. It is imperative that the /unix file supplied is the one that was running at the time the memory image was created. Kernel Analysis Tools 2-13 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 2003 Unit 2. The memory image (whether a device such as /dev/dumplv or a file such as vmcore. .

This is required to analyze a system dump on a different system. .Student Notebook Parameters The kdb command may be used with the following parameters: Parameter Description no parameter -m system_image_file -u kernel_file -k kernel_modules -w -v -h -l Use /dev/mem as the system image file and /usr/lib/boot/unix as the kernel file. useful when running noninteractive session Example To run kdb against a vmcore file use the following command line: # kdb -m vmcore.X -u /unix To run kdb against the live (running kernel) no parameters are required. Add the kernel_modules listed View XCOFF object Print CDT entries Print help Disable in-line more. Use the image file provided Use the kernel file. # kdb 2-14 Kernel Internals © Copyright IBM Corp. In this case root permissions are required. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 2001.

4.3 Student Notebook Uempty Checkpoint 1. . The value of the _______kernel variable indicates how the debugger is loaded. 2.0. A system dump image contains everything that was in the kernel at the time of the crash. 2001.V2. 2003 Unit 2.0. Checkpoint BE0070XS4. 3.0 Notes: © Copyright IBM Corp. _____is used for system image analysis. Kernel Analysis Tools 2-15 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. _____is used for live system debugging. True or False? Figure 2-9.

0 Notes: Introduction Turn to your lab workbook and complete exercise two. 2-16 Kernel Internals © Copyright IBM Corp. . Read the information blocks included with the exercises. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.Student Notebook Exercise Complete exercise two Consists of theory and hands-on Ask questions at any time Activities are identified by a What you will do: Enable and start the kernel debugger Display and interpret stack traces Display and modify variables in kernel memory Perform basic kdb navigations on live system and crash dump Figure 2-10. They will provide you with information needed to do the exercise. 2001. Exercise BE0070XS4.

0.V2. . Unit Summary BE0070XS4.0. 2003 Unit 2. 2001.0 Notes: © Copyright IBM Corp. Kernel Analysis Tools 2-17 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.3 Student Notebook Uempty Unit Summary List the tools available for analyzing the AIX 5L kernel Use KDB to display and modify memory locations and interpret a stack trace Use basic kdb navigation to explore crash dump and live system Figure 2-11.

. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 2001.Student Notebook 2-18 Kernel Internals © Copyright IBM Corp.

pv_thread. thread.0. 2001. user and u_thread • Use the kernel debugging tools in AIX to locate and examine a process’ proc. thread. user and u_thread data structures • Identify the states of processes and threads on a live system and in a crash dump • Analyze a crash dump caused by a run-away process • Identify the features of AIX scheduling algorithms • Identify the primary features of the AIX scheduler supporting SMP and large system architectures • Identify the action the threads of a process will take when a signal is received by the process How You Will Check Your Progress Accountability: • Exercises using your lab system • Check-point activity • Unit review References AIX Documentation: Performance Management Guide AIX Documentation: System Management Guide: Operating System and Devices © Copyright IBM Corp.3 Student Notebook Uempty Unit 3.V2. proc. you should be able to: • List the three thread models available in AIX 5L • Identify the relationship between the six internal structures: pvproc. Process Management 3-1 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.0. Process Management What This Unit Is About This unit describes how processes and threads are managed in AIX 5L. 2003 Unit 3. What You Should Be Able to Do After completing this unit. .

user and u_thread Use the kernel debugging tools in AIX to locate and examine a process’ proc.0 Notes: 3-2 Kernel Internals © Copyright IBM Corp. proc. thread.Student Notebook Unit Objectives At the end of this unit you should be able to: List the three thread models available in AIX 5L Identify the relationship between the six internal structures: pvproc. . pv_thread. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. thread. Unit Objectives BE0070XS4. user and u_thread data structures Identify the states of processes and threads on a live system and in a crash dump Analyze a crash dump caused by a run-away process Identify the features of AIX scheduling algorithms Identify the primary features of the AIX scheduler supporting SMP and large system architectures Identify the action the threads of a process will take when a signal is received by the process Figure 3-1. 2001.

A set of one or more threads © Copyright IBM Corp.A collection of resources .0.0 Notes: Processes and threads A process is a self-contained entity that consists of the information required to run a single program. . such as a user application. Process Management 3-3 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.V2. 2001. Parts of a Process BE0070XS4. Process A process can be divided into two components: .0. 2003 Unit 3.3 Student Notebook Uempty Parts of a Process Process y y y y Resources Address space Open files pointers User credentials Management data Thread Stack CPU registers Thread Stack CPU registers Thread Stack CPU registers Figure 3-2.

Each thread has a private execution context that includes: . The resources are: .A set of open files pointers . data and heap) .A stack .CPU register values (loaded into the CPU when the thread is running) 3-4 Kernel Internals © Copyright IBM Corp. . 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.User credentials . 2001.Management data Threads A thread can be thought of as a path of execution through the instructions of the process.Address space (program text.Student Notebook Resources The resources making up a process are shared by all threads in the process.

V2. Kernel threads run completely in kernel mode and have their own kernel stack. © Copyright IBM Corp. Process Management 3-5 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 2003 Unit 3. They are cheap to create and manage thus are typically used to perform a specific function like asynchronous I/O. Kernel threads Kernel threads are not associated with a user process and therefore have no user context. 2001.0 Notes: Threads Threads provide the execution context to the process. . Threads BE0070XS4.3 Student Notebook Uempty Threads Three type of threads are available in AIX: Kernel Kernel-managed User Three thread programming models are available for user threads: 1:1 M:1 M:N Figure 3-3.0.0.

. Each user process contains one or more kernel-managed threads. On SMP systems. 2001. The application developer can chose between 1:1. Each thread is scheduled to run on a CPU independent of the other threads of the process. The scheduling and running of kernel-managed threads is managed by the kernel.Student Notebook Kernel-managed threads Kernel-managed threads are sometimes called ”Light Weight Processes” or LWPs and are the fundamental unit of execution in AIX. the threads of one process can run concurrently. M:1 and M:N models. The kernel has no knowledge of their existence. 3-6 Kernel Internals © Copyright IBM Corp. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. They are managed by a user-level threads library and their scheduling and execution are managed at the user level. Programming models AIX 5L provides three models for mapping user threads on top of kernel-managed threads. User threads User threads are an abstraction entirely at the user level.

0. 1:1 Thread Model BE0070XS4.3 Student Notebook Uempty 1:1 Thread Model User Thread User Thread User Thread Thread Library Kernelmanaged Thread Kernelmanaged Thread Kernelmanaged thread Figure 3-4. 2001. each user thread is mapped to a single kernel-managed thread: © Copyright IBM Corp.0. . 2003 Unit 3. Process Management 3-7 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.V2.0 Notes: 1:1 Model In the 1:1 model.

.0 Notes: M:1 In the M:1 model all user threads are mapped to one kernel-managed thread. 3-8 Kernel Internals © Copyright IBM Corp. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 2001.Student Notebook M:1 Thread Model User Thread User Thread User Thread Library Scheduler Thread Library Kernelmanaged Thread Figure 3-5. The scheduling and management of the user threads are completely handled by the thread library. M:1 Thread Model BE0070XS4.

0 Notes: M:N In the M:N model. Thread model for this unit This unit focuses on the management and scheduling of kernel-managed-threads. Primarily. The default for AIX 4. Unless specified. M:N Thread Model BE0070XS4. Process Management 3-9 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. the 1:1 model is discussed. Using the 1:1 model can improve performance.0. An additional “hidden” user scheduler thread may be started by the library to handle mapping user threads onto kernel managed threads.3. The following will select the 1:1 model: © Copyright IBM Corp. Note that the thread model is selectable. A user thread may be bound to a specific kernel-managed thread. the term “thread” refers to a kernel-managed thread.V2. 2001.0. 2003 Unit 3. user threads are mapped to a pool of kernel-managed threads. .3 Student Notebook Uempty M:N Thread Model User Thread User Thread User Thread User Thread Thread Library Library Scheduler Kernelmanaged Thread Kernelmanaged Thread Kernelmanaged Thread Figure 3-6.1 and higher is the M:N model.

See the Performance Management Guide in the AIX online documentation.Student Notebook #export AIXTHREAD_SCOPE=S #<your_program> There are many similar options available for thread tuning. 2001. . 3-10 Kernel Internals © Copyright IBM Corp. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

2003 Unit 3. . Exec When a process is first created it is running the same program as its parent. Process Management 3-11 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.0 Notes: Creating processes A new process is created when an existing process executes a fork() system call.0. © Copyright IBM Corp.0. One of the exec() class of system calls is normally used to load a new program into the process’ address space.V2. The new process is called a child process.3 Student Notebook Uempty Creating Processes When a process is created it is given: A process table entry Process identifier (PID) An address space (its contents are copied from the parent process) User-area Program text Data User and kernel stacks A single kernel-managed thread (even if the parent process had many threads) Figure 3-7. 2001. Creating Processes BE0070XS4. the creating process is the child’s parent.

exit(1). 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 2001. } if ( child==0 ) { /* child */ /* exec a new program */ if (execl("/bin/ls". . exit(1).NULL) == -1 ){ perror("error on execl").Student Notebook Example Here is an example of fork and exec to start a new program: main(){ pid_t child. /* all done end the new process */ } else { /* parent */ wait(NULL). /* Ensure parent terminates after child */ } } /* main */ 3-12 Kernel Internals © Copyright IBM Corp. } exit(0)."-l". if ( (child=fork()) == -1){ perror("could not fork a child process").

0. 2003 Unit 3. the library function pthread_create() is used to create threads rather than calling thread_create() directly.V2. The thread library allows for creation and management of both kernel-managed threads and user threads using the same interface. © Copyright IBM Corp. Process Management 3-13 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.3 Student Notebook Uempty Creating Threads A new thread is created by the thread_create() system call.0. A process can create additional threads using the thread_create() system call. Thread library AIX provides a thread library to assist programers with the creation and management of threads. .0 Notes: Creating threads When a process is first created it contains a single kernel-managed thread. Creating Threads BE0070XS4. Typically. 2001. When created the thread is assigned: A thread table entry A thread identifier An execution context (stack pointer and CPU registers) Figure 3-8.

.Student Notebook pthread_create example Here is an example of the creating a new thread using pthread_create: #include <pthread. new_thread. NULL. /* start up a new thread */ if (pthread_create (&threadId.h> void *new_thread(void *arg). 2001. NULL )) { perror ("pthread_create"). 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. } /* main thread code here */ } void *new_thread(void *arg) { /* new thread code here */ } 3-14 Kernel Internals © Copyright IBM Corp.h> #include <errno. pthread_t threadId. int main () { int i. exit (errno).

Swapped . 2003 Unit 3. . In AIX a process can be in one of five states: . Process State Transitions BE0070XS4.Idle .0. Process Management 3-15 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.Active .0 Notes: Process states This illustration above shows the states of a process during its life.V2.Stopped .Zombie © Copyright IBM Corp.0.3 Student Notebook Uempty Process State Transitions Process creation fork() Idle Swapped Active Stopped Zombie Non-existent Figure 3-9. 2001.

the init process (PID 1) frees the remaining resources held by the child. If a process is stopped. some of its resources are not automatically released. 2001. . If the parent process no longer exists when a child process exits. During creation the process is in the idle state. A process is placed in the zombie state until its parent cleans up after it frees the resources. A stopped process can be restarted by the SIGCONT signal. Active Stopped Swapped Zombie Zombie process Sometimes a Zombie process will stay in the process list for a long time. The parent must execute a wait() system call to retrieve the process’ exit status before the process will be removed from the process table.Student Notebook States The five process states are described in this table: State Idle Description A process is started with a fork() system call. it is placed in the stopped state. This state is temporary until all of the necessary resources have been allocated. This is the normal process state. It cannot run until swapped back into memory. 3-16 Kernel Internals © Copyright IBM Corp. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. When a process terminates. Once the creation of the process is done it is placed in the active state. all its threads are stopped and will not be scheduled on a CPU. One example of this situation could be that a process has exited. but the parent process is busy or waiting in the kernel and unable to read the return code. When a process receives a SIGSTOP signal. The threads of the process can now be scheduled to run on a CPU. A swapped process has lost its memory resources and its address space has been moved onto disk.

.0. 2001.0. Process Management 3-17 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. # ps -l F 240001 200001 200001 200011 S A A A T UID PID 201 0 0 0 17670 19172 19392 19928 PPID 16390 17670 19172 19172 C 0 0 3 0 PRI NI ADDR SZ 496 496 308 436 WCHAN TTY pts/3 pts/3 pts/3 pts/3 TIME CMD 0:00 ksh 0:00 ksh 0:00 ps 0:00 vi 60 20 61f4 60 20 59da 61 20 2605 60 20 4dff S Flag O I A T W Z Nonexistent Idle Active Stopped Swapped Zombie State Process state in a crash dump The state of a process can also be found in a crash dump using kdb: # kdb (0)> proc * SLOTNAME pvproc+000000 0 pvproc+000200 1 pvproc+000400 2 pvproc+000600 3 STATE PID PPID PGRP UID ADSPACE swapperACTIVE 00000 00000 00000 00000 00004812 init wait netm ACTIVE 00001 00000 00000 00000 0000342D ACTIVE 00204 00000 00000 00000 00004C13 ACTIVE 00306 00000 00000 00000 0000282A © Copyright IBM Corp. 2003 Unit 3.V2.3 Student Notebook Uempty Process state on a running system The state of a process can be found on a running system using the ps command.

A list of threads . . pvproc pvproc pv_procp proc pv_procp .Student Notebook The Process Table Process Table Slot Number 0 1 2 3 . .A description of the process’ address space .0 Notes: The process table The kernel maintains a table entry for each process on the system. This table is called the process table. . pvproc proc pv_procp . .Other process management data 3-18 Kernel Internals © Copyright IBM Corp. . 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. Each process is represented by one entry in the table. The Process Table BE0070XS4. Each entry contains: .A process identifier . .The process state . 2001. NPROC pvproc proc Figure 3-10.

For the 64-bit kernel. Slot number Each entry in the process table is referred to by its slot number. The pv_procp in the pvproc points to its associated proc structure. proc structure The proc structure is an extension on the pvproc structure. Table Management). one zone is allocated on each SRAD (see later topic. The proc and pvproc structures are split to accommodate large system architectures.V2. © Copyright IBM Corp. 2003 Unit 3. . Process Management 3-19 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.3 Student Notebook Uempty Process table The process table is a fixed-length array of pvproc structures allocated from kernel memory.0. this table is divided into a number of sections called zones. At system startup. 2001.0.

. 3-20 Kernel Internals © Copyright IBM Corp.h. 2001.0 Notes: pvproc structure The definition of the pvproc structure can be found in /usr/include/sys/proc.Student Notebook pvproc Element pv_pid pv_ppid pv_uid pv_stat pv_flags *pv_procp *pv_threadlist *pv_child *pv_siblings Description Unique process identifier (PID) Parents process identifier (PPID) User identifier Process state Process flags Pointer to the proc entry Head of list of threads Head of list of children NULL termintated sibling list Figure 3-11. pvproc BE0070XS4. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. Some of the key elements are shown above.

V2.0 Notes: pv_stat The process state is stored in the pvproc->pv_stat data element.h.3 Student Notebook Uempty pv_stat Values SNONE SIDL SACTIVE SSWAP SSTOP SZOMB Meaning Slot is not being used Process is being created Process has at least one active thread Process is swapped out Process is stopped Process is zombie Figure 3-12.h as shown in this table. . 2001.0. © Copyright IBM Corp. Values for pv_stat are defined in /usr/include/sys/proc. Process Management 3-21 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.0. Process table size The size of the process table determines how many processes the system can have. BE0070XS4. pv_stat . The size of the table is defined as NPROC in the file /usr/include/sys/proc. 2003 Unit 3.

.0 Notes: Table management If the entire process table were pinned in memory it would consume a significant amount. Zone 0 Slot 0 Pinned pages High water mark Slot 8192 Zone 32 Figure 3-13. When a zone on an SRAD fills up (i. At system startup. .Student Notebook Table Management Process Table Zone 0 Zone 1 . . . . . .e. The details can be determined by examining the value of PM_NUMSRAD_ZONES. . . Zones The process table used in the 64-bit kernel is split into equal sized sections called zones. and number of process slots per zone. 2001. all of the process slots in that zone are used) then another zone is 3-22 Kernel Internals © Copyright IBM Corp. . Table Management BE0070XS4. In reality the entire table is rarely needed. . only a portion of the table is pinned into memory at one time. . 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. Each zone contains a fixed number of process slots. . . is version dependent.h>. The number of zones. defined in the header file <sys/pmzone. . one zone is allocated on each SRAD in the system. . . . . . . therefore.

The memory pages containing the slots up to the high water mark are pinned in memory. © Copyright IBM Corp. The high water mark for the zone is found in the pm_heap. The table is defined by a struct pm_heap_global. Details Two structures are used to manage the process table. 32-bit kernel The process table on 32-bit kernels has only one zone encompassing the entire process table. This structure has pointers to several pm_heap structures. Large systems On some systems (64-bit kernel only) a zone would typically be associated with a single RAD (a group of resources connected together by some physical proximity). Pinning pages of the processes table Each zone of the process table contains a high water mark indicating the highest number of slots in the zone that have been in use.0.3 Student Notebook Uempty allocated to the SRAD and added to the pool. . 2003 Unit 3. A single high water mark is used and pages are pinned as explained above.h. Both are defined in /usr/include/sys/pmzone. there is only one SRAD per system.V2. At the moment. Process Management 3-23 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.0. one for each zone in the table. As the table grows the high water mark is moved and additional pages of the table are pinned. 2001.

2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 2001. In AIX 5L. Extending the pvproc . Access speed to memory hosted from another processor may be slower than accessing memory hosted from the local processor. Large systems In some systems physical memory is divided into pools that have a degree of physical proximity to particular processors. each process is represented by two structures. the proc and a smaller pvproc.Student Notebook Extending the pvproc SRAD proc proc pvproc table zone SRAD proc proc pvproc table zone CPU CPU CPU CPU CPU CPU CPU CPU SRAD proc proc pvproc table zone CPU CPU CPU CPU Figure 3-14. the process table was made from an array of proc structures.0 Notes: proc structure The proc structure is an extension to the pvproc structure. The AIX 3-24 Kernel Internals © Copyright IBM Corp. . History In older versions of AIX. BE0070XS4. Using one large proc structure table could result in many "remote" accesses.

The table of pvproc structures is separated into zones. which allows each zone to reside on its own SRAD. Process Management 3-25 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.0.0.V2. 2003 Unit 3. and refer to proc structures for processes running on that SRAD. a collection of resources grouped by some degree of physical proximity. . An SRAD (scheduler RAD) is a RAD large enough to warrant a dedicated scheduler thread. © Copyright IBM Corp.3 Student Notebook Uempty 5L design allows the use of RADs (Resource Affinity Domains). 2001.

. 3-26 Kernel Internals © Copyright IBM Corp. PID format The format of a PID is shown above.0 Notes: Process identifier The process identifier or PID is a unique number assigned to a process when the process is first created. 0 Generation count 0 Figure 3-15. . 2001. This means a process table slot can be used 128 times before a process ID is reused.Student Notebook PID Format 32-bit Kernel 31 26 25 8 7 1 0 000000 Process table slot index Generation count 0 64-bit Kernel 63 26 25 13 12 8 7 1 0 Low order bits of Process table slot index SRAD (upper bits of index) 00 . 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. PID Format BE0070XS4. It is composed of the process table slot number and a generation count. The generation count is incremented each time the process table slot is used. .

2001.3 Student Notebook Uempty Bits Bit 0 Generation count Process table slot index SRAD (Scheduler Resource Affinity Domain) Remaining bits Description Always set to zero making all PIDs even numbers.0.1 uses 5 bits. apart from init. Process Management 3-27 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. AIX 5.2 currently uses 4 bits. and defined by PM_NUMSRAD_BITS defined in <sys/pmzone. 2003 Unit 3. These bits are used to select the zone on the process table.h>. © Copyright IBM Corp. The process table slot number. The number of bits used for the SRAD is version dependent.0. pid_t Process identifiers are stored internally using the pid_t typedef. . Set to zero. which is a special case and always has process ID 1. AIX 5. A generation count used to prevent the rapid re-use of PIDs.V2.

therefore. 2001.Student Notebook Finding The Slot Number 000000 Process table SRAD index bits Generation 0 count SRAD Process table index bits pvproc table slot number Figure 3-16.0 Notes: Finding the slot number In a 32-bit kernel the process table slot number can easily be found from a PID by shifting the PID 8 bits to the right. the SRAD field is 5-bits long. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. the SRAD field is 4-bits long. so the calculation is a little easier. .2. 3-28 Kernel Internals © Copyright IBM Corp. the index bits do not line up on an even nibble boundary. On AIX 5. Why are the fields swapped? The SRAD and index bits are shifted around so that indexing is partitioned by zones. In a 64-bit kernel the slot number is a combination of the SRAD bits with the index bits as shown above. On AIX 5.1. Finding the Slot Number BE0070XS4. This makes calculating the slot number in your head a little difficult.

2003 Unit 3. Kernel Processes BE0070XS4. 2001. © Copyright IBM Corp. Process Management 3-29 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.V2.0 Notes: Kernel Processes Some processes in the system are kernel processes. but tend to have higher priorities Figure 3-17. as can user processes Are scheduled like user processes.0.0.3 Student Notebook Uempty Kernel Processes Kernel processes: Are created by the kernel Have a private u-area and kernel stack Share text and data with the rest of the kernel Are not affected by signals Can not use shared library object code or other user-protection domain code Run in the Kernel Protection Domain Can have multiple threads. . Kernel processes are created by the kernel itself and execute independently of user thread action.

. .Student Notebook Listing kernel processes You can list the current kernel processes with the ps -k command.11:20 wait . .0:00 j2pg . 98334 114718 163968 172074 TTY TIME CMD . 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.5681:27 wait .0:00 rtcmd . # ps -k PID 0 16388 24582 .0:02 swapper .0:00 lvmbb .0:00 dog 3-30 Kernel Internals © Copyright IBM Corp. 2001.

Process Management 3-31 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. Thread Table BE0070XS4.A thread identifier (TID) . Each entry in the table is referred to by its slot number. . .V2.0 Notes: Thread Table The kernel maintains a thread table. The thread table for 64-bit systems is divided into zones and the zones are allocated on different SRADs. just as with the process table. 2001. .Thread management data The thread table is similar to the process table. NTHREAD pvthread thread Figure 3-18.0. It is an array of pvthread structures allocated from kernel memory. . pvthread thread tv_threadp . 2003 Unit 3.A thread state . . © Copyright IBM Corp. pvthread pvthread tv_threadp thread tv_threadp .3 Student Notebook Uempty Thread Table Thread Table Slot Number 1 2 3 . Each kernel-managed thread is represented by one table entry which contains: . . .0.

3-32 Kernel Internals © Copyright IBM Corp. 2001. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. . The thread and pvthread structures were split to accommodate large system architectures.Student Notebook thread structure The thread structure is an extension on the pvthread structure. The tv_threadp item in the pv_thread points to its associated thread structure.

0 Notes: pvthread and thread structures Definitions for the pvthread and thread structures can be found in /usr/include/sys/thread. © Copyright IBM Corp.0. 2001. 2003 Unit 3. The thread table is split into multiple zones. pvthread Elements BE0070XS4.V2. Table management The memory pages for the thread table are managed using the same mechanism that was described for the process table. Each zone contains a high water mark representing the largest slot number used since system boot. .3 Student Notebook Uempty pvthread Elements Element tv_tid *tv_threadp *tv_pvprocp *tv_next thread *tv_prevthread tv_state Description Unique thread identifier (TID) Pointer to thread structure Pointer to pvproc for this thread Pointer to next thread (pvthread) in the process Pointer to previous thread (pvthread) in the process Thread state Figure 3-19. Process Management 3-33 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.h. Elements Some of the key element of the pvthread structure are shown above. and the number of zones are version dependent. The size of each zone.0. All memory pages for the slots up to the high water mark are pinned.

TID Format BE0070XS4. tid_t Thread identifiers are stored internally using the tid_t typedef. .Student Notebook TID Format 32-bit Kernel 31 27 26 8 7 1 0 000000 Thread table slot index Generation count 1 64-bit Kernel 63 27 26 13 12 8 7 1 0 Low order bits of thread table slot index SRAD (upper bits of index) 00 . The format of a TID is similar to that of a PID except that all TIDs are odd numbers and PIDs are even numbers.0 Notes: Thread identifier Introduction The thread identifier or TID is a unique number assigned to a thread. . 0 Generation count 1 Figure 3-20. . The format of a TID is shown above. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 3-34 Kernel Internals © Copyright IBM Corp. 2001.

0. 2003 Unit 3.process private memory segment y Definition . it is only accessible when in kernel mode.0.3 Student Notebook Uempty u-block y Location . It need not be in memory when the process is swapped out.0 Notes: Introduction Each process (including a kernel process) contains a u-block area. © Copyright IBM Corp.V2.h uthread y Thread private data y stack pointers y mstsave uthread uthread uthread uthread user y shared between all threads in the process user Figure 3-21. Process Management 3-35 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. u-block BE0070XS4. The u-block is made up of a user structure (one per process) and one or more uthreads (one per thread)./usr/include/sys/user. It is pinned when the process is swapped into memory. and unpinned when the process is swapped out. 2001. Access The u-block is part of the process private memory segment. . however. It maintains the process state information which is only required when the process is running. therefore. it need not be accessible when the process is not running.

For example.h. Information stored in the user structure is global and shared between all threads in the process. Threads are responsible for storing execution context. 2001. the uthread holds execution-specific items like the stack pointers and CPU registers. When a thread is interrupted or a context switch occurs the stack pointers and CPU registers of the interrupted thread are stored in the mst-save area of the uthread. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. uthread Each thread of a process has its own uthread structure. 3-36 Kernel Internals © Copyright IBM Corp. user Each process has one user structure. the file descriptor table and the user credentials are kept in the user structure. therefore.Student Notebook Definitions The u-block is described in the file /usr/include/sys/user. When execution of the thread continues the stack pointers and registers are loaded from the mst-save area. .

uthread and user. proc and thread From the pvproc structure the first pvthread can be found by following the pv_threadlist pointer.0 Notes: Introduction This unit has discussed the AIX 5L data structures: pvproc. The © Copyright IBM Corp. Process Management 3-37 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. Diagram The above diagram depicts the structures for a single process containing three kernel-managed threads. proc. Six Structures BE0070XS4.0.V2. This section describes how these six structures are tied together. 2001. thread.0. .3 Student Notebook Uempty Six Structures tv_pvprocp pv_threadlist tv_nextthread pvproc pvthread pvthread pvthread tv_threadp pv_procp t_pvthreadp t_procp proc thread thread thread t_uthreadp t_userp U_procp uthread uthread uthread u-block user Figure 3-22. pvthread. 2003 Unit 3. All the pvthread structures for the process are linked via a circular doubly-linked list (see pointers tv_nextthread and tv_prevthread).

This allows all threads in a process to share the same open files. Pointers in the thread structure point to both of these sections. . 2001. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. Data that is private to the thread-like stack pointers are kept in the uthread. one per thread and one process-wide user structure. u-block The u-block is divided into uthread sections. Similarly. Process-wide data is kept in the user area. for example. the file descriptor table. the pvthread structures are extended into the thread structures via the tv_threadp.Student Notebook pvproc is extended in to the proc structure via the pv_procp pointer. 3-38 Kernel Internals © Copyright IBM Corp.

.0 Notes: Introduction The object of thread scheduling is to manage the CPU resources of the system. 2003 Unit 3.V2.0.0. © Copyright IBM Corp. 2001. Process Management 3-39 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. sharing these resources between all the threads.3 Student Notebook Uempty Thread Scheduling Topics Thread states Thread priorities Run queues Software components of the kernel Scheduler Dispatcher Scheduling algorithms Support for SMP and large systems Figure 3-23. Thread Scheduling Topics BE0070XS4.

State transitions Threads can be in one of several states. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. .Student Notebook Thread State Transitions Idle Ready to Run Sleeping Running Stopped by a signal Zombie Figure 3-24. the kernel allows many threads to run at the same time. The diagram above shows all the state transitions a thread can make. 2001. sleeping and stopped several times during its lifetime. Thread State Transitions BE0070XS4. 3-40 Kernel Internals © Copyright IBM Corp.0 Notes: Introduction In AIX. but there can be only one thread actually executing on each CPU at one time. A thread typically changes its state between running. ready to run. The thread state shows if a thread is currently running or is inactive.

.0. 2001. Whenever the thread is waiting for an event. A thread in the running state is the thread executing on a CPU. Sleeping Stopped Swapped Zombie tv_state The thread state is kept in the tv_state flag of the pv_thread structure. The defined values for this flag are: Flag TSNONE TSIDL TSRUN TSSLEEP TSSWAP TSSTOP TSZOMB slot is available being created (idle) runable (or running) awaiting an event (sleeping) swapped stopped being deleted (zombie) Meaning © Copyright IBM Corp. Though swapping takes place at the process level and all threads of a process are swapped at the same time. the thread table is updated whenever the thread is swapped. Once the new thread creation is completed. the thread then goes to the zombie state. it is placed in the ready to run state. This state is temporary until all of the necessary resources for the the thread have been allocated. A stopped thread is a thread stopped by the SIGSTOP signal. Process Management 3-41 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. Stopped threads can be restarted by the SIGCONT signal.V2.3 Student Notebook Uempty States All the thread states are described in this table: State Idle Ready to Run Running Description When first created a thread is placed in the idle state. The thread state will change between running and ready to run until the thread finishes execution. 2003 Unit 3.0. The zombie state is an intermediate state for the thread lasting only until all the resources owned by the thread are given up. the thread is said to be sleeping. The thread waits in this state until the thread is run.

the thread is waiting for CPU access.Student Notebook Running threads No tv_state flag value has been defined for the running state. The value of the tv_state flag for running threads will be shown as ready to run (TSRUN). A thread must be ready to run before it can be run. and a wait type of TNOWAIT. A thread that is ready to run has a state of TSRUN. . 2001. i. 3-42 Kernel Internals © Copyright IBM Corp. and a wait type of TWCPU. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. The running state is implied when a thread is currently being run. therefore a flag is not necessary. A thread that is actually running has a state of TSRUN.e.

3 Student Notebook Uempty Thread Priority 0 kernel PUSER = 40 Highest priority user 255 Priority values Figure 3-25.0 Notes: Introduction All threads are assigned a priority value and a nice value.V2. The highest priority a thread can run in user mode is defined as PUSER or 40. 2003 Unit 3.0. Precedence is given to the thread with the lowest priority number. Priorities above PUSER (example: numerically lower) are used for real-time threads. © Copyright IBM Corp. 2001. Thread Priority Lowest priority BE0070XS4. CPU time is made available to threads according to their priority number. Thread priority Each thread is assigned a priority number between 0 and 255.0. Process Management 3-43 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. The dispatcher examines these values to determine what thread to run. .

In other words. nice Each process is assigned a nice value between 0 and 39. 2001. 3-44 Kernel Internals © Copyright IBM Corp. The nice value is used to adjust thread priority. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.Student Notebook Lower number means high priority Do not confuse a high priority value with a high priority thread. A process’ nice value is saved in the proc structure as p_nice=nice+PUSER. The two are inversely related. . The nice value of a process can be set using the nice command or changed using the renice command. The default value for nice is 20. a thread with a numerically low priority value is more important than one with a larger value.

V2. . © Copyright IBM Corp.3 Student Notebook Uempty Run Queues Run Queue 0 . AIX selects the next thread to run by searching the run queues for the highest priority (example. Run Queues BE0070XS4. A run queue is arranged as a set of doubly-linked lists. . and has a priority value of 255. 80 .0 Notes: Introduction All runnable threads on the system (except the currently running threads) are listed on a run queue. . 40 . 2003 Unit 3. 20 . 255 wait thread thread thread thread thread thread Figure 3-26.0. . If AIX finds no other ready to run thread. . . numerically lowest) runnable thread. with one linked list for each thread priority. it will run the wait thread. Since there are 256 different thread priorities. 100 . It is the only thread on the system that will run at priority 255. 2001. a single run queue consists of 256 linked lists. Wait thread The wait thread is always ready to run. A single CPU system has one run queue.0. . 60 . Process Management 3-45 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

Generally. Clock ticks A clock tick is 1/100 of a second. 3-46 Kernel Internals © Copyright IBM Corp. The number of clock ticks a thread has accumulated will be used to calculate a new priority for the thread by the scheduler. Dispatcher and Scheduler Functions BE0070XS4. .Student Notebook Dispatcher and Scheduler Functions Dispatcher Searches the run queues for the highest priority thread Dispatches the most-favored thread (highest priority) Invoked at various points in the kernel.0 Notes: Introduction The scheduling and running of threads are the jobs of the dispatcher and scheduler. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. including: By the clock interrupt (every 1/100th of a second) When the running thread gives up the CPU Scheduler Runs once a second Recalculates thread priority for all runnable threads based on: The amount of CPU time a thread has received The priority value The nice value Figure 3-27. 2001.e. (i. a thread that has accumulated many clock ticks will have its priority decreased. the priority value will grow larger). AIX is designed to handle many simultaneous threads.

0 Notes: Dispatcher The dispatcher runs under the following circumstances: .A thread has voluntarily given up the CPU.0. Process Management 3-47 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. Dispatcher BE0070XS4.V2. .3 Student Notebook Uempty Dispatcher Step Action 1 If invoked because a clock tick has passed.0. 2 Scan the run queue(s) looking for the highest priority read-to-run thread. if (thread->t_cpu < T_CPU_MAX) thread->t_cpu++. . 4 Resume execution of the thread at the end of the MST chain. 2003 Unit 3. 3 If the selected thread is different from the currently running thread.A thread (from a non-threaded process) that has been boosted is returning to user mode from kernel mode. . place the currently running thread back on the run queue. © Copyright IBM Corp.A thread has been made runnable by an interrupt and the processor is about to finish interrupt processing and return to INTBASE. Figure 3-28. then increment the t_cpu element of the currently running thread. t_cpu is limited to a maximum value of T_CPU_MAX. 2001. The steps the dispatcher takes are listed above. .A time interval has passed (1/100 sec). and place the selected thread at the end of the MST chain.

Its job is to recalculate the priority of all runnable threads on the system. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.d is 16. Recall that the value of p_nice is: nice+PUSER. 0 <= r. while d controls how fast the system “forgives” previous CPU consumption. 3 d ---t_cpu = t_cpu × 32 Figure 3-29. double its value. The steps the scheduler uses to calculate thread priorities are shown in the table above.d <= 32 The default value for r. r and d The values of r and d can be set using the schedo command. 3-48 Kernel Internals © Copyright IBM Corp. . 2001. The priority of a sleeping thread will not be changed. r impacts how severely a process is penalized by used CPU time. making it possible to more strongly discriminate against upwardly nice'd threads.Student Notebook Scheduler Step 1 Action If the value of nice is greater than the default value of 20. The r and d values control how a process is impacted by the run time. if ( p_nice > 60 ) new_nice = 2 × ( p_nice .60 ) + 60 2 Calculate the new priority using the equation: r · § new_nice + 4 · ----------------------------------priority = new_nice + t_cpu × § ¹ © 32¹ × © 64 Degrade the value of t_cpu so that ticks the thread has used in the past have less affect as recent ticks. Given: PUSER=40 and 0<=nice<=40.0 Notes: Scheduler The scheduler runs every second. Scheduler BE0070XS4.

Preemption BE0070XS4. 2003 Unit 3. Process Management 3-49 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. The thread that was displaced before it’s time slice expired is said to have been preempted. © Copyright IBM Corp.0 Notes: Preemption Definition When the dispatcher runs and finds a runnable thread with a higher priority than the current running thread the running context is switched to the higher priority thread.V2. This can result in long delays in processing high-priority or real-time threads. If the current running thread is in kernel mode and a higher priority thread becomes ready to run.0. Non-preemptive kernel Most UNIX systems will not allow pre-emption to occur when running in kernel mode. 2001.3 Student Notebook Uempty Preemption What is preemption? Non-preemptive kernel vs. preemptive kernel Preventing deadlock in preemptive kernels Priority boost Figure 3-30. it will not be granted CPU time until the running thread returns to user mode and voluntarily gives up the CPU.0. .

.Student Notebook Preemption in kernel mode AIX allows thread pre-emption in kernel mode. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. This feature supports real-time processing where a real-time thread must respond to an action in a known time-frame. 2001. 3-50 Kernel Internals © Copyright IBM Corp.

thread C pre-empts thread A. This thread cannot continue until thread A releases the lock. . Process Management 3-51 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.0.0.V2. 2.B and C are all running in kernel mode. When the dispatcher runs. Thread C’s priority is higher than thread A’s and is ready to run. 2001. Step Action 1. running at a higher priority.0 Notes: Problems with preemptive kernels The above scenario demonstrates the problem that AIX has solved to make kernel preemption work. Preemptive Kernels BE0070XS4. 2003 Unit 3. Thread B. In this example threads A. Thread A. a low priority thread.3 Student Notebook Uempty Preemptive Kernels Thread A Low priority Holding lock Thread B High priority Waiting for lock Thread C Medium priority Running Figure 3-31. is waiting to obtain the same resource lock. has obtained access to an exclusive resource lock. 3. © Copyright IBM Corp.

The priority boost only applies to the “low priority” thread when it is holding the lock. . 3-52 Kernel Internals © Copyright IBM Corp. Priority boost To resolve this situation. Thread A is not running so it can’t release the lock. 2001.Student Notebook Step Action 4. Priority boost applies to both kernel locks and user (pthreads library) locks.When a high priority thread has to wait for a lock. Priority boost increases the priority of threads holding locks. The priority is set back to the original value when either: — The scheduler notices that the boosted thread is no longer holding any locks. — The boosted thread returns to user mode from kernel mode. Even though thread B is the highest priority thread on the system it can’t proceed until it obtains the resource held by thread A. . 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. — The high priority thread that was waiting for the lock obtains the lock. priority boost was added to AIX. Thread A is still holding the resource lock. it changes the priority of the thread that is holding the lock to its own priority. . A thread running in kernel mode must release any kernel locks it holds before returning to user mode.

2001.V2. Process Management 3-53 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. Scheduling Algorithms BE0070XS4. 2003 Unit 3. The main algorithms as defined in <sys/sched.3 Student Notebook Uempty Scheduling Algorithms SCHED_RR Fixed priority Threads are timesliced SCHED_FIFO Fixed priority Threads ignore timeslicing SCHED_OTHER Default policy Priority based on CPU time and nice value Figure 3-32. © Copyright IBM Corp. .0 Notes: Introduction AIX has 3 main types of scheduling algorithms that will affect how a threads priority is calculated by the scheduler.0.0.h> are listed in the visual above.

SCHED_OTHER This is the default AIX scheduling policy that was discussed earlier. 3-54 Kernel Internals © Copyright IBM Corp.A thread using SCHED_FIFO must have root authority to use it. Choosing scheduling algorithms By default a thread will run with the SCHED_OTHER scheduling algorithm. however: .The thread must have root authority to be able to use this scheduling mechanism. 2001. or until a higher priority thread is made runnable. int priority.It is possible to create a thread with SCHED_FIFO that has a high enough priority that it could monopolize the processor if it is always runnable. The amount of CPU time and the nice value have no affect on the threads priority. . int thread_setsched (tid. SCHED_FIFO Similar to SCHED_RR. Thread priority is constantly being adjusted based on the value of nice and the amount of CPU time a thread has received. The FIFO policies differ in how they return threads to the run queue.It is possible to create a thread with SCHED_RR that has a high enough priority that it could monopolize the processor if it is always runnable and there are no other runnable threads with the same (or higher) priority. . policy) tid_t tid. Priority degrades with CPU usage.Student Notebook SCHED_RR This is a round robin scheduling mechanism in which the thread is time-sliced at a fixed priority. Threads running as the root user can change scheduling algorithms using the thread_setsched() subroutine. SCHED_FIFO2. SCHED_FIFO3 and SCHED_FIFO4. There are actually three other related policies.The thread runs at fixed priority and is not time-sliced.It will be allowed to run on a processor until it voluntarily relinquishes by blocking or yielding. . . 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.This scheme is similar to creating a fixed-priority. . priority. .int policy. See the Performance Management Guide of the AIX online documentation for more details. and thereby provide a way of differentiating between their effective priorities. . real-time process.

0. © Copyright IBM Corp. .3 Student Notebook Uempty t_policy The scheduling policy a thread is using is stored in: thread->t_policy. 2003 Unit 3.0. 2001. Process Management 3-55 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.V2.

.Student Notebook SMP . . . SMP . . . The purpose of the cache is to speed up processing by pre-loading blocks of physical memory into the higher speed cache. . . . Figure 3-33.Multiple Run Queues Globle Run Queue . . . Cache warmth A thread is said to have gained cache warmth to a CPU when a portion of the process memory had been loaded into the CPU’s cache. CPU 2 . In an SMP system.0 Notes: Introduction On Symmetric Multi-Processing systems (SMP) per-CPU run queues are used to compensate for the multiple memory caches used on these systems. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. .Multiple Run Queues BE0070XS4. 2001. Memory cache Each CPU in a symmetric multi-processing system has its own memory cache. . CPU 1 . . CPU 0 . The best performance is achieved when a thread runs on a 3-56 Kernel Internals © Copyright IBM Corp. threads can be scheduled onto any CPU.

Hard affinity (or binding) is recorded in thread->t_cpuid. As long as the thread is in the same CPU run queue it will run on the same processor.3 Student Notebook Uempty CPU where it has gained some cache warmth. Each CPU draws work from its own run queue. 2003 Unit 3. RT_GRQ If a thread has exported the environment variable RT_GRQ=ON. This is called hard affinity.0. Soft cache affinity By having a run queue for each processor. The bindprocessor() subroutine is used to give a single thread or all threads of a process hard affinity to a CPU. Load balancing The system uses load balancing techniques to ensure that work is distributed evenly between all of the CPU’s in the system. Multiple run queues In addition to a global run queue each CPU has given its own run queue. Hard affinity Threads can be bound to a single CPU meaning they are never placed in the global run queue. The AIX thread scheduler takes advantage of cache warmth by attempting to schedule a thread on the same CPU it ran on last.0. . If t_cpuid is set to PROCESSOR_CLASS_ANY=-1 the thread is not using hard affinity (note that t_cpuid=0 means bound to cpu 0). it will sacrifice soft cache affinity. The thread will be placed only in the global run queue and hence run on the first available CPU. Process Management 3-57 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. © Copyright IBM Corp.V2. selecting the highest priority work from its queue. we allow for some measure of soft cache affinity. 2001.

In a NUMA architecture. however this is still a point at which adding more CPUs. . which is on the same system building block as the CPU trying to access it. 2001. The memory in a NUMA system is effectively divided into two classes. The limits grow over time as individual technologies improve (such as processor speed and memory bandwidth). NUMA stands for Non-Uniform Memory Access. Local memory. both in terms of the number of CPUs. which is located on a different system building block. One approach that has been taken in the past to allow the development of large systems is to use building blocks of SMP systems.Student Notebook NUMA Node CPU CPU Node local memory CPU I/O CPU remote cache Memory Interconnect remote cache CPU remote cache CPU CPU Node local memory CPU Node local memory CPU CPU I/O CPU CPU I/O Node Figure 3-34. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 3-58 Kernel Internals © Copyright IBM Corp. NUMA Node BE0070XS4. A good example of this is the NUMA-Q systems developed by Sequent. and the amount of memory. there are relatively large differences in access latency (approximately 1 order of magnitude) and bandwidth between local and remote memory. and Remote Memory. The SMP architecture has a limit on the size that it can grow to. This means that any CPU can access any piece of memory with virtually the same cost in terms of latency and bandwidth. the S stands for symmetric. and couple them together into a single system. or adding more memory actually degrades performance.0 Notes: In a true SMP architecture.

0. Process Management 3-59 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. remote memory access may be slower. remote and local accesses are identical with the exception of speed. 2001.0.3 Student Notebook Uempty Local vs. . Accessing memory on a different node is defined as remote access. remote memory access Access to memory on the same node as the device requesting the access is defined as local access.V2. © Copyright IBM Corp. To the device (CPU or I/O) accessing the memory. 2003 Unit 3.

Student Notebook Memory Affinity GX Slot L3 L3 GX Mem Slot GX Slot Mem Slot Mem Slot L3 L3 GX P L2 P P L2 P L3 L3 GX L3 L3 GX P L2 P Mem Slot GX Slot L3 L3 GX P L2 P L3 L3 GX MCM 2 P P P L2 P P L2 P L2 MCM 3 P L2 P L3 L3 GX L3 L3 GX L3 L3 GX P L2 P P L2 P P L2 P P L2 P L3 L3 GX L3 L3 GX MCM 1 P L2 P P L2 P P L2 P MCM 0 P L2 P L3 L3 GX GX GX GX GX L3 L3 L3 L3 L3 L3 L3 L3 Mem Slot Mem Slot GX Slot Mem Slot Mem Slot Figure 3-35. This system is an SMP system that has some characteristics of a NUMA system. we could consider this architecture to be a single system (since all of the components are inside a single cabinet). 2001. and other parts of memory are 'remote'. 3-60 Kernel Internals © Copyright IBM Corp. . Memory Affinity BE0070XS4. we can see that each MCM has two attached memory cards. since these resources have a degree of physical proximity when compared to other parts of memory or other processors. The major difference between this architecture and a true NUMA one is that the latency and bandwidth differences between local and remote access are much smaller. We could consider an MCM and its two memory cards to be a RAD. Some memory is 'local' to a processor. Looking at this diagram.0 Notes: The visual above shows the system architecture of the pSeries 690. However if we examine the diagram more closely. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

The top level is the entire system.0. level consists of individual CPUs and memory.SDL .Scheduler RAD is the RAD that the scheduler will operate on. . .System Decomposition Level . the bottom. usually a physical node. The SDL determines how small the RAD will be. 2003 Unit 3.Resource Affinity Domain.0.V2. 2001. Process Management 3-61 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. or atomic.3 Student Notebook Uempty Definitions This section defines some additional terms: .SRAD . is a group of resources connected together by some physical proximity. © Copyright IBM Corp.RAD . .A RAD exists at multiple levels.

CPU 7 CPU 0 CPU 1 CPU 4 CPU 5 Figure 3-36. CPU 6 . Process placement For most applications the most frequent memory access is to the process’ text.0 Notes: Introduction This section talks about design enhancements to facilitate future systems. . .Student Notebook Global Run Queues Global Run Queue . . . . . 2001. . . . . . . SRAD SRAD . . Other frequent accesses include private data. . . .The goal of the thread scheduler is to balance the process load between all the CPUs in the system and reduce the amount of time a runnable thread waits to be run when other CPUs are idle. . . . . CPU 2 . . Global Run Queues BE0070XS4. To 3-62 Kernel Internals © Copyright IBM Corp. . . . . stack and some kernel data structures. Run queues The design of the AIX 5L thread scheduler has been extended to allow per-node run queues and one global run queue. . . . CPU 3 . . . 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. . . . . . . . . . .

3 Student Notebook Uempty minimize memory access time the process text. Physical attachment Processes can be attached to a physical collection of resources (CPU and memory) called an RSet. 2003 Unit 3. . AIX will occasionally migrate a process between SRADs. Logical attachment Processes that share resources may be logically attached. Process migration In order to keep the system efficient.0. data. Logically attached processes are required to run on the same RAD.0. © Copyright IBM Corp. RAD affinity scheduling The purpose of RAD affinity scheduling is to exploit RAD local memory and RAD level caches by allocating a process private memory and text on the RAD(s) where it will be executed. by attempting to execute process threads on CPUs where there is cache warmth. For a process to migrate. Processes attached to an RSet can only migrate between members of the RSet. stack and kernel data structures are allocated from memory on the RAD containing the CPUs that will execute the threads belonging to that process.V2. and conversely. An API is provided for the control of logical attachments. 2001. Process Management 3-63 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. its memory must be copied to the process’ new home RAD. This RAD or set of RAD’s is called the process home RAD.

0 Notes: 3-64 Kernel Internals © Copyright IBM Corp. AIX provides _____ programming models for user threads. 3. 5. Checkpoint BE0070XS4. 2001. A new thread is created by the __________system call. All process IDs (except pid 1) are _____. Figure 3-37. A thread table slot number is included in a thread ID. 4.Student Notebook Checkpoint 1. True or False? 6. A thread holding a lock may have its priority _______. . The process table is an _____ of _______ structures. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 2.

0.3 Student Notebook Uempty Exercise Complete exercise 3 Consists of theory and hands-on Ask questions at any time Activities are identified by a What you will do: Examine the process and thread structures using kdb Apply what you learned to the analysis of a crash dump Learn about and configure system hang detection Explore how signal information is stored and used in AIX Figure 3-38.V2. © Copyright IBM Corp. 2001. They provide you with information you need to do the exercise.0 Notes: Introduction Turn to your lab workbook and complete exercise three. Exercise BE0070XS4. . 2003 Unit 3. Process Management 3-65 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.0. Read the information blocks contained within the exercise.

Figure 3-39. . AIX has three thread programing models available: 1:1. threads can mask signals. proc. user. M:1. Processes can handle or ignore signals. SCHED_FIFO. Unit Summary BE0070XS4. u_thread. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.0 Notes: 3-66 Kernel Internals © Copyright IBM Corp. M:N The dispatcher Selects what thread to run The scheduler adjusts thread priority based on: nice CPU time Scheduling algorithms are SCHED_RR.Student Notebook Unit Summary The primary unit of execution in AIX is the thread. thread. 2001. SCHED_OTHER The six structures of a process are: pvproc. pv_thread.

Addressing Memory 4-1 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.0. 2001. .V2. 2003 Unit 4.nsf/productfamilies/PowerPC © Copyright IBM Corp. What You Should Be Able to Do After completing this unit.com/chips/techlib/techlib.ibm.3 Student Notebook Uempty Unit 4. you should be able to: • List the types of addressing spaces used by AIX 5L • List the attributes associated with each segment type • Given the effective address of a memory object. identify the segment number and object type How You Will Check Your Progress Accountability: • Exercises using your lab system • Unit review References PowerPC Microprocessor Family: The Programmers Reference Guide Available from http://www-3. Addressing Memory What This Unit Is About This unit describes how memory is organized and addressed in AIX 5L.0.

2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. identify the segment number and object type.0 Notes: 4-2 Kernel Internals © Copyright IBM Corp. Given the effective address of a memory object. Unit Objectives BE0070XS4. 2001. .Student Notebook Unit Objectives At the end of this lesson you should be able to: List the types of addressing spaces used by AIX 5L. Figure 4-1. List the attributes associated with each segment type.

0 Notes: Memory Management Definitions Introduction To explore how AIX 5L addresses memory we must first define the terms and concepts listed above. . Memory Management Definitions BE0070XS4.V2. 2003 Unit 4.0. © Copyright IBM Corp.3 Student Notebook Uempty Memory Management Definitions Page Frame Address Space Effective address space Virtual address space Physical address space Figure 4-2. Addressing Memory 4-3 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 2001.0.

AIX 5L uses a fixed page size of 4096 bytes. Pages stay separate from each other. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. they do not overlap in virtual address space. The smallest unit of memory managed by hardware and software is one page. 2001. Pages are organized and stored in real (physical) memory chunks called frames. Pages and Frames BE0070XS4. Page A page is a fixed-sized chunk of contiguous storage that is treated as the basic entity transferred between memory and disk. 4-4 Kernel Internals © Copyright IBM Corp.0 Notes: Introduction AIX manages memory in 4096-byte chunks called pages.Student Notebook Pages and Frames Real (physical) Memory Page size = 4096 bytes Page frame Figure 4-3. .

Frame The place in real memory used to hold the page is called the frame.V2.3 Student Notebook Uempty Large page support POWER4 processors can handle 16MB pages. © Copyright IBM Corp.0. Addressing Memory 4-5 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.0. . AIX 5L can be configured to allow a number of large page segments. a frame is the place in memory to hold that information. 2003 Unit 4. Whereas a page is a collection of information. 2001. See the AIX online documentation for more information.

The effective address space is the range of addresses defined by the instruction set. The effective address space is mapped to physical address space or to disk files for each process. 2001. 4-6 Kernel Internals © Copyright IBM Corp.Effective address space . AIX 5L defines several different address spaces: .0 Notes: Introduction An address space is memory (real or virtual) defined by a range of addresses. programs and processes ‘see’ one contiguous address space.Virtual address space .Physical address space Effective address space Effective addresses are those referenced by the machine instructions of a program or kernel. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.Student Notebook Address Space Physical Memory Virtual address space Process 1 Effective address Process 2 Filesystem pages Paging space Figure 4-4. . However. Address Space BE0070XS4.

In AIX. There is more information about this subject in a later unit covering the implementation of LPAR. and the number of PCI host bus controllers in the machine. Processes have access to a limited range of virtual addresses given to them by the kernel. it may be loaded from disk. the paging space is mainly used to hold the pages from working storage (process data pages). .3 Student Notebook Uempty Virtual address space The virtual address space is the set of all memory objects that could be made addressable by the hardware. The physical address space is mapped to the machine’s hardware memory. it may not be referenced in a single contiguous range. Paging space The paging space is the disk area used by the memory manager to hold inactive memory pages with no other home. Addressing Memory 4-7 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.0. rather than a single 0-8GB address range.V2. For example.0. The virtual address space is bigger (since it is addressed using more address bits) than the effective address. 2003 Unit 4. Writing a modified page to disk is called a page-out. this is called a page-in. © Copyright IBM Corp. Physical address space The physical address space is dependent on how much physical memory (DRAM) is on the machine. 2001. If a memory page is not in physical memory. a system with 8GB of memory installed may use the ranges 0-3GB and 4GB-9GB to reference physical memory. Physical addresses in the range 3GB-4GB would be used to access devices connected to PCI host bus controllers. however depending on how much memory is installed.

The memory operation requested by the process or kernel is completed on the physical memory.Student Notebook Translating Addresses Step 1 Action The effective address is referenced by a process or by the kernel. 2 3 4 5 Figure 4-5. The hardware translates the address into a system wide virtual address. If the page is currently located on disk a free frame is found in physical memory and the page is loaded into this frame. 4-8 Kernel Internals © Copyright IBM Corp. the hardware translates the address into a physical address using the above process. 2001. . Translating Addresses BE0070XS4.0 Notes: Introduction When a program accesses an effective address. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. The page containing the virtual address is located in physical memory or on disk.

0. © Copyright IBM Corp. Segment number n Figure 4-6. Segments The maximum number of segments available to a process depends on the effective address space size (32-bit or 64-bit). Addressing Memory 4-9 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.0 Notes: Introduction Effective memory address space in AIX 5L is divided into 256 MB objects called segments. . 2001.V2. Segments BE0070XS4.3 Student Notebook Uempty Segments 256 MB Segment Segment number 0 Effective address space Segment number 1 . A process can adjust the number of pages in a single segment (up to 256 MB). . . Available memory A process can control how much of its effective address space is available in two ways. 2003 Unit 4. A process can create or destroy segments in its address space.0.

. A segment can be mapped into more that one process’s effective address space allowing the same physical memory to be shared. 4-10 Kernel Internals © Copyright IBM Corp.Student Notebook Sharing address space The benefit of the segmented addressing model is the high degree of memory sharing that can occur between processes. 2001. Once a shared segment is defined it can be attached or detached by many processes. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

36 bits are used for the ESID. allowing for 236 (more than 64 million) segments. 2001. allowing for 16 segments.3 Student Notebook Uempty Segment Addressing An effective address is broken down into the following three components Segment # 4/36 bits Virtual Page Index Byte Offset 16 bits 12 bits The first 4 bits (32 bit address) or 36 bits (64 bit address) is called an ESID and selects the segment register or STAB table slot The next 16 bits select the page within the segment The next 12 bits select the offset within the page Figure 4-7. The address resolution information that follows describes this process. In this case the ESID identifies one of 16 Segment Registers.V2.0. this number is 4 bits long. which is pointed to by the ASR (Address Space Register). In the 64-bit model. In the 32-bit model. Segment Addressing BE0070XS4. The virtual page index and byte offset are used together with the VSID to resolve the effective address.0. . © Copyright IBM Corp. In both cases the main item in the register/table entry is called a Virtual Segment ID (VSID). Addressing Memory 4-11 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. Segment addressing Both the 64-bit and 32-bit effective address spaces are divided into 256 MB segments. Each segment has a Segment number or Effective Segment ID (ESID). In this case the value identifies an entry in the STAB table.0 Notes: Introduction This section discusses how memory segments are addressed. 2003 Unit 4.

4-12 Kernel Internals © Copyright IBM Corp. 2001. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. This is consistent with the 32-bit hardware model which only has 16 segments. When running a 32-bit application the 64-bit hardware will zero extend the 32-bit effective address.15) can be accessed by a 32-bit application.Student Notebook 32 bit process on 64 bit hardware Keeping a consistent segment size in both the 32-bit and 64-bit execution modes allows for a 32-bit environment that is compatible with 64-bit hardware. . only the first 16 segments (ESID 0 . Therefore.

0 Notes: Introduction As already noted. as well as the following visual illustrate this process. and it is 24/52 bits long for 32/64 bit hardware. 32-bit Hardware Address Resolution BE0070XS4. This visual. 32-bit Hardware Address Resolution On 32-bit hardware.0. This value together with the remaining effective address information (segment page number and page offset) is used to resolve our effective address to a machine-usable address.V2. The segment register contains a 24 bit Virtual Segment (VSID). . respectively). 2003 Unit 4.0. the effective address segment number identifies a register or table value. Addressing Memory 4-13 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. Note that the virtual address space is larger than the effective or real address spaces (it is 52/80 bits wide on 32/64 bit hardware platforms.3 Student Notebook Uempty 32-bit Hardware Address Resolution Segment # Virtual Page Index Page Offset 16 Segment Registers 24 Segment ID 16 12 Page Offset 52-bit Virtual Address Virtual Page Number 40 Translation Look-Aside Buffer (TLB) Hash Anchor Table (HAT) Hardware Page Frame Table (PFT) Software Page Frame Table Real Page Number 20 32 Real Address Figure 4-8. We call this table value the Virtual Segment ID (VSID). each 32 bit effective address uses the first 4 bits to select a segment register. © Copyright IBM Corp. 2001.

which is combined with the 12 bit page offset to end up with a 32 bit real address. 2001. and we get a 52 bit virtual address which is used internally by the processor.Student Notebook These 24 bits are used with the 16 bit segment page number from the original address to yield a 40 bit virtual page number. . 4-14 Kernel Internals © Copyright IBM Corp. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. The 40 bit virtual page number is then used in a lookup mechanism to find a 20 bit real page number. Combine this with the 12 bit page offset.

which is combined with the 12 bit page offset to end up with a 64 bit real address. © Copyright IBM Corp.0.0 Notes: 64-bit Hardware Platform Address Resolution The visual above illustrates the address resolution process for 64-bit hardware platforms. 64-bit hardware allows the operating system to define a virtual memory space that is significantly larger than the maximum amount of real memory that can be addressed. These 52 bits are used with the segment page number from the original address to yield a 68 bit virtual page number. The 68 bit virtual page number is then used in a lookup mechanism to find a 52 bit real page number. The segment number is mapped to a 52 bit Virtual Segment ID (VSID) (using either a segment lookaside buffer (SLB) or a segment table (STAB)). Combine this with the 12 bit page offset. . 2001. Addressing Memory 4-15 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.3 Student Notebook Uempty 64-bit Hardware Address Resolution Segment # 36 Virtual Page Index Page Offset 16 Segment Lookaside Buffer 52 Segment Table (STAB) 12 Page Offset Segment ID 68 80-bit Virtual Address Virtual Page Number Translation Look-Aside Buffer (TLB) Hash Anchor Table (HAT) Hardware Page Frame Table (PFT) Software Page Frame Table Real Page Number 52 64 Real Address Figure 4-9.0. and we get an 80 bit virtual address which is used internally by the processor. Each 64 bit effective address uses the first 36 bits as a segment number. Note that it is completely analogous to the preceding 32 bit platform illustration.V2. 2003 Unit 4. This is accomplished via the use of a segment table. 64 Bit Hardware Address Resolution BE0070XS4.

Memory in private segments is only mapped to a single process’ address space. Private vs.Student Notebook Segment Types Kernel Segment User Text Process Private Shared Library Text Shared Data Shared Library Data Figure 4-10. These segments can only be accessed by code running in the kernel protection domain. The segment types are listed in the visual above. . 4-16 Kernel Internals © Copyright IBM Corp. This prevents one process from accessing or altering another process’ private memory. Segment Types BE0070XS4. shared Memory in a shared segment may be mapped to the same virtual address in more than one process. 2001. This allows the sharing of data between processes. Kernel segments Kernel segments are segments that are shared by all process on the system. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.0 Notes: Introduction Several segment types are used in a process’s address space.

This results in a major performance advantage for the kernel. 2001.The user data . Running a debugger When running a debugger. The process private segment contains: . then the instructions of ls are shared between them. . especially in the (very common) situation where the newly created child process immediately performs an exec() call to start running a different program. This protection allows a single copy of a text segment to be shared by all processes associated with the same program.Per-process loader data (accessible only in kernel mode) Performance advantage When a process calls fork. if two processes in the system are running the ls command. In that case.3 Student Notebook Uempty User text The user text segments contain the code of the program. This allows debuggers to set breakpoints directly in code. the page is actually copied into the segment for the child process. Threads in user mode have read-only access to text segments to prevent modification during program execution.V2. a private read/write copy of the text segment is used. Process private segment The process private segment is not shared among other processes. the status of the text segment is changed from shared to private. © Copyright IBM Corp. 2003 Unit 4.0.The user stack (for 32-bit programs) . Addressing Memory 4-17 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.0.Text and data from explicitly loaded modules (for 32-bit programs) .The primary kernel thread stack (accessible only in kernel mode) . It shares its contents with the process private segment of the parent process. For example. the process private segment of the child process is created as a ‘copy-on-write’ segment. Whenever the parent or child process modifies a page that is part of the process private segment.Kernel per-process data such as the u-block (accessible only in kernel mode) .

4-18 Kernel Internals © Copyright IBM Corp. These elements are placed in the shared library data segment. . . . Each process using text from the shared library text segment has a copy of the corresponding data in the per-process shared library data segment. .Contains a copy of the program text (instructions) for the shared libraries currently in use in the system.Are added to the user address space by the loader when the first shared library is loaded. The shared library text is loaded into this segment when a module is loaded via the exec() system call. a program may issue load() calls to get additional shared modules.Shared memory can be attached read-only or read-write. . Or. . can serve as large pools for exchanging data among processes. Shared library data segment Functions in shared libraries can define variables and other data elements that are private to a process. Shared data Mapped memory regions. Executable modules list the shared libraries they need at exec() time. . 2001.Data itself is not shared.Student Notebook Shared library text The shared library text segment contains mappings whose addresses are common across all processes.Each process has one shared library data segment. A shared library segment: .Addresses of data items are generally the same across processes. also called shared memory areas. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.A shared data segment can represent a single memory object or a collection of memory objects.A process can create and/or attach a shared data segment that is accessible by other processes. The shared library data segments act as extensions of the process private segment. .

Methods of sharing The system provides two methods of sharing memory: .Mapping file data into the process address space (mmap() services). Shared Memory BE0070XS4. © Copyright IBM Corp. .V2.0.0. Shared memory address The shared memory is process-based and can be attached at different effective addresses in different processes. 2003 Unit 4. Addressing Memory 4-19 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. or when many processes maintain a common large database. 2001.0 Notes: Introduction Shared memory areas can be most beneficial when the amount of data to be exchanged between processes is too large to transfer with messages.3 Student Notebook Uempty Shared Memory Memory Segments Process A effective address space Virtual memory Process B effective address space Figure 4-11.

Mapping to anonymous memory regions that may be shared (shmat() services). 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 4-20 Kernel Internals © Copyright IBM Corp.Student Notebook . processes using shared memory areas must set up a signal or semaphore control method to prevent access conflicts and to keep one process from changing data that another is using. Serialization There is no implicit serialization support when two or more processes access the same shared data segment. 2001. The available subroutines do not provide locks or access control among the processes. Therefore. .

0 Notes: Introduction The shmat services are typically used to create and use shared memory objects from a program. when supporting objects larger than 256 MB shared memory regions. 2001.0. 2003 Unit 4.Controls shared memory operations shmget () . © Copyright IBM Corp.V2.Gets or creates a shared memory segment shmat () .Attaches a shared memory segment shmdt () . shmat functions A program can use the following functions to create and manage shared memory segments. creates multiple segments. and. Using shmat The shmget() system call is used to create a shared memory region.Detaches a shared memory segment disclaim () . shmat Memory Services BE0070XS4.3 Student Notebook Uempty shmat Memory Services shmctl () .Removes a mapping from a specified address range within a shared memory segment Figure 4-12. Addressing Memory 4-21 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. .0.

with no increase in the total amount of shared memory region space.Student Notebook The shmat() system call is used to gain address ability to a shared memory region. EXTSHM The environment variable EXTSHM=ON allows shared memory regions to be created with page granularity instead of the default segment granularity. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. This allows more shared memory regions within the same-sized address space. . 4-22 Kernel Internals © Copyright IBM Corp. 2001.

however. This single-level store approach can also greatly improve performance by creating a form of © Copyright IBM Corp.0. 2003 Unit 4. mmap () The mmap() service is normally used to map disk files into a process address space. 2001. shmat()can also be used to map disk files. Instead of reading and writing to the file. the program would just access variables stored in the segment. This avoids the system call overhead of read() and write(). Memory Mapped Files BE0070XS4.0.3 Student Notebook Uempty Memory Mapped Files Virtual memory Effective Address space Disk File Figure 4-13. Advantages Memory mapped files provides easy random access.V2. Addressing Memory 4-23 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. as the file data is always available.0 Notes: Introduction Memory segments can be used to map any ordinary file directly into memory. using system calls. .

Synchronizes a mapped file with its underlying storage device. even if some are using mapping and others are using the read/ write system call interface. the mmap() subroutine extends this capability beyond that provided by the shmat() subroutine by allowing a relatively unlimited number of such mappings to be established. Of course. .Portability of the application is a concern. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. . 2001. However. Maps an object file into virtual memory. Both the mmap()and shmat() services provide the capability for multiple processes to map the same region of an object so that they share addressability to that object. mmap services The mmap() services are typically used for mapping files.Page-level protection needs to be set on the mapping.Many files are mapped simultaneously. Service madvise() mincore() mmap() mprotect() msync() munmap() Description Advises the system of a process' expected paging behavior. this may require synchronization between the processes. Instead of buffering the data in the kernel and copying the data from kernel to user. . Modifies the access protections of memory mapping.Only a portion of a file needs to be mapped. . When to use shmat() Use the shmat() services under the following circumstances: . Shared files A mapped file can be shared between multiple processes. the file data is mapped directly into the user’s address space.When mapping files larger than 256 MB. .Student Notebook Direct Memory Access (DMA) file access. 4-24 Kernel Internals © Copyright IBM Corp. When to use mmap() Use mmap() under the following circumstances: . although they may also be used for creating shared memory segments. Un-maps a mapped memory region.Private mapping is required. Determines residency of memory pages.

© Copyright IBM Corp. . Any storing into the segment modifies the segment. 2001.0. Addressing Memory 4-25 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. the application can begin modifying the file data (by memory-mapped loads and stores) and then either commit the modifications to the file system (via fsync()) or discard the modifications completely. 2003 Unit 4. when eleven or fewer files are mapped simultaneously and each is smaller than 256 MB. Mapping types There are a 3 mapping types: . a thread that loads beyond the end of the file loads zero values.Read-write mapping .0. Read-only mapping allows only loads from the segment.Read-only mapping .3 Student Notebook Uempty . . With deferred update (O_DEFER flag set on file open). This can greatly simplify error recovery. The difference between this mapping and read-write mapping is that the modifications are delayed. . If all processes that have a file open with the O_DEFER flag set close that file before an fsync() or synchronous update operation is made against the file then that file is not updated.When mapping shared memory regions which need to be shared among unrelated processes (no parent-child relationship). but does not modify the corresponding file.For 32-bit applications.When mapping entire files.Deferred-update mapping Read-write mapping allows loads and stores in the segment to behave like reads and writes to the corresponding file.V2. The operating system generates a SIGSEGV signal if a program attempts an access that exceeds the access permission given to a memory region. and allows the application to avoid a costly temporary file that may otherwise be required. Deferred-update mapping also allows loads and stores to the segment to behave like reads and writes to the corresponding file. Just as with read-write access.

32-bit User Address Space BE0070XS4. read-only shared. User Text .applications text (code). BSS. read-write 13 14 15 shared. 4-26 Kernel Internals © Copyright IBM Corp. 32-bit user mode The table above shows the segment layout of a user mode 32-bit process. read-write Figure 4-14. segment numbers (Effective Segment IDs) have different uses in user and kernel modes. Segment 2 contains the data. Segment zero contains the first kernel segment. Ublock. . stack and heap for the program. heap) Shared data (shmat or mmap) NOTE: for big data programs segments 3-10 can optionally be used as additional heap.0 Notes: Introduction For the 32-bit hardware platform. The user program text (application code) is located in segment 1. Process Private Segment (Data. 2001. read-write private. read-write shared. read-only private. uthread. read-only shared. The kernel segment contains the system call table and the kernel text.Student Notebook 32-bit User Address Space Segment Number (SID) 0 1 2 3-12 Segment Type and Use Kernel Segment. BSS (uninitialized data). Shared library text Shared data (shmat or mmap) Shared library data segment Attributes shared. This represents how a process running in user mode will see its effective memory. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. The u-block for the process is also located in segment 2. stack.

Big data model A big data model is supported for 32-bit applications. 2001. This allows an application to use more segments for heap. Segment 14 provides an additional segment for shmat() and mmap().12 are added to the heap. . Segment 15 holds the library data. Eliminating these segments as shmat() and mmap() areas. and stack.0. © Copyright IBM Corp. To accomplish this segments 3 . Such a model is required for programs which exceed the limit imposed by the normal 32-bit address space (i.3 Student Notebook Uempty Segments 3 -12 are used for shmat() and mmap() areas. data. 2003 Unit 4.0. a single 256MB segment for heap and data and stack). Addressing Memory 4-27 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. Segment 13 contains the text for shared libraries (library code).V2.e.

4-28 Kernel Internals © Copyright IBM Corp. Private process segment Segment 2 the (private process segment) is mapped the same for both user and kernel modes. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. This allows the kernel access to this section of the user address space. . 2001. Extended kernel address space (file system and network data) Private process segment Kernel heap segment MBUF segments Kernel address space Kernel thread segment Figure 4-15. Any data passed between kernel and user will occur by copying data in and out of this segment. The segment layout for the 32-bit kernel is shown above.0 Notes: 32-bit kernel mode When a process switches into kernel mode (32-bit kernel) the mapping of segments is changed so that the kernel may access its entire address space.Student Notebook 32-bit Kernel Address Space Segment Number 0 1 2 3 7-10 14 15 Segment type and use Kernel and kernel extension text and data. 32-bit Kernel Address Space BE0070XS4.

0x08FF_FFFF_F 0x0900_0000_0 .0x09FF_FFFF_F 0x0A00_0000_0 . Also.0. Addressing Memory 4-29 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. data. stack and heap A program’s text. This allows for 64-bit programs to be significantly larger than 32-bit programs. BSS and heap can occupy from segments 0x10 through segment 0x6FFFFFFF.0x0FFF_FFFF_F 0x1000_0000_0 . 2003 Unit 4.0. kernel text Reserved for system use Reserved for user mode loader (process private segment) Shmat or mmap use Reserved for user mode loader shmat or mmap use Reserved for user mode loader Application text.0 Notes: Segment 64-bit layout The 64-bit model adds many more segments to the effective address space. for the 64-bit case one segment layout applies to both user and kernel modes. 64-bit User/Kernel Address Space BE0070XS4.0XEFFF_FFFF_F 0xF000_0000_0 .0x07FF_FFFF_F 0x0800_0000_0 . © Copyright IBM Corp. 2001.0x0000_0000_C 0x0000_0000_D 0x0000_0000_E 0x0000_0000_F 0x0000_0001_0 . BSS and heap Default application shmat and mmap area Application explicit module load area Shared library text and per-process shared library data Reserved for future use Application primary thread stack Reserved for future use Additional kernel segments Figure 4-16. . text. data. bss.0x0EFF_FFFF_F 0x0F00_0000_0 .0xFFFF_FFFF_F Segment usage System call tables.V2. data.3 Student Notebook Uempty 64-bit User/Kernel Address Space Segment Number (hex) 0x0000_0000_0 0x0000_0000_1 0x0000_0000_2 0x0000_0000_3 .0x06FF-FFFF_F 0x0700_0000_0 .

2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.Student Notebook shmat() and mmap() For 64-bit applications the default segments for shmat() and mmap() are segments 0x70000000 . 2001.0x7FFFFFFF. Kernel segments Segment 0 is the first kernel segment. The segments from 0xF00000000 and up may be used for additional kernel segments. Note that segments 0x3-0xC and segment 0xE are also reserved for shmat() and mmap(). this mirrors the 32-bit segment model. . 4-30 Kernel Internals © Copyright IBM Corp.

3.3 Student Notebook Uempty Checkpoint 1. The 32-bit user address space layout is the same s the 32-bit kernel address space layout. Shared library data segments can be shared between processes. 4. A 32-bit effective address contains a ______segment number.0. 5.V2. Checkpoint BE0070XS4. 2003 Unit 4. 2001. The _____________ provides each process with its own _______address space. . Addressing Memory 4-31 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 2. True or False? Figure 4-17. True or False? 6. A segment can be up to ______ in size.0. AIX divides physical memory into ______.0 Notes: © Copyright IBM Corp.

2001. Exercise. BE0070XS4. . 4-32 Kernel Internals © Copyright IBM Corp. Figure 4-18.0 Notes: Turn to your lab workbook and complete exericse four. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.Student Notebook Exercise Complete exercise four Consists of theory and hands-on Ask questions at anytime Activities are identified by a What you will do: Given the address of a memory object you will identify what segment the address belongs to and speculate as to how the object was created.

3 Student Notebook Uempty Unit Summary Pages size = 4096 Virtual memory management Address spaces effective virtual physical Segment size = 256 MB 32-bit vs 64-bit segment layout Figure 4-19.0.V2. . 2001.0. Unit Summary BE0070XS4.0 Notes: © Copyright IBM Corp. Addressing Memory 4-33 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 2003 Unit 4.

2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. .Student Notebook 4-34 Kernel Internals © Copyright IBM Corp. 2001.

0. What You Should Be Able to Do After completing this unit.nsf/productfamilies/PowerPC © Copyright IBM Corp.com/chips/techlib/techlib. 2003 Unit 5.0.ibm.3 Student Notebook Uempty Unit 5. you should be able to: • Identify the key functions of the AIX virtual memory manager • Given a memory object type identify the location of the backing store the VMM system will use for this object • Describe the affect that different paging space allocation policies have on applications and the system • Find the current paging space usage on the system • Identify the paging characteristics of a system from a vmcore file How You Will Check Your Progress Accountability: • Exercises using your lab system • Unit review References PowerPC Microprocessor Family: The Programmers Reference Guide Available from http://www-3. Memory Management What This Unit Is About This unit describes how AIX 5L manages memory using demand paging. 2001.V2. . Memory Management 5-1 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

Unit Objectives BE0070XS4. 2001.0 Notes: 5-2 Kernel Internals © Copyright IBM Corp. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.Student Notebook Unit Objectives At the end of this lesson you should be able to: Identify the key functions of the AIX virtual memory management system Given a memory object type identify the location of the backing store the VMM system will use for this object Describe the affect that different paging space allocation policies have on applications and the system Find the current paging space usage on the system Identify the paging characteristics of a system from a vmcore file Figure 5-1. .

2003 Unit 5.0. Function of the VMM The VMM is responsible for keeping track of which program pages are resident in memory and which are on secondary storage (disk).0 Notes: Introduction In the Addressing Memory lesson we saw how AIX 5L manages the effective address space for both the user and kernel. When all of the physical memory is in use. This lesson focuses on the management of the virtual address space by the Virtual Memory Manager (VMM). 2001. © Copyright IBM Corp.0. It handles interrupts from the address translation hardware in the system to determine when pages must be retrieved from secondary storage and placed in physical memory. the VMM decides which programs’ pages are to be replaced and paged out to secondary storage.V2.3 Student Notebook Uempty Virtual Memory Management Physical Memory Virtual address space Process 1 Effective address Process 2 Filesystem pages Paging space Figure 5-2. Memory Management 5-3 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. Virtual Memory Management (VMM) BE0070XS4. .

. the virtual address is mapped (if it is not already mapped) by the VMM to a physical address (where the data is located). 5-4 Kernel Internals © Copyright IBM Corp. 2001. Access protection also allows programs to set up memory that may be shared between processes.Student Notebook Each time a process accesses a virtual address. Access protection Another function of the VMM is to provide for access protection that prevents illegal access to data. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. This function protects programs from incorrectly accessing kernel memory or memory belonging to other programs.

0 Notes: Memory Object Types Introduction Memory objects in AIX 5L are classified based how the object is used.V2. The working storage segment holds the amount of paging space allocated to © Copyright IBM Corp.0. 2001. Object Types BE0070XS4. used during the execution of a program. All memory objects are assigned one of five classification types. 2003 Unit 5. Process data is created by the loader at run time and is paged in and out of paging space. such as stack and data areas. The Virtual Memory Management system manages each memory object based on its type.3 Student Notebook Uempty Object Types Working objects Persistent objects Client objects Log objects Mapping objects Figure 5-3.0. . Memory Management 5-5 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. Working objects Working objects (also called working storage and working segments) are temporary segments.

from where they can be paged-in later if needed. Client objects Client objects are used for pages of client file systems.Student Notebook pages in the segment. Log objects Log objects are used for writing or reading journaled file systems file logs during journaling operations. When the contents of a file changes. they are paged-in. they are marked and eventually paged-out to the original disk location across the network. The program text pages are read-only pages. the data pages are paged-in. Persistent objects are used to hold file data for the local file systems. 2001. . Part of the AIX kernel is also pageable and is part of the working storage. the page is marked as modified and eventually paged-out directly to the original disk location. File data pages and program text are both part of persistent storage. When remote pages are modified. File system reads and writes occur by attaching the appropriate file system object and performing loads/stores between the mapped object and the user buffer. and never paged-out to disk. Persistent objects The VMM is used for performing I/O operations of file systems. which allows an application to map multiple objects to the same memory segment. 5-6 Kernel Internals © Copyright IBM Corp. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. Mapping objects Mapping objects are used to support the mmap() interfaces. Remote program text pages (read-only pages) page-out to paging space. Persistent pages do not use paging space. When the process opens the file.

Memory Management 5-7 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. .V2.0. 2001. How it works Data is copied to a physical page only when referenced by a program or by the kernel. Demand Paging BE0070XS4.0 Notes: Introduction AIX is a demand paging system. 2003 Unit 5.0.3 Student Notebook Uempty Demand Paging Physical Memory Virtual address space Kernel or user effective address space Pinned Page Fault Filesystem pages Paging space Backing store Figure 5-4. References to a non allocated page results in a page fault. Page faults A page fault occurs when a thread tries to access a page that is not currently in physical memory. Paging is done on-the-fly and is invisible to the program causing the page fault. © Copyright IBM Corp. Physical pages (frames) are not allocated for virtual pages until they are needed (referenced).

2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. . Page validity The VMM checks to ensure that the effective address being referenced is part of the valid address range of the segment that contains the effective address.Student Notebook The mapping of effective addresses to physical addresses is done in the hardware on a page-by-page basis. If the processor is running in kernel mode. If the processor is running in user mode. the page fault cannot be resolved. however the page containing the effective address has not been instantiated. Action The hardware detects a page fault and raises the page fault condition. and the page containing the effective address has already been instantiated. The page is loaded from disk into physical memory. this happens when an application performs a large malloc() operation. There are a number of possible scenarios. It then updates the hardware page frame table to reflect the physical location of the page. then the unresolved page fault results in the running process being sent either a SIGSEGV (Segmentation violation) or SIGBUS (Bus error) depending on the address being referenced. Execution of the faulted thread is resumed. and then updates the segment information to indicate that the page has been allocated. When the hardware finds that there is no mapping to physical memory. 5. .The effective address is outside the valid address range for the segment. In this case. Page fault handler The job of a virtual memory management system (VMM) is to handle page faults so that they are transparent to the thread using effective memory addresses. it raises a page fault condition. . In this case.The effective address is within the valid address range for the segment.The effective address is within the valid address range for the segment. The pages for the malloced space are not instantiated until they are referenced for the first time. 3. For example. an unresolveable page fault results in a system crash. and allows the faulting thread to continue. Control is transferred to the page fault handler (part of the virtual memory management system). 2001. the VMM allocates a physical frame for use by the page. The actions of the VMM in this case are described over the next few pages of the class. 4. 5-8 Kernel Internals © Copyright IBM Corp. The steps to resolve a page fault are: Step 1. 2. Execution of the faulting thread is suspended. .

3 Student Notebook Uempty Advantages The demand paging system in AIX allows more virtual pages to be allocated than can be stored in physical memory. If the number of pages available goes below a low-water mark threshold. 2003 Unit 5.0. 2001. Kernel pages that are not currently being used can be paged out. Physical memory management Data that has been recently used is kept in physical memory. valuable physical memory will never be used.V2. If a process never makes use of a portion of its virtual space. Only a small part of the kernel is required to be pinned. Only some of the kernel is in physical memory at one time. and continues to do so until a high-water mark threshold is reached. Pageable kernel AIX’s kernel is pageable.0. Data that has not been recently used is kept in paging space. Memory Management 5-9 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. © Copyright IBM Corp. Demand paging also saves much of the overhead of creating new processes because the pages for execution do not have to be loaded until they are needed. the pager frees the oldest referenced pages. . A pager daemon attempts to keep a pool of physical pages free. The interrupt processing portion of a device driver is pinned.These pages are said to be pinned. Pinned pages Some parts of the kernel are required to stay in memory because it is not possible to perform a page-in when those pieces of code execute.

2001. Data Structures BE0070XS4. Address translation requires both hardware and software components. 5-10 Kernel Internals © Copyright IBM Corp. This section covers the relationship between the hardware and software components of the VMM. Data structures The diagram above shows the overall relationships between the major AIX data structures involved in mapping a virtual page to a physical page or to paging space. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.0 Notes: Introduction The main function of the VMM is to make translations from the effective address to the physical address.Student Notebook Data Structures Effective address space Hardware Page Frame Table Physical memory Software page frame table External page tables (XPT) Segment ID and page number Paging space SID table File inode Filesystem pages Figure 5-5. .

Memory Management 5-11 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.File system object (persistent object) © Copyright IBM Corp. 2001.Physical memory (but not in the hardware PFT). recovers the page if necessary and updates the hardware’s frame page frame table with the location of the page. If the page is valid. . the VMM determines the location of the page. 2003 Unit 5.V2.Paging disk (working object) .0. It handles the fault by first verifying that the requested page is valid.3 Student Notebook Uempty Page faults A page fault causes the AIX (VMM) to do the bulk of its work.0. . A faulted page will be recovered from one of the following locations: .

5-12 Kernel Internals © Copyright IBM Corp. . Illustration The flow of the best case address translation is illustrated above. There is no need for a page fault to be generated.Student Notebook Hardware Page Mapping Effective address space Hardware Page Frame Table Physical memory Software page frame table External page tables (XPT) Paging space SID table File inode Filesystem pages Figure 5-6. These tables only contain a subset of all available translations for the contents of physical memory. We say the memory is paged in. If a translation is found in this table the physical page is returned to the requestor. Hardware Page Frame Table (PFT) A hardware Page Frame Table (PFT. 2001.0 Notes: Introduction In a normal situation. an effective address refers to a piece of memory that is currently in real memory. sometimes “HWPFT”) of address translations is used to make the conversions from effective addresses to physical addresses. Hardware Page Mapping BE0070XS4. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

Page not in Hardware Table BE0070XS4.0. The VMM must be called to update the hardware tables. however. a page fault is generated. The VMM software must supplement the hardware table with a software-managed page table.0 Notes: Introduction The size of the hardware Page Frame Table is limited. © Copyright IBM Corp.V2.0.3 Student Notebook Uempty Page not in Hardware Table Effective address space Hardware Page Frame Table Physical memory Software page frame table External page tables (XPT) Paging space SID table File inode Filesystem pages Figure 5-7. . Illustration When a translation cannot be found in the hardware table. The physical page may be resident in memory. the translation entry is not in the hardware table. 2003 Unit 5. therefore. the hardware can not satisfy all address translation requests. 2001. The procedure is shown in the table above. Memory Management 5-13 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

If the page is not valid. • The hardware Page Frame Table is updated with the real page number for this page. • No page-in of the page occurs. and the block number.Student Notebook What happened to the thread? When this type of page fault is resolved the dispatcher is not run. If the page is valid. The faulted thread just continues the execution at the instruction that caused the fault. and is used and managed by the VMM software. Action The hardware Page Frame Table is searched for a page translation and none is found. and the process resumes execution. since it is already in memory. Software Page Frame Table Software Page Frame Table (SWPFT) is an extension of the hardware Page Frame Table. SWPFTs contain information connected with a page as well as page-in flags. Step 1. the VMM searches the software PFT for the page. Procedure These steps assume that the memory page is in memory but not in the hardware Page Frame Table. 3. 2. The hardware generates a page fault causing the VMM to be called. If the page is found: 5. This process resembles hardware processing. 5-14 Kernel Internals © Copyright IBM Corp. The software PFT is big enough to contain translation information for every page resident in physical memory. 2001. . The VMM first verifies that the requested page is valid. 4. They contain the device information used to obtain the proper page from disk. free list flags. page-out flags. Only some parts of the software PFT are pinned. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. but uses a software page table instead. a kernel exception is generated.

0 Notes: Introduction If a page is not found in physical memory. 2001. and the page is loaded into a free memory page. Page on Paging Space BE0070XS4. then the disk block containing the page is located.3 Student Notebook Uempty Page on Paging Space Effective address space Hardware Page Frame Table Physical memory Software page frame table External page tables (XPT) Paging space SID table File inode Filesystem pages Figure 5-8. . Any process or thread waiting for a page fault to be handled is put to sleep until the page is available. 2003 Unit 5. Waiting for I/O Copying a page from the paging space to an available frame is not a synchronous process. Memory Management 5-15 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.V2. Illustration Working pages are mapped to disk blocks in the paging space. If the page is in paging space.0.0. the VMM determines whether it is on paging space or elsewhere on disk. © Copyright IBM Corp. The procedure for loading a page from paging space is shown in the visual on the previous page.

Student Notebook External Page Table (XPT) XPT Direct Block #0 0 Disk blocks in paging space page 0 XPT Root 0 page 255 1 MB 255 XPT Direct Block #255 0 . page 65280 255 page 65535 255 . . Each word in a direct block represents a single page in the segment. The second level consists of 256 direct blocks. . Figure 5-9. There is one XPT for each working storage segment. 2001. . 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. . . It contains the page’s state and disk block information. External Page Table (XPT) BE0070XS4. Structure Each segment that is mapped to paging space has the following XPT structure.0 Notes: External Page Table (XPT) The XPT maps a page within working storage segments to a disk block on external storage. Description The first level of the tree is the XPT root block. The XPT is a two-level tree structure. . Each XPT direct block covers 1 MB of the 256MB segment. 5-16 Kernel Internals © Copyright IBM Corp. Each word in the root block is a pointer to one of the direct blocks. .

8. 4. The VMM issues an I/O request to the device with the logical block and physical address of the page to be loaded. The VMM takes the first available frame from the free frame list. The VMM updates the hardware PFT. 7.0. The VMM looks up the object ID for this address in the Segment ID table and gets the External Page Table (XPT) root pointer. Step 1. When the I/O completes. The VMM gets the paging space disk block number from the XPT direct block. The thread waiting on the frame is awakened and resumes at the faulting instruction.0. 6. 2003 Unit 5. 5.V2. © Copyright IBM Corp. The VMM finds the correct XPT (direct block from XPT root). 2. . The free list contains one entry for each free frame of real memory. Action The thread causing the fault is suspended. 3.3 Student Notebook Uempty Procedure In this procedure the faulting thread must be suspended until I/O for the faulting page has completed. The net effect is that the process or thread has no knowledge that a page fault occurred except for a delay in its processing. the VMM is notified. Memory Management 5-17 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 2001.

Student Notebook Loading Pages From the File System Effective address space Hardware Page Frame Table Physical memory Software page frame table External page tables (XPT) Paging space SID table File inode Filesystem pages Figure 5-10. 5-18 Kernel Internals © Copyright IBM Corp. 2001.0 Notes: Introduction Persistent pages do not use external page tables. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. The effective address for the mapped page of the local file is indexed in the Segment Information Table (SID). allowing the VMM to find and page-in the faulting block. The VMM uses the information contained in a file’s inode structure to locate the pages for the file. . Loading Pages From the File System BE0070XS4. Procedure Persistent pages are mapped to local files located on file systems. The inode is pointed to by the SID entry.

Scheduling a page to be written does not mean that the data is written to disk immediately. The VMM schedules the modified persistent pages to be written to their original location on disk when: . It means that file objects are not directly addressable in the current address space but instead are temporarily attached.The sync operation is performed. A sync() operation flushes all scheduled pages to disk. .V2. A file gnode contains information about which segment belongs to the particular file.3 Student Notebook Uempty File System I/O Introduction The paging functions of the VMM is also used to perform file “reads” and “writes” by processes.0. .0. Persistent pages AIX uses a large portion of memory as the file system buffer cache.The file is closed. © Copyright IBM Corp. The sync() operation is performed by the syncd daemon every 60 seconds by default or by a user running the sync command. . File system objects File system reads and writes occur by attaching the appropriate file system object and performing loads/stores between the mapped object and the user buffer. Memory Management 5-19 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 2001. A local file has a segment allocated and has an entry (SID) in the segment information table. The pages for files compete for storage the same way as other pages.The VMM needs the frame for another page. 2003 Unit 5.

Questions Using what you know about memory object types. match the object types on the left with the location of its backing store on the right in the visual above.0 Notes: Introduction Paging provides automatic backup copies of memory objects on disk. Working B. 5-20 Kernel Internals © Copyright IBM Corp. This copy is called the backing store and can be located on a paging disk. An NFS disk file 3.Student Notebook Object Type A. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. a regular disk file. Persistent C. Client Backing Store 1. Paging disk Figure 5-11. Object Type / Backing Store BE0070XS4. . 2001. A regular disk file 2. or even on a network accessible disk file.

Figure 5-12. . Memory Management 5-21 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. This feature allows applications to release memory or take other appropriate actions when paging space runs low.0. Action SIGDANGER is sent to all process (except kprocs) that have registered to handle the signal A SIGKILL is sent to the newest process that does not have a signal handler for SIGDANGER. they are: . and the UID is not less than nokilluid. (npskill) 3 If paging space is still below the paging space kill threshold.3 Student Notebook Uempty Paging Space Management Process Steps Condition 1 If the number of free paging space blocks falls below the paging space warning level (npswarn).0. 2003 Unit 5. SIGKILL will continue to be sent to eligible processes until the free paging space rises above the kill threshold.V2. 2001.Paging space warning level © Copyright IBM Corp. Low paging space can result in failed applications and system crashes. 2 If the number of free paging space blocks falls below the paging space kill level. SIGDANGER Application programs can ask AIX to notify them when paging space runs low by registering to receive a SIGDANGER signal. Paging Space Management Process BE0070XS4. The default action for SIGDANGER is to ignore the signal.0 Notes: Introduction Proper management of paging space is required for the system to perform. Threshold AIX has two paging space thresholds.

This helps to prevent long-running processes from being terminated due to a low paging space condition caused by a recently started process. The value of nokilluid is 0 by default. int danger(void) { if (own_pid == SPECIALPID) { console(NOLOG.Student Notebook . Process The table above describes the actions AIX takes when paging space becomes low.Paging space kill level Application programs can monitor these thresholds and free paging space using the psdanger() function. .1) and vmo (AIX 5. 2001. which means processes owned by root are eligible to be sent a SIGKILL. Age of the process The kernel send the SIGKILL signal to the youngest eligible process.2) commands. "Paging space low!\n"). unload(L_PURGE). /* unload and remove any * unused modules in kernel or * library */ } return(0).2) commands. Both thresholds are set with the vmtune (AIX 5. which can be set with the vmtune (AIX 5. Example The init process (pid 1) registers a signal handler for the SIGDANGER signal. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.1) and vmo (AIX 5. M_DANGER. Nokilluid The SIGKILL signal is only sent to processes that do not have a handler for SIGDANGER and where the UID of the process is greater than or equal to the kernel variable nokilluid. } 5-22 Kernel Internals © Copyright IBM Corp. The handler prints a warning message on the system console and attempts to free memory by unloading unused modules.

Memory Management 5-23 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. Note that this policy holds only for this process and is not system-wide.1. 2003 Unit 5. This can be controlled with vmtune -d {0. This helps to ensure that the paging space will be available if it is needed.0 Notes: Introduction Individual processes may select when paging space will be allocated for them. . PSALLOC A process that has the environment variable PSALLOC=early will cause the VMM to allocate paging space for any memory which is requested. whether or not the memory is accessed. Paging Space Allocation Policy BE0070XS4. PSALLOC= Figure 5-13.V2. For AIX 4. Finding a process allocation policy Use kdb to examine the process flags in the proc structure to determine a process’s current paging space allocation policy. which means that paging space will not be allocated until a page Out occurs.0. Note that this is a system-wide policy and applies to all processes running on the system.2 and later releases the system default is Deferred Paging Space Allocation (DPSA). VMtune -d 0 will turn DPSA off. 2001. © Copyright IBM Corp.3 Student Notebook Uempty Paging Space Allocation Policy Policy Early allocation PSALLOC=early Description Causes paging space to be allocated as soon as the memory request is made. This is the algorithm that was used on AIX v3. which means paging space will be allocated when requested memory is accessed. The system wide default applies.3.1}.0. This is called a paging space policy.

5-24 Kernel Internals © Copyright IBM Corp.Student Notebook When early allocation is selected the SPEARLYALLOC flag will be set in proc->p_flag. This flag is defined in proc. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 2001. . If the flag is set it will show up in the second set of “FLAGS” indicated by the name: “SPEARLYALLOC”.h as: #define SPEARLYALLOC 0x04000000 /* allocates paging space early */ This flag can be seen through kdb by running the p <slot_number> subcommand.

V2. Memory Management 5-25 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. the VMM always wants some physical memory to be available for page-ins. 2003 Unit 5.3 Student Notebook Uempty Free Memory Free memory list free pages < minfree maxfree Run page stealer minfree free pages no => maxfree Figure 5-14. When a page fault occurs. This section describes the free memory list and the algorithms used to keep pages on the list.0 Notes: Introduction To maintain system performance.0. Page stealer The page stealer is invoked when the number of memory pages on the free list drops below the threshold defined by the value of minfree. . The page stealer attempts to © Copyright IBM Corp. Free memory list The VMM maintains a linked list containing all the currently free real memory pages in the system.0. the VMM just takes the first page from this list and assigns it to the faulting page. 2001. Free Memory BE0070XS4.

Student Notebook replenish the free list until it reaches the high threshold defined by maxfree.1 with the vmtune command (/usr/samples/kernel/vmtune). 2001. 5-26 Kernel Internals © Copyright IBM Corp. . Page replacement algorithm The method used by the page stealer to select a page which should be placed on the free list is called the Page Replacement Algorithm.2 with the vmo command. Evidence The page stealer is visible as the lrud kernel process. The values of maxfree and minfree can be viewed or adjusted on AIX 5. The page replacement algorithm used in AIX is called the clock-hand algorithm. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. and on AIX 5.

© Copyright IBM Corp. 2001. If a modified page is stolen.3 Student Notebook Uempty Clock Hand Algorithm Physical page Reference = 1 The reference bit is changed to zero when the clock hand passes rotation Reference = 0 Reference = 0 This page is eligible to be stolen Reference = 1 Figure 5-15. the clock-hand algorithm writes the page to disk (to paging space or a file system) before stealing the page. Clock Hand Algorithm BE0070XS4. The clock-hand advances whenever the algorithm advances to the next frame.0 Notes: Clock hand The algorithm is called the clock-hand algorithm because the algorithm acts like a clock hand that is constantly pointing at frames in order. .0. 2003 Unit 5.V2. Memory Management 5-27 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.0.

Step 1. The number of frames considered in each cycle is known as the lrud bucket size. 4. 5. 2. Action Each time a page is referenced the hardware sets the referenced bit in the PTE (Page Table Entry) for that page. then it is likely that all frames would have been referenced by the time the algorithm starts its second pass. The hardware automatically sets the reference bit for a page translation whenever the page is referenced. 2001. . The clock hand algorithm scans all PTEs checking the reference bit. 3. Bucket size The clock hand algorithm examines a set of frames at a time. If it were to examine all memory frames in the system in one cycle. If the reference bit is found set the bit is reset. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. The process continues until the number of free pages reaches maxfree. 5-28 Kernel Internals © Copyright IBM Corp. If the reference bit is found reset the page will be stolen.Student Notebook Process This algorithm is commonly used in operating systems when the hardware provides only a reference bit for each page in the physical memory.

3 Student Notebook Uempty Fatal Memory Exceptions In all of the following cases. An instruction storage exception occurs while in kernel mode. Figure 5-16. Memory Management 5-29 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. Fatal Memory Exceptions BE0070XS4.0 Notes: Introduction Not all page and protection faults can be handled by the OS.0. A page fault occurs with interrupts partially disabled. the system will panic and immediately halt. 2003 Unit 5. © Copyright IBM Corp. A memory exception occurs while in kernel mode without an exception handler set up. .V2. 2001. A protection fault occurs while in kernel mode on kernel data. An I/O error occurs when paging in kernel data. When a fault occurs that cannot be handled by the OS.0. the VMM bypasses all kernel exception handlers and immediately halts the system: A page fault occurs in the interrupt environment.

4. 3. The ________environment variable can be used to change the paging space policy of a process. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. Figure 5-17. 2001. A ______ ______ when interrupts are disabled will cause the system to crash. The system hardware maintains a table of recently referenced ______ to ______address translations.Student Notebook Checkpoint 1.0 Notes: 5-30 Kernel Internals © Copyright IBM Corp. Checkpoint BE0070XS4. 6. 5. . The S_____ P____ F____ T____ contains information on all pages resident in _______ _______. 2. A _________ signal is sent to every process when the free paging space drops below the warning threshold. Each ______ _______ has an XPT.

2003 Unit 5. Exercise BE0070XS4.V2.0. © Copyright IBM Corp.0 Notes: Turn to your lab workbook and complete exercise five.3 Student Notebook Uempty Exercise Complete exercise five Consists of theory and hands-on Ask questions at anytime Activities are identified by a What you will do: Observe the effect of the AIX paging space allocation policies on an application program Investigate what effect running out of paging space has on applications and the system Diagnose a crash dump from a system with paging space depletion Figure 5-18. 2001. .0. Memory Management 5-31 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

clock hand Figure 5-19.Student Notebook Unit Summary Virtual memory management Memory objects types Demand paging system Backing store Paging space allocation policies Free memory list . 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. Unit Summary BE0070XS4.0 Notes: 5-32 Kernel Internals © Copyright IBM Corp. . 2001.

V2.0.0.3
Student Notebook

Uempty

Unit 6. Logical Partitioning
What This Unit Is About
This unit describes the implementation of logical partitioning (otherwise known as LPAR) on pSeries systems.

What You Should Be Able to Do
After completing this unit, you should be able to: • Describe the implementation of logical partitioning • List the components required to support partitioning • Understand the terminology relating to partitions

How You Will Check Your Progress
Accountability: • Checkpoint questions • Unit review

References
AIX Documentation: AIX Installation in a Partitioned Environment Hardware Management Console for pSeries Installation and Operations Guide Available from http://www-1.ibm.com/servers/eserver/pseries/library/hardware_docs/hmc.html

© Copyright IBM Corp. 2001, 2003

Unit 6. Logical Partitioning

6-1

Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

Student Notebook

Unit Objectives
At the end of this lesson you should be able to:
Describe the implementation of logical partitioning List the components required to support partitioning Understand the terminology relating to partitions

Figure 6-1. Unit Objectives

BE0070XS4.0

Notes:

6-2

Kernel Internals

© Copyright IBM Corp. 2001, 2003
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

V2.0.0.3
Student Notebook

Uempty

Partitioning
Subdivision of a single machine to run multiple operating system instances Collection of resources able to run an operating system image
Processors Memory I/O devices

Physical partition
Building blocks

Logical partition
Independent assignment of resources

Figure 6-2. Partitioning

BE0070XS4.0

Notes: Introduction
Partitioning is the term used to describe the ability to run multiple independent operating system images on a single server machine. Each partition has its own allocation of processors, memory and I/O devices. A large system that can be partitioned to run multiple images offers more flexibility than using a collection of smaller individual systems.

© Copyright IBM Corp. 2001, 2003

Unit 6. Logical Partitioning

6-3

Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

Student Notebook

Reasons for partitioning
Partitioning is intended to address a number of pervasive requirements, including: - Server consolidation: The ability to consolidate a set of disparate workloads and applications onto a smaller number of hardware platforms, in order to reduce total cost of ownership (administrative and physical planning overhead). - Production and test environments: The ability to have an environment to test and migrate software releases or applications, which runs on exactly the same platform as the production environment to ensure compatibility, but does not cause any exposure to the production environment. - Data and workload isolation: The ability to support a set of disparate applications and data on the same server, while maintaining very strong isolation of resource utilization and data access. - Scalability balancing: The ability to create resource configurations appropriate to the scaling characteristics of a particular application, without being limited by hardware upgrade granularities. - Flexible configuration: The ability to change configurations easily to adapt to changing workload patterns and capacity requirements.

Partitioning types
In the UNIX market place, there are two main types of partitioning available: - Physical partitioning - Logical partitioning There are a number of distinct differences between the two implementations.

6-4

Kernel Internals

© Copyright IBM Corp. 2001, 2003
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

V2.0.0.3
Student Notebook

Uempty

Physical Partitioning

Interconnect

SMP Building Block

SMP Building Block

SMP Building Block

Operating System
Dedicated CPU, Memory and I/O Dedicated CPU, Memory and I/O

Operating System
Dedicated CPU, Memory and I/O

Physical Partition

Physical Partition

Figure 6-3. Physical Partitioning

BE0070XS4.0

Notes: Introduction
Physical partitioning is the term used to describe a system where the partitions are based around physical building blocks. Each building block contains a number of processors, system memory and I/O device connections. A partition consists of one or more physical building blocks. The diagram shows a system that contains three building block units. The system currently is configured to run two partitions. One partition consists of all of the resources (CPU, memory, I/O) on two physical building blocks. The other partition contains of all of the resources on the remaining building block.

© Copyright IBM Corp. 2001, 2003

Unit 6. Logical Partitioning

6-5

Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

Student Notebook

Properties
A system that implements physical partitioning has the following characteristics: - Multiple memory coherence domains, each with an OS image. A memory coherence domain is a group of processors that are accessing the same physical system memory. Memory coherence traffic (such as cache line invalidation, and snooping) is shared between the processors in the domain. - Separation controlled by interfaces between physical units. Memory coherence information stays within the physical building blocks allocated to the partition. A processor that is part of one building block cannot access the memory on another building block that is not part of the memory coherence domain (partition). - Strong software isolation, strong hardware fault isolation. Applications running inside an operating system instance have no impact on applications running inside another partition. A failure of a component on one system building block will not (or should not) impact a partition running on other building blocks. However the system as a whole still contains components that could impact multiple partitions in the event of failure, for example a failure of the backplane interconnect. - Granularity of allocation at the physical building block level. A partition that does not have enough resources can only be grown by incorporating whole building blocks, and therefore will include all of the resources on the building block, even though they may not be desired. For example, a partition that needs more processors will need to add another building block. By doing so, the partition will also incorporate the memory and I/O devices on that building block. - Resources allocated only by contents of complete physical group. The granularity of growing individual resources (CPU, memory, I/O) is determined by the amount of each resource on the physical building block being added to the partition. For example, in a system where each building block contains 4 processors, a partition that required more CPU power would receive an increment of 4 processors, even though perhaps only 1 or 2 would be sufficient.

Example
The Sun Enterprise 10000 and Sun Fire15K are examples of systems that use physical partitioning. In the case of Sun machines, the term domain is used instead of partition.

6-6

Kernel Internals

© Copyright IBM Corp. 2001, 2003
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

V2.0.0.3
Student Notebook

Uempty

Logical Partitioning

RS232 RS422

OS

OS

Hypervisor OS Processors

Hardware Management Console (HMC)

Managed System

Memory

I/O adapters and devices Up to 16 LPARs

LPAR 1 LPAR 2

LPAR 3
Ethernet

Figure 6-4. Logical Partitioning

BE0070XS4.0

Notes: Introduction
Logical partitioning is the term used to describe a system where the partitions are created independently of any physical boundaries. The diagram shows a system configured with three partitions. Each partition contains an amount of resource (CPU, memory, I/O slots) that is independent of the physical layout of the hardware. In the case of pSeries systems, an additional system, the Hardware Management Console for pSeries (HMC), is required for configuring and administering a partitioned server. The HMC connects to the system through a dedicated serial link connection to the service processor. Additionally, applications running on the HMC communicate over an Ethernet connection with the operating system instances in the partitions to provide service functionality, and in the case of AIX 5.2, dynamic partitioning capabilities.

© Copyright IBM Corp. 2001, 2003

Unit 6. Logical Partitioning

6-7

Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

Student Notebook

Properties
A system that implements logical partitioning has the following characteristics: - One memory coherence domain with multiple OS images. This basically means that all processors in the system are aware of the physical memory addresses being accessed by the other processors, even if they are in a different partition. Since each partition is allocated its own portion of physical memory, this has no real performance impact. - Separation controlled mainly by address mapping mechanisms. Rather than using physical boundaries between components to control the memory access available to each partition, a set of address mapping mechanisms provided by hardware and firmware features is used. The operating system running in each partition is restricted in its ability to access physical memory, and is only permitted to access physical memory that has been explicitly assigned to that partition. - Strong software isolation, fair-to-strong hardware fault isolation. Applications running inside an operating system instance have no impact on applications running inside another partition. The failure of the operating system in one partition has no impact on the others. - Granularity of allocation at the logical resource level (or below). In the case of pSeries systems, the current unit of allocation for each resource type is: • One CPU • Individual I/O slot • 256MB memory - Resources allocated in almost any combinations or amounts. The amount of memory allocated to a partition is independent of the number of CPUs or I/O slots. Each resource quantity is based on the system administrator’s understanding of the needs of the partition, rather than the physical layout of the machine. - Some resources can even be shared. In the case of pSeries systems, some resources are shared by all partitions. These are divided into two classes: • Physical resources (such as power supplies) that are visible to each partition. • Logical resources, where each partition is given its own “instance”, for example, the operator panel and virtual console devices provided by the HMC.

6-8

Kernel Internals

© Copyright IBM Corp. 2001, 2003
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

RML and LPAR ID registers Processors require Hypervisor support Means no LPAR support on older machines Firmware Global firmware image Partition specific firmware instance Hypervisor code Operating System Use of Hypervisor callout by VMM Means no LPAR support for older operating systems (e. . For © Copyright IBM Corp. 2001. one for each partition.Interrupt controller hardware The interrupt controller hardware on the system directs interrupts to a CPU for processing.g. The hardware must be capable of recognizing the source of an interrupt and determining which partition should receive the interrupt notification. AIX 4.0 Notes: Introduction No single feature determines whether a pSeries system is capable of implementing LPAR or not. Hardware The following hardware features are required for LPAR support: .3) All 3 required for LPAR operation Figure 6-5. it is a combination of features provided by different components. In the case of a partitioned system. 2003 Unit 6. the interrupt controller hardware must be capable of maintaining multiple global interrupt queues. Components Required for LPAR BE0070XS4.0. Rather.0. all of which must be present. Logical Partitioning 6-9 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.3 Student Notebook Uempty Components Required for LPAR Hardware Interrupt controller hardware Processors require RMO.V2.

If the interrupt is sent to a CPU that is part of a different partition. A processor implements hypervisor support by recognizing the HV bit in the Machine Status Register (MSR). All processors in the same partition have the same value loaded in the RMO register.Locate an operating system boot image . an interrupt from a SCSI adapter card must be sent to the partition that controls the card and the devices connected to it. .Create a device tree . The HV bit of the MSR. The hypervisor is described in detail later. a processor must have hypervisor support. All processors in the same partition have the same value loaded in the LPI register. All processors in the same partition have the same value loaded in the RML register.Identify and configure system components . The POWER4 processor is the first CPU used in pSeries systems that has the required capabilities.Processor support A processor requires 3 new registers in order to be used in a partitioned environment. • Logical Partition Identity register The LPI register contains a value that indicates the partition to which the processor is assigned. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. along with the Problem State bit indicates if the processor is in hypervisor mode. In other words. The use of the register is described in detail in a later part of this unit.Initialize/Reset system components . • Real Mode Limit (RML) register The RML register is also used when the processor is referencing an address in real mode. The use of the register is described in detail in a later part of this unit. 2001. .Student Notebook example. Hypervisor mode is implemented in a similar fashion to the system call mechanism used to transition the processor between Problem State (user mode) and Supervisor State (kernel mode). Firmware The job of firmware in a system is to: . only kernel code can make hypervisor calls. Hypervisor mode can only be invoked from Supervisor State. the CPU would be unable to access the device to process the interrupt. The registers are: • Real Mode Offset (RMO) register The RMO register is used by the processor when referencing an address in real mode.In order to implement the required isolation between partitions.Load the boot image into memory and transfer control 6-10 Kernel Internals © Copyright IBM Corp. .

The functionality of firmware is now divided into two parts. Most pSeries systems have two or three native serial ports.0. The hypervisor code performs partition and argument validation before allowing the requested action to take place. The hypervisor provides the following functions: . It then continues with the task of locating an operating system image and loading it. The hypervisor provides a © Copyright IBM Corp. and updating the boot list in NVRAM. Since firmware in a partitioned system now has to deal with multiple operating system images. The OS calls these functions rather than manipulating hardware registers directly.Page Table access Page tables are described later in this unit when we examine the changes in translating a virtual address to a physical address in the LPAR environment. reducing the need for hard-coding the OS for each platform.When the operating system image is terminated. Examples of RTAS functions include accessing the time-of-day clock. The hypervisor is trusted code that allows a partition to manipulate physical memory that is outside the region assigned to the partition. The partition specific instance contains a device tree that is a subset of the global device tree. or enforce the addition of additional serial adapters. and that its use does not conflict with that of another partition. and Partition firmware.V2. each partition requires an I/O device to act as the console. It identifies and configures all of the hardware components in the system. An additional component of firmware required for LPAR support is the hypervisor function. it uses a component of firmware called Run-Time Abstraction Services (RTAS) to interact with the hardware. that run in hypervisor mode on the processor. and creates a global device tree that contains information on all devices. . .3 Student Notebook Uempty .When the operating system is running. control is returned to firmware.Virtual console serial device When multiple partitions are running on a system.0. a special version is required that provides additional functionality. and contains only the devices that have been assigned to the partition. In order to allow AIX to run on different hardware platform types. The global firmware is initialized when the system is first powered on. Hypervisor The hypervisor can be considered as a special set of RTAS features. The RTAS functions are provided by pSeries RISC Platform Architecture (RPA) platforms to insulate the operating system from having to know about and manipulate a number of key functions which ordinarily would require platform-dependent code. 2003 Unit 6. so it would be impractical to insist that each partition have its own native serial port. a partition specific instance of firmware is created. Logical Partitioning 6-11 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. The RTAS functionality provided by partition firmware performs validation checks and locking to ensure that the partition is permitted to access the particular hardware feature being used. it has control over the hardware. . known as Global firmware. When a partition is started. 2001.

. This allows the operating system to present a consistent interface to the application layer. A few other low level kernel components are aware of the fact that the OS is running inside a partition. The net effect of the required changes is that an operating system not designed for use in a partitioned environment will fail to boot. 2001. 6-12 Kernel Internals © Copyright IBM Corp. This means that older operating systems (such as AIX 4. The I/O from the virtual device is communicated to the HMC via the serial link from the service processor in the partitioned system. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. . since there is no need for any changes. The vast bulk of the kernel however is unaware. rather than maintain the table directly in memory.Debugger support The hypervisor also provides support that permits the debugger running on the system to access specific memory and register locations.Student Notebook virtual serial console interface to each partition. regardless of whether it is running in a partition or running as the only operating system on a regular standalone machine. Operating system The operating system that will run in a partition needs to be modified to use hypervisor calls to manipulate the Page Frame Table (PFT).3) will not work in a partition.

2003 Unit 6. © Copyright IBM Corp.0 Notes: Introduction The diagram summarizes the interfaces used by the operating system to interact with the hardware platform.0.3 Student Notebook Uempty Operating System Interfaces Applications Operating System AIX Boot/Config VMM Kernel Virtual Debugger TTY Dev & Dump Driver Register & Memory Access TTY Data Streams Platform Adaptation Layer (PAL) Hardware Service Calls Device Tree Subset Firmware "Partitioned" Open Firmware "Global" Open Firmware Hypervisor Virtual Page Mapping Partition Validation Run-Time Abstraction Services (RTAS) Figure 6-6. Operating System Interfaces BE0070XS4. It details the different components of the OS that interact with each function provided by the platform firmware. The Platform Adaptation Layer (PAL) is an operating system component similar in function to the RTAS layer provided by firmware. . 2001. Logical Partitioning 6-13 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. In other words. its job is to mask the differences between hardware platforms from other parts of the kernel.0.V2.

2001. . and ensure that pages are mapped to physical memory when required so that they can be accessed by the processors.Student Notebook Virtual Memory Manager Physical Memory Virtual address space Process 1 Effective address Process 2 Filesystem pages Paging space Figure 6-7.0 Notes: Introduction The job of the Virtual Memory Manager (VMM) component of the operating system is to manage the effective address space of each process on the system. The translation of a virtual address to a physical address is an area of the operating system that has undergone some changes to allow the implementation of a partitioned environment. Virtual Memory Manager BE0070XS4. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. since there are now multiple operating system images co-existing in a single machine. 6-14 Kernel Internals © Copyright IBM Corp.

.0. we first take a closer look at the memory layout on a non-LPAR system.V2. Logical Partitioning 6-15 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. Data written to the Host Bridge is passed to a specific I/O adapter card. based on the address being written. rather than being stored in the DRAMS or other components used to implement physical memory.3 Student Notebook Uempty Real Address Range Non-LPAR system with 2 PCI busses Processor View Sys Mem 1 I/O Adapters 4GB HB1 I/O Adapters HB0 Sys Mem 0 0 = Invalid for load/store Figure 6-8.0. A system has at least one Host Bridge (HB). Each HB is allocated a unique portion of the system address space.0 Notes: Introduction Before examining the changes in address translation for the LPAR environment. Device I/O The hardware provides memory mapped access to I/O devices. 2001. © Copyright IBM Corp. 2003 Unit 6. which is mapped to a region in the address map. When the processor writes to specific addresses. The Host Bridge device allocates portions of its address space to each I/O adapter plugged into a slot it controls. the data is passed to the Host Bridge. Real Address Range BE0070XS4.

and the number of Host Bridge devices in the system.5GB to 9. 6-16 Kernel Internals © Copyright IBM Corp. a system with 8GB total of physical memory may address 3GB of that memory using physical addresses in the range 0 to 3GB. and the VMM system of AIX (and most other modern operating systems) is designed to cope with this.Student Notebook Physical memory Another feature of the diagram that is worth noting is that the address range of physical memory in the system is not necessarily contiguous. As an example.5GB. the physical address range may be divided into multiple components. In other words. The physical memory in the system always starts with address zero. 2001. however depending on the total amount of memory. This is perfectly normal. and the remaining part of memory using addresses 4. . 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. there appears to be ‘holes’ in the physical address range used by the system.

V2. 2003 Unit 6. Logical Partitioning 6-17 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. Real address A real address is an address that is generated by the processor when address translation is disabled. such as the boot process (before the VMM is initialized) or interrupt/exception handler code. 2001. Real Mode Memory BE0070XS4. Address translation for instructions and data can be enabled or disabled independently.0 Notes: Introduction In addition to considering the ranges used when addressing memory.0. real address = physical address On LPAR systems real address != physical address Only one physical address zero in the system Physical address zero used by hypervisor Each partition requires its own address zero Requires mapping from real address generated by partition to physical address used by memory hardware Figure 6-9.0. The size of real mode © Copyright IBM Corp. Real mode memory starts at address zero. Typically real addresses are used by specialized parts of kernel code. The function of the VMM is to translate a virtual address into a real (or physical) address. .3 Student Notebook Uempty Real Mode Memory Real address = address generated when translation disabled Used by system startup code that runs before VMM is configured Used by interrupt vector mechanism Used by VMM itself to maintain tables Real mode memory normally starts at address zero Size of real mode memory region depends on operating system On non-LPAR systems. and the status of this is indicated by bits in the MSR. Address translation can be enabled or disabled. another important distinction to make is the type of access being performed.

Student Notebook memory is dependent on the requirements of the operating system. but we can generalize the statement as: For any given address n. since a system only has a single overall physical address range (although it may be split into multiple sections). physical address zero is used by the hypervisor. this is where the RMO register of the processor is used. Obviously they can’t all access the same physical address n.For real mode addresses. partition page tables are used to translate the partition specific address n into a system-wide physical address. . Each partition requires its own address zero. LPAR changes The assertion that a real address is the same as a physical address no longer holds true in the partitioned environment however. . a real address is equivalent to a physical address. just know that: . 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 2001. We explain things later. but there is only one true physical address zero inside a system.For virtual addresses. each partition expects to be able to access address n. but for now. Another important thing to note is that on a non-LPAR system. 6-18 Kernel Internals © Copyright IBM Corp. so something needs to be done to accommodate this. In actual fact.

2003 Unit 6. Logical Partitioning 6-19 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 48GB.e. Alignment Physical memory allocated in a partitioned environment for use as Real Mode memory by a partition must be contiguous. As we will see later.0 Notes: Introduction The amount of real mode memory required by a partition depends upon two factors.16GB 16GB . . 64GB etc. 2) The amount of memory allocated to the partition.V2. 16GB. Operating System Real Mode Issues Supported partition sizes 256MG . a 16GB real mode region must be aligned on an address boundary divisible by 16GB (i. 32GB. since it is used by the hypervisor.0. For example. 2001.0.1 real mode requirement scales with memory in partition AIX 5. 1GB real mode region aligned on 1GB address boundary AIX 5. 1) The version of the operating system.4GB 1GB . © Copyright IBM Corp. address 0 cannot be used.1 VMM accesses many tables with translation disabled Some of these tables scale with amount of memory in partition Therefore AIX 5.g.3 Student Notebook Uempty Operating System Real Mode Issues Real mode memory aligned on same size address boundary e.1 Real mode Requirements Real mode region size 256MB 1GB 16GB Figure 6-10. and aligned on an address boundary that is divisible by the size of the real mode region.).2 & Linux require 256MB of real mode memory VMM requires fixed size real mode region Most VMM tables accessed only with address translation enabled AIX 5.256GB BE0070XS4.

1 The VMM in AIX 5. Sometimes the alignment requirements of the 1GB and 16GB real mode regions can cause problems on systems that are using a large percentage of their physical memory. The result of this is that partitions running AIX 5.2 and linux Both AIX 5.Student Notebook AIX 5. rather than the 256MB required by AIX 5. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 6-20 Kernel Internals © Copyright IBM Corp.1 maintains tables in real mode memory that scale with the total amount of memory allocated to the partition.1 may need 256MB. AIX 5. In these situations. 1GB or 16GB of real mode memory. 2001. since the VMM only uses real mode to maintain tables that do not scale with memory size. .2 and Linux. sometimes the order in which partitions are started can have an impact on whether all partitions can be started.2 and Linux require only 256MB of memory be accessible in real mode.

In a normal non-LPAR system.3 Student Notebook Uempty Address Translation If address translation enabled. In a partitioned environment. The VMM then performs an additional step. however the real address is a “logical” address within all of the memory assigned to the partition. VMM converts virtual address to real address Treats address as segment ID. Address Translation BE0070XS4. page number and page offset Determines physical page starting address from segment ID and page number Non-LPAR systems use software PFT (page frame table) LPAR systems use partition page tables (stored outside partition) Adds page offset to physical page address Value of RMO register is not used If address translation disabled. the VMM is effectively translating the virtual address to a real address.V2.0. 2001.0. 2003 Unit 6. Translation enabled When address translation is enabled. and converts the partition-specific real address into a system-wide physical address. It accomplishes this using partition page tables. . value of RMO register added to address All processors in the same partition have the same value in RMO register RMO value set by firmware when partition is activated Figure 6-11. there is no problem. Logical Partitioning 6-21 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. the VMM uses a slightly different method to convert a virtual address into a true system-wide physical address. © Copyright IBM Corp. but because a real address is the same as a physical address. the VMM is in charge.0 Notes: Introduction The method used by a partition to interpret an address depends if virtual address translation is currently enabled or disabled. The VMM converts the virtual address into a real address.

6-22 Kernel Internals © Copyright IBM Corp.Student Notebook Translation disabled When address translation is disabled. as indicated by the status bits in the MSR. The processor knows when it is dealing with a real address. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. When dealing with a real address. The RML register is used to limit the amount of memory that a partition can access in real mode. the processor automatically (and without the knowledge of the operating system) adds the value loaded in the RMO to the address to convert the partition specific real address into a true system-wide physical address before submitting it to the memory controller hardware as part of the request to read or write the memory location. 2001. . the RMO (Real Memory Offset) register of the processor is used in the address calculation.

0. so that the hypervisor can track which PMBs are allocated for specific purposes. 2GB partition requires 8 PMBs assigned to a partition need not be contiguous Logical Memory Block (LMB) is the name given to a block of memory when viewed from the partition perspective Each LMB has a unique ID within a partition. In order to be activated. Each LMB has an ID that is unique within the partition. © Copyright IBM Corp. Terminology The physical memory of the system is divided up into 256MB chunks called Physical Memory Blocks (PMBs). a partition will be allocated sufficient PMBs to satisfy the minimum memory requirement as indicated by the partition profile being activated. 2001.g. The partition views the memory assigned to it as a number of logical memory blocks (LMBs).3 Student Notebook Uempty Allocating Physical Memory Memory divided into 256MB Physical Memory Blocks (PMB) Each PMB has a unique ID Multiple PMBs assigned to provide the logical address space for a partition e. Allocating Physical Memory BE0070XS4. .0. and is associated with a PMB Some PMBs are used for special purposes. Each PMB has a unique ID within the system.0 Notes: Introduction The physical memory of a partitioned system must be divided up between the partitions that are to be started. The PMBs assigned to a partition need not be contiguous. Logical Partitioning 6-23 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.V2. and cannot be allocated to partitions Partition page tables TCE space Hypervisor Figure 6-12. 2003 Unit 6.

2001. . 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.Student Notebook Some PMBs in the system are used for special purposes. 6-24 Kernel Internals © Copyright IBM Corp. and cannot be allocated for use by partitions. The number of PMBs allocated for these special purposes depends upon many factors.

0. The table is used by the VMM in the partition to translate a partition specific virtual address into a system-wide physical address. Page table requirements The page table space for a partition is under the control of the hypervisor. 2003 Unit 6. Four 16 byte entries are required in the page table for each 4K page in the partition. Logical Partitioning 6-25 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. each partition is allocated space for a partition page table. Partition Page Tables BE0070XS4. but instead must make a hypervisor call to perform the requested action. 64MB page table aligned on 64MB address Figure 6-13.0 Notes: Introduction As mentioned previously. .g. For example. Page tables are allocated in sizes that are powers of two. a partition with 1GB of memory requires a partition page table of 16MB in size.0. the operating system instance cannot read or write the page table entries directly. A page table © Copyright IBM Corp. This equates to a size equal to 1/64th of the memory allocated to the partition. In other words. 2001.V2.3 Student Notebook Uempty Partition Page Tables Used when translating virtual address to physical address Stored outside memory area allocated to partition Under control of Hypervisor VMM makes hypervisor call to read and update partition page table Scale with size of partition memory Four 16 byte entries per 4K page of memory assigned to partition (rounded up to power of 2) Equivalent to 1/64th of partition memory Placed in contiguous physical memory Aligned on address boundary divisible by table size e.

but this would be rounded up to 64MB. then the hypervisor will allocate more PMBs for page table use. In addition. If existing PMBs allocated for page table use do not contain sufficient space (or sufficient contiguous space). If the virtual page is already in physical memory. This performance penalty is only experienced when a virtual page is mapped into physical memory. page tables must be allocated in contiguous physical memory. The size of the page table allocated to a partition is large enough to handle the maximum memory amount the partition may grow to. then the VMM can perform the virtual to physical address translation by accessing the Translation Lookaside Buffer (TLB).Student Notebook requirement that is not a power of two is rounded up to the next size that is a power of two. Page tables must be allocated on an address boundary that is divisible by the size of the page table. So a partition that has 2. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. The maximum memory amount is an attribute of the partition that is used in limiting the extent of dynamic LPAR operations. Performance penalty There is a small performance penalty associated with the action of the VMM in a partition accessing the partition page tables. 2001. 6-26 Kernel Internals © Copyright IBM Corp. The performance penalty is only really noticeable when a partition is performing heavy paging activity. since this means the page tables are being accessed frequently. a processor specific cache of the most recently accessed virtual to physical translations. The hypervisor will attempt to place multiple page tables of 128MB or smaller inside a single PMB that has been allocated for page table use. .5GB of memory has a page table requirement of 40MB. the next power of two.

V2.e. the TCE tables exist within the memory image of the operating system. TCE tables TCE tables contain information on the current TCE mappings for each host bridge device. © Copyright IBM Corp. therefore all PCI slots are controlled by a single operating system instance. The translation entries are used to convert the 32-bit I/O address generated by the adapter card on the I/O bus into a 64-bit address that the host bridge will submit to the system memory controller. 2003 Unit 6.0. In a standalone system.3 Student Notebook Uempty Translation Control Entries Used to allow 32-bit PCI adapters to access 64-bit memory space Similar in concept to partition page tables. but used for device I/O Provided as a function of the PCI Host Bridge device TCE space controlled by hypervisor Outside the control of a single partition Single PCI Host Bridge may have slots in different partitions Hypervisor required for dynamic LPAR operations TCE space allocated at the top of physical memory Amount of TCE space depends on number of PCI slots/drawers 512MB for 5-8 I/O drawers on p690 256MB for all others Figure 6-14. . the operating system controls all host bridge devices in the system.0 Notes: Introduction Host bridge devices use Translation Control Entries (TCEs) to allow a PCI adapter that can only generate a 32-bit address (i. In this case. 2001. Translation Control Entries BE0070XS4.0. an address in the range 0 to 4GB) to access system memory above the 4GB address range. Logical Partitioning 6-27 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

a single host bridge device may support 4 PCI slots. they are placed under the control of the hypervisor. and after validating the permissions and arguments. and each slot may be assigned to a different partition. which is a requirement for the ability to dynamically reassign an I/O slot from one partition to another with a DLPAR operation. The amount of memory allocated for TCE space depends on the number of host bridge devices (and PCI slots) in the system. we now have a situation where multiple partitions require to access adjacent memory locations. Since the TCEs need to be manipulated by the operating system as it establishes a mapping to the adapter card. TCE space is always located at the top of physical memory. As an example. Access to the TCE tables is performed by the partition in a manner similar to accessing partition page tables. the hypervisor performs the requested action on the TCE table entry. The memory locations are not under the control of any specific partition. The hypervisor allocates each partition valid “windows” into the TCE address space. p690 systems with less than 5 I/O drawers. Rather than having the TCE tables under the control of a special partition. and all other LPAR capable pSeries systems use 256MB (1 PMB) for TCE space. . 2001. Currently a p690 system that has between 5 and 8 I/O drawers will use 512MB of memory (2 PMBs) for TCE space. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. Another benefit of having the TCE space under the control of the hypervisor is that it allows the “windows” that are valid for each partition to be changed on the fly. The partition makes a hypervisor call (similar to a system call). that relate to the adapter slots that have been assigned to the partition. 6-28 Kernel Internals © Copyright IBM Corp.Student Notebook LPAR changes In the partitioned environment. there is no requirement for all of the slots of a single host bridge device to be under the control of a single partition.

Making a hypervisor call from user mode results in a permission denied error. The transition to hypervisor mode can only be made from Supervisor State (i. © Copyright IBM Corp.0.e.3 Student Notebook Uempty Hypervisor Similar to system call mechanism Hypervisor bit in MSR indicating processor mode Can only be invoked from Supervisor (kernel) mode Used by operating system to access memory outside the partition e. It is loaded in the first PMB in the system. Hypervisor mode Hypervisor mode is entered using a mechanism similar to that used when a user application makes a system call.V2. 2001. The hypervisor code is supplied as part of the firmware image loaded onto the system. and the kernel segment becomes visible. . starting at physical address zero. kernel mode). Logical Partitioning 6-29 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.0. Hypervisor BE0070XS4.g. When a user application makes a system call. the processor state transitions between Problem State (user mode) and Supervisor State (kernel mode). 2003 Unit 6. partition page tables Hypervisor code validates arguments and ensures each partition can only access its allocated page table & TCE space Checks tables of PMBs allocated to each partition Prevents a partition from accessing physical memory not assigned to the partition Figure 6-15.0 Notes: Introduction The hypervisor is the name given to code that runs under the hypervisor mode of the processor.

Student Notebook The HV bit in the MSR indicates if the processor is in hypervisor mode. Purpose The hypervisor is trusted code that allows a partition to manipulate memory that is outside the bounds of that allocated to the partition. 2001. The hypervisor routines first validate that the calling partition is permitted to access the requested memory before performing the requested action. This means that the parts of the VMM used for page table management and device I/O mapping are aware of the fact that the operating system is running within a partition. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. . The operating system must be modified for use in the LPAR environment to make use of hypervisor calls to maintain page frame tables and TCE tables that would normally be managed by the OS directly if it were running on a non-LPAR system. 6-30 Kernel Internals © Copyright IBM Corp.

and the PMB at the top of physical memory is allocated for TCE space.5GB allocated to the partition consists of 14 PMBs. © Copyright IBM Corp. LPAR 1 LPAR 1 has 4. which may or may not be contiguous.0. Dividing Physical Memory BE0070XS4. The remaining 3.0. Logical Partitioning 6-31 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.3 Student Notebook Uempty Dividing Physical Memory TCE space 2 LPAR 2 0 RMO = M 4. The first PMB is allocated to the hypervisor. 2003 Unit 6.1. This means the first set of PMBs allocated to the partition must be contiguous for at least 1GB. and aligned on a 1GB address boundary. . The partition needs to run AIX 5.5GB of memory allocated to the partition.0 Notes: Introduction The diagram above shows a sample system that has two active partitions. so this means it has a real mode memory requirement of 1GB. which must be contiguous.5 Physical Memory LPAR 1 M 0 RMO = N Partition page tables Hypervisor N Physical Address 0 Figure 6-16. 2001.V2.

2. This is already a power of 2. however the system firmware will allocate them in a contiguous fashion where possible. The algorithms used by the firmware to allocate PMBs try to make best use of those available. 2001. It only allocates a new PMB for page tables if free space inside a PMB already being used for page tables cannot be found. In this example. If this is the first partition to be activated. 6-32 Kernel Internals © Copyright IBM Corp. This means the 32MB page table for LPAR 2 shares the same PMB as the 128MB page table for LPAR 1. It is running AIX 5.Student Notebook A partition with 4. which will be rounded up to 128MB. and only 128MB was being used.5GB of memory has a page table requirement of 72MB. LPAR 2 is permitted to access the portions of TCE space required for mapping the I/O slots assigned to the partition. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. since these are required for AIX 5. The allocated PMBs need not be contiguous. there was one PMB allocated for page tables. A partition with 2GB of memory (and an attribute of a maximum of 2GB) requires a page table of 32MB. and are careful to avoid encroaching on a 16GB aligned 16GB contiguous group of PMBs if it can be avoided. The partition will be permitted to access the portions of TCE space that are used to map the I/O slots that are assigned to the partition. LPAR 2 LPAR 2 has 2GB of memory assigned. Since this is the same size as the PMB. it effectively means that a partition running AIX 5.2 can consist of the required number of PMBs to satisfy the requested memory amount. and so at partition activation time. the firmware allocates a page table of 32MB. resulting in the seemingly sparse allocation of PMBs. so is quite happy with just 256MB of real mode memory. .1 partitions that are 16GB or larger in size. the page table will be placed in a PMB that is marked by the hypervisor for use as page table storage. Typical example The example shown in the diagram shows a situation where multiple partitions may have been activated and then terminated.

0 Notes: Introduction Answer all of the questions above. 3) All partitions have the same real mode memory requirements. © Copyright IBM Corp.V2.0. . True or False? 6) Which physical addresses in the system can a partition access? Figure 6-17. Checkpoint BE0070XS4.3 Student Notebook Uempty Checkpoint 1) What processor features are required in a partitioned system? 2) Memory is allocated to partitions in units of __________MB.0. 2003 Unit 6. a real address is the same as a physical address. Logical Partitioning 6-33 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 2001. True or False? 4) In a partitioned environment. We will review them as a group when everyone has finished. True or False? 5) Any piece of code can make hypervisor calls.

Student Notebook Unit Summary Hardware and software (operating system) changes are required for LPAR Can't run LPAR on just any system Can't use just any OS inside a partition Resources (CPU. memory. 2001. Unit Summary BE0070XS4.0 Notes: 6-34 Kernel Internals © Copyright IBM Corp. IO slots) are allocated to partitions independently of one another A partition can receive as much (or as little) of each resource as it needs Multiple partitions on a single machine imply changes to the addressing mechanism used by the operating system Can't have all partitions using the same physical address range Hypervisor is special code called by the operating system that allows it to modify memory outside the partitions Figure 6-18. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. .

0. • Use kdb to identify the data structures representing an open file. and the LVM structures used by the kernel. VFS and LVM 7-1 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.3 Student Notebook Uempty Unit 7. 2003 Unit 7. LFS. you should be able to: • List the design objectives of the logical and virtual file systems. logical and physical volumes. LFS. 2001. VFS and LVM What This Unit Is About This unit describes the organization and operation of the logical and virtual file system. • Identify the data structures that make up the logical and virtual file systems. . • Given a file descriptor of a running process. What You Should Be Able to Do After completing this unit. Identify the kdb subcommands for displaying these structures. How You Will Check Your Progress Accountability: • Exercises using your lab system • Unit review References AIX Documentation: Kernel Extensions and Device Support Programming Concepts © Copyright IBM Corp. • Use kdb to identify the data structures representing a mounted file system. locate the file and the file system the descriptor represents.0.V2. • Identify the basic kernel structures for tracking LVM volume groups.

Use kdb to identify the data structures representing an open file.Student Notebook Unit Objectives At the end of this lesson you should be able to: List the design objectives of the logical and virtual file systems. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. Figure 7-1.0 Notes: 7-2 Kernel Internals © Copyright IBM Corp. Identify the kdb subcommands for displaying these structures. locate the file and the file system the descriptor represents. Identify the basic kernel structures for tracking LVM volume groups. 2001. . Use kdb to identify the data structures representing a mounted file system. Identify the data structures that make up the logical and virtual file systems. logical and physical volumes. Unit Objectives BE0070XS4. Given a file descriptor of a running process.

These file systems reside below the LFS/VFS layer and operate relatively independently of each other. sockets.3 Student Notebook Uempty What is the Purpose of LFS/VFS? Provide support for many different file systems types simultaneously Allow for different types of file systems to be mounted together forming a single homogenous view Provide a consistent user interface to all file type objects (regular files. AIX 5L can support a number of different file system types that are transparent to application programs. 2003 Unit 7..Journaled File System (JFS) . services and data structures that are provided by the Logical File System (LFS) and the Virtual File System (VFS).) Support the sharing of files over the network Provide an extensible framework allowing third party file system types to be added into AIX Figure 7-2. 2001. LFS. The following physical file system implementations are currently supported: . .. special files.Network File System (NFS) © Copyright IBM Corp.0. VFS and LVM 7-3 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. What is the Purpose of LFS/VFS? BE0070XS4.V2.0 Notes: Introduction This unit covers the interface. Supported file systems Using the structure of the logical file system and the virtual file system.0.Enhanced Journaled File System (JFS2) .

.Student Notebook .A CD-ROM file system. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 7-4 Kernel Internals © Copyright IBM Corp. High Sierra and Rock Ridge formats Extensible The LFS/VFS interface also provides a relatively easy means by which third party file system types can be added without any changes to the LFS. 2001. which supports ISO-9660.

2001. as illustrated above. Kernel I/O Layers BE0070XS4. VFS and LVM 7-5 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.3 Student Notebook Uempty Kernel I/O Layers System call interface read(). LFS.0. 2003 Unit 7. .0. write() Logical File System Virtual File System File System Implementation JFS. JFS2 VMM Fault handler VMM LVM Device Driver Device Figure 7-3.V2.0 Notes: Introduction Several layers of the AIX kernel are involved in the support of file systems I/O as described in this section. Hierarchy Access to files and directories by a process is controlled by the various layers in the AIX 5L kernel. © Copyright IBM Corp.

Student Notebook Layers The layers involved in file I/O are described in this table: Level System call interface Logical file system Virtual file system Purpose A user application can access files using the standard interface of the read() and write() system calls. . The file system type is invisible to the user. The system call interface is supported in the LFS with a standard set of operations. Device driver code to interface with the device. Different physical file systems can handle the request (JFS. I/O to a file causes a page fault and is resolved by the VMM fault handler. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. Files are mapped to virtual memory. The VFS defines a generic set of operation that can be performed on a file system. File system VMM fault handler Device drivers 7-6 Kernel Internals © Copyright IBM Corp. The LVM is the device driver for JFS2 and JFS. JFS2. It is invoked by the page fault handler. NFS). 2001.

0. © Copyright IBM Corp. 2001. . The illustration is repeated throughout the unit highlighting the areas being discussed. close(). Major Data Structures BE0070XS4. such as open(). read() and write(). 2003 Unit 7. VFS and LVM 7-7 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.3 Student Notebook Uempty Major Data Structures u-block inode vnode gnode User File Descriptor Table System File Table vnodeops vfs gfs vmount vfsops Logical File System Virtual File System (Vnode-VFS Interface) File System Figure 7-4.0 Notes: Introduction This illustration shows the major data structures that will be discussed in this unit. LFS.0. The system calls implement services that are exported to users to provide a consistent user-mode programming interface that is independent of the underlying file system type.V2. Logical file system The LFS is the level of the file system at which users can request file operations by using system calls.

Operations performed by a process on a file or file system are mapped through the VFS to the file system below. 7-8 Kernel Internals © Copyright IBM Corp. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. Pointers to these functions are stored in the vfsops (file system operations) and vnodeops (file operations) structures. the process need not know the specifics of different file systems (such as JFS. 2001. NFS or CD-ROM). . File system Each file system type extension provides functions to perform operations on the file system and its files. J2. In this way.Student Notebook Virtual files system The Virtual File System (VFS) defines a standard set of operations on an entire file system.

2001. Logical File System Structures BE0070XS4. The system open file table has entries for open files on the system. Each entry in the system file table points to a vnode in the virtual file system.0.0 Notes: Introduction The user file descriptor table and the system file table are the key data bases used by the LFS. .3 Student Notebook Uempty Logical File System Structures f_data u-block fp vnode read(0) 0 1 User File Descriptor Table vnode System File Table n=open("file") n vnode Process private Global One per file Figure 7-5. LFS.V2. VFS and LVM 7-9 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.0. 2003 Unit 7. © Copyright IBM Corp. Structures in the LFS The user file descriptor table (one per process) contains entries for each of the process’ open files. These memory structures and their relationship to vnodes are discussed in this section.

. device. System open file table The system file table is a global resource and is shared by all processes on the system. The index of the entry in the table is returned to open() as a file descriptor.Student Notebook User file descriptor table The user file descriptor table is private to a process and located in the processes u-area. If multiple processes have the same file open (or one process has opened the file several times) a separate entry exists in the table for each unique open. a vnode for that object is created. 2001. When a process opens a file. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. vnode The vnode provides the connection between the LFS and the VFS. or socket in the system. Each time an object is located. The vnode will be covered in more detail later. an entry is created in the user’s file descriptor table. One unique entry is allocated for each unique open of a file. It is the primary structure the kernel uses to reference files. 7-10 Kernel Internals © Copyright IBM Corp.

V2.0.h in the structure ufd. . There © Copyright IBM Corp. 2003 Unit 7. The index of the entry in the table is returned to open()as a file descriptor.3 Student Notebook Uempty User File Descriptor struct ufd { struct file * fp.0. an entry is created in the users file descriptor table. The file descriptor table can extend beyond the first page of the u-block. 2001. unsigned short flags. VFS and LVM 7-11 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. Table management One or more slots of the file descriptor table are used for each open file.0 Notes: Introduction The user file descriptor table is private to a process and located in the process’ u-area. #endif /* __64BIT_KERNEL */ }. Figure 7-6. LFS. and is pageable. Descriptor table definition The user file descriptor table consists of an array of user file descriptors as defined in /usr/include/sys/user. #ifdef __64BIT_KERNEL unsigned int reserved. unsigned short count. When a process opens a file. User File Descriptor BE0070XS4.

h). 2001. 7-12 Kernel Internals © Copyright IBM Corp.Student Notebook is a fixed upper limit of 65534 open file descriptors per process (defined as OPEN_MAX in /usr/include/sys/limits. This value is fixed. . 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. and may not be changed.

2003 Unit 7.3 Student Notebook Uempty The File Structure struct file { long f_flag. /* process credentials at open() */ struct file *f_cpqmnext. Table management The system file table is a large array of file structures.000. int (*fo_ioctl)(). or socket in the system. The file Structure BE0070XS4. #endif /* __64BIT_KERNEL || __FULL_PROTO */ } *f_ops. The table can contain a maximum of 1. /* see fcntl. Structure definition The file structure is described in /usr/include/sys/file. 2001. offset_t f_offset. /* read/write character pointer */ off_t f_dir_off. /* reference count */ short f_options. /* next entry in freelist */ } f_up. /* file flags not passed through vnode layer */ short f_type. they are added back onto the free list. #else int (*fo_rw)().0. /* any info vfs needs */ struct fileops { . Simple_lock f_lock. /* descriptor type */ union { struct vnode *f_uvnode. int (*fo_fstat)().h */ int f_count. int (*fo_select)().h. The array is partly initialized. LFS. Once entries are freed. /* file structure fields lock */ Simple_lock f_offset_lock.V2. /* next quick move chunk on free list*/ } f_cp. /* pointer to vnode structure */ struct file *f_unext. VFS and LVM 7-13 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. /* file structure offset field lock */ caddr_t f_vinfo.0 Notes: Introduction The system file table is a global resource and is shared by all processes on the system. It grows on demand and is never shrunk. . The head of the free list is pointed to by ffreelist. . One entry is allocated for each unique open of a file. int (*fo_close)().0. /* BSD style directory offsets */ union { struct ucred *f_cpcred. }. device. In the visual above the fileops definitions for __FULL_PROTO have been omitted for clarity. © Copyright IBM Corp.000 entries and is not configurable. . Figure 7-7.

This value is increased each time the file is opened.f_uvnode. and may be re-used. Various flags described in fcntl. ioctl. Defined as f_up. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 2001.Student Notebook Table entries The file table array consists of struct file data elements. Once the reference count is zero. . A structure containing pointers to functions for the following file operations: rw (read/write). and decremented on each close(). f_ops 7-14 Kernel Internals © Copyright IBM Corp. it is a pointer to another data structure representing the object (typically the vnode structure). the slot is considered free. close and fstat. Several of the key members of this data structure are described in this table: Member Description A reference count field detailing the current number of opens on the file. select.h A type field describing the type of file: f_type /* f_type values */ #define DTYPE_VNODE #define DTYPE_SOCKET endpoint */ #define DTYPE_GNODE #define DTYPE_OTHER 1 2 3 -1 /* file */ /* communications /* device */ /* unknown */ f_count f_flag f_offset f_data A read/write pointer.

VFS and LVM 7-15 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.0.3 Student Notebook Uempty vnode/vfs Interface u-block inode vnode gnode User File Descriptor Table System File Table vnodeops vfs gfs vmount vfsops Logical File System Virtual File System (Vnode-VFS Interface) File System Figure 7-8. This interface provides a logical boundary between generic objects understood at the LFS layer.0 Notes: Introduction The interface between the logical file system and the underlying file system implementations is referred to as the vnode/vfs interface. vfs and vmount structures are given in this table: © Copyright IBM Corp. . and the file system specific objects that the underlying file system implementation must manage. Description Descriptions of the vnode. vnode/vfs Interface BE0070XS4.0. LFS. 2001.V2. Data structures vnodes and vfs structures are the primary data structures used to communicate through the interface (with help from vmount). 2003 Unit 7.

Student Notebook

Part vnodes vfs vmount

Function Represents a single file or directory Represents a mounted file system Contains specifics of the mount request

7-16 Kernel Internals

© Copyright IBM Corp. 2001, 2003
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

V2.0.0.3
Student Notebook

Uempty

vnode
struct vnode { ushort ulong32int64 int Simple_lock struct vfs struct vfs struct gnode struct vnode struct vnode struct vnode union v_data { void * struct vnode * } _v_data; char * */ };

v_flag; v_count; v_vfsgen; v_lock; *v_vfsp; *v_mvfsp; *v_gnode; *v_next; *v_vfsnext; *v_vfsprev; _v_socket; _v_pfsvnode; v_audit;

/* the use count of this vnode */ /* generation number for the vfs */ /* lock on the structure */ /* pointer to the vfs of this vnode */ /* pointer to vfs which was mounted over /* this vnode; NULL if not mounted */ /* ptr to implementation gnode */ /* ptr to other vnodes that share same gnode */ /* ptr to next vnode on list off of vfs /* ptr to prev vnode on list off of vfs /* vnode associated data */ /* vnode in pfs for spec */ /* ptr to audit object

Figure 7-9. vnode

BE0070XS4.0

Notes: Introduction
A vnode represents an active file or directory in the kernel. Each time a file is located, a vnode for that object is located or created. Several vnodes may be created as a result of path resolution.

Structure definition
The vnode structure is defined in /usr/include/sys/vnode.h.

vnode management
vnodes are created by the vfs-specific code when needed, using the vn_get kernel service. vnodes are deleted with the vn_free kernel service. vnodes are created as the result of a path resolution.

© Copyright IBM Corp. 2001, 2003

Unit 7. LFS, VFS and LVM

7-17

Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

Student Notebook

Detail
Each time an object (file) within a file system is located (even if it is not opened), a vnode for that object is located (if already in existence), or created, as are the vnodes for any directory that has to be searched to resolve the path to the object. As a file is created, a vnode is also created, and will be re-used for every subsequent reference made to the file by a path name. Every path name known to the logical file system can be associated with, at most, one file system object, and each file system object can have several names because it can be mounted in different locations. Symbolic links and hard links to an object always get the same vnode if accessed through the same mount point.

7-18 Kernel Internals

© Copyright IBM Corp. 2001, 2003
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

V2.0.0.3
Student Notebook

Uempty

vfs
struct vfs { struct vfs struct gfs struct vnod struct vnode struct vnode int caddr_t unsigned int int #ifdef _SUN short unsigned short #else short unsigned short #endif /* _SUN */ struct vmount Simple_lock }; /* vfs's are a linked list */ /* ptr to gfs of vfs */ /* pointer to mounted vnode, */ /* the root of this vfs */ *vfs_mntdover; /* pointer to mounted-over */ /* vnode */ *vfs_vnodes; /* all vnodes in this vfs */ vfs_count; /* number of users of this vfs */ vfs_data; /* private data area pointer */ vfs_number; /* serial number to help distinguish between */ /* different mounts of the same object */ vfs_bsize; /* native block size */ vfs_exflags; vfs_exroot; vfs_rsvd1; vfs_rsvd2; *vfs_mdata; vfs_lock; /* for SUN, exported fs flags */ /* for SUN, " fs uid 0 mapping */ /* Reserved */ /* Reserved */ /* record of mount arguments */ /* lock to serialize vnode list */ *vfs_next *vfs_gfs; *vfs_mntd;

Figure 7-10. vfs

BE0070XS4.0

Notes: Introduction
There is one vfs structure for each file system currently mounted. The vfs structure connects the vnodes with the vmount information, and the gfs structure that help define the operations that can be performed on the file system and its files.

Structure definition
The vfs structure is defined in /usr/include/sys/vfs.h.

Key elements
Several key elements of the vfs structure are described in this table: Element *vfs_next Description The next mounted file system.
Unit 7. LFS, VFS and LVM 7-19

© Copyright IBM Corp. 2001, 2003

Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

Student Notebook

Element vfs_mntd

Description The vfs_mntd pointer points to the vnode within the file system which generally represents the root directory of the file system. The vfs_mntdover pointer points to a vnode within another file system, usually represents a directory, which indicates where the file system is mounted. The pointer to all vnodes for this file system. The path back to the gfs structure and its file system specific subroutines through the vfs_gfs pointer. The pointer to vmount providing mount information for this file system

vfs_mntdover vfs_vnodes *vfs_gfs vfs_mdata

7-20 Kernel Internals

© Copyright IBM Corp. 2001, 2003
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

V2.0.0.3
Student Notebook

Uempty

root (/) and usr File Systems

1

2

vfs_next rootvfs vfs for root file system vfs_mntd
2

v_mvfsp vfs_mntdover

vfs for usr file system vfs_mntd

vfs_mntdover v_vfsp

v_vfsp

4 3

3

v_vfsp vnode for /usr

Null vnode for /

vnode for root of usr

root file system

usr file system

Figure 7-11. root (l) and usr File Systems

BE0070XS4.0

Notes: Relationship between vfs and vnodes
This illustration shows the relationship between the vfs and vnode objects for mounted file systems. This example shows the root (/) and usr file systems.

© Copyright IBM Corp. 2001, 2003

Unit 7. LFS, VFS and LVM

7-21

Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

Student Notebook

Description
The numbered items in the table match the number in the illustration. Item 1. 2. 3. 4. Description The global address rootvfs points to the vfs for the root file system The vfs_next pointers create a linked list of mounted file systems The vfs_mntd points to the vnode representing the root of the file system The vfs_mntdover points to the vnode of the directory the file system is mounted over

7-22 Kernel Internals

© Copyright IBM Corp. 2001, 2003
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

V2.0.0.3
Student Notebook

Uempty

vmount
struct vmount { uint vmt_revision; uint vmt_length; fsid_t vmt_fsid; int vmt_vfsnumber; uint vmt_time; uint vmt_timepad; int vmt_flags;

int vmt_gfstype; struct vmt_data { short vmt_off; /* I offset of data, word aligned short vmt_size; /* I actual size of data in bytes } vmt_data[VMT_LASTINDEX + 1]; };

/* I revision level, currently 1 */ /* I total length of structure & data */ /* O id of file system */ /* O unique mount id of file system */ /* O time of mount */ /* O (in future, time is 2 longs) */ /* I general mount flags */ /* O MNT_REMOTE is output only */ /* I type of gfs, see MNT_XXX above */ */ */

Figure 7-12. vmount

BE0070XS4.0

Notes: Introduction
The vmount structure contains specifics of the mount request. The vfs and vmount are created as pairs and linked together.

Structure definition
The vmount structure is defined in /usr/include/sys/vmount.h.

© Copyright IBM Corp. 2001, 2003

Unit 7. LFS, VFS and LVM

7-23

Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

which completes the vfs structure and performs any operations required internally by the particular file system implementation. .Student Notebook vfs management The mount helper creates the vmount structure and calls the vmount subroutine. and invokes the file system dependent vfs_mount subroutine. partially populates it. 7-24 Kernel Internals © Copyright IBM Corp. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 2001. The vmount subroutine then creates the vfs structure.

0 Notes: Introduction Each file system type extension provides functions to perform operations on the file system and its files. 2001. one group of these three data structures shown above will be created. File and File System Operations BE0070XS4. © Copyright IBM Corp.3 Student Notebook Uempty File and File System Operations u-block inode vnode gnode User File Descriptor Table System File Table vnodeops vfs gfs vmount vfsops Logical File System Virtual File System (Vnode-VFS Interface) File System Figure 7-13. . Data structures For each file system type installed. Pointers to these functions are stored in the vfsops (file system operations) and vnodeops (file operations) structures.0.0. LFS. 2003 Unit 7. VFS and LVM 7-25 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.V2.

and vfsops are given in this table: Part gfs vnodeops vfsops Function Holds pointers to the vnodeops and the vfsops structures Contains pointers to file system dependent operations on files (open. write. 2001. etc.) 7-26 Kernel Internals © Copyright IBM Corp. umount.) Contains pointers to file system dependent operations on the file system (mount. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. . close. vnodeops. etc.Student Notebook Structure descriptions Descriptions of gfs. read.

0 Notes: Introduction gfs is used as a pointer to the vnodevops and the vfsops structures. .3 Student Notebook Uempty gfs ops gn_ vnodeops vfs vfs_gfs gfs gfs _op s vfsops Figure 7-14. gfs BE0070XS4.0. VFS and LVM 7-27 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.V2. 2001.0. 2003 Unit 7. © Copyright IBM Corp. LFS.

/* name of vfs (eg. int gfs_type. Generally."nfs")*/ int (*gfs_init)(). 7-28 Kernel Internals © Copyright IBM Corp. /* gfs private config data*/ int (*gfs_rinit)(). "jfs". . int gfs_hold /* count of mounts */ } gfs management The gfs structures are stored within a global array accessible only by the kernel.Student Notebook Structure definition The gfs structure is defined in /usr/include/sys/gfs. and only one gfs entry of a given gfs_type can be inserted into the array. /* type of gfs (from vmount. The gfs entries are removed with the gfsdel() kernel service.h) */ char gfs_name[16]. /* ( gfsp ) . */ /* called once to init gfs */ int gfs_flags. struct vnodeops *gn_ops. gfs entries are added by the CFG_INIT section of the configuration code of the file system kernel extension.h: struct gfs { struct vfsops *gfs_ops. /* flags for gfs capabilities */ caddr_t gfs_data. The gfs entries are inserted with the gfsadd() kernel service. This is usually done within the CFG_TERM section of the configuration code of the file system kernel extension.if ! NULL. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 2001.

such as link. mkdir.0. open.0.0 Notes: vnodeops The vnodeops structure contains pointers to the file system dependent operations that can be performed on the vnode. 2001. © Copyright IBM Corp. vnodeops BE0070XS4. close and remove. LFS. .3 Student Notebook Uempty vnodeops vn_link() vn_mkdir() vn_open() vn_close() vfs vfs_gfs gfs gn_ops vnodeops vn_remove() vn_rmdir() vn_lookup() Figure 7-15. VFS and LVM 7-29 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.V2. 2003 Unit 7. mknod.

int (*vn_mknod)(struct vnode *. struct ucred *). int (*vn_mkdir)(struct vnode *. caddr_t. int32long64_t. int (*vn_rename)(struct vnode *. struct vnode *. 7-30 Kernel Internals © Copyright IBM Corp. struct ucred *). int32long64_t. 2001. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. only a few lines are detailed below: struct vnodeops { /* creation/naming/deletion */ int (*vn_link)(struct vnode *. char *.Student Notebook Structure definition The vnodeops structure is defined in /usr/include/sys/vnode. Due to the size of this structure. caddr_t.struct ucred *).h.caddr_t. struct vnode *. int (*vn_remove)(struct vnode *. struct vnode *. struct ucred *).struct vnode *. struct vnode *. . char *. dev_t. struct ucred *). char *.

VFS and LVM 7-31 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.3 Student Notebook Uempty vfsops vfs_mount() vfs_unmount() vfs_root() vfs_sync) vfs vfs_gfs gfs gfs_ops vfsops vfs_vget() vfs_cntl() vfs_quotactl() Figure 7-16. LFS. 2003 Unit 7. unmount.0. such as mount. © Copyright IBM Corp. .0 Notes: vfsops The vfsops structure contains pointers to the file system dependent operations that can be performed on the vfs. 2001.V2. vfsops BE0070XS4.0. or sync.

/* manage file system quotas */ int (*vfs_quotactl)(struct vfs *. /* get file system information */ int (*vfs_statfs)(struct vfs *. /* unmount a file system */ int (*vfs_unmount)(struct vfs *. struct ucred *). struct vnode **. /* get a vnode matching a file id */ int (*vfs_vget)(struct vfs *. struct fileid *. struct ucred *). caddr_t. 7-32 Kernel Internals © Copyright IBM Corp. struct ucred *). . struct ucred *). struct vnode **. /* sync all file systems of this type */ int (*vfs_sync)().Student Notebook Structure definition The vfsops structure is defined in /usr/include/sys/vfs. caddr_t. int. struct ucred *). /* get the root vnode of a file system */ int (*vfs_root)(struct vfs *. }. struct ucred *). uid_t. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. size_t. int. int. struct statfs *. 2001. struct ucred *). /* do specified command to file system */ int (*vfs_cntl)(struct vfs *.h: struct vfsops { /* mount a file system */ int (*vfs_mount)(struct vfs *.

Location The gnode is contained in an in-core-inode for a file on a local file system. . NFS files have gnodes contained within rnodes.0 Notes: Introduction gnodes are generic objects pointed to by vnodes but may be contained in different structures depending on the file system type. LFS. VFS and LVM 7-33 © Copyright IBM Corp. gnode .h: struct gnode { enum vtype gn_type. /* type of object: VDIR.3 Student Notebook Uempty gnode in-core inode vnode v_gnode gnode specnode vnode v_gnode gnode rnode vnode v_gnode gnode Figure 7-17.V2. BE0070XS4. Structure definition The gnode structure is defined in /usr/include/sys/vnode. Special files (such as /dev/tty). have gnodes contained in specnodes.0. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.0. 2001.VREG etc */ Unit 7.

their "chan". in addition to the file system inode. /* locked region list */ caddr_t gn_data. because some file system implementations may not include the concept of an inode. /* for devices. /* ptr to list of vnodes per this gnode*/ dev_t gn_rdev. Some examples are directory. /* count of map for read */ long32int64 gn_rdcnt. and block. their "dev_t" */ chan_t gn_chan. 2001.Student Notebook short gn_flags. struct vnode *gn_vnode. A gnode is needed. This is normally immediately followed by a call to the vn_get kernel service to create a matching vnode. Thus the gnode structure substitutes for whatever structure the file system implementation may have used to uniquely identify a file system object. /* count of map for write */ long32int64 gn_mrdcnt. minor’s minor*/ Simple_lock gn_reclk_lock. Identifies the set of operations that can be performed on the object Segment number to which the file is mapped Pointer to private data. /* total opens for read */ long32int64 gn_wrcnt. /* attributes of object */ ulong gn_seg. /* total opens for exec */ long32int64 gn_rshcnt. as needed by file system specific code at the same time as implementation specific structures are created. /* ptr to private data (usually contiguous) } Key elements Some of the key elements of the gnode are described below: Element gn_type gn_ops gn_seg gn_data Description Identifies the type of object represented by the gnode./* event list for file locking */ struct filock *gn_filocks. /* segment into which file is mapped */ long32int64 gn_mwrcnt. The gnode structure is usually deleted either when the file it refers to is deleted. /* total opens for read share */ struct vnodeops *gn_ops. Calls to the file system implementation serve as requests to perform an operation on a specific gnode. character. /* for devices. . or when the implementation specific structure is being reused for another file. 7-34 Kernel Internals © Copyright IBM Corp. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. Points to the start of the inode the gnode is imbedded Detail Each file system implementation is responsible for allocating and destroying gnodes. gnodes are created. /* total opens for write */ long32int64 gn_excnt. /* lock for filocks list */ int gn_reclk_event.

lvol and pvol. which is not distributed with the AIX product). Logical Volumes and Physical Volumes) is maintained both on disks and in the ODM.0. lvol and pvol (defined in src/bos/kernel/sys/dasd.V2.nodev (000E12EC) dsdptr: selptr: opts: (0)> 310E3000 DEV_DEFINED DEV_MPSAFE 00000000 0000002A Figure 7-18. The structures are volgrp.0 Notes: Introduction The file systems discussed earlier in this unit are contained within Logical Volume Manager (LVM) Logical Volumes. The kdb subcommands to display these structures have corresponding names: volgrp.3 Student Notebook Uempty kdb devsw Subcommand Output (0)> devsw 0xa Slot address 30057280 MAJOR: 00A open: 0207DC40 close: 0207D694 read: 0207CDC0 write: 0207CCF4 ioctl: 0207B4DC strategy: 02095914 ttys: 00000000 select: . Here we would like to introduce three kernel structures which maintain LVM data.h unless otherwise noted. 2001. This architecture is discussed in other classes. .nodev (000E12EC) config: 020795E8 print: .nodev (000E12EC) dump: 020A7530 mpx: . 2003 Unit 7. kdb devsw Subcommand Output BE0070XS4. All definitions are from src/bos/kernel/sys/dasd. In the above visual we will illustrate the structure definitions with example output from the kdb subcommands and corresponding AIX commands. LFS. The data defining LVM entities (including Volume Groups.nodev (000E12EC) revoke: .0. and the kdb commands that display these structures. VFS and LVM 7-35 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. © Copyright IBM Corp.h.

2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM./* lock for all vg structures struct unique_idvg_id. . . }. short . 2001. */ */ open_count./* logical volume struct array*/ struct pvol*pvols[NEW_MAXPVS]. the major number of rootvg on this system./* count of open logical volumes */ ./* physical volume struct array */ .*pvols[NEW_MAXPVS] points to the array of pvol structures for this volume group. 7-36 Kernel Internals © Copyright IBM Corp.vg_id This is the 32 character volume id. . . struct lvol*lvols[NEW_MAXLVS]. . struct volgrp *nextvg./* major number of volume group . The array is indexed by physical volume minor number. The kernel describes this in the volgrp structure. Portions of the structure definition follows: struct volgrp { Simple_lockvg_lock. /* pointer to next volgrp structure */ . The array is indexed by logical volume minor number. . We the devsw subcommand with a single parameter 0xa./* volume group id */ intmajor_num. This table is displayed with the kdb subcommand. devsw. .*lvols[NEW_MAXLVS] points to the array of lvol structures for this volume group. The Items in bold are defined as: .open_count This is the count of active logical volumes in this volume group. At this point we introduce it only to obtain a volgrp address. . In the command output. . A value of zero means this is the last or only volume group. . .*nextvg This is the volgrp linked list item.Student Notebook volgrp structure The administrative unit of LVM is a volume group. the dsdptr: field is the address of rootvg’s volgrp structure. volgrp kdb subcommand volgrp addresses are registered in the devsw table. .

V2.0.0.3
Student Notebook

Uempty

kdb volgrp Subcommand Output
(0)> volgrp 310e3000 VOLGRP............. 310E3000 . . . lvols............... @ 310E302C pvols............... @ 310E382C major_num............. 0000000A vg_id................. 0001D2CA00004C00000000F11C1697A0 nextvg................ 00000000 opn_pin............. @ 310E3A2C . . . sa_hld_lst............ 00000000 vgsa_ptr.............. 31107000 config_wait........... FFFFFFFF sa_lbuf............. @ 310E3B10 sa_pbuf............. @ 310E3B68 . . . LVOL[007]....... 31108180 work_Q.......... 3110BE00 lv_status....... 00000002 lv_options...... 00001000 nparts.......... 00000001 i_sched......... 00000000 nblocks......... 00010000 parts[0]........ 31108300 pvol@ 310E4600 dev 00190000 start 00DE1100 parts[1]........ 00000000 parts[2]........ 00000000 . . . LVOL[009]....... 31108380 . . .

Figure 7-19. kdb volgrp Subcommand Output

BE0070XS4.0

Notes: Output of volgrp command
The visual above shows partial output of the kdb subcommand, volgrp, using the address just obtained with devsw. The volgrp subcommand formats volgrp structure data in a helpful way. The pointer values for pvol and lvol arrays are provided (“pvols” and “lvols”), but in addition the subcommand formats each lvols array entry. So we see an “LVOL” entry for each logical volume in our rootvg. In the example above there were 10 entries for lvols array data. We have shown only the entry for minor device number 7 which is the entry for hd3, the /tmp file system. We will examine this logical volume with other commands, and describe the bold items at that time.

© Copyright IBM Corp. 2001, 2003

Unit 7. LFS, VFS and LVM

7-37

Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

Student Notebook

Other items above in bold give: - major_num = 0xA means this is the rootvg volume group. - The vg_id value is rootvg’s volume group id. *nextvg=0 means this volume group is the last or only one on the volgrp linked list.

7-38 Kernel Internals

© Copyright IBM Corp. 2001, 2003
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

V2.0.0.3
Student Notebook

Uempty

AIX lsvg Subcommand Output
# lsvg rootvg VOLUME GROUP: rootvg VG IDENTIFIER: 0001d2ca00004c00000000f11c1697a0 VG STATE: active PP SIZE: 32 megabyte(s) VG PERMISSION: read/write TOTAL PPs: 542 (17344 megabytes) MAX LVs: 256 FREE PPs: 497 (15904 megabytes) LVs: 9 USED PPs: 45 (1440 megabytes) OPEN LVs: 8 QUORUM: 2 TOTAL PVs: 1 VG DESCRIPTORS: 2 STALE PVs: 0 STALE PPs: 0 ACTIVE PVs: 1 AUTO ON: yes MAX PPs per PV: 1016 MAX PVs: 32 LTG size: 128 kilobyte(s) AUTO SYNC: no HOT SPARE: no BB POLICY: relocatable #

Figure 7-20. AIX lsvg Command Output

BE0070XS4.0

Notes: AIX lsvg command view of the same data
Now that we have seen the kernel’s view of rootvg data, it is interesting to look at what our command line interface shows. The lsvg command provides a summary of volume group information. This visual shows lsvg output for the same rootvg that we just examined with volgrp. The items in bold print above correspond to kdb volgrp items described on the prior slide: “VOLUME GROUP: rootvg” corresponds to major_num=0xA “VG IDENTIFIER” corresponds to the vg_id value.

© Copyright IBM Corp. 2001, 2003

Unit 7. LFS, VFS and LVM

7-39

Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

Student Notebook

kdb lvol Subcommand Output
(0)> lvol 31108180 LVOL............ 31108180 work_Q.......... 3110BE00 lv_status....... 00000002 lv_options...... 00001000 nparts.. 00000001 i_sched......... 00000000 nblocks......... 00010000 parts[0]..31108300 pvol@ 310E4600 dev 00190000 start 00DE1100 parts[1]........ 00000000 parts[2]........ 00000000 maxsize......... 00000000 tot_rds......... 00000000 complcnt........ 00000000 waitlist........ FFFFFFFF stripe_exp...... 00000000 striping_width.. 00000000 lvol_intlock. @ 311081BC lvol_intlock.... 00000000 (0)>

Figure 7-21. kdb lvol Subcommand Output

BE0070XS4.0

Notes:

7-40 Kernel Internals

© Copyright IBM Corp. 2001, 2003
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

V2.0.0.3
Student Notebook

Uempty

The lvol structure
Each active logical volume is represented by an lvol structure. The lvol structure is defined as follows:
struct lvol { struct buf short ushort short **work_Q; /*work in progress hash table */ lv_status; /*lv status:closed,closing,open */ lv_options;/*logical dev options (see below)*/ nparts; /* num of part structures for this*/ /* lv - base 1 char char ulong struct part int ulong int i_sched; lv_avoid; nblocks; */ */ */ /* initial scheduler policy state /* online backup mask indicator /* LV length in blocks */ */ */

*parts[3]; /*partition arrays for each mirror*/ maxsize; tot_rds; /* max number of pp allowed in lv /* total number of reads to LV

parent_minor_num;/*if this is an online backup copy*/ /*this is the minor number of the ’real’*/ /* or ’parent’ logical volume */

/* These fields of the lvol structure are read and/or written by * the bottom half of the LVDD; and therefore must be carefully * modified. */ int tid_t struct file unsigned int unsigned int Simple_lock uchar struct io_stat unsigned int unsigned int }; complcnt; waitlist; *fp; * completion count-used to quiesce */ /* event list for quiesce of LV */ /*file ptr for lv mir bkp open/close */ stripe_exp; /* 2**stripe_block_exp = stripe */ /* lvol_intlock; lv_behavior;/* special conditions lv may be under */ *io_stats[3];/* collect io statistics here */ syncing; blocked; /* Count of SYNC requests */ /* Count of blocked requests */ block size */

striping_width; /* number of disks striped across */

© Copyright IBM Corp. 2001, 2003

Unit 7. LFS, VFS and LVM

7-41

Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

Student Notebook

Items shown in bold: lv_status: 0=> closed, 1=> trying to close, 2=> open, 3=> being deleted lv_options is a flag word. Some of the flags are: 0x0001=>write verify, 0x0020=>read-only, 0x0040=>dump in progress to this logical volume, 0x0080=>this logical volume is a dump device, 0x1000=>original default (not passive) mwcc (mirror write consistency check) on. nparts: Number of copies (1=>no mirror, 2=>single mirror, 3=>two mirrors). This gives the number of *parts array elements that are meaningful. i_sched: Scheduling policy for this logical volume values include: 0=>regular, non-mirrored LV, 1=>sequential write, sequential read, 2=>parallel write, read closest, 3=>sequential write, read closest, 4=> parallel write, sequential read, 5=>striped n_blocks: Number of 512 byte blocks in this logical volume *parts[3]: Each parts element is a part structure pointer, which points to an array of part structures, which define the physical volume storage for one logical volume copy. - Each of these part structures points to a pvol structure and disk start address for one part of the logical volume data. The structure is defined as follows:
struct part { struct pvol daddr_t int char char *pvol; start; sync_trk; ppstate; sync_msk; /* containing physical volume /* starting physical disk address /* current LTG being resynced /* physical partition state /* current LTG sync mask */ */ */ */ */

kdb lvol subcommand
The kdb subcommand, lvol, formats lvol structure data. The visual above shows the lvol output for lvols[7], from the rootvg volume group. This is the logical volume with minor # 7: hd3. Items above in bold give: - lv_status = 2 means the logical volume is open. - nparts=1 means there is only one parts structure for this logical volume. - i_sched=0 means the scheduling policy for this logical volume is “regular, non-mirrored”. - n_blocks=0x10000 is the number of 512 byte blocks in this logical volume. This translates to 65536 decimal. The single part structure is at location 0x31108300. The lvol subcommand summarizes this part structure:
7-42 Kernel Internals © Copyright IBM Corp. 2001, 2003
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

V2.0.0.3
Student Notebook

Uempty

- It points to the pvol structure at 0x310e4600. The physical volume major/minor numbers are 0x19 (decimal 25)/0. The disk start address is 0x00DE1100. The ls -l command on /dev/hd* tells us this is the major/minor number of hdisk0.

© Copyright IBM Corp. 2001, 2003

Unit 7. LFS, VFS and LVM

7-43

Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

Student Notebook

AIX lslv Command Output
# lslv hd3 LOGICAL VOLUME: hd3 VOLUME GROUP: rootvg LV IDENTIFIER: 0001d2ca00004c00000000f11c1697a0.7 PERMISSION: read/write VG STATE: active/complete LV STATE:opened/syncd TYPE: jfs WRITE VERIFY: off MAX LPs: 512 PP SIZE: 32 megabyte(s) COPIES: 1 SCHED POLICY: parallel LPs: 1 PPs: 1 STALE PPs: 0 BB POLICY: relocatable INTER-POLICY: minimum RELOCATABLE: yes INTRA-POLICY: center UPPER BOUND: 32 MOUNT POINT: /tmp LABEL: /tmp MIRROR WRITE CONSISTENCY: on/ACTIVE EACH LP COPY ON A SEPARATE PV ?: yes Serialize IO ?: NO

Figure 7-22. AIX lslv Command Output

BE0070XS4.0

Notes: lslv command output
The visual above shows lslv command output for rootvg logical volume hd3, the /tmp logical volume. The items in bold print above correspond to kdb lvol items described on prior slide: - “LV STATE: opened/syncd” corresponds to lvstatus=2 - “Write Verify: off” corresponds to lv_options=00001000 (flag is 0x0001 for write verify) - “PP SIZE: 32”, ”LPs: 1 and “PPs: 1” correspond to nblocks=00010000 (1 pp x 32 MB/pp = 65536 blocks x 512 bytes/block, and 65536 decimal = 10000 hexadecimal.) - “MIRROR WRITE CONSISTENCY: on/ACTIVE” corresponds to lvoptions=00001000 (flag is 0x1000 for original default mwcc)

7-44 Kernel Internals

© Copyright IBM Corp. 2001, 2003
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

© Copyright IBM Corp. But it has no meaning because this logical volume is not mirrored.0. VFS and LVM 7-45 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.0. 2003 Unit 7. 2001. non-mirrored logical volume).“SCHED POLICY: parallel” is technically incorrect here. The i_sched=00000000 value from kdb correctly reflects this (SCH_REGULAR = 0 => regular. LFS.V2.3 Student Notebook Uempty . .

.... 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM./* first available block on the PV */ daddr_t /* for user data*/ daddr_t beg_relblk............. ............ short pvstate.... /* place to hold flags */ short num_bbdir_ent........ 021E6B9F next_relblk......................... @ 310E4648 oclvm... /* PV state */ short pvnum. @ 310E4638 sa_area[1]....... /* dev_t of physical device */ struct unique_idpvid.. /* file pointer from open of PV */ char flags......... 00000000 pvnum. 021E6C9E defect_tbl........... 310E4600 dev.. 00000000 fst_usr_blk. 0000000A fp. 310E4800 sa_area[0].. 00000000 num_bbdir_ent.......... @ 310E4640 pv_pbuf......... This structure is defined as follows: struct pvol { dev_t dev. /* LVM PV number 0-31/0-127 */ int vg_num............ The kernel describes this in the pvol structure... 00000000 vg_num....... 2001..... @ 310E46F0 Figure 7-23...... 021E6B9Fl max_relblk. 10000C60 flags.............../* first blkno in reloc pool */ 7-46 Kernel Internals © Copyright IBM Corp..... 00000000 pvstate. /* VG major number*/ struct file * fp.0 Notes: pvol structure The basic hardware unit for LVM is a physical volume./* current number of BB Dir entries */ fst_usr_blk.... 00190000 xfcnt.. kdb pvol Subcommand Output BE0070XS4........... 00001100 beg_relblk.Student Notebook kdb pvol Subcommand Output (0)> pvol 310e4600 PVOL......

1=>cannot be accessed.vg_num= 0xA is the major number of this volume group (rootvg) This can be confirmed by executing ls -l in /dev: this shows /dev/rootvg as having major number 10 (decimal). /* flag set if SA to be deleted */ } sa_area[2]. /* pointer to defect table */ struct sa_pv_whl { /* VGSA information for this PV */ daddr_t lsn. and is for hdisk0. © Copyright IBM Corp. accessible physical volume.dev=00190000 means major/minor #s are 25/0 (decimal). 2001./* changed to 1 on first bad read */ */ #ifdef CLVM_2_3 struct clvm_2_3pv *oclvm./* blkno of next unused relocation */ /* block in reloc blk pool at end */ /* of PV */ daddr_t max_relblk. .pvstate=0 means normal. Defined in /usr/include/sys/types. dev(15-0) = minor). */ */ Items shown in bold: .3 Student Notebook Uempty daddr_t next_relblk. The parameter used is from our volgrp output for rootvg. 3=> pv involved in snapshot) ./* largest blkno avail for reloc */ struct defect_tbl *defect_tbl.LV 0 */ ushort sa_seq_num. /* pbuf struct for writing cache bad_read.0. /* SA logical sector number .pvstate: Physical volume state (0=>normal. . 2003 Unit 7. VFS and LVM 7-47 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.V2. /* one for each possible SA on PV */ struct pbuf short pv_pbuf.pvnum=0 means physical volume number 0 in this volume group.0.dev: major/minor device number for this disk (dev(31-16) = major.h./* SA wheel sequence number */ char nukesa. . . /* transfer count for this pv }./* ptr to old CLVM pv struct #endif /* CLVM_2_3 */ int xfcnt. LFS. . Items above in bold give .vg_num: volume group major number pvol kdb subcommand The visual above shows output of the kdb pvol command. 2=> No hw/sw relocation allowed.

. the LVM number for hdisk0 in rootvg . 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 7-48 Kernel Internals © Copyright IBM Corp.80.Student Notebook AIX lspv Command Output # lspv hdisk0 PHYSICAL VOLUME :hdisk0 VOLUME GROUP: rootvg PV IDENTIFIER: 0001d2ca308b4251 VG IDENTIFIER 0001d2ca00004c00000000f1c1697a0 PV STATE: active STALE PARTITIONS: 0 ALLOCATABLE: yes PP SIZE: 32 megabyte(s) LOGICAL VOLUMES: 9 TOTAL PPs: 542 (17344 megabytes) VG DESCRIPTORS: 2 FREE PPs: 497 (15904 megabytes) HOT SPARE: no USED PPs: 45 (1440 megabytes) FREE DISTRIBUTION: 108.16.00.92.. 2001..00 # Figure 7-24. The “PV IDENTIFIER” is maintained in the ODM class CuAt. The items in bold print above correspond to kdb pvol items described on prior visual: . . AIX lspv Command Output BE0070XS4.0 Notes: AIX lspv command The visual above shows output of the AIX lspv command for hdisk0.“PHYSICAL VOLUME: HDISK0” corresponds to pvnum=0.. The “VG IDENTIFIER” is maintained in the volgrp structure which points to this pvol structure. It is also maintained in ODM class CuAt.“VOLUME GROUP: rootvg” corresponds to vg_num=0xA..108..28..109 USED DISTRIBUTION: 01.. the rootvg major number.

True or False? The three kernel structures __________. The kdb subcommand __________ and the AIX command _________ both reflect volume group information.0. The kernel maintains a _______structure and a _______structure for each mounted file system. Checkpoint (1 of 2) BE0070XS4. Figure 7-25. There is one gfs structure for each mounted file system. 2001.0. __________ and __________ are used to track LVM volume group.0 Notes: © Copyright IBM Corp.V2. respectively. logical volume and physical volume data. VFS and LVM 7-49 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. . 2003 Unit 7.3 Student Notebook Uempty Checkpoint (1 of 2) Each user process contains a private F___ D______ T____. LFS.

0 Notes: 7-50 Kernel Internals © Copyright IBM Corp. . 2001. True or False? The inode number given by ls -id/usr is _____.Student Notebook Checkpoint (2 of 2) There is one vmount/vfs structure pair for each mounted filesystem. Why? Each vnode for an open file points to a _______structure. Figure 7-26. True or False? Every open file in a filesystem is represented by exactly one file structure. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. Checkpoint (2 of 2) BE0070XS4.

0 Notes: Turn to your lab workbook and complete exercise six. 2001.V2. © Copyright IBM Corp. VFS and LVM 7-51 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.3 Student Notebook Uempty Exercise Complete exercise six Consists of theory and hands-on Ask questions at any time Activities are identified by a What you will do: Test what you have learned about the LFS and VFS Locate the LFS/VFS structures for an open file Identify what file a process has opened Figure 7-27. . LFS.0. 2003 Unit 7.0. Exercise BE0070XS4.

Figure 7-28. Unit Summary BE0070XS4. There are kdb subcommands for displaying these structures.Student Notebook Unit Summary The LFS and VFS provide support for many different file systems types simultaneously The LFS/VFS allows for different types of file systems to be mounted together forming a singe homogenous view The LFS services the system call interface for read()write() The VFS defines files (vnodes) and file systems (vfs) Each file system type provides unique functions for file and file system types operations. . 2001. Operations are defined by the vnodeops and vfsops structures.0 Notes: 7-52 Kernel Internals © Copyright IBM Corp. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. The gnode is a generic object connecting the VFS with the file system specific inode kdb has special subcommands for viewing LFS/VFS structures The kernel tracks LVM data in structures volgrp. lvol and pvol.

Journaled File System What This Unit Is About This unit describes the internal structures of the Journaled File System (JFS). 2003 Unit 8. What You Should Be Able to Do After completing this unit. allocation groups. superblock.0.0. indirect block and double indirect block • Contrast on disk and incore inode structures • Describe the relationship between JFS and LVM in performing I/O How You Will Check Your Progress Accountability: • Unit review References AIX Documentation: System Management Guide: Operating System and Devices © Copyright IBM Corp. 2001. Journaled File System 8-1 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. .3 Student Notebook Uempty Unit 8.V2. you should be able to: • Describe basic concepts of the JFS disk layout • Describe JFS elements: inodes.

indirect block and double indirect block Contrast on disk and incore inode structures Describe the relationship between JFS and LVM in performing I/O Figure 8-1. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. .0 Notes: 8-2 Kernel Internals © Copyright IBM Corp. 2001. allocation groups.Student Notebook Unit Objectives At the end of this lesson you should be able to: Describe basic concepts of the JFS disk layout Describe JFS elements: inodes. Unit Objectives BE0070XS4. superblock.

data blocks. 2003 Unit 8. JFS File System BE0070XS4.V2. These components include inodes.0. JFS maintains file data and components that identify where a file or directory's data is located on the disk. The visual illustrates some of the basic components of a JFS.0 Notes: Journaled File System Introduction AIX 5L supports two main native file system types: JFS and JFS2.3 Student Notebook Uempty JFS File System Boot Block Super Block Inodes Indirect Blocks Data Blocks Figure 8-2. Journaled File System 8-3 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. JFS (Journaled File System) is the original native file system for AIX. 2001. JFS2 (Enhanced Journaled File System) is a more recent development and is discussed in a following unit. Each JFS file system occupies one logical volume. The actual on-disk layout of a JFS file system can be viewed with the fsdb command.0. . © Copyright IBM Corp. super blocks a boot block and one or more allocation groups An allocation group contains disk inodes and fragments.

and is not used in AIX. The allowable fragment sizes for JFS are 512. Smaller allocation units or fragments minimize wasted disk space by more efficiently storing the data in a file or directory's partial logical blocks. Fragments The journaled file system is organized in a contiguous series of fragments. but only one fragment size can be used within a single file system. The functional behavior of JFS fragment support is based on that provided by Berkeley Software Distribution (BSD) fragment support. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. The default fragment size is 4096 bytes. 1024. 2048. Blocks A block is a 4096 byte data allocation unit. JFS fragments are the basic allocation unit and the disk is addressed at the fragment level.A flag indicating the state . and 4096 bytes. JFS fragment support allows disk space to be divided into allocation units that are smaller than the default size of 4096 bytes. Specifying fragment size The fragment size for a JFS is specified during its creation. The superblock maintains information about the entire JFS and includes the following fields: .Size . .Student Notebook Boot Block The boot block occupies the first 4096 bytes of a JFS starting at byte offset 0.Allocation group sizes The superblock is critical to the JFS and if corrupted will prevent the file system from being mounted. Different Doffs can have different fragment sizes. For this reason a backup copy of the superblock is always written in block 31.Number of data blocks . This area is from the original Berkeley Software Distribution (BSD) Fast File System design. 8-4 Kernel Internals © Copyright IBM Corp. 2001. Superblock The superblock is 4096 bytes in size and starts at byte offset 4096.

For the first allocation group. Each allocation group contains disk inodes and free blocks. the inodes are found at the start of each group.0. allocation. owner. The first 4096 bytes of the first allocation group holds the boot block and the second 4096 bytes holds the superblock.Beginning in Version 4. Inodes are 128 bytes in size and are identified by a unique inode number. For subsequent groups. © Copyright IBM Corp. The collection of disk inodes can be referred to as the disk inode table.0. Despite the fact that the inodes are distributed through the disk. Allocation groups The set of fragments making up a JFS are divided into one or more fixed-sized units of contiguous fragments. These are called allocation groups. The inode number maps an inode to its location on the disk or to an inode within its allocation group. An allocation group is similar to BSD cylinder groups. This permits inodes and data blocks to be dispersed throughout the file system and allows file data to lie in closer proximity to its inode.V2. . Allocation group sizes Allocation groups are described by three sizes: . However. a disk inode can be located using a simple formula based on the i-number and the allocation group information contained in the super block.The fragment allocation group size and the inode allocation group size are specified as the number of fragments and inodes that exist in each allocation group. it can be as large as 64 MB. it is disjoint from the name since many different names can be refer to the same inode via the inode number.3 Student Notebook Uempty Inodes The disk inode is the anchor for files in a JFS. . an i-number.2. and a file. These three values are stored in the file system superblock.The default allocation group size is 8 MB. 2001. Journaled File System 8-5 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. and they are set at JFS creation. 2003 Unit 8. the inodes occupy the fragments immediately following the reserved block area. The inode records file information such as size. and so on. There is a one to one correspondence between a disk inode. .

When the physical file system creates a file. examine the inode to determine where the data is. 8-6 Kernel Internals © Copyright IBM Corp. it creates an External Page Table (XPT). 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. and XPT blocks make their respective user-level resources appear contiguous. 2001. When AIX needs to create a segment of virtual memory. and initiate a page in to transfer the data from the file system into memory. oblivious to the fact that a memory mapped access caused a disk operation. the VMM is able to determine what file is being accessed. Just as virtual memory looks contiguous to a user program but may be scattered about real memory or paging space. disk files are made to look contiguous to the user program even though the physical disk blocks may be very scattered. This greatly simplifies the code by dividing the algorithmic problem of searching directory entries from the task of performing disk I/O operations and managing a buffer cache. it creates a disk inode and possibly indirect blocks to describe the file. .Student Notebook Virtual memory AIX exploits the segment architecture to implement its JFS physical file system. including user data blocks. Disk inodes (and indirect blocks). The read and write operations are much simplified in that they merely initialize the mapping and then copy the data. a directory lookup operation merely maps the directory into virtual memory and then goes walking through the directory structure. the faulting process can be resumed and the operation continues. The JFS maps all file system information into virtual memory. Likewise. The I/O function is handled by the Virtual Memory Manager (VMM). When a page fault occurs on a mapped file object. which contains a collection of XPT blocks. Once completed.

Every open file is represented by a segment in the VMM.inodemap) Disk block allocation map (. these ‘hidden’ files do not appear in any directory.”) because they are hidden files. but are only present in the VMM when a file system is mounted. © Copyright IBM Corp. . Every JFS file system has inodes 0-15 reserved.inodes) Indirect blocks (. This is done by manipulating the inodes so they do not require a directory entry to support their link count value.3 Student Notebook Uempty Reserved Inodes 0 1 2 3 4 5 6 7 8 9-15 Not used Superblock (. Most of these reserved inodes never actually exist on the disk. Most of these files names begin with a dot (“. 2001.0.inodexmap) Reserved BE0070XS4.V2. 2003 Unit 8.superblock) Root directory of file system Disk inodes (. But. Reserved Inodes Notes: Reserved Inodes Introduction A unique feature of the JFS implementation is the implementation of file system data as unnamed files that reside in the file system.0 Figure 8-3. Journaled File System 8-7 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.indirect) Disk inode allocation map (.0.diskmap) Disk inode extensions (.inodex) Inode extension map (.

Inode extension map Inode 8 is reserved for the virtual file named . The inode points to two data blocks. Indirect blocks Inode 4 is reserved for a file named . This allocation map has bit flags turned on or off showing if an inode is in use or free. and an indication of the consistency of on-disk data structures. This bit map indicates whether each block on the logical volume is in use or free. 8-8 Kernel Internals © Copyright IBM Corp.diskmap. It would be impractical to allocate inodes large enough to directly hold this entire list. Disk block allocation map Inode 6 is reserved for a virtual file named . Disk inodes Inode 3 is reserved for a file named .Student Notebook Superblock Inode 1 is reserved for a file named . the inode holds a list of the data blocks which compose the file. rather than an array. 1 and 31. Future use Inodes 9 through 15 are reserved for future extensions. This bit map is used to keep track of free and allocated inode extensions. The superblock holds a concise description of the JFS: its size allocation information. Data block 31 is a spare copy of the superblock at data block 1. The most common JFS object is a regular file. Every JFS object is described by an disk inode. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.inodexmap.inodex. Disk inode extensions Inode 7 is reserved for a virtual file named . . Each disk inode is a fixed size: 128 bytes. 2001.superblock. This file contains information about inode extensions which are used by access control lists.inodemap. For a regular file. The list of physical blocks are held in a tree structure.indirect. The intermediate nodes of this tree are the indirect blocks. Disk inode allocation map Inode 5 is reserved for a virtual file named .inodes. Root directory Inode 2 is always used for the JFS root directory.

_di_rdaddr . 2003 Unit 8. . . . The on disk inode structure is defined in /usr/include/jfs/ino. 2001._di_file. Journaled File System 8-9 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. . . . The number of inodes in a JFS file system depends on its size. Figure 8-4._di_file. . . . The most basic elements or this structure are shown on the slide above and described in the text that follows. uint di_nblocks. . di_mtime_ts. . uid_t di_uid. .V2.id_raddr }. and the number of bytes per inode ratio (4096 by default). gid_t di_gid. .0. di_ctime_ts. Disk Inode Structure BE0070XS4. . mode_t di_mode. . # define di_rindirect _di_info. . ushort di_nlink. . . . . .0 Notes: Disk inode structure Introduction Inodes exist in a static form on disk and have access information for the file in addition to pointers to the real disk addresses of the file’s data blocks.h.0. . di_atime_ts.3 Student Notebook Uempty Disk Inode Structure struct dinode { uint di_gen. # define di_rdaddr _di_info._di_indblk. . © Copyright IBM Corp. . The allocation group size (default 8MB by default).

if any Inode types The private portion of the inode depends on its type. access permissions and attributes User ID of owner Group ID File size Number of blocks used by file.h and contains: Symbol di_gen di_nlink di_mode di_uid di_gid di_size di_nblocks di_mtime di_atime di_ctime di_rdaddr[8] di_rindirect Description The disk inode generation number The number of directory entries which refer to the file The file type. Character device inodes have only the dev_t Symbolic link A UNIX domain socket FIFO. The types are defined in /usr/include/sys/mode. The format of the private portion of an inode for a data file (including some symbolic links and directories) depends on the size of the file. Block device. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.Student Notebook Inode header file The inode structure is defined in /usr/include/jfs/ino. Inode types are: Type Description Regular file. This does not include indirect blocks. . 2001. A FIFO inode has no persistent private data.h and compose portions of the di_mode field. The private portion of a directory inode is identical to that of a regular file. The AIX file system always allocates full blocks to data files. S_IFREG S_IFDIR S_IFBLK S_IFCHAR S_IFLNK S_IFSOCK S_IFIFO 8-10 Kernel Internals © Copyright IBM Corp. Block device inodes have only the dev_t Character device. Time at which the contents of the file were last modified Time at which the file was last accessed by read Time at which contents of disk inode were last updated Real disk addresses of the data Real disk address of the indirect block. Directory.

an in-core inode is created in memory The in-core inode structure is defined in /usr/include/jfs/inode.0. an in-core inode is created in memory. In-core inode header file The in-core inode structure is defined in /usr/include/jfs/inode.h In-core inodes include: An exclusive-use lock Use count Open counts State flags Exclusion counts Hash table links Free list links Mount table entry In-core inode states Active Cached Free Figure 8-5. Journaled File System 8-11 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.0. 2003 Unit 8.V2.The last 128 bytes is a copy of the disk inode © Copyright IBM Corp. In-core Inodes BE0070XS4. There are two parts to each in-core inode: . The in-core inode contains a copy of all the fields defined in the disk inode in addition to fields for keeping track of the in-core inode.3 Student Notebook Uempty In-core Inodes When a file is opened. 2001. .h.0 Notes: In-core inodes Introduction When a JFS file is opened.First portion of data structure relevant only while the object is accessed .

. 2001. it can be placed on a wait list for the inode (if the O_DELAY open flag was specified) Exclusion counts 8-12 Kernel Internals © Copyright IBM Corp. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.Student Notebook In-core inode header file The in-core inode includes: Item Notes • Must be held before the in-core inode is updated • Actually implemented with a simple lock The in-core inode cannot be destroyed while it has a non-zero use count • Separate reader and writer counts are maintained in the gnode in the in-core inode • Are incremented at each open. and decremented at close • A process which has opened the file for both reading and writing is counted as both a reader and writer Exclusive-use lock Use count Open counts State flags Maintain miscellaneous in-core inode state • A bit indicates that the file has been opened for exclusive access • A separate count of the number of readers who have specified read-only sharing (precluded writers) is also maintained • If a process attempts to open the inode with a mode which conflicts with the current open status.

it can be easily reacquired should a process need the inode again. There is no vnode that refers to this inode.Active. but still has a non-zero link count.Free.0. There is currently a vnode that refers to this inode. 2003 Unit 8. . accessed by device and index • Allows finding an inode by file handle. If an entry is not already in the table.Cached. The data the in-core inode and associated segment holds is still valid and may be reused if the inode is reopened. then it is placed on the free list. Entries are accessed by iget().3 Student Notebook Uempty Item Notes • All existing in-core inodes are kept in a hash table. Journaled File System 8-13 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. then it is placed in the cache list. a hash table of in-core inodes for recently accessed files. This implies that a process has the corresponding file open. it no longer has other references to it. This implies that the corresponding file is not open anywhere on the system.V2. and it has a zero link count. If an inode is iput(). its underlying device must currently be mounted • Each in-core inode points back to its mount table entry to avoid searching the mount table to find the entry for this object Hash table links Free list links Mount table entry In-core inode states There are three states for every in-core inode: . From this list. This avoids extra disk I/O. The structure is available for immediate use. If an inode is iput() and has no references to it. iget() will call iread to obtain the entry.0. It contains the inode number and a file system number. Entries are marked as unused by iput(). . . and assures that multiple inodes are not created for the same object All unused in-core inodes are kept in a free list • If an object is in use. 2001. © Copyright IBM Corp. In-core inode table Active in-core inodes are maintained in the inode table.

2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. This prevents deadlock conditions. Otherwise. an in-core inode is removed from the free list and the disk inode is copied into the in-core inode. Action When a file is opened. 4. Note: The iget() routine does not return a locked inode.Student Notebook In-core inode creation The steps for in-core inode creation are: Step 1. 2. If an inode is found in the hash queue. 2001. For all operations which require locking more than one inode. 3. . the kernel searches the hash queue to see if there is an in-core inode already associated with the file. the reference count of the in-core inode is incremented and the file descriptor is returned to the user. The ilocklist() routine sorts these into a descending order before locking (highest inode number is locked first). The in-core inode is then placed on the hash queue and remains there until the reference count is zero (no processes have the file open). 8-14 Kernel Internals © Copyright IBM Corp. Inode locking The JFS serializes operations by obtaining an exclusive lock on each inode involved in the operation. all involved inodes are known at the start of the operation. nor does iput() free any lock on the inode.

2001. 2003 Unit 8. .3 Student Notebook Uempty Direct (no Indirect Blocks) Inode Disk Addresses for File Size <= 32KB di_raddr[0] di_raddr[7] Inode (Logical volume block numbers) data block 0 data block 7 Figure 8-6. There are three methods for addressing the disk space .0.Single indirect . The first double indirect block contains 4096 byte fragments.Direct . file systems enabled for large files allow a maximum file size of slightly less than 64 gigabytes (68589453312). Direct (No Indirect Blocks) BE0070XS4.2.0.V2. Journaled File System 8-15 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.0 Notes: Indirect blocks Introduction JFS uses indirect blocks to address the disk space allocated to larger files.Double indirect Beginning in AIX 4. and all subsequent double indirect blocks contain (32 X 4096 = © Copyright IBM Corp.

2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 2001. The following produces the maximum file size for file systems enabling large files: (1 * (1024 * 4096)) + (511 * (1024 * 131072)) The fragment allocation assigned to a directory is divided into records of 512 bytes each and grows in accordance with the allocation of these records. .Student Notebook 131072) byte fragments. This method is used for files that are less than 32 KB in size. Direct The first eight addresses point directly to a single allocation of disk fragments. (8 x 4KB = 32 KB). 8-16 Kernel Internals © Copyright IBM Corp. Each disk fragment is 4 KB in size.

indirect) Indirect Page indir[0] indir[1023] (Logical volume block numbers) data block 0 data block 1023 Figure 8-7.0.0 Notes: Single indirect The i_rindirect field of the inode contains the address of an indirect block containing 1024 addresses.V2.0. . Single Indirect BE0070XS4. These addresses point to disk fragments for each allocation. 2003 Unit 8.3 Student Notebook Uempty Single Indirect File Size Between 32KB and 4MB Inode indirect (page index in . This method is used for files between 32KB and 4MB (1024 x 4KB) in size. Journaled File System 8-17 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 2001. © Copyright IBM Corp.

. Double Indirect BE0070XS4. This method is used for files in the range from 4 MB to 2GB With large file support enabled.The 512 addresses do not point to data but instead point to 1024 addresses that point to data blocks (512 x ( 1024 x 4KB) ) = 2GB.0 Notes: Double indirect The i_rindirect field of the inode points to a double indirect block that contains 512 addresses that point to indirect blocks.indirect) ind[1023] ind[0] ind[1023] (Logical volume block numbers) data block 0 data block 1023 data block 0 data block 1023 Figure 8-8. rather than the default fragment size of 4096 bytes. . 2001. indir[511] are 32 x 4096 = 131072 bytes long. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 8-18 Kernel Internals © Copyright IBM Corp.indirect) indir[511] (Pages indices in .. However. the graphic still holds true. in this case all “data blocks” pointed to through indir[1] .Student Notebook Double Indirect Inode Disk Addresses for File Size > 4MB Inode Indirect Root indir[0] Indirect Pages ind[0] indirect (page index in .

2003 Unit 8.0 Notes: © Copyright IBM Corp. An allocation group contains __________ and __________. True or False? 5. Journaled File System 8-19 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. .0. True or False? 3. The root inode number of a filesystem is always 1.3 Student Notebook Uempty Checkpoint 1. The basic allocation unit in JFS is a disk block. 2. 2001. Checkpoint BE0070XS4. True or False? 4.V2.0. JFS maps user data blocks and directory information into virtual memory. True or False? Figure 8-9. The last 128 bytes of an in core JFS inode is a copy of the disk inode.

Figure 8-10. 2001.Student Notebook Unit Summary Principle components of the JFS are allocation groups. data blocks and indirect blocks. JFS accomplishes I/O by mapping all file system information into virtual memory. thus relying on VMM to do the actual I/O operations. . 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. Unit Summary BE0070XS4.0 Notes: 8-20 Kernel Internals © Copyright IBM Corp. A JFS allocation group contains inodes and related data blocks. A JFS in core inode contains the disk inode data together with activity information such as open count and in core inode state information. The state information indicates whether the structure is active or available for re use. inodes.

2003 Unit 9. What You Should Be Able to Do After completing this unit.0.0. Enhanced Journaled File System 9-1 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 2001. . Enhanced Journaled File System What This Unit Is About This unit is about the internal structures of the Enhanced Journaled File System (JFS2).3 Student Notebook Uempty Unit 9. References AIX Documentation: System Management Guide: Operating System and Devices © Copyright IBM Corp. you should be able to: • List the difference between the terms aggregate and fileset • Identify the various data structures that make up the JFS2 file system • Use the fsdb command to trace the various data structures that make up files and directories.V2. How You Will Check Your Progress Accountability: • Exercises using your lab system.

0 Notes: 9-2 Kernel Internals © Copyright IBM Corp. . Unit Objectives BE0070XS4.Student Notebook Unit Objectives At the end of this lesson you should be able to: List the difference between the terms aggregate and fileset. Use the fsdb command to trace the various data structures that make up files and directories. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. Figure 9-1. Identify the various data structures that make up the JFS2 filesystem. 2001.

V2. file size (supported) Number if Inodes Directory Organization 1 Terabyte (16 Terabytes on AIX 5. 2003 Unit 9. Enhanced Journaled File System 9-3 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. limited by disk space B+ tree Figure 9-2. It is the default file system for the 64-bit kernel of AIX 5L. 2001.0 Notes: Introduction The Enhanced Journaled File System (JFS2).0. files size (this is not the supported size!) Value 512 . © Copyright IBM Corp. The table above lists some general information about JFS2.0. file system size (supported) 1 Terabyte (16 Terabytes on AIX 5. .3 Student Notebook Uempty Numbers Function Block Size Architectural max.4096 Configurable block size 4 Petabytes Max.2) Dynamic.2) Max. Numbers BE0070XS4. is an extent based Journaled File System.

called a fileset. The rules that define aggregates and filesets in JFS2 are listed above in the visual. and must be 9-4 Kernel Internals © Copyright IBM Corp. The aggregate block size defines the smallest unit of space allocation supported on the aggregate. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. There may be multiple filesets per aggregate Currently only one fileset per aggregate is supported. Aggregate block size An aggregate has a fixed block size (number of bytes per block) that is defined at configuration time. The block size cannot be altered. and this feature may be introduced in a future release of AIX 5L Aggregate Block Size 512 bytes 1024 bytes 2048 bytes 4096 bytes Figure 9-3. called an aggregate. 2001. from the notion of a mountable file system sub-tree.0 Notes: Introduction The term aggregate is defined in this section. The meta-data has been designed to support multiple filesets. Aggregate and Fileset BE0070XS4. The layout of a JFS2 aggregate is also described. Definitions JFS2 separates the notion of a disk space allocation pool.Student Notebook Aggregate and Fileset There is exactly one aggregate per logical volume. .

© Copyright IBM Corp.0. which defines the smallest unit of I/O.4096 bytes. 2003 Unit 9. Do not confuse aggregate block size with the logical volume block size.2048 bytes .512 bytes . . Legal aggregate block sizes are: .V2.0.3 Student Notebook Uempty no smaller than the physical block size (currently 512 bytes). Enhanced Journaled File System 9-5 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.1024 bytes . 2001.

Primary aggregate superblock 9-6 Kernel Internals © Copyright IBM Corp. . ixd Section length[0]: 16 addr[0]: 44 length[1]: 0 addr[1]: 0 . such as the: • Size of the aggregate... 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. • Size of allocation groups. Part Reserved area Function The first 32 KB is not used by JFS2. The first block is used by the LVM..Student Notebook Aggregate Note: Aggregate Block Size is 1K in this example..blah blah 1638 4 aggr in od e #1 6: fi le set 0 owner: perm: etc: size: root -rwx-----blah blah 12288 0 240 8 8192 10284 4 aggr in od e # 17: fi le se t 1 owner: perm: etc: size: root -rwx-----blah blah 8192 Working Map 0xf8008000 0x00000000 . • Aggregate block size.--bl ah bl ah 81 92 a ggr inode #2 : block ma p o wner: perm: etc: size: root -rwx ----. 1KB (One Aggregate Block) Aggregate Block # Reserved for LVM 0 31 32 Inodes (1 6KB) Aggregate Inode Ta ble. offse t: 0 add r: 36 lengt h: 8 xad entries (8 total) o ffset : 0 addr : 64 l ength : 16 offset: addr: length: offset: addr: length: offset: 0 addr: 5992 length: 8 Figure 9-4. The primary aggregate superblock (defined as a struct superblock) contains aggregate-wide information. inode numbers shown Primary Ag gregate Superblock 0 2 3 4 5 6 7 8 9 10 12 14 1 6 18 20 22 24 2 6 2 8 30 11 1 3 15 17 19 2 1 23 25 27 29 31 Control Page IAG 1 Secondary Aggregate Superblock 32 36 40 44 60 1st extent of Aggregate Inode Allocation Map Control Section iagnum: 0 Persistent Map 0xf8008000 0x00000000 . 2001. agg r inode #1 : “s elf” owne r: per m: et c: siz e: ro ot -r wx--.0 Notes: Aggregate layout The diagram above and the table below details the layout of the aggregate. Aggregate BE0070XS4...

Since the inodes in the aggregate inode table are critical for finding file system information they are replicated in the secondary aggregate inode table.0. Inodes will be described later. Enhanced Journaled File System 9-7 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. The in-line log always exist after the fsck working space. Provides space for logging the meta-data changes of the aggregate.V2. The block allocation map maps one-to-one with the aggregate disk blocks. The fsck working space always exists at the end of the aggregate. The actual data for the inodes will not be repeated. This space is necessary. For a very large aggregate. Describes the control structures for allocating and freeing aggregate disk blocks within the aggregate. This allows the superblocks to be found without depending on any other information. The space is described by the superblock. Describes the secondary aggregate inode table. Describes the aggregate inode table. Contains replicated inodes from the aggregate inode table. Contains inodes that describe the aggregate-wide control structures. 2003 Unit 9. just the addressing structures used to find the data and the inode itself. Secondary aggregate superblock Aggregate inode table Secondary aggregate inode table Aggregate inode allocation map Secondary aggregate inode allocation map Block allocation map fsck working space In-line Log © Copyright IBM Corp.0.3 Student Notebook Uempty Part Function The secondary aggregate superblock is a direct copy of the primary aggregate superblock. The secondary aggregate superblock is used if the primary aggregate superblock is corrupted. The space is described by the superblock. Provides space for fsck to be able to track the aggregate block allocations. there might not be enough memory to track this information in memory when fsck is run. . It contains allocation state information on the aggregate inodes as well as their on-disk location. 2001. Both primary and secondary superblocks are located at a fixed locations. One bit is needed for every aggregate block.

As additional filesets are added to the aggregate. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. in that aggregate inode one is itself in the file that it describes. 2001. 4 KB after the primary aggregate superblock.Student Notebook Aggregate inodes When the aggregate is initially created. . The obvious circular representation problem is handled by forcing at least the first aggregate inode extent to appear at a well-known location. 4 . Reserved for future extensions.2 release there can only be one fileset. These inodes describe the control structures that represent each fileset. This inode is allocated but no data is saved to disk. Describes the In-line Log when mounted. additional inode extents are allocated and de-allocated dynamically as needed. Note that as of AIX 5. namely. Starting at aggregate inode 16 there is one inode per fileset (the fileset allocation map Inode). This is included to show design potential. JFS2 can easily find aggregate inode one. Description 1. the aggregate inode table itself may have to grow to accommodate additional fileset inodes. and is not realizable at present. this inode describes the aggregate disk blocks comprising the aggregate inode map. Each of these aggregate inodes describe certain aspects of the aggregate itself. and from there it can find the rest of the aggregate inode table by following the B+–tree in inode one Describes the block allocation map. 3. the first inode extent is allocated. as follows: Inode # 0 Reserved Called the “self” inode. The preceding graphic shows a fileset 17.15 16 - 9-8 Kernel Internals © Copyright IBM Corp. This is a circular representation. Therefore. 2.

Allocation policies When locating data on the disk.0. © Copyright IBM Corp. .0. 8... The allocation group size must always be a power of 2 multiple of the number of blocks described by one dmap page. Allocation Group BE0070XS4. dmap pages) Figure 9-5.3 Student Notebook Uempty Allocation Group The maximum number of allocation groups per aggregate is 128. The minimum number of allocation group is 8192 aggregate blocks. .Group disk blocks for related data and inodes close together. Enhanced Journaled File System 9-9 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 2003 Unit 9. JFS2 will attempt to: . 4. Allocation groups are used for heuristics only.V2. 2. (for example 1. Allocation groups allow JFS2 resource allocation policies to use well known methods for achieving good JFS2 I/O performance.Distribute unrelated data throughout the aggregate.0 Notes: Introduction Allocation Groups (AG) divide the space on an aggregate into chunks. . 2001.

except we mark the non-existent disk blocks allocated in the Block Allocation Map. The rules for setting the allocation group size are shown in the visual on the previous page. 2001. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. . 9-10 Kernel Internals © Copyright IBM Corp. This partial allocation group will be treated as a complete allocation group. The allocation group size is stored in the aggregate superblock.Student Notebook Allocation group sizes Allocation group sizes must be selected which yield allocation groups that are sufficiently large to provide for contiguous resource allocation over time. Partial allocation group An aggregate whose size is not a multiple of the allocation group size contains a partial allocation group that it is not fully covered by disk blocks.

Enhanced Journaled File System 9-11 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.3 Student Notebook Uempty Fileset Fileset Inode Table 0 2 3 4 5 6 7 8 10 12 14 16 18 20 22 24 26 28 30 Control Pag e IAG 1 9 11 13 15 17 19 21 23 25 27 29 31 IAG 24 8 264 10284 240 fileset #0: AG Free Inode List i nofree: e xtfree: n uminos: n umfree: 1 inofree: extfree: numinos: numfree: inofree: extfree: numinos: numfree: 1 1 3 2 2 8 -1 -1 0 0 -1 -1 0 0 2 44 Fileset Inode Allocation M ap : 2n d extent IAG Free List: 1st entry Fileset Inode Allocation M ap: 1st extent Control S ection iagnum: 0 Working M ap 0xf000000 0 0xfffffff f . Fileset inode allocation map © Copyright IBM Corp. The Fileset Inode allocation map contains allocation state information on the fileset inodes... id otdot:2 2 Figure 9-6.0 Notes: Introduction A fileset is a set of files and directories that form an independently mountable sub-tree that is equivalent to a UNIX file system file hierarchy.. Persisten t Map 0xf000000 0 0xfffffff f .. The Fileset Inode Table logically contains an array of inodes.. 2003 Unit 9.0. . fileset inode #2: root directory owner: perm: etc: size: r oot rwx-----b lah blah 4 096 Con trol Sect ion iag num: 1 iag free: -1 Wo rking Map 0x ffffffff 0x ffffffff . ix d Section le ngth[0]: 0 ad dr[0]: 0 le ngth[1]: 0 ad dr[1]: 0 .. . .. Fileset BE0070XS4. A fileset is completely contained within a single aggregate. Part Fileset inode table Function Contains inodes describing the fileset-wide control structures. . 2001.V2.. Pe rsistent Map 0x ffffffff 0x ffffffff .0. ixd Secti on length[0] : 16 addr[0]: 248 length[1] : 0 addr[1]: 0 . as well as their on-disk location.. A fileset inode allocation map which describes the Fileset Inode Table. The visual illustration above and table below details the layout of a fileset.

Inodes 9-12 Kernel Internals © Copyright IBM Corp. . file type (regular or directory). the data format (on-disk layout) becomes inherently extensible. Note that all JFS2 meta data structures (except for the superblock) are represented as “files.” By reusing the inode structure for this data. They also “contain” a B+–tree to record the allocation of extents. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. which contains the expected object-specific information such as time stamps.Student Notebook Part Function Every JFS2 object is represented by an inode. 2001.

user files. which points to the same data.3 Student Notebook Uempty Inode Allocation Map 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 Fileset Inode # 0 1 2 3 4- Description reserved. The root directory inode for the fileset. 2003 Unit 9. The inodes in a fileset are allocated as shown above in the visual. additional inode extents are allocated and de-allocated dynamically as needed.V2. . the first inode extent is allocated. additional fileset information that would not fit in the fileset allocation map inode in the aggregate inode table. The ACL file for the fileset.0. and symbolic links. Figure 9-7. Since the aggregate inode table is replicated. Inode Allocation Map BE0070XS4. Inodes Every file and directory in a fileset is describe by an on-disk inode. Fileset inodes from four onwards are used by ordinary fileset objects. there is also a secondary version of this inode. © Copyright IBM Corp.0.0 Notes: Super inode Super inodes found in the aggregate inode table (#16 and greater) describe the fileset inode allocation map and other fileset information resides in the aggregate inode table. Enhanced Journaled File System 9-13 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. When the fileset is initially created. 2001. directories.

xad_address.h struct xad { uint8 uint16 uint40 uint24 uint40 }. xad_reserved. xad_length. 9-14 Kernel Internals © Copyright IBM Corp. offset=0 len=3 addr=101 xad_flag. . File system disk blocks disk block 101 disk block 503 offset=0 len=4 addr=503 disk block 856 offset=0 len=2 addr=856 Figure 9-8.0 Notes: Introduction Disk space in a JFS2 file system is allocated in a sequence of contiguous aggregate blocks called an extent. Extents BE0070XS4. xad_offset. 2001.Student Notebook Extents XADs for a file flag reserved /usr/include/j2/j2_xtree. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

. Details of the xad data structure are shown in the visual on the previous page.1 aggregate blocks.Are variable in size and can range from 1 to 224. Extent allocation descriptor Extents are described in an xad structure (a 16 byte structure). both the length and address are expressed in units of the aggregate block size. The xad_offset. Enhanced Journaled File System 9-15 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. Extents are generally grouped together to form a larger group of disk blocks. . Member xad_flag xad_reserved Description Flags set on this extent.Large extents may span multiple allocation groups. containing the length of the extent in aggregate blocks. and its address.0. xad_offset xad_length xad_address © Copyright IBM Corp. and is the block offset from the beginning of the aggregate.Are indexed in a B+-tree.V2. The address is in units of aggregate blocks. describes the logical block offset this extent represents in the larger group. .Are wholly contained within a single aggregate . containing the address of the first block of the extent. In an xad.Is made up of a series contiguous aggregate blocks.0. See /usr/include/j2/j2_xtree. The two main values describing an extent are its length.1 aggregate blocks.h for a list of flags. A 24-bit field. . An extent can range in size from 1 to 224 . xad description The elements of the xad structure are described in this table. 2003 Unit 9.3 Student Notebook Uempty Extent rules An extent: . 2001. A 40-bit field. Reserved for future use.

Increasing an Allocation BE0070XS4. the allocation policy for JFS2 tries to maximize contiguous allocation by allocating a minimum number of extents. . 9-16 Kernel Internals © Copyright IBM Corp. keeping each extent as large and contiguous as possible.Student Notebook Increasing an Allocation File system disk blocks Before flag reserved offset=0 len=100 addr=101 maximize contiguous allocation After flag reserved offset=0 len=200 addr=101 File system disk blocks 100 disk blocks 100 disk blocks 100 disk blocks 100 disk blocks 100 disk blocks flag reserved offset=0 len=100 addr=701 flag reserved offset=0 len=100 addr=701 flag reserved offset=100 len=100 addr=1001 100 disk blocks Figure 9-9. resulting in improved performance. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.0 Notes: Introduction In general. 2001. This allows for larger I/O transfer.

© Copyright IBM Corp. 2003 Unit 9. since we must read the entire extent into memory and decompress it.V2. this is not always possible to keep extent allocation contiguous.3 Student Notebook Uempty Exceptions In special cases. We have a limited amount of memory available. Another case is restriction of the extent size. .0. For example. so we must ensure we will have enough room for the decompressed extent. For example. the extent size is restricted for compressed files.0. 2001. The defragfs utility can be used to defragment a JFS2 file system. copy-on-write clone of a segment will cause a contiguous extent to be partitioned into a sequence of smaller contiguous extents. Enhanced Journaled File System 9-17 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. Fragmentation The user can configure a JFS2 aggregate with a small aggregate block size of 512 bytes to minimize internal fragmentation for aggregates with large numbers of small size files.

Trees Binary trees consists of nodes arranged in a tree structure.0 Notes: Introduction Objects in JFS2 are stored in groups of extents arranged in binary trees.Student Notebook Binary Tree of Extents Root node Header flags=BT_ROOT Internal node Header flags= BT_INTERNAL Leaf node Header flags= BT_LEAF Leaf node Header flags= BT_LEAF Leaf node Header flags= BT_LEAF Array of extent descriptors xad xad xad Array of extent descriptors xad xad xad Array of extent descriptors xad xad xad Figure 9-10. A flag in the node header identifies the role of the node in the tree. Each node contains a header describing the node. these headers reside in the second inode quadrant and in 4KB blocks referenced by the inode. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. As we will show in subsequent material. The concepts of binary trees are introduced in this section. . 9-18 Kernel Internals © Copyright IBM Corp. Binary Tree of Extents BE0070XS4. 2001.

0. The entries are sorted by the offsets of the xad structures. . Enhanced Journaled File System 9-19 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. B+-tree index There is one generic B+–tree index structure for all index objects in JFS2 (except for directories). The data being indexed depends upon the object.Providing fast search for reading a particular extent of a file.0. 2001.Providing fast reading and writing of extents.Being efficient for traversal of an entire B+–tree. An internal node points to two or more leaf nodes or other internal nodes.3 Student Notebook Uempty Header flags This table describes the binary tree header flags: Flag BT_ROOT BT_LEAF BT_INTERNAL Description The root or top of the tree. 2003 Unit 9.V2. The bottom of a branch of a tree. © Copyright IBM Corp. .Providing efficient append or insert of an extent in a file. Why B+-tree? B+–trees are used in JFS2. Leaf nodes point to the extents containing the objects data. . The B+–tree is keyed by the offset of the xad structure of the data being described by the tree. . and help performance by: . each of which is an entry in a node of a B+–tree. the most common operations.

access time. 1. The inode holds the root header for the extent binary tree. split into four 128 byte sections. created. object size. © Copyright IBM Corp. Inode layout The inode is a 512 byte structure. File attribute data and block allocation maps are also kept in the inode. user Id. created time and more. 2001. 9-20 Kernel Internals . modified time.0 Notes: Overview Every file on a JFS2 file system is describe by an on-disk inode. Section Description This section describes the POSIX attributes of the JFS2 object including the inode and fileset number.Student Notebook Inodes Inode Layout Section 1 Section 2 Section 3 y y y y POSIX Attributes extended attributes block allocation maps Inode allocation maps headers describing the inode data In-line data or xad's extended attributes or more in-line data or additional xad's Section 4 Figure 9-11. object type. The sections of the inode are described in this table. group Id. Inodes BE0070XS4. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

Currently section 3 is only used for extent information. implementation Currently section 4 is not used.0.0. . directory. Enhanced Journaled File System 9-21 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. • The first eight xad structures describing the extents for this file. 4.3 Student Notebook Uempty Section Description This section contains several parts: • Descriptors for extended attributes. xad structures or in-line data. • Block allocation maps. The inline data function of JFS2 is not currently enabled. in-line data) This section can contain one of the following: 3. 2001. © Copyright IBM Corp. • Header pointing to the data (b+-tree root. This section extends section 3 by providing additional storage for more attributes. Design vs.V2. • Inode allocation maps. • In-line file data for very small files (up to 128 bytes). 2003 Unit 9. 2.

2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. /* 16: ea descriptor */ _data[80]. /* * block allocation map */ struct { struct bmap *__bmap. #define di_bmap _data2. di_size. /* 4: */ pxd_t int64 int64 uint32 uint32 int32 uint32 j2time_t j2time_t j2time_t j2time_t di_ixpxd. di_nblocks. */ ead_t union { uint8 di_ea. di_otime. /* 4: stamp to show inode belongs to fileset */ uint32 di_rsv1. . /* 4: inode generation number */ uint32 di_fileset. /* 8: inode extent descriptor */ /* 8: size */ /* 8: number of blocks allocated */ /* 4: uid_t user id of owner */ /* 4: gid_t group id of owner */ /* 4: number of links to the object */ /* 4: mode_t attribute format and permission */ /* /* /* /* 16: 16: 16: 16: time time time time last data accessed */ last status changed */ last data modified */ created */ /* * II. /* incore bmap descriptor */ } _bmap. /* 8: inode number.Student Notebook Structure The current definition of the on-disk inode structure is: struct dinode { /* * I. di_nlink. base area (128 bytes) * -----------------------* * define generic/POSIX attributes */ ino64_t di_number. di_gid. 2001. extension area (128 bytes) * -----------------------------*/ /* * extended attributes for file system (96)._bmap. di_mtime. inode # of inode map file */ uint32 di_inostamp. aka file serial number */ uint32 di_gen. di_mode. /* 4: fileset #.__bmap /* 9-22 Kernel Internals © Copyright IBM Corp. di_ctime. di_atime. di_uid.

or dtroot_t for directory. . * N.0._di_btroot #define di_dtroot _data2r. /* 16: */ dxd_t _di_dxd. /* 32: xtpage_t or dtroot_t */ ino64_t _di_parent. } _file.B.__ipimap2 #define di_imap _data2. } _data2._di_dxd #define di_btroot _data2r._data /* * regular file or directory * * B+-tree root node/inline data area */ struct { uint8 _xad[128]._di_btroot #define di_parent _data2r. /* replica */ struct inomap *__imap._xd. 2003 Unit 9._di_btroot #define di_xtroot _data2r. #define di_gengen _data2. #define di_dxd _data2r.3 Student Notebook Uempty * inode allocation map (fileset inode 1st half) */ struct { uint32 _gengen._imap. /* di_gen generator */ struct inode *__ipimap2.V2. type-dependent area (128 bytes) * -----------------------------------* * B+-tree root node xad array or inline data * */ union { uint8 _data[128]._gengen #define di_ipimap2 _data2. */ union { struct { int32 _di_rsrvd[4]. /* 16: data extent descriptor */ } _xd.__imap /* * B+-tree root header (32) * * B+-tree root node header.0. * or data extent descriptor for inline data. must be on 8-byte boundary._imap. /* * device special file */ © Copyright IBM Corp. /* 8: idotdot in dtroot_t */ } _data2r._imap. int32 _di_btroot[8]._di_parent /* * III. #define di_inlinedata _data3. /* incore imap control */ } _imap. Enhanced Journaled File System 9-23 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 2001.

File allocation for large files can consume multiple allocation groups and still be contiguous._rdev /* * symbolic link.Student Notebook struct { dev64_t } _specfile. This is especially important with the larger inode size of 512 bytes in JFS2. */ struct { uint8 _fastsymlink[128]. thus. or * B+-tree root node continuation * */ union { uint8 _data[128].There is no need to allocate “ten times as many inodes as you will ever need. or * inline data continuation. #define di_rdev _rdev. The inodes can be moved and still retain the same number. as with file systems that contain a fixed number of inodes. file system space utilization is optimized. * * link is stored in inode if its length is less than * IDATASIZE. all the blocks contained in an allocation group can be used for data. 2001. which decouples the inode number from the location. . . #define di_fastsymlink _data3. /* 8: dev_t device major and minor */ _data3. With dynamic allocation. /* * IV._symlink. This decoupling simplifies supporting aggregate and fileset reorganization (to enable shrinking the aggregate). } _symlink. #define di_inlineea _data4. which makes it unnecessary to search the directory structure to update the inode numbers. Static allocation forces a gap containing the initially allocated inodes in each allocation group._data } _data4._fastsymlink } _data3. Otherwise stored like a regular file._specfile. typedef struct dinode dinode_t. type-dependent extension area (128 bytes) * ----------------------------------------* * user-defined attribute. }. .Allows placement of inode disk blocks at any disk address. 9-24 Kernel Internals © Copyright IBM Corp. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. Allocation policy JFS2 allocates inodes dynamically. which provides the following advantages: .

in JFS2. The inode allocation map provides this function.3 Student Notebook Uempty Dynamic inode allocation causes a number of problems. the geometry of the file system implicitly describes the layout of inodes on disk.g. Enhanced Journaled File System 9-25 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. Therefore. their inode numbers and extent addresses are set.V2. With a 512 byte inode size. such as NFS. The static inode allocation practice of storing a per-inode generation counter will not work with dynamic inode allocation. separate mapping structures are required. an inode extent occupies 16 KB on the disk. there is simply one inode generation counter that is incremented on every inode allocation (rather than one counter per inode that would be incremented when that inode is reused). replicating the B+–tree structures. However. including: . (implicitly) require them. 2001.0. Therefore we must have a means of finding the inodes on disk. Inode extents Inodes are allocated dynamically by allocating inode extents that are simply a contiguous chunk of inodes on the disk. 2003 Unit 9. i. Due to the overhead involved in replicating these structures we accept the risk of losing these maps. © Copyright IBM Corp.With static allocation. . Inode generation numbers Inode generation numbers are simply counters that will increment each time an inode is reused.. a simple calculation shows that the 32-bit value is still sufficient to meet NFS or DFS requirements. and the mode and link count fields are set to zero. By definition. .0. With dynamic allocation. the space may be reclaimed for ordinary file data storage). the inodes in the extent are initialized. they form part of the file identifier manipulated by VNOP_FID() and VFS_VGET(). because when an inode becomes free its disk space may literally be reused for something other than an inode (e. Inode initialization When a new inode extent is allocated. Although a fileset-wide generation counter will recycle faster than a per-inode generation counter. Network file system protocols. Information about the inode extent is also added to the inode allocation map.The inode mapping structures are critical to JFS2 integrity. allows us to find the maps. a JFS2 inode extent contains 32 inodes.e. Inode allocation map Dynamic inode allocation implies that there is no direct relationship between an inode number and the disk address of the inode.

2001. The header found in the second section of the inode points to the data that is stored in the third and fourth section of the inode. 9-26 Kernel Internals © Copyright IBM Corp.0 Notes: In-line data If a file contains small amounts of data. This design feature has not been implemented yet. This is called in-line storage.Student Notebook Inline Data Inode Info Header for in-line data Figure 9-12. Inline Data In-line data BE0070XS4. . the data may be stored in the inode itself. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

0. then the last quadrant of the inode is available for 8 more xad structures. 2003 Unit 9. INLINEEA bit Once the 8 xad structures in the inode are filled.V2. an attempt is made to use the last quadrant of the inode for more xad structures. Binary Trees BE0070XS4. 2001. © Copyright IBM Corp. The header in the inode now becomes the binary tree root header. then the xad structures describing the extents are contained in the inode. . Enhanced Journaled File System 9-27 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.0 Notes: Binary trees When more storage is needed than can be provided in-line the data must be placed in extents. If the INLINEEA bit is set in the di_mode field of the inode. If there are 8 or fewer extents for the file. An inode containing 8 or less xad structures would look like the figure shown above. This design feature has not been implemented yet.3 Student Notebook Uempty Binary Trees Inode Info B+-tree header offset: 0 addr: 68 length: 16 offset: 84 addr: 4096 length: 48 68 16KB Data In-line data 4096 48KB Data offset: 256 addr: 26624 length:48 26624 8KB Data Figure 9-13.0.

0 Notes: More extents Once all of the available xad structures in the inode are used. 4 KB of disk space is allocated for a leaf node of the B+–tree. The first xad structure in the inode is updated to point to the newly allocated leaf node. The organization of the inode now looks like the figure above. . More Extents BE0070XS4. which is logically an array of xad entries with a header. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 9-28 Kernel Internals © Copyright IBM Corp. 2001. The offset for this new xad structure contains the offset of the first entry in the leaf node. and that it contains the pure root of a B+-tree.Student Notebook More Extents ino de I node Info B+.tree header offset: addr: length: offset: addr: length: 0 412 4 0 0 0 header 412 68 16KB Data 254 xad leaf node entries xad entries (8 total) 4096 48KB Data offset: 0 addr: 0 length: 0 26624 8KB Data Figure 9-14. The 8 xad entries are moved from the inode to the leaf node. and the header is initialized to point to the 9th entry as the first free entry. the B+–tree must be split. and the inode header is updated to indicate that only one xad structure is now being used.

© Copyright IBM Corp. The node now looks like the figure shown above.0. until the node fills.0 Notes: Continuing to add extents As new extents are added to the file.0. Once the node fills an additional 4 KB of disk space is allocated for another leaf node of the B+–tree.V2. Enhanced Journaled File System 9-29 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. and the second xad structure from the inode is set to point to this newly allocated node. 2003 Unit 9. they continue to be added to the leaf node in the necessary order. Continuing to Add Extents BE0070XS4. .3 Student Notebook Uempty Continuing to Add Extents inode Inode Info B+-tree header offset: addr: length: offset: addr: length: 0 412 4 750 560 4 header 16KB Data 412 68 254 xad leaf node entries xad entries (8 total) 4096 48KB Data offset: 0 addr: 0 length: 0 560 header 254 xad leaf node entries 26624 8KB Data Figure 9-15. 2001.

9-30 Kernel Internals © Copyright IBM Corp.Student Notebook Another Split inode Inode Info B+-tree header offset: addr: length: offset: addr: length: 0 380 4 8340 212 4 header 380 header 16KB Data 412 68 254 xad internal node entries 254 xad leaf node entries xad entries (8 total) 4096 48KB Data offset: 0 addr: 0 length: 0 212 header header 560 254 xad internal node entries 254 xad leaf node entries 26624 8KB Data Figure 9-16. and the internal node header is initialized to point to the 9th entry as the first free entry. Another Split BE0070XS4. at which time another split of the B+–tree will occur. This split creates an internal node of the B+–tree. . 4 KB of disk space is allocated for the internal node of the B+–tree. An internal node looks exactly like a leaf node.0 Notes: Another split As extents are added to the inode. which is used purely to route the searches of the tree. this behavior continues until all 8 xad structures in the inode contain leaf node xad structures. the 8 xads of the leaf nodes are moved from the inode to the newly created internal node. and the header in the inode is updated to indicate that only 1 xad structure is now being used for the B+–tree. The root of the B+–tree is then updated by making the inode’s first xad structure point to the newly allocated internal node. 2001. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

a second internal node is allocated. © Copyright IBM Corp. and these leaf nodes are added to the internal node. Enhanced Journaled File System 9-31 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. . additional leaf nodes are created to contain the xad structures for the extents. 2001. 2003 Unit 9.0.3 Student Notebook Uempty As extents continue to be added.V2.0. the inode’s second xad structure is updated to point to the new internal node. This behavior continues until all eight of the inode’s xad structures contain internal nodes. Once the first internal node is filled.

. alter. 2001. Starting fsdb It is best to run fsdb against an unmounted file system. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.0 Notes: Introduction The fsdb command enables you to examine. fsdb Utility BE0070XS4.Student Notebook fsdb Utility # fsdb /dev/lv00 Aggregate Block Size: 512 > > help Xpeek Commands a[lter] <block> <offset> <hex string> b[tree] <block> [<offset>] dir[ectory] <inode number> [<fileset>] d[isplay] [<block> [<offset> [<format> [<count>]]]] dm[ap] [<block number>] dt[ree] <inode number> [<fileset>] h[elp] [<command>] ia[g] [<IAG number>] [a | <fileset>] i[node] [<inode number>] [a | <fileset>] q[uit] su[perblock] [p | s] Figure 9-17. Use the following syntax to start fsdb: fsdb <path to logical volume> For example: # fsdb /dev/lv00 Aggregate Block Size: 512 > 9-32 Kernel Internals © Copyright IBM Corp. and debug a file system.

Commands The commands available in fsdb can be viewed with the help command as shown in the visual. © Copyright IBM Corp. The following explains how to use fsdb with a JFS2 file system. 2001. Enhanced Journaled File System 9-33 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.3 Student Notebook Uempty Support file systems fsdb supports both the JFS and JFS2 file systems. . The commands available in fsdb are different depending on what file system type it is running against.0.0.V2. 2003 Unit 9.

. 2001. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 9-34 Kernel Internals © Copyright IBM Corp.0 Notes: Turn to your lab workbook and complete exercise seven. Identify a file's inode number Identify extent descriptors Locate the data extents that hold the contents of a file Figure 9-18. Exercise BE0070XS4.Student Notebook Exercise Complete exercise seven Consists of theory and hands-on Ask questions at any time Activities are identified by a What you will do: Use the fsdb utility to examine a JFS2 file system.

0. and is composed of directory entries. an inode can represent a directory. Member inumber Description Inode number. . Directory BE0070XS4. © Copyright IBM Corp. The directory entry is a 32 byte structure and has the members shown here.V2. the directory entries link the names of the objects in the directory to an inode number.0.0 Notes: Introduction In addition to files. which indicate the files and sub-directories contained in the directory. Enhanced Journaled File System 9-35 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 2003 Unit 9. A directory is a journaled meta-data file in JFS2. Directory entry Stored in an array. 2001.3 Student Notebook Uempty Directory inumber next namelen name[22] Figure 9-19.

File name. /* * */ typedef struct { ino64_t inumber. Directory entry structure definition Following is the structure definition for a directory entry. . 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. namlen. /* 8: 4-byte aligned */ leaf node entry head/only segment int8 uint8 #ifdef next. /* 22: 2-byte aligned */ #else char #endif } ldtentry_t. up to 22 characters. /* 22: 2-byte aligned */ 9-36 Kernel Internals © Copyright IBM Corp. additional entries are linked using the next pointer.h. 2001. /* 1: */ /* 1: */ _J2_UNICODE UniChar name[11]. It is from /usr/include/j2/j2_dtree.Student Notebook Member next namelen name[22] Description If more than 22 characters are needed. /* (32) */ name[22]. Length of the name.

2001.0 Notes: Root header In order to improve the performance of locating a specific directory entry. rsrvd1. .V2.3 Student Notebook Uempty Directory Root Header typedef union { struct { ino64_t int64 uint8 int8 int8 int8 int32 int8 } header. and whether it is the root of the binary tree. a binary tree sorted by name is used. idotdot. © Copyright IBM Corp. 2003 Unit 9. stbl[8]. freelist. dtslot_t } dtroot_t. Directory Root Header BE0070XS4. The last used slot in the directory entry slot array.0. The number of free slots in the directory entry array. Enhanced Journaled File System 9-37 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. Indicates if the node is an internal or leaf node. The root header is a 32 byte structure defined by dtroot_t in /usr/include/j2/j2_dtree.h. As with files. freecnt. Each header describes an eight element array of directory entries. nextindex. rsrvd2. the header section of a directory inode contains the binary tree root header. Member idotdot flag nextindex freecnt Description Inode number of parent directory. /* /* /* /* /* /* /* /* /* 8: parent inode number */ 8: */ 1: */ 1: next free entry in stbl */ 1: free count */ 1: freelist header */ 4: */ 8: sorted entry index table */ (32) */ Figure 9-20.0. flag. slot[9].

The array of directory entries.h. Leaf and internal node header When more than eight directory entries are needed a leaf or internal node is added. 9-38 Kernel Internals © Copyright IBM Corp.Student Notebook Member freelist stbl[8] slot[9] Description The slot number of the head of the free list The indices to the directory entry slots that are currently in use. The directory internal and leaf node headers are similar to the root node header. The header is stored in the first slot. except that they may have up to 128 directory entries (corresponding to a 4096 byte leaf page). The entries are sorted alphabetically by name. . There are eight entries. The page header is defined by a dtpage_t structure. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. contained in /usr/include/j2/j2_dtree. 2001.

This limits the amount of shifting necessary when directory entries are added or deleted. The stbl table contains the slot numbers of the entries ordering the entries alphabetically. Enhanced Journaled File System 9-39 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.0.0 Notes: Directory slot array The directory slot array (stbl[]) is a sorted array of indices of the directory slots that are currently in use. The entries are sorted alphabetically by name. since the array is much smaller than the entries themselves. the directory entry table contains four files.0. . Example In the example show above. A binary search can be used on this array to search for particular directory entries. © Copyright IBM Corp.3 Student Notebook Uempty Directory Slot Array Directory Entry table 1 2 3 4 5 6 7 8 def abc xyz hij 2 1 4 STBL[8] 3 0 0 0 0 Figure 9-21. 2003 Unit 9. Directory Slot Array BE0070XS4. 2001.V2.

the directory tables must be increase in size. Action Initial directory entries are stored in the directory inode in-line data area. 3. This table describes the steps used.. a new extent must be allocated. If the leaf node again becomes full and is still not 4 KB repeat step 3. Once the leaf node reaches 4 KB allocate a new leaf node. Update the header to point to this array and add the slots for the old array to the free list. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 5. First attempt to double the extent in place. . the page will be removed from the B+–tree.”) and parent (“. JFS2 allocates a leaf node the same size as the aggregate block size. the directory will shrink back into the directory inode in-line data area.”) directories. and . so a new slot array will have to be created. The directory slot array will only have been big enough to reference enough slots for the smaller page. Use the slots from the beginning of the newly allocated space for the larger array and copy the old array data to the new location. When that initial leaf node becomes full and the leaf node is not yet 4 KB. 4. and the parent inode number is held in the “idotdot” field in the header.Student Notebook .. and the data from the old extent must be copied to the new extent. 2001. When all the entries in the last leaf page are deleted. these will be represented in the inode itself. When all entries are free in a leaf page. directories A directory does not contain specific entries for the self (“. double the current size. Instead. if there is not room to do this. When the in-line data area of the directory inode becomes full. Self is the directory’s own inode number. Every leaf node after the initial one will be allocated as 4 KB to start. Step 1. 9-40 Kernel Internals © Copyright IBM Corp. 2. Growing directory size As the number of files in the directory grow.

V2.0 . 0. .2 . all the inode information fits into the in-line data area. 2003 Unit 9. © Copyright IBM Corp. Small directories Initial directory entries are stored in the directory inode in-line data area. Examine the example of a small directory. Note: the file with a long name has its name split across two slots. Enhanced Journaled File System 9-41 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.3.0.0 } 1 i nu mbe r: 69 652 n ex t: -1 n am ele n: 7 n am e: foo ba r1 i nu mbe r: 69 653 n ex t: -1 n am ele n: 8 n am e: foo ba r12 i nu mbe r: 69 654 n ex t: -1 n am ele n: 7 n am e: foo ba r2 i nu mbe r: 69 655 n ex t: 5 n am ele n: 37 n am e:l ong na med fi lew it hov er 2 n ex t: -1 c nt : 0 n am e: 2ch ar sin it sna me 2 3 4 5 Figure 9-22.. 4. In the example shown above. Small Directory Example BE0070XS4. 69652 foobar1 69653 foobar12 69654 foobar3 69655 longnamedfilewithover22charsinitsname fl ag: B T_R OOT B T_L EA F ne xti nd ex: 4 fr eec nt : 3 fr eel is t: 6 id otd ot : 2 st bl: { 1. 2001.0.0 Notes: Introduction This section demonstrates how the directory structures change over time.3 Student Notebook Uempty Small Directory Example # ls -ai 69651 . 2 .

the first file in the directory. 4.0 Notes: Adding a file An additional file called “afile” is created. 2001. Adding a File BE0070XS4. 0. 0} 1 in um be r: 6 96 52 ne xt : -1 na me le n: 7 na me : fo ob ar 1 in um be r: 6 96 53 ne xt : -1 na me le n: 8 na me : fo ob ar 12 in um be r: 6 96 54 ne xt : -1 na me le n: 7 na me : fo ob ar 2 in um be r: 6 96 55 ne xt : 5 na me le n: 3 7 na me :l on gn am ed fi le wi th ov er 2 ne xt : -1 cn t: 0 na me : 2c ha rs in it sn am e 6 in um be r: 6 96 56 ne xt : -1 na me le n: 5 na me : af il e 2 3 4 5 Figure 9-23. the search table array (stbl[]) is re-organized. 2. As this is now. Details for this file are added at the next free slot (slot 6). 0.Student Notebook Adding a File # ls -ai 69651 .. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 9-42 Kernel Internals © Copyright IBM Corp. alphabetically. 2 . 69656 afile 69652 foobar1 69653 foobar2 69654 foobar3 69655 longnamedfilewithover22charsinitsname f la g: B T_ RO OT B T_ LE AF n ex ti nd ex : 5 f re ec nt : 2 f re el is t: 7 i do td ot : 2 s tb l: { 6. 1. so that the entry in slot 6 is now in the first entry. 3. .

Note: the internal node entry contains the name of the first file (in alphabetical order) for that leaf node..7. then JFS2 allocates a leaf node the same size as the aggregate block size.14} 1 inumber: 5 next: -1 namelen: 5 name: file0 inumber: 6 next: -1 namelen: 5 name: file1 inumber: 15 next: -1 namelen: 6 name: file10 2 3 19 20 inumber: 23 next: -1 namelen: 6 name: file18 inumber: 24 next: -1 namelen: 6 name: file19 Figure 9-24. as illustrated above..13.8} 1 xd.0 Notes: Adding a leaf node When the directory grows to the point where there are more entries than can be stored in the in-line data area of the inode.addr2: 52 next: -1 namelen: 0 name: file0 flag: BT_LEAF nextindex: 20 freecnt: 103 freelist: 25 maxslot: 128 stbl: {1. . Once the leaf is full.2.addr1: 0 xd. Enhanced Journaled File System 9-43 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.5. which will contain the address of the next leaf node. The in-line entries are moved to a leaf node. © Copyright IBM Corp.3. 2003 Unit 9. .0. 2001.15.6.3 Student Notebook Uempty Adding a Leaf Node Block 52 flag: BT_ROOT BT_INTERNAL nextindex: 1 freecnt: 7 freelist: 2 idotdot: 2 stbl: {1.0.4. Adding a Leaf Node BE0070XS4. 8.len: 1 xd. an internal node is added at the next free in-line data slot in the inode.V2.2.

.Student Notebook Adding an Internal Node Bl ock 1 18 fla g: BT _R OOT B T_I NT ERN AL nex ti nde x: 4 fre ec nt: 4 fre el ist : 5 ido td ot: 2 stb l: {1 . 7 ..7 .a dd r2: 19 91 n ext : -1 n ame le n: 9 n ame : fil e13 83 3 x d.0 Notes: Adding an internal node Once all the in-line slots have been filled by internal nodes.a dd r2: 11 8 n ext : -1 n ame le n: 0 n ame : fil e0 x d.ad dr 1: 0 xd .18 . . .a dd r1: 0 x d. 6.l en : 1 x d.15 .1 na mel en : 0 na me: f ile 0 x d. add r1 : x d.1 12 } 1 i nu mbe r: 5 n ex t: -1 n am ele n: 5 n am e: fil e0 i nu mbe r: 6 n ex t: -1 n am ele n: 5 n am e: fil e1 i nu mbe r: 15 n ex t: -1 n am ele n: 6 n am e: fil e1 0 2 2 2 3 3 1 26 4 xd .l en : 1 x d. a separate node block is allocated. two layers of internal nodes are required to reference all the files..ad dr 2: 52 ne xt: . and the first in-line data slot updated with the address of the new internal node. and each entry in these references the name of the alphabetically first entry in each leaf node. len : x d.8} 1 xd . 9-44 Kernel Internals © Copyright IBM Corp.a dd r2: 12 04 n ext : -1 n ame le n: 8 n ame : fil e48 45 x d.ad dr 1: -1 xd . add r2 : 1 47 2 n ex t: -1 n am ele n: 8 n am e: fi le1 01 7 12 6 12 7 12 7 inu mb er: 1 005 7 nex t: -1 nam el en: 9 nam e: fi le 100 52 inu mb er: 1 004 1 nex t: -1 nam el en: 9 nam e: fi le 100 36 Figure 9-25.4.8} 1 x d. .le n: 0 xd . Adding an Internal Node BE0070XS4.1 na mel en : 8 na me: f ile 14 72 x d.a dd r1: 0 x d.2 .a dd r2: 26 09 n ext : -1 n ame le n: 8 n ame : fil e17 72 3 f lag : BT_ IN TER NA L n ext in dex : 64 f ree cn t: 59 f ree li st: 7 6 m axs lo t: 12 8 s tbl : {1. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.2 . the entries from the in-line data slots are moved to this new node. 11 3. add r1 : 0 x d.a dd r1: 0 x d. After many extra files have been added to the directory.l en : 1 x d. len : 1 x d.ad dr 2: 14 73 ne xt: . 2001.le n: 1 xd . Note: now. add r2 : n ex t: n am ele n: n am e: Bl ock 5 2 fl ag: BT _L EAF ne xti nde x: 64 fr eec nt: 5 9 fr eel ist : 21 ma xsl ot: 1 28 st bl: {1 . that the internal node entries in the inode contain the name of the alphabetical first entry referenced by each of the second level internal nodes. 2.3 .a dd r1: 0 x d.l en : 1 x d. 19 .

True or False? The data contents of a file is stored in objects called _____.0. directories. True or False? Figure 9-26. 2001.0 Notes: © Copyright IBM Corp. A JFS2 directory contains directory entries for the . and . A single extent can be up to ____ in size.0.V2. Checkpoint BE0070XS4. An allocation group is at least ____aggregate blocks. Enhanced Journaled File System 9-45 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.3 Student Notebook Uempty Checkpoint There is ____ aggregate per logical volume.. . The number of inodes in a JFS2 file system is fixed. 2003 Unit 9.

Exercise BE0070XS4.0 Notes: Turn to your lab workbook and complete exercise eight. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.Student Notebook Exercise Complete exercise eight Consists of theory and hands-on Ask questions at any time Activities are identified by a What you will do: Use fsdb to examine the structures of directories in a JFS2 file system Figure 9-27. 9-46 Kernel Internals © Copyright IBM Corp. . 2001.

0.0 Notes: © Copyright IBM Corp. . 2003 Unit 9. Enhanced Journaled File System 9-47 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. Unit Summary BE0070XS4. 2001.V2.0.3 Student Notebook Uempty Unit Summary Aggregate is a pool of space allocated to filesets A fileset is a mountable file system The contents of files and directories are stored in extents Extents are arranged in B+ trees for fast file and directory traversal Figure 9-28.

Student Notebook 9-48 Kernel Internals © Copyright IBM Corp. . 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 2001.

V2. 2003 Unit 10. Kernel Extensions What This Unit Is About This unit describes how the AIX 5L kernel is dynamically extended. What You Should Be Able to Do After completing this unit.3 Student Notebook Uempty Unit 10. . you should be able to • List the 3 uses for kernel extensions • Build a kernel extension from scratch • Compose an export file • Create an extended system call How You Will Check Your Progress Accountability: • Exercises using your lab system References AIX Documentation: Kernel Extensions and Device Support Programming Concepts AIX Documentation: Technical Reference: Kernel and Subsystems.0. Volume 1 AIX Documentation: Technical Reference: Kernel and Subsystems.0. Kernel Extensions 10-1 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 2001. Volume 2 © Copyright IBM Corp.

2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. .0 Notes: 10-2 Kernel Internals © Copyright IBM Corp.Student Notebook Unit Objectives At the end of this lesson you should be able to: List the 3 uses for kernel extensions Build a kernel extension from scratch Compose an export file Create an extended system call Figure 10-1. 2001. Unit Objectives BE0070XS4.

0.0.3 Student Notebook Uempty Kernel Extensions Kernel extensions can include: Device drivers System calls Virtual file systems Kernel processes Other device driver management routines Kernel extensions run within the protection domain of the kernel Extensions can be loaded into the kernel during: system boot runtime Extensions can be removed at runtime Figure 10-2. Kernel Extensions BE0070XS4. 2003 Unit 10. .0 Notes: Introduction The AIX kernel is dynamically extensible and can be extended by adding additional routines called kernel extensions. User-level code can only access kernel extensions through the system call interface. 2001. configurability. A kernel extension could best be described as a dynamically loadable module that adds functionality to the kernel. and ease of system administration to AIX. © Copyright IBM Corp. Kernel protection domain These modules are extensions to the kernel in the sense that they run within the protection domain of the kernel.V2. Kernel extensions add extensibility. Kernel Extensions 10-3 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

kernel extensions allow maximum flexibility. Rather than bundling all possible options into the kernel at compile time (and creating a large kernel). 10-4 Kernel Internals © Copyright IBM Corp. Advantages Allowing kernel extensions to be loaded and unloaded allows a system administrator to customize a system for particular environments and applications.Student Notebook Loading extensions Extensions can be added at system boot or while the system is in operation. 2001. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. development time is reduced since a new kernel does not have to be compiled and installed for each development cycle. Disadvantages Importing new code into the kernel allows the possibility of an unlimited number of runtime errors to be introduced into the system. The option of loading and unloading kernel extensions at runtime increases system availability and ease of use. . path length. and serialization must be taken into account when writing extensions to the kernel. Extensions are loaded and removed from the running kernel using the sysconfig() system call. Such issues as execution environment. pageability. In addition.

Kernel Extensions 10-5 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 2003 Unit 10. . Relationship With the Kernel Nucleus BE0070XS4.0. © Copyright IBM Corp.V2.0. 2001.3 Student Notebook Uempty Relationship With the Kernel Nucleus Commands System Calls Kernel Protection Boundary System Call Interface Virtual File System Device Drivers Extended System Calls Private routines Extended Kernel Mode Experts Extended Kernel Services Nucleus Kernel Services Figure 10-1.0 Notes: Kernel Components The schematic drawing above illustrates the relationship to the kernel.

2001. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.0 Notes: Introduction This section describes how symbol names are shared between the kernel and kernel extensions. Some of these symbols are private to the parts of the kernel that use them. 10-6 Kernel Internals © Copyright IBM Corp. The set of symbols used by the kernel makes up the kernel’s name space. Some of these symbols are made available for other parts of the kernel and kernel extensions to use.exp Global kernel Name space export Core kernel services (/unix) Extended system calls import/ export import export Other kernel extensions import Kernel Extensions Device drivers Extended kernel Services Figure 10-2. Global Kernel Name Space BE0070XS4. Name space The kernel contains many functions and storage locations that are represented by symbols.Student Notebook Global Kernel Name Space /usr/lib/kernex. .

In the case of the kernel exports file. Extensions can make symbols they define visible to other extensions by exporting these symbols. 2001. Export file The kernel export file has the following format: #!/unix * list of kernel exports devswadd devswchg devswdel devswqry devwrite e_assert_wait e_block_thread e_clear_wait System calls There is an additional file that lists the system calls that are exported from the kernel (/usr/lib/syscalls.3 Student Notebook Uempty Exported symbols The kernel makes symbols available for kernel extensions by exporting them.exp).exp. Exports file format The first line of the kernel export file indicates the binary where the symbols are being exported from. . they are exported from the /unix binary. The remainder of the file lists the symbols that are exported. The kernel exports file is imported by the kernel extension when the linker command (ld) is run.V2.0. 2003 Unit 10. The linker uses the kernel export file to resolve the kernel symbols used by the kernel extension code. If a kernel extension or other program wants to reference these symbols they must import them. Kernel exports file The purpose of the kernel exports file is to list the symbols exported by the kernel. © Copyright IBM Corp.0. Kernel Extensions 10-7 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. The kernel export file is /usr/lib/kernex.

This system call is only available in the 64-bit kernel. 2001. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. absinterval access accessx acct adjtime . .Student Notebook The format of the file syscalls.exp file is similar to the format of the kernel exports file except for an additional tag for each system call. Here is a fragment of the file syscalls. This descriptor indicates the ability of the system call to interact with 64-bit processes. 10-8 Kernel Internals © Copyright IBM Corp. This system call supports both 32-bit and 64-bit applications. This system call is a 32-bit system call and passes 32-bit addresses. syscall3264 syscall3264 syscall3264 syscall3264 syscall3264 Tag syscall syscall32 syscall64 syscall3264 Description This system call does not pass any arguments by reference (address).exp and a description of the tags. .

Symbols are exported by creating an export file. An exports file for a kernel extension is used as an import file by other kernel extensions that wish to use the symbols exported by the latter. Kernel Extensions 10-9 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. The format of the exports file is identical to the format of the imports file.V2. This means that other kernel extensions cannot use the routines and variables within the extension. Why Export Symbols? BE0070XS4. © Copyright IBM Corp. All symbols within a kernel extension remain private by default.3 Student Notebook Uempty Why Export Symbols? To make symbols available for use by other extensions To share private symbols between extensions To define extended system calls to programs that will call them Figure 10-3.0.0 Notes: Introduction Kernel extensions can export symbols that are defined by the extension. 2001. which makes these symbols available for reference outside the kernel extension. 2003 Unit 10. .0. This default action can be changed by creating an export file for the extension. Any symbols which are exported by a kernel extension are automatically added to the kernel global name space when the module is explicitly loaded. The export file lists the symbols you want to exported from the kernel extension.

Examples of these files are shown here: #!/unix sys_call_name syscall Note that the above will only work if “sys_call_name” has no parameters. For object files that reference each other's symbols. 10-10 Kernel Internals © Copyright IBM Corp. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.Student Notebook Using private routines Kernel extensions can also consist of several separately link-edited object files that are bound at load time. The export file for the object file providing the services should specify the directory path to the object file as the first line in the exports file. Load-time binding is useful where several kernel extensions use common routines provided in a separate object file. the first line of the export file should be: #!/usr/lib/drivers/pci/scsi_ddpin Extended system calls When a kernel extension creates a new system call. If the system call has parameters a different “tag” value such as syscall3264 must be used. . For example. The filename specified should be where the file will be installed when the kernel extension is loaded into the kernel. an export file must be created containing the symbol name of the new system call. This was explained earlier. each file should use the other's export file as an import file during link-edit. 2001.

libc. The C library for application programs is a shared object. and contain special kernel safe versions of some useful routines such as atoi() and strlen() that are normally found in the regular C library.0.a a641 164a memmove strchr strncat strspn atoi memccpy memset strcmp strncmp strstr bcmp memchr ovbcopy strcpy strncpy strtok bcopy memcmp remque strcspn strpbrk bzero memcpy strcat strlen strrchr libsys.V2.o files). Kernel Libraries BE0070XS4.a.3 Student Notebook Uempty Kernel Libraries libcsys. It is not possible to access this user-level library from within the kernel protection domain.a d_align newstack xdump d_roundup secs_to_date date_to_jul timeout date_to_secs timeoutcf untimeout Figure 10-4. 2003 Unit 10. 2001.a. kernel extensions should not be linked the normal C library. For this reason. the kernel extension may link with the libraries libcsys. which provides a set of useful programming routines. These are static libraries (ar format library with static.a and libsys. Note that the routines provided by libcsys.a are only a very small subset of those provided in the normal C library.0. . Kernel Extensions 10-11 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. Instead. © Copyright IBM Corp.0 Notes: Introduction Normal C applications are linked with the C library.

Reference Additional information on the libcsys. 2001. .a and libsys. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.a are available in the AIX online documentation.Student Notebook Kernel libraries Libraries available to kernel extensions are shown in the visual on the previous page. 10-12 Kernel Internals © Copyright IBM Corp.

3 Student Notebook Uempty Configuration Routines Kernel extension int module_entry (cmd. struct uio *uiop. and are automatically exported to the global name space.0. For example.V2. the symbol nfs_config is the entry point routine for the NFS kernel extension. In order to avoid conflicts in the kernel name space. a kernel extension does not have a routine called main. These routines can have any name. © Copyright IBM Corp. uiop) dev_t dev. Configuration Routines BE0070XS4. 2003 Unit 10. struct uio *uiop.0 Notes: Introduction Unlike a normal user-level C language application. Kernel Extensions 10-13 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. uiop) int cmd. 2001. . Instead it has a configuration routine and one or more entry points. int cmd. cmd. it is normally best to prepend the names of exported symbols with something that indicates the extension which defines the symbol. Device Driver int dd_entry (dev.0. Value of cmd CFG_INIT CFG_TERM Description Initialize Terminate Figure 10-5.

The format of the configuration routine is below. The uio structure is used to pass arguments from the configuration method. 10-14 Kernel Internals © Copyright IBM Corp. . When linking the extension the configuration routine is specified with the -e option of the ld command.Student Notebook Configuration routine An extension configuration routine is typically executed shortly after loading the extension. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 2001. The value of cmd depends on the operation the configuration method is being requested to perform. See later section on sysconfig() for details. These are routines that could be called as a result of a system call or other action that invokes the kernel extension. Entry points Kernel extensions typically define one or more entry points.

Kernel Extensions 10-15 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 2001.0.o -D__64BIT_KERNEL -D_KERNEL -D_KERNSYS Link ld -b64 -o ext64 ext64.0.V2.exp \ -bI: /usr/lib/kernex. In general. Compiler command A number of different commands can be used to invoke the compiler on AIX.0 Notes: Introduction Compiling and linking a kernel extension must be split into two phases: 1) Compile each source file to create an object file. The commands call the same compiler core with a different set of options. 2003 Unit 10. Compiling and Linking Kernel Extensions BE0070XS4. © Copyright IBM Corp. .exp -lsys -lcsys Figure 10-6.c -o ext64.o -e init_routine -bE:extension. 2) Link the required object files to create the extension binary.3 Student Notebook Uempty Compiling and Linking Kernel Extensions Compile cc -q64 -c ext. kernel code should be compiled with either the cc or xlc commands.

Additional values should be chosen appropriately.Student Notebook Conditional compiler values One of the main requirements of the compile stage is that the appropriate conditional compile values are used to select the correct code sections. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. Enable kernel symbols in header files. This value should always be used. Some conditional compile values will vary from extension to extension. This value is automatically defined by the kernel if the -q64 option is specified. This value should always be used. This value should always be used for 64-bit kernel extensions and device drivers. The compiler automatically defines a conditional compile variable to indicate which platform the code is being compiled on. Compiling kernel extension or device driver code. Code is being compiled for a 64-bit kernel. and are decided by the developer. _KERNSYS _KERNEL __64BIT_KERNEL __64BIT__ 10-16 Kernel Internals © Copyright IBM Corp. Value _POWER_MP Meaning Code is being compiled for a multiprocessor machine. . Code is being compiled in 64-bit mode. Other conditional compile values should be used to ensure that the correct sections of system-provided header files are used for environment (32-bit or 64-bit kernel) for which the extension is being built. 2001.

Linker option -b64 -b32 -eLabel -lcsys -lsys -oName -bE:FileID -bI:FileID Meaning Generate a 64-bit executable Generate a 32-bit executable Set the entry point of the executable to Label.3 Student Notebook Uempty Compiler options The default mode for the compiler is 32-bit. 2001. Some linker options will always be used when creating the binary. The general format of the linker command is: © Copyright IBM Corp.lst file. In order to compile 64-bit code. output goes to .0. output goes to stdout. Names the output file Name. 1 is assumed. Define <name> as in #define directive.0. use the linker (ld) to create the kernel extension binary. Generate information to be included in a “make” description file. Allow C++ style comments // Enables checking for possible long-to-integer or pointer-to-integer truncation.lst file. Compiler option -q64 -qlist -qsource -c -D<name>[=<def>] -M -O -S -v -qcpluscmt -qwarn64 Meaning Generate 64-bit object files. Do not send files to the linkage editor. Kernel Extensions 10-17 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. . Generate optimized code. Other compiler options may be used to generate additional information about the source files being compiled. some are optional. If <def> is not specified. Linking Once you have created all of the object files.s output file (assembler source) Displays language processing commands as they are invoked by the compiler. the -q64 option should be used. Exports the external symbols listed in the file FileID. and some are platform dependent.a and libsys.V2. 2003 Unit 10. output goes to . Produce a source listing.a libraries with the kernel extension. Link the libcsys. Produce a . Imports the symbols listed in FileID. (-q32 is the default) Produce an object listing.

10-18 Kernel Internals © Copyright IBM Corp. .o object2. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 2001.o -lcsys -lsys The order of arguments is not important.Student Notebook ld -e entry_point [import files] [export files] \ -o output_file object1.

cc -q64 -o ext64.o -c ext.3 Student Notebook Uempty How to Build a Dual Binary Extension Step 1 2 Action Compile a 32-bit object file using the -q32 compiler option. For example.0.0 Notes: Introduction Machines with 64-bit hardware can run either the 32-bit kernel or the 64-bit kernel. How to Build a Dual Binary Extension BE0070XS4.and 64-bit extensions ar -X32_64 -r -v ext ext32 ext64 Figure 10-7. © Copyright IBM Corp. it will load the appropriate binary for the type of kernel.c -D_KERNEL -D_KERNSYS Link a 32-bit module file using the -b32 linker option.o -e ext_init \ -bI: /usr/lib/kernex. A kernel extension must be of the same binary type as the kernel. Kernel Extensions 10-19 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.V2. The library contains both the 32-bit and 64-bit binary versions of the kernel extension. ld -b64 -o ext64 ext64. ld -b32 -o ext32 ext32.o -c ext. a 64-bit kernel will extract the 64-bit binary from the library.0.c -D_KERNEL -D_KERNSYS \ -D__64BIT_KERNEL 3 4 5 Build a 64-bit object file using the -b64 linker option. 2003 Unit 10. A kernel extension that supports both 32-bit and 64-bit kernels is packaged as an ar format archive library. cc -q32 -o ext32. . if the kernel detects that the file is an ar format library.exp -lcsys Build a 64-bit object file from the same source file as step 1. When the extension is loaded.exp -lcsys Create an archive of both 32. 2001.o -e ext_init \ -bI: /usr/lib/kernex.

10-20 Kernel Internals © Copyright IBM Corp. 2001. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. .Student Notebook Creating a dual binary extension The table/visual on the previous page describes the steps to building a dual binary kernel extension.a format. Note: The name of the library file does not need to be of the libnnn.

even on systems running the 64-bit kernel.0 Notes: Introduction A user-level program called a Configuration Method is used to load a kernel extension into the kernel. The program is normally a 32-bit executable. © Copyright IBM Corp. . Kernel Extensions 10-21 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.3 Student Notebook Uempty Loading Extensions sysconfig() system call can be used to: Load kernel extensions Unload kernel extensions Invoke the extension's entry point Query the kernel to determine if a extension is loaded loadext() library routine can be used to: Load kernel extensions Unload kernel extensions Query the kernel to determine if an extension is loaded Figure 10-8. 2003 Unit 10. sysconfig() and loadtext() There are two routines available for loading the extension into the kernel as shown in the visual above.V2. Loading Extensions BE0070XS4.0. 2001.0.

the caller specifies the kmid. . mid_t kmid. &cfg_load. unloading or querying. 2001. Unloads a previously loaded kernel object file.0 Notes: Loading. sizeof(cfg_load) ) Cmd Value SYS_KLOAD SYS_SINGLELOAD SYS_QUERYLOAD SYS_KUNLOAD Description Loads a kernel extension object file into kernel memory. and the path and libpath are ignored. sysconfig() . unloading and querying When loading.Loading and Unloading sysconfig ( Cmd. The caller provides the path value. The libpath is optional. and the sysconfig routine returns the kmid.h>) and one of the commands shown in this table. the sysconfig() subroutine is passed a pointer to a cfg_load structure (defined in <sys/sysconfig. 10-22 Kernel Internals © Copyright IBM Corp. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. When unloading. Determines if a specified kernel object file is loaded. Loads a kernel extension object file only if it is not already loaded.Student Notebook sysconfig() . struct cfg_load { caddr_t path.Loading and Unloading BE0070XS4. caddr_t libpath. /* ptr to object module pathname */ /* ptr to a substitute libpath */ /* kernel module id (returned) */ Figure 10-9. }.

2001. the next step is to call the entry point or configuration routine. sysconfig() .Configuration sysconfig(SYS_CFGKMOD. .0.V2. sizeof(cfg_kmod) ) struct cfg_kmod { mid_t kmid. caddr_t mdiptr. int mdilen. a pointer to a cfg_kmod structure and the SYS_CFGKMOD command is passed to sysconfig(). © Copyright IBM Corp. &cfg_kmod. /* /* /* /* module ID of module to call command parameter for module pointer to module dependent info length of module dependent info */ */ */ */ Figure 10-10. The cfg_kmod structure is used with the SYS_CFGKMOD command to call the entry point of a kernel extension. For all extensions other than device drivers.0.Configuration BE0070XS4. }. int cmd.3 Student Notebook Uempty sysconfig() .0 Notes: Calling the entry point Once the kernel extension has been loaded into the kernel. 2003 Unit 10. Kernel Extensions 10-23 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

sizeof(cfg_dd) ) struct cfg_dd { mid_t kmid. &cfg_dd. /* dev_t devno. /* int cmd. Values are defined in <sys/device. /* caddr_t ddsptr./* }. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.Student Notebook sysconfig() .0 Notes: Device driver entry point The cfg_dd structure is used with the SYS_CFGDD command to the sysconfig() routine to call the entry point of a device driver.h> as follows: Value Meaning CFG_INIT CFG_TERM CFG_QVPD Initialize the extension Terminate the extension Query of vital product data 10-24 Kernel Internals © Copyright IBM Corp.Device Driver Configuration BE0070XS4. . 2001. Entry point options A number of commands can be passed to the entry point of a kernel extension in the cmd parameter of the cfg_dd or cfg_kmod structure passed to sysconfig()./* int ddslen. module ID of device driver*/ device major/minor number*/ config command code for device */ pointer to DD structure*/ length of DD structure */ Figure 10-11.Device Driver Configuration sysconfig(SYS_CFGDD. sysconfig() .

Unloads a previously loaded kernel object file. . this indicates that the device driver can be used by 64-bit applications. Determines if a specified kernel object file is loaded.0. Checks the status of a device switch entry in the device switch table. this indicates that the kernel extension does not export 64-bit system calls.0.3 Student Notebook Uempty CFG_UCODE Value Meaning Download of microcode sysconfig() commands This table provides a complete list of commands for the sysconfig() system call: Cmd Value SYS_KLOAD SYS_SINGLELOAD SYS_QUERYLOAD SYS_KULOAD SYS_QDVSW SYS_CFGDD SYS_CFGKMOD SYS_GETPARMS SYS_SETPARMS Result Loads a kernel extension object file into kernel memory. Loads a kernel extension object file only if it is not already loaded. Calls the specified device driver configuration routine (module entry point). When running on the 32-bit kernel. Calls the specified module at its module entry point for configuration purposes. Kernel Extensions 10-25 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. For kernel extensions. Sets run-time system parameters from a caller-provided structure. SYS_64BIT © Copyright IBM Corp. Returns a structure containing the current values of run-time system parameters found in the var structure. but that all 32-bit system calls also work for 64-bit applications. For device drivers. 2001.V2. this flag can be bit-wise OR'ed with the cmd parameter (if the cmd parameter is SYS_KLOAD or SYS_SINGLELOAD). 2003 Unit 10.

/”. “. For example. 10-26 Kernel Internals © Copyright IBM Corp. PCI device drivers are normally stored in the /usr/lib/drivers/pci directory. It uses a boolean logic interface to perform the query. load and unload of kernel extensions. The dd_name argument “pci/fred” would result in the loadext routine trying to load the file /usr/lib/drivers/pci/fred into the kernel. .Student Notebook The loadext() Routine The loadext() routine is defined as follows: #include <sys/types. or a “/”). it does not start with “.0 Notes: Introduction The loadext() routine. The loadext() Routine BE0070XS4./”. load. dd_name The dd_name string specifies the pathname of the extension module to load. query.. is often used to perform the task of loading the extension code into the kernel. Figure 10-12. defined in the libcfg. query) char *dd_name.h> mid_t loadext (dd_name. 2001. int load. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.a library. If the dd_name string is not a relative or absolute path name (in other words. then it is concatenated to the string “/usr/lib/drivers/”.

V2.0.0.3
Student Notebook

Uempty

load and query parameters
The load and query parameters are either TRUE or FALSE, and indicate the action to be taken as follows:
loadext(“pci/fred”, FALSE, TRUE); /* Query of pci/fred */ loadext(“pci/fred”, TRUE, FALSE); /* SYS_SINGLELOAD of pci/fred */ loadext(“pci/fred”, FALSE, FALSE); /* Unload pci/fred */

Multiple copies
If you require multiple copies of a kernel extension to be loaded, you should use the sysconfig interface with the SYS_KLOAD command, since loadext uses SYS_SINGLELOAD, which will only load the extension if it is not already loaded.

Calling entry points
Even if using the loadext routine to load the kernel extension, you still need to use the sysconfig() routing to call the entry point.

© Copyright IBM Corp. 2001, 2003

Unit 10. Kernel Extensions

10-27

Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

Student Notebook

System Calls

User program main(){ . . sys_call(arg1, arg2,..... ) . . }

User address space Kernel address space
System call code in kernel sys_call( arg1, arg2,.....) { . . . }

1) Switch protection domain from user to kernel 2) Switch to the kernel stack. 3) Execute the system call code.

Figure 10-13. System Calls

BE0070XS4.0

Notes: Introduction
A system call is a function called by user-process code that runs in the kernel protection domain.

What is a system call?
A system call: - Provides user access to kernel functions and resources - Runs with kernel-mode privileges - Protects the kernel from direct user mode access to the kernel domain

10-28 Kernel Internals

© Copyright IBM Corp. 2001, 2003
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

V2.0.0.3
Student Notebook

Uempty

Differences from a user-mode function
From an external view the mechanism used to call a system call appears the same as calling a user-mode function. There are, however, several significant differences between a user-mode function and a system call. In a system call: - Execution mode is switched from user to kernel mode - Code and data are located in global kernel memory - Cannot use the shared user libraries - Cannot reference symbols outside of the kernel protection domain - System calls can’t be interrupted by signals (must poll for signals) - Can create kernel process to perform asynchronous processing

© Copyright IBM Corp. 2001, 2003

Unit 10. Kernel Extensions

10-29

Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

Student Notebook

Sample System Call - Export/Import File
question.exp #!/unix question syscall

Figure 10-14. Sample System Call - Export/Import File

BE0070XS4.0

Notes: Introduction
This section describes the creation of a very simple kernel extension that adds a new system call to the kernel. The extended system call created here is called question().

Export and import files
When creating an extended system call, the function name of the system call must be exported by the kernel extension and imported by any program calling the system call. Shown above is the export and import file used for this example. Note: The “tag”, syscall, shown above works here because the question() function has no parameters. If it did have parameters we would need to use a “tag” such as syscall3264.

10-30 Kernel Internals

© Copyright IBM Corp. 2001, 2003
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

V2.0.0.3
Student Notebook

Uempty

Sample System Call - question.c
/* question.c */ #include <stdio.h> #include <sys/device.h> question_init(int cmd, struct uio *uio) { switch(cmd) { case CFG_INIT:{ /* do init stuff here */ printf("question_init: command=CFG_INIT\n"); break; } case CFG_TERM:{ /* clean up */ printf("question_init: command=CFG_TERM\n"); break; } default: printf("question_init: command=%d\n",cmd); } return(0); } question() { return(42); /* return the answer to the user */ }
Figure 10-15. Sample System Call - question.c BE0070XS4.0

Notes: Example extension
This is the kernel extension code. The init routine question_init() is run when the extension is loaded. The function question() is the new system call. The code uses kernel printf() calls. The output from these calls will be displayed on /dev/console if the running kernel image has the kernel debugger loaded.

© Copyright IBM Corp. 2001, 2003

Unit 10. Kernel Extensions

10-31

Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

Student Notebook

Sample System Call - Makefile
question: question.c cc -q32 -D_KERNEL -D_KERNSYS -o question32.o \ -c question.c ld -b32 -o question32 question32.o -e question_init \ -bE:question.exp -bI:/usr/lib/kernex.imp cc -q64 -D_KERNEL -D_KERNSYS -D_64BIT_KERNEL \ -o question64.o -c question.c ld -b64 -o question64 question64.o -e question_init \ -bE:question.exp -bI:/usr/lib/kernex.imp rm -f question ar -X32_64 -r -v question question32 question64

Figure 10-16. Sample System Call - Makefile

BE0070XS4.0

Notes: System call makefile
This is the Makefile used to build the kernel extension. In this example both 32-bit and 64-bit objects are built. The two objects are archived (ar) into a single file. When loaded into the kernel, the object matching the kernel type will be extracted from the archive and loaded.

10-32 Kernel Internals

© Copyright IBM Corp. 2001, 2003
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

V2.0.0.3
Student Notebook

Uempty

Argument Passing

64-bit User Process:

32-bit User Process:

32-bit User Process:

64-bit User Process:

sys_call(int * )

sys_call(int * )

sys_call(int * )

sys_call(int * )

User mode Kernel mode
32-bit pointers are zero extended Low-order 32 bits only are passed.

64-bit kernel

32-bit kernel

sys_call(int * )

sys_call(int * )

Figure 10-17. Argument Passing

BE0070XS4.0

Notes: Introduction
System calls can accept up to 8 arguments. Often these arguments are 64-bits long or pointers to buffers in the user’s address space. Because AIX supports a mix of 32-bit and 64-bit environments, care must be taken when processing 64-bit arguments.

64-bit kernels
When running a 64-bit kernel, pointer arguments passed from a 32-bit process will be zero extended. This case requires no special handling.

32-bit kernels
In the 32-bit kernel, a kernel service that accepts a pointer as a parameter expects a 32-bit value. When dealing with a 64-bit user process however, things are different. Although the kernel expects (and indeed receives) 32-bit values as the arguments, the
© Copyright IBM Corp. 2001, 2003 Unit 10. Kernel Extensions 10-33

Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

Student Notebook

parameters in the user process itself are 64-bit. The system call handler copies the low-order 32-bits of all parameters onto the kernel stack it creates before entering the system call. The high-order 32-bits are stored elsewhere. A new kernel service called get64bitparm() is used to retrieve the stored high-order 32-bits and reconstruct the 64-bit value inside the kernel.

get64bitparm()
The get64bitparm() kernel service is defined in the header file <sys/remap.h> as follows:

unsigned long long get64bitparm(unsigned long low32, int parmnum);

The get64bitparm() kernel service is used to reconstruct a 64-bit long pointer that was passed (and truncated) from a 64-bit user process to the 32-bit kernel. The 64-bit system call handler stores the high order 32-bits of all system call arguments. Once the 64-bit value has been re-constructed, the kernel service may use it for whatever purpose it deems necessary. In the following material we demonstrate the use of this service in forming a 64-bit address which is then used to read parameter data from a 64-bit process into a 32-bit kernel extension. In this case the get64bitparm() call is used to obtain a user space address which is then accessed by the copyin64() kernel service.

10-34 Kernel Internals

© Copyright IBM Corp. 2001, 2003
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

User Memory Access BE0070XS4. 2001. Kernel extensions reside in the kernel protection domain and cannot directly access user space memory. count).0.3 Student Notebook Uempty User Memory Access sys_call( &user_buffer. Prototypes are defined for the services in the header file <sys/uio. Kernel Extensions 10-35 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. a number of services can be used to copy data from user space to kernel space.&kernel_buffer. © Copyright IBM Corp. List of services The following services can be used to transfer data between user and kernel address space.count). Overview User applications reside in the user protection domain and cannot directly access kernel memory.buffer.V2. and from kernel space to user space. copyout(&kernel_buffer.0 Notes: Introduction Within the kernel. kernel_buffer Figure 10-18. sizeof(user_buffer) ). . user_buffer User address space Kernel address space copyout copyin sys_call( void * buffer.h>.0. 2003 Unit 10. int count ){ copyin(buffer.

suword(void *uaddr. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. void * uaddr. uint *actual). unsigned long long uaddr. Both the user structure and the IS64U macro are defined in /usr/include/sys/user. The macro evaluates to true if the calling process is 64-bit. char * kaddr. .fetch a byte and word respectively int fubyte(void *uaddr).h. size_t max. char val). size_t *actual). suword64(unsigned long long uaddr. 2001.copies a character string (including the null character) copyinstr(void * uaddr. It checks the U_64bit member of the user structure described earlier. uchar val). IS64U The macro IS64U can be used by system call code to determine if the calling process is 64-bit or 32-bit. void * kaddr. size_t count) . int fuword(void *uaddr). copyout64(char * kaddr. Copy data from kernel to user space Use the following services to copy data from kernel to user space: . 10-36 Kernel Internals © Copyright IBM Corp. size_t count) . copyin64(unsigned long long uaddr. caddr_t kaddr. fuword64(unsigned long long uaddr).copies count bytes of data copyin (void * uaddr. fubyte64(unsigned long long uaddr).Student Notebook Copy data from user to kernel space Use the following services to copy data from user to kernel space: . void * kaddr.copies count bytes of data copyout(void * kaddr. int count) copyinstr64(unsigned long long uaddr. int val). int val). 32-bit kernels Additional services can be used by 32-bit kernels when dealing with a 64-bit user process. .store a byte and word respectively subyte(void *uaddr. int count) subyte64(unsigned long long uaddr. uint max.

size = (long) lsize. . #ifndef __64BIT_KERNEL /* 32-bit kernel logic */ unsigned long long lbuf.localmem.lbuf. lsize = get64bitparm( size. } . 2001. int count. int myservice(void * buf. } else #endif { /* this path is taken if 32-bit kernel & 32-bit process ** OR any size process if running in 64-bit kernel */ copyin(buf.V2.count).0.kernel_heap). } if (rc != 0 ) { . lsize. .count). .localmemm.3 Student Notebook Uempty 64-bit argument code sample The following code sample shows the logic used in a kernel extension that can handle calls from 64-bit user applications when running in the 32-bit kernel. if(is64u) { /* 32-bit kernel & caller is a 64-bit process */ lbuf = get64bitparm( (unsigned long) buf. copyin64(lbuf. int rc. /* body of kernel service */ #ifndef __64BIT_KERNEL /* 32-bit kernel logic */ if(is64u) { /* 32-bit kernel & caller is a 64-bit process */ rc = copyout64(localmem. .count).2.count). . long size) { void *localmem = xmalloc(count.buf. 2). 2003 Unit 10. 0). } else #endif { rc = copyout(localmem.0. Kernel Extensions 10-37 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. © Copyright IBM Corp. char is64u = IS64U.

2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 2001. A kernel extension can be compiled and linked like a regular user application.Student Notebook Checkpoint Kernel extensions can be loaded at _____ _____ and during _______.0 Notes: 10-38 Kernel Internals © Copyright IBM Corp. Checkpoint BE0070XS4. Figure 10-19. True or False? Kernel extensions are used mainly for D_____ D_____. True or False? A kernel extension must supply a routine called main(). F_____ S______ and S______ C______. . The ________ system call is used to invoke the entry point of a kernel extension.

Kernel Extensions 10-39 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. or even causing the system to crash. kernel services perform very little (or no) checking of arguments for error conditions. link and load a kernel extension Write your own system call Write a kernel extension that creates kernel processes Create your own ps command Figure 10-20. Turn to your lab workbook and complete exercise ten. The consequences of invoking a kernel service with incorrect arguments include data corruption.V2. .3 Student Notebook Uempty Exercise Complete exercise ten Consists of theory and hands-on Ask questions at any time Activities are identified by a What you will do: Compile. Exercise BE0070XS4. This is in stark contrast to similar problems in a user-level application which normally would result in the application terminating because of a SIGSEGV signal. In general.0 Notes: Developing code for the kernel environment is very different compared with developing a user-level application.0.0. © Copyright IBM Corp. 2001. 2003 Unit 10.

2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. Unit Summary BE0070XS4. and can be unloaded at runtime Kernel extensions require special compile and link steps Kernel extensions need to match the binary type of the running kernel Kernel extension code must take into account that the kernel is pageable Figure 10-21.0 Notes: 10-40 Kernel Internals © Copyright IBM Corp. file systems and extended system calls Kernel extensions can be loaded at boot time or runtime.Student Notebook Unit Summary Kernel extensions are used to implement device drivers. . 2001.

2. 4. Checkpoint Solutions Unit 1 Checkpoint Solutions 1. 2001. The kernel is the base program of the operating system. pageable and dynamically extendable. The processor runs interrupt routines in kernel mode. Checkpoint Solutions A-1 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. The 64-bit AIX kernel supports only 64-bit kernel extensions.0.V2. 2003 Appendix A. The 32-bit kernel supports 64-bit user applications when running on 64-bit hardware.3 Student Notebook Uempty Appendix A. 3. . © Copyright IBM Corp. and only runs on 64-bit hardware. The AIX kernel is preemptable. 5.0.

KDB is used for live system debugging. 2. 2001. kdb is used for system image analysis. 3. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. True or False? False. The value of the dbg_avail kernel variable indicates how the debugger is loaded. A-2 Kernel Internals © Copyright IBM Corp. . A system dump image contains everything that was in the kernel at the time of the crash. 4. The system dump image contains only selected areas of kernel memory.Student Notebook Unit 2 Checkpoint Solutions 1.

A thread table slot number is included in a thread ID. 5. 4.0. AIX provides three programming models for user threads. 6. All process IDs (except pid 1) are even. . 2. 2003 Appendix A. The process table is an array of pvproc structures. Checkpoint Solutions A-3 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. A new thread is created by the thread create() system call.3 Student Notebook Uempty Unit 3 Checkpoint Solutions 1.V2. 3. A thread holding a lock may have its priority boosted. 2001. © Copyright IBM Corp.0. True or False? True.

4. 3. .Student Notebook Unit 4 Checkpoint Solutions 1. AIX divides physical memory into frames. The virtual memory manager provides each process with its own effective address space. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. Shared library data segments can be shared between processes. 5. The 32-bit user address space layout is the same s the 32-bit kernel address space layout. A segment can be up to 256MB in size. 2. 6. The shared library text segments are shared. True or False? False. A 32-bit effective address contains a 4-bit segment number. A-4 Kernel Internals © Copyright IBM Corp. True or False? False. but the data segments are private. 2001.

. 5. A SIGDANGER signal is sent to every process when the free paging space drops below the warning threshold. The PSALLOC environment variable can be used to change the paging space policy of a process. 2. 3. The system hardware maintains a table of recently referenced virtual to physical address translations. 4. The Software Page Frame Table contains information on all pages resident in physical memory. Checkpoint Solutions A-5 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 2001.0.V2.3 Student Notebook Uempty Unit 5 Checkpoint Solutions 1. © Copyright IBM Corp. 6. 2003 Appendix A. Each working storage has an XPT. A page fault when interrupts are disabled will cause the system to crash.0.

. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 6) Which physical addresses in the system can a partition access? A partition can access the PMBs allocated to the partition. 2001.2 and Linux need 256MB. True or False? The statement is False. RML and LPI registers are needed in a partitioned system 2) Memory is allocated to partitions in units of ____256___MB. True or False? The statement is False. 4) In a partitioned environment. A real address is not equivalent to a physical address in the partitioned environment. AIX 5. (and with hypervisor assistance) the partition's own page table. True or False? The statement is False. and the TCE windows for the allocated I/O slots. a real address is the same as a physical address.Student Notebook Unit 6 Checkpoint Solutions 1) What processor features are required in a partitioned system? RMO.1 requires 256MB. AIX 5. depending on the amount of memory allocated to the partition. 5) Any piece of code can make hypervisor calls. Only kernel code can make hypervisor calls. A-6 Kernel Internals © Copyright IBM Corp. 3) All partitions have the same real mode memory requirements. 1GB or 16GB.

The three kernel structures volgrp.V2. logical volume and physical volume data. There is one gfs structure for each mounted file system. There is one gfs structure for each file system type registered with the kernel. 2001. lvol and pvol are used to track LVM volume group.0.3 Student Notebook Uempty Unit 7 Checkpoint Solutions (1 of 2) Each user process contains a private File Descriptor Table. © Copyright IBM Corp. The kernel maintains a vfs structure and a vmount structure for each mounted file system. 2003 Appendix A. respectively. True or False? False.0. Checkpoint Solutions A-7 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. . The kdb subcommand volgrp and the AIX command lsvg both reflect volume group information.

This will point us to the vnode structure of directory /usr in the root filesystem. So. Each vnode for an open file points to a gnode structure. not the inode of the /usr directory in the /(root) filesystem. 2001. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. The reason for this is that gnode structures are of one format. Why? The reason is that ls is giving us the root inode of the /usr filesystem. They are imbedded in the corresponding inode/specnode/rnode structure for the file in question. There is one file structure (system file table entry) for each unique open() of a file. True or False? True. A-8 Kernel Internals © Copyright IBM Corp. The inode number given by ls -id/usr is shown as 2.Student Notebook Unit 7 (continued) Checkpoint Solutions (2 of 2) There is one vmount/vfs structure pair for each mounted filesystem. True or False? False. Every open file in a filesystem is represented by exactly one file structure. a given file may be represented by several file structures. which contains the directory inode number. . To obtain this directory inode we need to follow the vfs_mntdover pointer in the /usr filesystem vfs structure. These structures are of different formats.

Checkpoint Solutions A-9 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. True or False? False. The last 128 bytes of an in core JFS inode is a copy of the disk inode. True or False? True. The basic allocation unit is a fragment.0. 3. The basic allocation unit in JFS is a disk block. JFS itself does copy operations and relies on VMM to do the actual I/O operations. 2001. This includes such items as open count and in-core inode state. The root inode number of a filesystem is always 1. True or False? True. .V2. 5.0. The root inode number is always 2. 2. © Copyright IBM Corp. JFS maps user data blocks and directory information into virtual memory. This is a reason for JFS I/O efficiency. 4.3 Student Notebook Uempty Unit 8 Checkpoint Solutions 1. 2003 Appendix A. True or False? False. The first part of an in core JFS inode contains data relevant only when the associated object is being referenced. An allocation group contains disk inodes and fragments.

. An allocation group is at least 8192 aggregate blocks. is contained in the inode of the directory.Student Notebook Unit 9 Checkpoint Solutions There is one aggregate per logical volume. A single extent can be up to 224-1 in size. A JFS2 directory contains directory entries for the . A-10 Kernel Internals © Copyright IBM Corp. True or False? False. The information for . The number of inodes in a JFS2 file system is fixed.. directories. and . 2001. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. True or False? False.. and . The data contents of a file is stored in objects called extents.

3 Student Notebook Uempty Unit 10 Checkpoint Solutions Kernel extensions can be loaded at system boot and during runtime. . The sysconfig system call is used to invoke the entry point of a kernel extension. A kernel extension must supply a routine called main().V2.0. Checkpoint Solutions A-11 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. File Systems and System Calls. 2003 Appendix A. A kernel extension can be compiled and linked like a regular user application. True or False? False. © Copyright IBM Corp.0. 2001. True or False? False. Kernel extensions are used mainly for Device Drivers.

Student Notebook A-12 Kernel Internals © Copyright IBM Corp. . 2001. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

3 Student Notebook Uempty Appendix B. KI Crash Dump What This Unit Is About This unit describes how to configure and perform system dumps on a system running a version of the AIX 5L operating system. KI Crash Dump B-1 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. . What You Should Be Able to Do After completing this unit. you should be able to: • Configure a system to perform a system dump • Test the system dump configuration of a system • Validate a dump file How You Will Check Your Progress Accountability: • Exercises using your lab system References © Copyright IBM Corp.V2. 2001.0. 2003 Appendix B.0.

2001. .Student Notebook Unit Objectives At the end of this unit you should be able to: Configure a system to perform a system dump Test the system dump configuration of a system Validate a dump file Figure B-1. Unit Objectives BE0070XS4.0 Notes: B-2 Kernel Internals © Copyright IBM Corp. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

. such as unexpected or unrecoverable kernel mode exceptions. KI Crash Dump B-3 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. It can also be initiated by the system administrator when the system is hung. © Copyright IBM Corp.0. 2003 Appendix B. the system dump facility automatically copies selected areas of kernel data to the primary dump device.0. When is a crash dump created? An AIX 5L system will generate a system crash dump when encountering a severe system error. Crash Dumps BE0070XS4. When a manually-initiated or unexpected system halt occurs.V2.0 Notes: System Dump Facility in AIX 5L What is crash dump? A system crash dump is a snapshot of the operating system state at the time of the crash or manually initiated dump. These areas include kernel memory as well as other areas registered in the Master Dump Table by kernel modules or kernel extensions. 2001.3 Student Notebook Uempty Crash Dumps What is a crash dump? When is a crash dump created? What is a crash dump used for? Figure B-2.

. the system will be booted and returned to production.Student Notebook What is a crash dump used for? The system dump facility provides a mechanism to capture sufficient information about the AIX 5L kernel for later expert analysis. The dump is then typically submitted to IBM for analysis. Once the preserved image is written to disk. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. B-4 Kernel Internals © Copyright IBM Corp. 2001.

2001.0.V2.3 Student Notebook Uempty Process Flow AIX 5L in production Stage 1 copycore copies dump into /var/adm/ras. 2003 Appendix B. copycore is called by rc. .0 Notes: System dump process Introduction The process of performing a system dump is illustrated in the chart. the contents of memory is copied to a temporary disk location. Process Flow BE0070XS4. In stage two. © Copyright IBM Corp. AIX 5L is booted and the memory image is moved to a permanent location in the /var/adm/ras directory. KI Crash Dump B-5 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.boot System Panics System is booted Stage 2 Memory dumper is run Memory is copied to disk location specified in SWservAt ODM object class Figure B-3. In stage one. The process involves two stages.0.

Student Notebook Exercise Complete exercise A Consists of theory and hands-on Ask questions at any time Activities are identified by a What you will do: Learn about the sysdumpdev command Configure your lab system to perform a system dump Test the crash dump configuration Verify you have obtained a successful system dump Figure B-4. 2001. . About This Exercise BE0070XS4. 2003 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.0 Notes: B-6 Kernel Internals © Copyright IBM Corp.

0 backpg Back page .V2.

® .