The information contained within this document is subject to change without notice.
HEWLETT-PACKARD MAKES NO WARRANTY OF ANY KIND WITH REGARD TO THIS MATERIAL, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
Hewlett-Packard shall not be liable for errors contained herein nor for incidental consequential damages in connection with the furnishing, performance, or use of this material.
Warranty. A copy of the specific warranty terms applicable to your Hewlett-Packard product and replacement parts can be obtained from your local Sales and Service Office.
Restricted Rights Legend. Use, duplication, or disclosure by the U.S. Government Department is subject to restrictions as set forth in subparagraph (c) (1) (ii) of the Rights in Technical Data and Computer Software clause at DFARS 252.227-7013 for DOD agencies, and subparagraphs (c) (1) and (c) (2) of the Commercial Computer Software Restricted Rights clause at FAR 52.227-19 for other agencies.
Copyright Notices. (C)copyright 1983-2000 Hewlett-Packard Company, all rights reserved.
This documentation contains information that is protected by copyright. All rights are reserved. Reproduction, adaptation, or translation without written permission is prohibited except as allowed under the copyright laws.
(C)Copyright 1981, 1984, 1986 UNIX System Laboratories, Inc.
(C)copyright 1986-1992 Sun Microsystems, Inc.
(C)copyright 1985-86, 1988 Massachusetts Institute of
Technology.
(C)copyright 1989-93 The Open Software Foundation, Inc.
(C)copyright 1986 Digital Equipment Corporation.
(C)copyright 1990 Motorola, Inc.
(C)copyright 1990, 1991, 1992 Cornell University
(C)copyright 1989-1991 The University of Maryland.
(C)copyright 1988 Carnegie Mellon University.
Trademark Notices. UNIX is a registered trademark in the United States and other countries, licensed exclusively through X/Open Company Limited.
NFS is a trademark of Sun Microsystems, Inc.
OSF and OSF/1 are trademarks of the Open Software Foundation, Inc. in the U.S. and other countries.
First Edition: April 1997 (HP-UX Release 10.30) Second Edition: September 2000 (HP-UX Release 11.11)
exec()
struct hpde and struct hpde2_0,
the Hashed Page Directory
struct pregion
struct region)
a.out Support by Regions
struct vfd)
struct dbd)
B-tree Node Description (struct bnode)
struct broot
struct vfdcw
struct pfdat (Page Frame Data)
pf_flag Values
struct hdlpfdat
setmemthresholds() Paging Thresholds
pregion Elements used by vhand
vhand
swdevt[] (struct swdevt)
fswdevt[]
(struct fswdevt)
struct swaptab)
struct swapmap)
pregion
vfd)
dbd)
vfddbds
B-tree (order = 3, depth = 3)
pseudo-vas Structures
htbl Entry to the Page Directory Entry
htbl Entry
swaptab and swapmap Structures
pregions with Shared regions
region of Type RT_PRIVATE
copy-on-write Page
DBD_FSTORE Page
The memory management system is designed to make memory resources available safely and efficiently to threads and processes:
The data and instructions of any process (a program in execution) or thread of execution within a process must be available to the CPU by residing in physical memory at the time of execution.
To execute a process, the kernel creates a per-process virtual address space that is set up by the kernel; portions of the virtual space are mapped onto physical memory. Virtual memory allows the total size of user processes to exceed physical memory. Through "demand paging", HP-UX enables you to execute threads and processes by bringing virtual pages into main memory only as needed (that is, "on demand") and pushing out portions of a process's address space that have not been recently used.
The term "memory management" refers to the rules that govern physical and virtual memory and allow for efficient sharing of the system's resources by user and system processes.
The system uses a combination of pageout and deactivation to manage physical memory. Paging involves writing recently unreferenced pages from main memory to disk from time to time. A page is this smallest unit of physical memory that can be mapped to a virtual address with a given set of access attributes. On a loaded system, total unreferenced pages might be a large fraction of memory.
Deactivation takes place if the system is unable to maintain a large enough free pool of physical memory. When an entire process is deactivated, the pages associated with the process can be written out to secondary storage, since they are no longer referenced. A deactivated process cannot run, and therefore, cannot reference its data.
Secondary storage supplements physical memory. The memory management system monitors available memory and, when it is low, writes out pages of a process or thread to a secondary storage device called a swap device. The data is read from the swap device back into physical memory when it is needed for the process to execute.
On a PA-RISC system, every page of physical memory is addressed by a physical page number (PPN), which is a software "reduction" of the physical page number from the physical address. Access to pages (and thus to the data they contain) are done through virtual addresses, except under specific circumstances. (When virtual translation must be turned off (the D and I bits are off), pages are accessed by their absolute addresses.)
When a program is compiled, the compiler generates virtual addresses for the code. Virtual addresses represent a location in memory. These virtual addresses must be mapped to physical addresses (locations of the physical pages in memory) for the compiled code to execute. User programs use virtual addresses only.
The kernel and the hardware coordinate a mapping of these virtual and physical addresses for the CPU, called "address translation," to locate the process in memory.
The PA-RISC architecture is segmented; a complete virtual address consists of a space identifier (SID) and an offset within that space.
The offset may be 32 bits or 64 bits wide; earlier PA-RISC processors (before PA-RISC 2.0) only support 32 bit offsets.
From the point of view of a user program, the segmentation is not obvious; instead, user programs experience an almost flat address space with either 32 or 64 bit virtual addresses (depending on how the process was compiled).
The kernel however deals in the full complexity of space and offset.
From the kernel point of view, every process running on a PA-RISC processor shares a single global virtual address space, with global virtual addresses (GVAs) composed of both space and offset. (These GVAs are 96 bit on PA-RISC 2.0 processors running in 64-bit (wide) mode; smaller on earlier processors.) This global virtual address space is also shared by the kernel.
Although any process can create and attempt to read or write any global virtual address, the kernel uses page granularity access control mechanisms to prevent unwanted interference between processes.
When a virtual page is "paged" into physical memory, free physical pages are allocated to it by the physical memory allocator. These pages may be randomly scattered throughout the memory depending on their usage history. Translations are needed to tell the processor where the virtual pages are loaded. The process of translating the virtual into physical address is called virtual address translation.
Potentially the virtual address space can be much greater than the physical address space. The virtual memory system enables the CPU to execute programs much larger than the available physical memory and allows you run many more programs at a time than you could without a virtual memory system.
The more main memory in the system, the more data the system can access and the more (or larger) processes it can retain and execute without having to page or cause deactivation as frequently. Memory-resident resources (such as page tables) also take up space in main memory, reducing the space available to applications.
At boot time, the system loads HP-UX from disk into RAM, where it remains memory-resident until the system is shut down.
User programs and commands are also loaded from disk into RAM, but in small portions as they are needed. When a program terminates, the operating system frees the memory used by the process.
Disk access is slow compared to RAM access. Excessive disk access can lead to increased latency or reduced throughput and can lead to the disk access becoming the bottleneck in the system. To avoid this, you need to do some sort of buffering. Buffering, paging, and deactivation algorithms optimize disk access and determine when data and code for currently running programs are returned from RAM to disk. When a user or system program writes data to disk, the data is either written directly from the program's RAM (e.g. if writing to a "raw" device) or buffered in what is called the buffer cache and written to disk in relatively big chunks. Programs also read files and database structures from disk into RAM. When you issue the sync command before shutting down a system, all modified buffers of the buffer cache are flushed (written) out to disk.
On each processor, there are also registers and cache, which are even faster than main memory. Actual program execution actually happens in registers, which get data from the cache and other registers. The cache contains the current working copy of parts of main memory. Most of the time when discussing memory management, cache and registers will be completely ignored; data and instructions will be treated as being accessed directly from main memory. They are mentioned here in an attempt to reduce confusion:
From this point on, this section only discusses "main memory".
+------------------------------+ | | | | | | | | | | | | | | | | | | | | | | | | | Lockable memory | | | | | | | |Available memory | | | | | +..............................+ | |Physical memory | | | | | | | | +------------------------------+ | | HP-UX kernel | | | at bootup | | | +------------------------------+ |
Not all physical memory is available to user processes. Kernel text and initialized data occupy about 10 MB of RAM; additional memory is used by kernel bss (uninitialized data), and (especially) various structures allocated during kernel boot. Many of the structures allocated during kernel boot can be quite large. The sizes of some are determined by kernel tunables, but many are sized based on the amount of physical memory in the system, e.g. such a structure might have one 96 byte entry for every 4096 byte page of physical memory.
Instead of allocating all its data structures at system initialization, the HP-UX kernel dynamically allocates and releases some kernel structures as needed by the system during normal operation. This allocation comes from the available memory pool; thus, at any given time, part of the available memory is used by the kernel and the remainder is available for user programs.
Physical address space is the entire range of addresses used by hardware (4GB on 32 bit (narrow mode) kernels), and is divided into memory address space, processor-dependent code (PDC) address space, and I/O address space. The next figure shows the expanse of memory available for computation. Memory address space takes up 15/16 of the system address space, while address space allotted to PDC and I/O consume a relatively small range of addresses.
+-----------+
0x00000000| page zero |
+-----------+
| |
| | +-----------------------+
| Memory | /| PDC address space |0xF0000000
| address | / | |
| space | / +-----------------------+
| | / | |0xF1000000
| | / | |
| | / | I/O Register |
0xF0000000+-----------+/ | address |
| PDC & I/O | | space |
0xFFFFFFFF+-----------+ | |
\ | |
\ +.......................+
\ | Central bus |
\ | address space |
\ +.......................+
\ | Broadcast address |0xFFFC0000
\| space (local, global) |0xFFFFFFFF
+-----------------------+
+-----------------------+
0x00000000 00000000| page zero |
+.......................+
| |
| |
| |
| |
| |
| |
| Memory |
| address |
| space |
| |
| |
| |
| |
| |
| |
| |
| |
+-----------------------+
0xF0000000 00000000| PDC address space |
0xF1000000 00000000| |
+-----------------------+
| I/O Register |
| address |
| space |
+.......................+
| Central bus |
| address space |
+.......................+
0xFFFFFFFF FFFC0000| Broadcast address |
0xFFFFFFFF FFFFFFFF| space (local, global) |
+-----------------------+
Pages kept in memory for the lifetime of a process by means of a system
call (such as mlock, plock, or shmctl)
are termed locked memory.
Locked memory cannot be paged and processes with locked memory
cannot be deactivated. Typically, locked memory holds frequently
accessed programs or data structures, such as critical sections of
application code. Keeping them memory-resident improves application
performance.
The lockable_mem variable tracks how much memory can be locked.
Available memory is a portion of physical memory, minus the amount of
space required for the kernel and its data structures. The initial value of
lockable_mem is the available memory on the system after boot-up,
minus the value of the system parameter, unlockable_mem.
The value of lockable memory depends on several factors:
unlockable_mem is a kernel tunable
parameter. Changing the value of unlockable_mem alters the
initial value of lockable_mem also.
HP-UX places no explicit limits on the amount of available memory you may lock down; instead, HP-UX restricts how much memory cannot be locked.
Other kernel resources that use memory (such as the dynamic buffer cache) can cause changes.
As the amount of memory that has been locked down increases, existing
processes compete for a smaller and smaller pool of usable memory. If
the number of pages in this remaining pool of memory falls below the
paging threshold called lotsfree,
the system will activates its paging
mechanism, by scheduling vhand in an attempt to keep a reasonable
amount of memory free for general system use.
Care must be taken to allow sufficient space for processes to make forward progress; otherwise, the system is forced into paging and deactivating processes constantly, to keep a reasonable amount of memory free.
Data is removed to secondary storage if the system is short of main memory. The data is typically stored on disks accessible either via system buses or network to make room for active processes.
Swap refers to a physical memory management strategy (predating UNIX) where entire processes are moved between main memory and secondary storage. Modern virtual memory systems today no longer swap entire processes, but rather use a paging scheme, where individual pages of data and instructions can be paged in from secondary storage as needed, or paged out again to free up memory for other uses. This is backed up by a deactivation scheme that allows whole processes to be pushed out if the system is desperately short of memory. However, the secondary storage dedicated to storing paged out data is still referred to as "swap space".
Device swap can take the form of an entire disk or LVM(1)
logical volume
of a disk. A file system can be configured to offer free space for swap; this
is termed file-system swap. If more swap space is required, it can be
added dynamically to a running system, as either device swap or
file-system swap.
The swapon command is used to allocate disk space or
a directory in a file system for swap.
(1) Logical Volume Manager (LVM) is a set of commands and underlying software to handle disk storage resources with more flexibility than offered by traditional disk partitions.
A computer has a finite amount of RAM available, but each 32-bit HP-UX process has a 4 GB virtual address space apportioned in four one-gigabyte quadrants. (64-bit HP-UX processes have an even larger virtual address space, though they can't actually use the full (16 Exabyte) range of virtual addresses addressable with 64 bits. It too is broken into 4 quadrants equal sized quadrants.) This is termed virtual memory.
Virtual memory is the software construct that allows each process sufficient computational space in which to execute. It is accomplished with hardware support.
As software is compiled and run, it generates virtual addresses that provide programmers with memory space many times larger than physical memory alone.
HP-UX is a Shared Address Space (SAS) operating system. A given virtual address (including space ID) refers to the same page of memory for all processes; translations are not changed when the process context changes.
Thus, the number of bits available for the space ID (segment) and offset (often simply called "virtual address") determines the ultimate size of the total virtual address space available to the kernel and all prcesses together.
As PA-RISC evolved, the number of bits usable for space and offset have increased. On PA-RISC 2.0, the space ID is 32 bits (18 bits actually used in HPUX 11.11) and the offset is effectively 42 bits (though stored in a 64 bit field). (PA-RISC 1.1 systems, and PA-RISC 2.0 running in narrow (32 bit) mode have a smaller offset.)
NOTE: Understand, however, that a single process has significant
limitations on the virtual address space it is allowed to access.
For example, a 32-bit SHARE_MAGIC executable text is limited
to 1 GB and data is limited to 1 GB.
Also, the total amount of shared virtual address
space in the system is limited to much less than theoretically addressable;
without using memory windows, the total shared space on a wide mode
(64-bit) system is limited
to approximately 8 TB (i.e. 2 64-bit quadrants).
A physical address points to a page in memory that represents 4096 bytes of data. The physical address also contains an offset into this page. Thus, the complete physical address is composed of a physical page number(PPN) and page offset. The PPN is the 20 or 52 most significant bits of the physical address where the page is located. These bits are concatenated with an 12-bit page offset to form the 32 or 64-bit physical address.
Page Number Page Offset +--------------------+------------+ |00000000000000000100|100001110011| +--------------------+------------+ 0 19 20 31
Page Number Page Offset +---------------------------------------------------+------------+ |000000000000000000000000000000000000000000000000100|100001110011| +---------------------------------------------------+------------+ 0 51 52 63
To handle the translation of the virtual address to a physical address the virtual address also needs to be looked at as a virtual page number(VPN) and page offset. Since the page size is 4096 bytes, the low order 12 bits of the offset are assumed to be the offset into the page. The space ID and the high order bits of the offset are the VPN.
For any given address you can determine the page number by discarding the least significant 12 bits. What remains is the virtual page number for a virtual address or the physical page number for the physical address.
The next figure shows the bit layout of a 32-bit virtual address of 0x0.4873.
32-bit Space ID 32-bit Offset
+--------------------------------+--------------------+------------+
|00000000000000000000000000000000|00000000000000000100|100001110011|
+--------------------------------+--------------------+------------+
| | | |
+----------------------------------------------------+ +-----------+
| |
VPN = 0x4 Page Offset
0x873
The virtual page number's address must be translated to obtain the associated page number, with page offset 0x873.
+---------------------------------------------------+
| +--------------------+ |
| | Central Processing | |
| | Unit (CPU) | +-------------------+ |
| +--------------------+ | Floating Point | |
| |-------------->| Coprocessor | |
| | +-------------------+ |
| |------------------------+ |
| | | |
| V V |
| +--------------------+ +-------------------+ |
| | | | Translation | |
| | Cache | | Lookaside Buffer | |
| | | | (TLB) | |
| +--------------------+ +-------------------+ |
| | | |
| |<-----------------------+ |
| +--------------------+ |
| | System Interface | |
| | Unit (SLU) | |
| +--------------------+ |
| | |
+------------V--------------------------------------+
| Central Bus
==================================================================
The figure above and the table that follows, name the principal processor components; of them, registers, translation lookaside buffer, and cache are crucial to memory management, and will be discussed in greater detail following the table.
| Component | Purpose |
|---|---|
| Central Processing Unit (CPU) |
The main component responsible for reading
program and data from memory, and
executing the program instructions. Within
the CPU are the following:
|
| Instruction and Data Cache | The cache is a portion of high-speed memory used by the CPU for quick access to data and instructions. The most recently accessed data is kept in the cache. |
| Translation Lookaside Buffer (TLB) |
The processor component that enables the
CPU to access data through virtual address
space by:
|
| Floating Point Coprocessor | An assist processor that carries out specialized tasks for the CPU. |
| System Interface Unit (SIU) | Bus circuitry that allows the CPU to communicate with the central (native) bus. |
The translation lookaside buffer (TLB) translates virtual addresses to physical addresses.
+---------------------------+\ | | \ | | \ | | \ | | \ | | \ | | \ | | +--------+ | Virtual | +---+ |Physical| | address |<-->|TLB|<-->|address | | space | +---+ |space | | | +--------+ | | / | | / | | / | | / | | / | | / +---------------------------+/
Address translation is handled from the top of the memory hierarchy
hitting the fastest components first (such as the TLB on the processor)
and then moving on to the page directory table
(pdir in main memory)
and lastly to secondary storage.
The TLB looks up the translation for the virtual page numbers (VPNs) and gets the physical page numbers (PPNs) used to reference physical memory.
Virtual address Main Memory
+-------------------+-----------+ +--------+
|Virtual Page Number|Byte Offset| | 0 |
+-------------------+-----------+ | |
| | | |
| +-------------------+ | |
V | | |
VPN PPN Rights ID O U T D P | | |
+------------+-------+----+---+-+-+-+-+-+ | | |
| | | | | | | | | | | +------>[] |
+------------+-------+----+---+-+-+-+-+-+ | PPN | | |
T| | | | | | | | | | | + | | |
L+------------+-------+----+---+-+-+-+-+-+ | Offset| | |
B| | | | | | | | | | | | | |
+------------+-------+----+---+-+-+-+-+-+ | | | |
| | | | |
V Physical address V | | |
+--------------------+-----------+ | | |
|Physical Page Number|Byte Offset|---+ |physmem |
+--------------------+-----------+ +--------+
Ideally the TLB would be large enough to hold translations for every page of physical memory; however this is prohibitively expensive; instead the TLB holds a subset of entries from the page directory table (PDIR) in memory. The TLB speeds up the process of examining the PDIR by caching copies of its most recently utilized translations.
Because the purpose of the TLB is to satisfy virtual to physical address translation, the TLB is only searched when memory is accessed while in virtual mode. This condition is indicated by the D-bit in the PSW (or the I-bit for instruction access).
Depending on model, the TLB may be organized on the processor in one of two ways:
The advantage of having a split Data TLB (DTLB) and Instruction TLB (ITLB), is that it is possible to account for the different characteristics of data and instruction locality and type of access (frequent random access of data versus relatively sequential single usage of instructions).
Because TLB size is limited, it is desirable to use as few entries as possible to translate the largest possible amount of memory. PA-RISC 2.0 processors provide a variable page size, and memory is organized to use large page sizes wherever this is reasonable. In particular, the memory initially allocated for the kernel at boot time is mapped with the largest possible page size that fits it. (Other memory will be mapped with large pages if possible, but there are tradeoffs that may make this impractical, especially on small memory systems.)
PA-RISC processors before PA-RISC 2.0 do not support a general purpose variable page size. Instead, they may provide a block TLB. The block TLB is quite small, but its entries can map more than a single 4K page (i.e. multiple hpdes). Block TLB entries are used to reference kernel memory that remains resident. (Memory referenced by a block TLB entry cannot be paged out.) The block TLB is typically used for graphics, because their data is accessed in huge chunks. It is also used for mapping other static areas such as kernel text and data.
Since the TLB translates virtual to physical addresses, each entry contains both the Virtual Page Number (VPN) and the Physical Page Number (PPN). Entries also contain Access Rights, an Access Identifier, and five flags.
| Flag | Name | Meaning |
|---|---|---|
| O | Ordered | Accesses to data for load and store are ranked by strength -- strongly ordered, ordered, and weakly ordered. (See PA-RISC 2.0 specifications for model and definitions.) |
| U | Uncacheable | Determines whether data references to a page from memory address space may be moved into the cache. Typically set to 1 for data references to a page that maps to the I/O address space or for memory address space that must not be moved into cache. |
| T(1) | Page Reference Trap | If set, any access to this page causes a reference trap to be handled either by hardware or software trap handlers |
| D | Dirty | When set, this bit indicates that the associated page in memory differs from the same page on disk. The page must be flushed before being invalidated. |
| B | Break | This bit causes a trap on any instruction that is capable of writing to this page |
| P | Prediction method for branching | Optional, used for performance tuning. |
(1) The T,D, and B flags are only present in data or unified TLBs.
In PA 1.x architecture, an E bit (or "valid" bit) indicates that the TLB entry reflects the current attributes of the physical page in memory.
The operating system maintains a table in memory called the Page Directory (PDIR) which keeps track of all virtual pages currently in memory. When a page is mapped in some virtual address space, it is allocated an entry in the PDIR. The PDIR is what links a virtual address to a physical page in memory.
The PDIR is implemented as a memory-resident table of software structures called hashed page directory entries (HPDEs), which contain virtual and physical addresses. When the processor needs to find a physical page not indexed in the TLB, it can search the PDIR with a virtual address to find the matching address.
The PDIR table is a hash table with collision chains. The virtual address is used to hash into one of the buckets in the hash table and the corresponding chain is searched until a chain entry with a matching virtual address is found.
Note that the page table is not a purely software construct. On systems that provide hardware for TLB miss handling, this is the table examined by the hardware to attempt to find an appropriate translation to insert in the TLB when resolving a TLB miss fault.
A trap occurs because translation is missing in the translation lookaside buffer (TLB). If the processor can find the missing translation in the PDIR, it installs it in the TLB and allows execution to continue. If not, a page fault occurs.
A page fault is a trap taken when the address needed by a process is missing from the main memory. This occurrance is also known as a PDIR miss. A PDIR miss indicates that the page is either on the free list, in the page cache, or on disk; the memory management system must then find the requested page on the swap device or in the file system and bring it into main memory.
Conversely, a PDIR hit indicates that a translation exists for the virtual address in the TLB.
hpde and hpde2_0) StructureEach PDE contains information on the virtual-to-physical address translation, along with other information necessary for the management of each page of virtual memory.
PA-RISC 1.1 and PA-RISC 2.0 systems use different hashed page directory
entry structures, with mostly similar field names and purposes.
The following table combines the structural elements of the PA-RISC 1.1
hashed page directory entry (struct hpde) and the PA-RISC 2.0
hashed page directory entry (struct hpde2_0).
struct hpde and struct hpde2_0,
the Hashed Page Directory
| Element | PA-RISC Version | Meaning |
|---|---|---|
pde_valid |
PA-RISC 1.1 | Flag set by the kernel to indicate a valid pde entry. |
pde_invalid |
PA-RISC 2.0 | Flag set by the kernel to indicate an invalid pde entry. |
pde_vpage |
both | Virtual page - the virtual offset divided by 4096. |
pde_space |
both | Contains the complete virtual space ID. |
pde_rtrap |
both | Data reference trap enable bit; when set, any access to the page causes a page reference trap interruption. |
pde_dirty |
both | Dirty bit; marked if the page differs in memory from what is on disk. |
pde_dbrk |
both | Data break; used by the TLB. |
pde_ar |
both | Access rights; used by the TLB.(1) |
pde_uncache |
both | Uncache bit. |
pde_order |
PA-RISC 2.0 | Strong ordering bit. |
pde_br_predict |
PA-RISC 2.0 | Branch prediction bit. |
pde_ref_trickle |
both |
Trickle-up bit for references. Used with pde_ref
on systems whose hardware can search the htbl directly.
|
pde_block_mapped |
both | Block mapping flag; indicates page is mapped by block TLB and cannot be aliased. |
pde_executed |
both | Used by the stingy cache flush algorithm to indicate that page is referenced as text(2). |
pde_ref |
both |
Reference bit set by the kernel when it receives
certain interrupts; used by vhand to tell if a
page has been used recently.
|
pde_accessed |
both | Used by the stingy cache flush algorithm to indicate that the page may be in data cache. |
pde_modified |
both | Indicator to the high-level virtual memory routines as to whether the page has been modified since last written to a swap device. |
pde_uip |
both | Lock flag used by trap-handling code. |
pde_protid |
both | Protection ID, used by the TLB. |
pde_os |
PA-RISC 2.0 | Entry in use. |
pde_alias |
both | Virtual alias field. If set, the pde has been allocated from elsewhere in kernel memory, rather than as a member of the sparse PDIR. |
pde_wx_demote |
PA-RISC 2.0 (64-bit kernels only) | User space fic. |
pde_phys |
PA-RISC 1.1 | Physical page number; the physical memory address divided by the page size (4096 bytes). |
pde_phys_u |
PA-RISC 2.0 | Physical page number: most significant 25 bits. |
pde_phys |
PA-RISC 2.0 | Physical page number: least significant 27 bits address divided by the page size. |
var_page |
PA-RISC 2.0 | Page size. |
pde_next |
both | Pointer to next entry, or null if end of list. |
(1) For detailed information on access rights, see the
PA-RISC 2.0 Architectural reference,
chapter 3, "Addressing and Access Control."
For information about how programs can manipulate this field, see
mmap(2) and mprotect(2) manpages.
(2) Stingy cache flush is a performance enhancement by which the kernel recognizes whether or not to flush the cache.
Cache is fast, associative memory on the processor module that stores recently accessed instructions and data. From it, the processor learns whether it has immediate access to data or needs to go out to (slower) main memory for it.
Cacheable data going to the CPU from main memory passes through the cache. Conversely, the cache serves as the means by which the CPU passes data to and from main memory. Cache reduces the time required for the CPU to access data by maintaining a copy of the data and instructions most recently requested.
A cache improves system performance because most memory accesses are to addresses that are very close to or the same as previously accessed addresses. The cache takes advantage of this property by bringing into cache a block of data whenever the CPU requests an address. Though this depends on size of the cache, associativity, and workload, a vast majority of the time (according to performance measurements), the cache has what you're looking for the next time, enabling you to reference it.
Depending on model, PA-RISC processors are equipped with either a unified cache or separate caches for instructions and data (for better locality and faster performance). In multiprocessing systems, each processor has its own cache, and a cache controller maintains consistency.
Cache memory itself is organized as follows:
Cache Tag
+---------------------------+-+-+--------------------+ /|\
| |v|d| | |
| |a|i| | |
|Physical Page Number (PPN) |l|r| Tag Parity Bits | |
| |i|t| | |
| |d|y| | |
+---------------------------+-+-+--------------------+ |
|Cache
Cache Line |entry
+----------------------------------+-----------------+ |
| | | |
| | | |
| Data words |Data parity bits | |
| | | |
| | | |
+----------------------------------+-----------------+ \|/
When a process executes, it stores its code (text) and data in processor registers for referencing. If the data or code is not present in the registers, the CPU supplies the virtual address of the desired data to the TLB and to the cache controller. Depending on implementation, caches can be direct mapped, set associative, or fully associative. Recent PA implementations use direct associative caches and fully associative TLBs. Virtual addresses can be sent in parallel to the TLB and cache because the cache is virtually indexed.
A physical page may not be referenced by more than one virtual page, and a virtual address cannot translate to two different physical addresses; that is, PA-RISC does not support hardware address aliasing, although HP-UX implements software address aliasing for text only in EXEC_MAGIC executables.
The cache controller uses the low-order bits of the virtual address to index into the direct-mapped cache. Each index in the cache finds a cache tag containing a physical page number (PPN) and a cache line of data. If the cache controller finds an entry at the cache location, the cache line is checked to see whether it is the right one by looking at the PPN in the cache tag and the one returned by the TLB, because blocks from many different locations in main memory can be mapped legitimately to a given cache location. If the data is not in cache but the page is translated, the resultant data cache miss is handled completely by the hardware. A TLB miss occurs if the page is not translated in the TLB; if the translation is also not in the PDIR, HP-UX uses the page fault code to fault it in. If not in RAM, the data and code might have to be paged from disk, in which case the disk-to-memory transaction must be performed.
+---------------------------------+
|+-------+ processor |
|| CPU | |
|+-------+ |
| | : virtual address | +------------------+
| | :..................... | | RAM |
| | V V | | |
|+-------+ +-------+ | |page directory |
|| CPU | | TLB | | | +-----+ |
|+-------+ +-------+ | | |-----| |
| | : : | | |-----| |
| | : PPN PPN : | | +-----+ |
| | ....> <.... | | |
+---|-----------------------------+ +------------------+
| bus
===============================================================
|
+--------+
| disk |
+--------+
On a more detailed level, the next figure demonstrates the mapping of virtual and physical address components.
Virtual address
+-----------+-------------+
+--------------------| virtual | offset in |-----------------+
| | page # | page | |
| +-----------+-------------+ |
| |
| Address translation in |
| Translation Lookaside Buffer Physical address in Cache |
| +-------------+-------------+ +-------------+---------+ |
+->| Virtual | Physical |----->| Physical | Offset |<-+
| page number | page number | +->| page number | in page |<-+
+-------------+-------------+ | +-------------+---------+ |
| |
| Physical address in RAM |
| +-------------+---------+ |
+->| Physical | Offset |<-+
| page number | in page |
+-------------+---------+
The sequence followed by the processor as it validates addresses is one of "hit or miss".
In addition to assisting in virtual address translation, the translation lookaside buffer (TLB) serves a security function on behalf of the processor, by controlling access and ensuring that a user process sees only data for which it has privilege rights.
The TLB contains access rights and protection identifiers. PA-RISC 2.0 allows up to eight protection IDs to be associated with each process. These IDs are held in control registers CR-8, CR-9, CR-12, and CR-13 (2 per register). (PA-RISC 1.1 only allows four protection IDs to be associated with each process.)
| Security check | Purpose |
|---|---|
| Protection Checks |
The P-bit (Protection ID Validation Enable bit) of
the Processor Status Word (PSW) is checked:
|
| Access Rights Check |
Access Rights are stored in a seven-bit field
containing permissible access type and two
privilege levels affecting the executing instruction:
|
The following figure shows the checkpoints for controlling access to a page of data through the TLB. Two checks are performed for controlling access to a page of data through the TLB: protection check and access rights check. If both checks pass, access is granted to the page referenced by the TLB.
Control Registers
+-----------------+
| | TLB Entry
CR 8|Protection ID 1,2|-+ +---------------+
CR 9|Protection ID 3,4| | | |
| | +-+ +---------------+
CR 12|Protection ID 5,6| | | +----------------| Access ID |
CR 13|Protection ID 7,8|-+ | | +---------------+
| | | | +--| Access Rights |
+-----------------+ | | | +---------------+
| | Type of | | |
PSW | | Access | +---------------+
+-------+-+---+ | | | |
| |P| | | | / \ | IA Queue
+-------+-+---+ | | / \ | +------+--+
| | | / \ | | +---------+--+
+---------------+ | | | | | +-| | |
V V V V V V +---------+--+
+------------+ +--------+ |
| | | Access |<-------------+
| Protection | | Rights |
| Check | | Check |
+------------+ +--------+
| |
+---+ +---+
| |
V V
+-------------+
| Both Checks |
| Passed? |
+-------------+
|
V
Access Granted
If the two PPNs do not match (assuming a TLB hit), the cache line is loaded because the bytes referenced on the virtual page are not yet in the cache. The time it takes to service a cache miss varies depending on if the data already present in the cache is clean or dirty. (When the cache is dirty, the old contents are written out to memory and the new contents are read in from memory.) If the cache line is "clean" (that is, not modified), it does not have to be written back to main memory, and the penalty is fewer instruction cycles than if the cache is dirty and must be written back to main memory.
Page found in PDIR (deposit in TLB)
+-----------+-------------------------+
V | +--|---+
+--------+ V | | |
+->| hashes | +-----+ TLB miss | [ ] | Not Found
| +--------+ | |------------------>| |-----------+
| /| | | TLB | TLB Hit +------+ |
| | VPN------>| |-----------+ PDIR V
| | | +-----+ |PPN s/w
| | | | (cache line) handler
| +- | -----------+------------ | -------------------+
| | | | |
| | | V |
| | V / \ |
CPU | | +-------+ PPN / \ No/Cache Miss +-----+
requests| +------->| Cache |------> =? -------------->| |
virtual | +-------+ \ / +-----+
address | \ / RAM
| |Yes/cache hit
| Return data to |
+-----+ CPU from cache |
| CPU |<-----------------------------+
+-----+
Registers, high-speed memory in the processor's CPU, are used by the software as storage elements that hold data for instruction control flow, computations, interruption processing, protection mechanisms, and virtual memory management.
All computations are performed between registers or between a register and a constant (embedded in an instruction), which minimizes the need to access main memory or code. This register-intensive approach accelerates performance of a PA-RISC system. This memory is much faster than conventional main memory but it is also much more expensive, and therefore used for processor-specific purposes.
Registers are classified as privileged or non-privileged, depending on the privilege level of the instruction being executed.
| Type of Register | Purpose |
|---|---|
| 32 General Registers, each 64 bits in size (non-privileged) |
Used to hold immediate results or data that
is accessed frequently, such as the passing
of parameters. Listed are those with uses
specified by PA-RISC or HP-UX.
|
| 7 Shadow Registers (privileged) | Store contents of GR1,8,9,16,17,24, and 25 on interrupt, so that they can be restored on return from interrupt. Numbered SHR0-SHR6. |
| 8 Space Registers (SR5-SR7 are privileged) |
Hold the space IDs for the current running process.
|
| 25 Control Registers (numbered CR0, and CR8 through CR31), each 64 bits. (Most are privileged.) |
Used to reflect different states of the system, many related primarily
to interrupt handling.
|
| 32 Floating Point Registers, 64-bits each, or 64, 32-bits each |
Data registers used to hold computations.
|
| 2 Instruction Address Queues |
Two queues 2 elements deep.
The front elements of the queues (IASQ_Front and IAOQ_Front) form the
virtual address of the current instruction, while the back
elements (IASQ_Back and IAOQ_Back) contain the address of the
following instruction.
|
| 1 Processor Status Word (PSW), 64 bits (privileged) | Contains the current processor state. When an interruption occurs, the PSW is saved into the Interrupt Processor Status Word (IPSW), to be restored later. Low-order five bits of the PSW are the system mask, and are defined as mask/unmask or enable/disable. Interrupts disabled by PSW bit are ignored by the processor; interrupts masked remain pending until unmasked. |
uarea vas
+-------+ +---------------->+-----+
| | | +--------->| |<--------------+
+-------+ proc | | pregion +-----+ |
|u_procp|---->+-----+ | +->+-----+<->+--+<->+--+<->+--+<-+
+-------+ | | | | | | | | | | |
| | +-----+ | +-----+ +--+ +--+ +--+
+-------+ |p_vas|-+ |p_reg|--+
+-----+ +-----+ |
| | | | |
+-----+ +-----+ |
Process resources |
=========================================|============================
System resources | region
+--->+------+
| |
+------+ broot
|r_root|--->+------+
+------+ | |
chunk | | +------+
+-----+<----+ +------+ +-|b_root|
| | | | +------+
+-----+ | +--+<---------+ | |
hpde RAM <--| vfd | | B-tree | | +------+
+--------+ | dbd | | +--+
| | /|\ +-----+ | / | \
+--------+ | | | | V V \|
|pde_phys|--+ | | | +--+ +--+ +--+
+--------+ +-----+ | | | | | | |
| | | +--+ +--+ +--+
+--------+ | / | \
| |/ V \|
| +--+ +--+ +--+
+---| | | | | |
+--+ +--+ +--+
Process management uses kernel structures down to the pregions to
execute the threads of a process.
The uarea, proc structure,
vas, and pregion are
per-process resources, because each process has its own unique copies of
these structures, which are not shared among multiple processes.
Below the pregion level are the systemwide resources. These
structures can be shared among multiple processes (although they are
not required to be shared).
Memory management kernel structures map pregions to physical memory and provide support for the processor's ability to translate virtual addresses to physical memory. The table that follows introduces the structures involved in memory management; these are discussed later in detail.
| Kernel structure | Purpose |
|---|---|
vas |
Keeps track of the structural elements associated with a
process in memory. One vas maintained per process.
|
pregion |
A per-process resource that describes the regions attached to the process. |
region |
A memory-resident system resource that can be shared
among processes. Points to the process's B-tree,
vnode, pregions.
|
B-tree |
Balanced tree that stores pairs of page indices and
chunk addresses. At the root of a B-tree of VFDs and
DBDs is struct broot.
|
hpde |
Contains information for virtual to physical translation
(that is, from VFD to physical memory).
|
vas)
The vas represents the virtual address space of a process
and serves as
the head of a doubly linked list of process region data structures called
pregions.
The vas data structure is always memory resident.
When a process is created, the system allocates a vas structure and puts
its address in p_vas,
a field in the proc structure.
The virtual address space of a process is broken down into logical chunks
of virtually contiguous pages. (See the Process Management white paper
for a table of vas entries.)
pregion
Each pregion represents a process's view
of a particular portion of
its virtual address space and information on getting to those pages.
The pregion points to
the region data structure
that describes the pages' physical locations in
memory or in secondary storage.
The pregion also contains the virtual
addresses to which the process's pages are mapped, the page usage (text,
data, stack, and so forth), and page protections (read, write, execute, and
so on).
pregion
+---------+
+------------->| vas |<-------------+
| +---------- |
| /| |\ |
| / \ |
V |/ \| V
+---------+ +---------+ +---------+ +---------+
| pregion |<->| pregion |<->| pregion |<->| pregion |
+---------+ +---------+ +---------+ +---------+
/|\
|
V
+---------+
| region |
+---------+
The following elements of a per-process pregion structure are
important to the virtual memory subsystem.
struct pregion| Element | Purpose |
|---|---|
p_type |
Type of pregion.
|
*p_reg |
Pointer to the region attached by the pregion.
|
p_space, p_vaddr |
Virtual address of the pregion, based on virtual
space and virtual offset.
|
p_off |
Offset into the region, specified in pages.
|
p_count |
Number of pages mapped by the pregion.
|
p_ageremain, p_agescan,
p_stealscan, p_bestnice |
Used in the vhand algorithm to age and steal
pages of memory (discussed later).
|
*p_vas |
Pointer to the vas to which the pregion is linked.
|
p_forw, p_back |
The doubly-linked list, used by vhand to walk the active pregions.
|
p_deactsleep |
The address at which a deactivated process is sleeping. |
p_pagein |
Size of an I/O, used for scheduling when moving data into memory. |
p_strength, p_nextfault |
Used to track the ratio between sequential and
random faults; used to adjust p_pagein.
|
The region is a system-wide kernel data structure that associates groups
of pages with a given process. Regions can be one of two types, private
(used by a single process) or shared (able to be used by more than one
process). Space for a region data structure is allocated as needed. The
region structure is never written to a swap device,
although its B-tree
may be.
Regions are pointed to by pregions, which are a per-process resource.
Regions point to the vnode
where the blocks of data reside when not in
memory.
struct region)
| Element | Meaning |
|---|---|
r_flags |
Region flags (enumerated shortly). |
r_type |
|
r_pgsz |
Size of region in pages (not just those presently in memory). |
r_nvalid |
Number of valid pages in region.
This equals the number of valid vfds
in the B-tree or b_chunk.
|
r_dnvalid |
Number of pages in swapped region. If the
system swaps the entire process, the value of
r_nvalid is copied here to later calculate how
many pages the process will need when it faults
back in. This information is used to decide
which process to reactivate.
|
r_swalloc |
Total number of pages reserved and allocated for
this region on the swap device. Does not account
for swap space allocated for vfd/dbd pairs.
|
r_swapmem, r_vfd_swapmem |
Memory reserved for pseudo-swap or vfd swap.
|
r_lockmem |
Number of pages currently allocated to the
region for lockable memory, including lockable
memory allocated for vfd/dbd pairs.
|
r_pswapf, r_pswapb |
Forward and backward pointers to list of regions using pseudo-swap pages
(pswaplist).
|
r_refcnt |
Number of pregions pointing at the region
|
r_zomb |
Set to indicate modified text. If an executing a.out file on a remote
system has changed, the pages are flushed from the processor's cache,
causing the next attempted access to fault. The fault handler finds
that r_zomb is non-zero,
prints the message Pid %d killed due to
text modification or page I/O error
and sends the process a SIGKILL.
|
r_off |
Offset into the page-aligned vnode, specified in pages; valid only if
RF_UNALIGNED is not set.
Page r_off of the vnode is referenced by the
first entry of the first chunk of the region's B-tree.
|
r_incore |
Number of pregions sharing the region whose
associated processes have the SLOAD flag set.
|
r_dbd |
Disk block descriptor for B-tree pages written
to a swap device Specifies the location of the
first page; the pages are stored together in a contiguous area of swap space.
|
r_fstore, r_bstore |
Pointers to vnode of origin and destination of
block. This data depends on the type of
pregion above the region. In most cases,
r_bstore is set to the paging system vnode, the
global swapdev_vp that is initialized at system
startup.
|
r_forw, r_back |
Pointers to linked list of all active regions.
|
r_lock |
Region lock structure used to get read or read/write locks to modify the region structure. |
r_mlock |
Lock used to serialize mlock operations on this region.
|
r_poip |
Number of page I/Os in progress. |
r_root |
Root of B-tree; if referencing more than one
chunk, r_key is set to DONTUSE_IDX.
|
r_key, r_chunk |
Used instead of B-tree search (r_root)
if only a single chunk
of vfddbds is needed (referencing 32 or fewer pages on a 32-bit
kernel, or 64 or fewer pages on a 64-bit kernel).
|
r_next, r_prev |
Circularly linked list of all regions sharing vnode.
|
r_preg_un |
pregion(s) pointing to the region.
|
r_excproc |
Pointer to the proc table entry, if the process
has RF_EXCLUSIVE set in r_flags.
|
r_lchain |
Linked list of memory lock ranges. |
r_mlockswap |
Swap reserved to cover locks. |
r_pgszhint |
Page size hint. |
r_hdl |
Hardware-dependent layer structure. |
a.out Support for Unaligned PagesText and data of most executables start on a four-kilobyte page boundary. HP-UX can treat these as memory-mapped files, because a page in the file maps directly to a page in memory.
In addition to the fields shown above,
struct region has fields to support
executables compiled on older versions of HP-UX whose text and data do
not align on a (4 KB) page boundary. These executables are referenced by
regions whose r_flags has RF_UNALIGNED set.
a.out Support by Regions| Element | Meaning |
|---|---|
r_byte, r_bytelen |
Offset into the a.out file and length of its text.
|
r_hchain |
Hash list of unaligned regions. |
r_flags.
Here are some of the possible flag values:
| Region Flag | Meaning |
|---|---|
RF_ALLOC |
Always set because HP-UX regions are allocated and freed on demand; there is no free list. |
RF_UNALIGNED |
Set if text of an executable does not start
on a page boundary. In this case, the text
is read through the buffer cache to align
it, and the vfds are pointed at the buffer
cache pages.
|
RF_WANTLOCK |
Set if a thread wanted to lock a vfd of this
region (to do I/O on the page), but found it already locked and
went to sleep. After the vfd is
unlocked, this flag ensures that
wakeup() is called so the waiting
thread(s) can proceed.
|
RF_HASHED |
The text is unaligned (RF_UNALIGNED)
and thus is on a hash chain. The region is
hashed with r_fstore and r_byte; the
head of each hash chain is in texts[].
The RF_UNALIGNED flag may be set
without the RF_HASHED flag (if the
system tries to get the hashed region but
it is locked, the system will create a
private one), but the RF_HASHED flag will
never be set without the RF_UNALIGNED
flag.
|
RF_EVERSWP, RF_NOWSWP |
Set if the B-tree has ever been or is now
written to a swap device. These flags are
used for debugging.
|
RF_IOMAP |
This region was created with an iomap()
system call, and thus requires special
handling when calling exit().
|
RF_LOCAL |
Remote file using local swap space. |
RF_EXCLUSIVE |
The mapping process is allowed exclusive
access to the region. This flag is set, and
r_excproc is set to the proc table
pointer.
|
RF_STATIC_PREDICT |
Text object uses static branch prediction for compiler optimization. |
RF_ALL_MLOCKED |
Entire region is memory locked. |
RF_SWAPMEM |
Region is using pseudo-swap; that is, a portion of memory is being held for swap use. |
RF_LOCKED_LARGE |
Region is locked using large pages. |
RF_SUPERPAGE_TEXT |
Text region using large pages. |
RF_FLIPPER_DISABLE |
Disable kernel assist prediction; a flag used for performance profiling. |
RF_MPROTECTED |
Some part of the region is subject to the system call mprotect,
which is performed on an memory-mapped file.
|
r_key, r_chunk,
and r_root are used to find information
about the individual pages of a region.
Each page is represented by a vfd (if it's in memory)
or dbd (if it's on disk).
For each page, the vfd and dbd are grouped
together into a struct vfddbd.
By definition, if the vfd's pg_v bit is set,
the vfd is used; if not, the
dbd is used.
Since information is typically needed about groups of (rather than individual) pages, pages are grouped into chunks. A chunk contains 32 or 64 pairs of virtual frame descriptors and disk block descriptors:
vfd)
A one-word structure called a virtual frame descriptor enables processes
to reference pages of memory.
The vfd is used when the process is in
memory, and can be used to refer to the page of physical memory described in
the pfdat table (pfdat_ptr[],
described
below).
vfd)
+----------+---------------------+
| flags | page frame number |
+----------+---------------------+
11 31
struct vfd)| Element | Meaning |
|---|---|
pg_v |
Valid flag. If set, this page of memory contains valid
data and pg_pfnum is valid. If not set, the page's
valid data is on a swap device.
|
pg_cw |
Copy-on-write flag. If set, a write to the page causes a data protection fault, at which time the system copies the page. |
pg_lock |
Lock flag. If set, raw I/O is occurring on this page. Either the data is being transferred between the page and the disk, or data is being transferred between two memory pages. The kernel sleeps waiting for completion of I/O before launching further raw I/O to or from this page. Nothing can read the page while it is being written to disk. |
pg_mlock |
If set, the page is locked in memory and cannot be paged out. |
pg_pfnum (aliased as pg_pfn) |
Page frame number, from which can be accessed the
correct pfdat entry for this page.
|
dbd)
When the pg_v bit in a vfd is not set,
the vfd is invalid and the page of
data is not in memory but on disk. In this case, the disk block descriptor
(dbd) gives valid reference to the data.
Like the vfd structure, the dbd
is one word long.
dbd)+----+---------------------------+ |type| data | +----+---------------------------+ 0 3 31
struct dbd)| Element | Meaning |
|---|---|
dbd_type |
Type of data:
|
dbd_data |
vnode type (jfs, nfs,
ufs, swap space) specific data.
Used by the file system (or swap space management) code to find the data
in a file pointed to by a vnode.
|
(1) When the dbd_type is DBD_FSTORE,
it means that the page
of data resides in the file pointed to by r_fstore (typically a
file system). When the dbd_type is DBD_BSTORE,
the page of
data resides in the file of device pointed to by r_bstore
(typically a swap device).
Since information is typically needed about groups of (rather than individual) pages, pages are grouped into chunks. A chunk contains 32 or 64 pairs of virtual frame descriptors and disk block descriptors:
vfd).
dbd).
vfd's pg_v bit is set,
the vfd is used; if not, the
dbd is used.
A one-to-one correspondence is maintained between vfd
and dbd
through the vfddbd structure,
which simply contains one vfd (c_vfd)
and one dbd (c_dbd).
vfddbdschunks +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +----------+ +------+ /| vfd | +------+/ +----------+ +------+\ | dbd | +------+ \+----------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+
HP-UX regions use chunks of vfds and dbds
to keep track of page
ownership:
Each region contains either a single array of vfddbds
(a chunk)
or a
pointer to a B-tree.
The structure called a B-tree allows for quick
searches and efficient storage of sparse data.
A bnode is the same size
as a chunk; both can be gotten from the same source of memory.
The
region's B-tree stores pairs of page indices and chunk addresses. HP-UX
uses an order 29 B-tree.
A B-tree is searched with a key and yields a value.
In the region
B-tree, the key is the page number in the region divided by
the
number of vfddbds in a chunk.
B-tree (order = 3, depth = 3)
++-+-+-+-++
||9| | | ||
+++++++++++
| | | | | |
+-+-+-+-+-+
| |
+-----+ +-----+
| |
V V
++-+-+-+-++ ++-+--+-+-++
||4|7| | || ||9|11| | ||
+++++++++++ +++++-++++++
| | | | | | | | | | | |
+-+-+-+-+-+ +-+-+--+-+-+
| | | | |
+-------------------+ | | | |
| +---------+ | | +---------+
| | | | |
V V V V V
++-+-+-+-++ ++-+-+-+-++ ++-+-+-+-++ ++-+--+-+-++ ++--+--+-+-++
||1|3| | || ||4|6| | || ||7|8| | || ||9|10| | || ||11|12| | ||
+++++++++++ +++++++++++ +++++++++++ +++++-++++++ +++-++-++++++
| |G|H| | | | |D|E| | | | |J|I| | | | |F| B| | | | | C| A| | |
+-+-+-+-+-+ +-+-+-+-+-+ +-+-+-+-+-+ +-+-+--+-+-+ +-+--+--+-+-+
Each node of a B-tree contains room for order+1 keys (or index
numbers) and order+2 values. If a node grows to contain more than
order keys, it is split into two nodes; half of the pairs are kept in the
original node and the other half are copied to the new node.
The B-tree
node data also includes the number of valid elements contained in that
node.
B-tree Node Description (struct bnode)
| Element | Meaning |
|---|---|
b_key[B_SIZE] |
The array of keys used for each page index of the bnode.
|
b_nelem |
Number of valid keys/values in the bnode.
|
b_down[B_SIZE+1] |
The array of values in the bnode,
either pointers to another bnode
(if this is an interior bnode)
or pointers to chunks (if this is a leaf
bnode).
|
B-tree struct broot points
to the start of the B-tree.
struct broot| Element | Meaning |
|---|---|
b_root |
Pointer to the initial point of the B-tree.
|
b_depth |
Number of levels in the B-tree
|
b_npages |
Pages used to construct the B-tree, counting both
pages used for chunks and bnodes.
|
b_rpages |
Number of swap pages
reserved for the B-tree by the kernel, using the
routine grow_vfdpgs(). Amount of swap allocated for
the vfd/dbd pairs in the B-tree structure.
|
b_list |
Pointer to a linked list of memory pages used for bnodes
or chunks in this region. The first page in this list
usually has free space available (if b_nfrag is non-zero).
New bnodes or chunks
can be allocated from here and added to the B-tree.
|
b_nfrag |
Number of chunks available (not yet allocated) in b_list.
Since chunks are allocated from the end of the page, this is also
the index of the most recently allocated chunk in the page (decrement
it to get the next available one).
|
b_rp |
Pointer to the region using the B-tree.
|
b_protoidx, b_proto1,
b_proto2 |
Two prototpe dbd values,
and the page index at which we switch from
b_proto1 to b_proto2.
This is used to
minimize time and memory costs when allocating chunk space.
|
b_vproto |
List of page ranges which are copy on write. This allows pages to be set
copy on write without having to immediately allocate the actual
B-tree entries.
This is used to determine the vfd prototype.
(See "vfd Prototypes" below.)
|
b_key_cache[], b_val_cache[] |
Caches of most recently used keys and pointers to
chunks associated with the keys; checked first when
looking for a particular struct vfddbd
(before searching the B-tree).
|
vfd Prototypes
The
When a file is opened as an a.out or shared library, the easiest way to
keep track of the region is to create a
The
All
Even after all processes using the
The page frame data (
If physical memory addresses always started with page zero, and increased
in a continuous sequence, it would be implemented as a single level
array. (Indeed, it was implemented this way in older HPUX releases,
as the hardware they ran on had such a continuous address range.)
However, some recent systems have huge gaps in their physical addresses
(e.g. one might have memory from page 0 to page 0x1000, and then from page
0x20000 to 0x21000); a table that represented all addresses would be
much larger than actually needed.
Consequently the first layer (
The
(1) Hashing is done on the tuple
The
The PA-RISC hardware attempts to convert a virtual address to a
physical address by looking in the TLB. If it cannot resolve the
address, it generates a page fault (interrupt type 6 for an instruction
TLB miss fault; interrupt type 15 for a data TLB miss fault). The kernel
must then handle this fault.
PA-RISC uses a hashed page table (
See "The Page Table or PDIR"
above for additional discussion of this table,
and
"The Hashed Page Directory (
NOTE: For historical reasons, the entries of this table
can be referred to as
To find an address in the
As with any hash algorithm, multiple addresses can map to the same
In practice,
Each
HP-UX uses a hashed page directory to translate from virtual to physical
address.
Translations from physical to virtual
use the
Each
The
HP-UX supports software address aliasing on most platforms.
(Whereas the hardware implements address aliasing on
16 MB boundaries, software address aliasing is implemented on a per-page
basis; pages are 4KB apart.)
This is not used as much as it might be in other operating systems;
HP-UX doesn't generally map the same object at multiple virtual addresses.
When a text segment is first translated, it has no alias. However, if a
process or thread attaches to the same text segment, it may require
another translation. Processes sharing text segments do not use aliases.
Only processes with private text segments that share data pages using
When multiple virtual addresses translate to the same physical address,
HP-UX uses alias structures to keep track of them. Aliases for a page
frame (
To locate the
The
The global variable
The number of available alias
The number of available alias structures is kept in
Two computational elements maintain page availability:
NOTE: The
Memory management uses paging thresholds that trigger various paging
activities. The figure shows the full range of available memory and
indicates what paging activity occurs when memory level falls below
each paging threshold.
The value termed
Three tunable paging thresholds are initialized by the
The
When the system boots,
The system wants to keep free memory at
If
The paging thresholds are set as follows:
The routine
In actual implementation,
Using
However, it has the disadvantage of putting all the memory belonging
to a single process together. Thus, when the
It's important to keep an appropriate distance between the hands. Too close,
and pages are stolen that are in fact in regular use.
Too far, and the hands have to move faster to keep the same steal rate;
this means that
The two hands cycle through the active
The
When the age hand arrives at a
The steal hand uses the
How much to age and steal depends on several factors:
Refer to the table that follows for explanations of the vhand variables.
NOTE: None of the variables in the table that follows may be tuned.
Once
Note, the steal hand is moved first to keep it behind the age hand and
prevent aging and stealing a page in the same cycle.
The
NOTE: Deactivation occurs on a per-thread basis.
Deactivation occurs when
Reactivation occurs when the system is no longer low on memory or
thrashing.
Deactivation and reactivation are determined by:
If the system appears to be thrashing or experiencing memory pressure,
the
If the system is not thrashing or experiencing memory pressure, the
Once a process is chosen for deactivation,
b_vprotothread struct
field of the .
The list is sorted
by struct broot
contains a list of ranges of pages to be treated as copy-on-write.
This allows pages to be set copy-on-write without their B-tree
entries being allocated immediately.
It is of type struct vfdcw.
When creating vfds, the prototype is determined by checking whether the
page is present in this list to dertmine which prototype
Table 15
struct vfdcw
Element Meaning v_start[MAXVPROTO]
Page that indexes start of copy-on-write range; set to -1 if unused.
v_end[MAXVPROTO]
End of copy-on-write range
pseudo-vas for Text
and Shared Library pregionspseudo-vas
the first time the file
is opened as an executable. This is done by calling mapvnode()
and
storing the vas pointer
in the vnode's v_vas element. On subsequent
opens of the file as an executable,
the non-NULL value in v_vas aids in
finding the region to which the virtual address space is being attached.
pseudo-vas is type PT_MMAP,
and the associated pregion has
PF_PSEUDO set in p_flags.
This pregion is attached to the region for
this vnode.
All the processes that use this executable or shared library
(non-pseudo pregions) then attach to the region with type PT_TEXT
(a.out) or PT_MMAP (shared library).
The number of processes using a
particular vnode as an executable is kept
in the pseudo-vas in
va_refcnt.
pregions associated with a region are connected with a
doubly-linked list that begins with the region element r_pregs,
and is
defined in the pregions by p_prpnext and p_off, the pregion's offset into the region,
and is NULL-terminated.
a.out or shared library exit,
the
handle to the region remains; its pages can be disposed of at that time.
Figure 21 Mapping the
pseudo-vas Structures
a.out shlib
vnode vnode
+-----+ +---->+-------+ +-----+ +---->+-------+
| | | |pseudo | | | | |pseudo |
+-----+ | +>| vas |<+ +-----+ | +>| vas |<+
|v_vas|-+ | +-------+ | |v_vas|-+ | +-------+ |
+-----+ | | +-----+ | |
| | | +-------+ | | | | +-------+ |
+-----+ +>| MMAP |<+ +-----+ +>| MMAP |<+
.............|pregion| ................|pregion|
+-----------------| | : | |--------+
| : +-------+ : +-------+ |
| : : |
| : proc[n].p_vas--+ : |
| : V : V
| : +-------+ : +-------+
| : | vas | +----------------------------->| MMAP |
| : +--------->| |<-----------+ |region |
| : | +-------+ | : | +-------+
| V V V V V /|\
| +-------+ +-------+ +-------+ +-------+ |
| | TEXT |<->| |<->| MMAP |<->| | proc[m].pvas |
| |pregion| | | |pregion| | | | |
| +-------+ +-------+ +-------+ +-------+ | |
| : | :............. V |
| : | : +-------+ |
| : | r_prpnext +------------------->| vas |<---+ |
| :...|............. | : | | | |
| | : | : +-------+ | |
| V V V V V |
| +-------+ +-------+ +-------+ +-------+ +-------+ |
+->| TEXT | | TEXT |<->| |<->| MMAP |<->| | |
|region |<-------|pregion| | | |pregion| | | |
+-------+ +-------+ +-------+ +-------+ +-------+ |
| |
+---------------+
Hardware-Independent Page Information Table (
pfdat) pfdat) table is a two level table
which represents all reallocatable pages of physical memory. (Memory
premanently allocated at kernel boot time is not represented.)
Conceptually it may be imagined as a giant array indexed by the
page frame number (pfn, i.e. the physical page number).
pfdat_ptr)
is basically an array of pointers to sub-tables.
Each pointer represents PFN_CONTIGUOUS_PAGES (0x1000) pages
of possible physical address space, but the pointers are NULL
unless there's actual physical memory in that range.
(As a memory-saving optimization, memory allocated permanently at boot
is treated as nonexistent for purposes of this table.)
points to the last pfdat structures are used for several purposes.
pfdats.
Table 16 Principal Entries in
struct pfdat
(Page Frame Data)
Element Meaning pf_hchain
Hash chain link.
pf_devvp(1)
vnode for device.
pf_next, pf_prev
Next and previous free pfdat entries.
pf_vnext, pf_vprev
Links for linked list of pages associated with the same vnode.
pf_lock
Lock pfdat entry (beta semaphore), used to lock the page while
modifying the pde
(physical-to-virtual translation, access rights,
or protection ID).
pf_pfn
Physical page frame number.
pf_use
Number of regions sharing the page; when pf_use drops to zero,
the page can be placed on the free linked list.
pf_cache_waiting
If set, this element means that a thread is waiting to grab the
pf_lock on that page. Required for synchronization.
pf_data
Disk block number or other data
to uniquely identify this page within pf_devvp.
pf_sizeidx
Identifies the page size for the base page of a large page
in a physical memory free list. That size determines which free list it's
placed in.
pf_size
Page size of a variable sized page that's in use.
pf_flags
Page frame data flags (shown in the next table).
pf_hdl
Hardware dependent layer elements
(see
hdlpfdat discussion, shortly).
(pf_devvp, pf_data).
Flags Showing the Status of the Page
Table 17 Principal
pf_flag Values
Flag Meaning P_FREE
Page is free (available for allocation).
P_BAD
Page is marked as bad by the memory deallocation subsystem.
P_HASH
Page is on a hash queue.
P_SYS
Page is being used by the kernel rather than by a
user process. Pages marked with this flag
include dynamic buffer cache pages, B-tree
pages and the results of dynamic kernel memory
allocation.
P_DMEM
Page is locked by the memory diagnostics
subsystem; set and cleared with an ioctl() call
to the dmem driver.
P_LCOW
Page is being remapped by copy-on-write.
P_UAREA
Page is used by a pregion of type PT_UAREA.
P_KERN_DYNAMIC
Page is used for kernel dynamic memory. (Subset of P_SYS.)
This includes pages in the kernel dynamic memory free lists.
P_KERN_NO_LGPG
Page is allocated (as kernel dynamic memory)
by a user who intends to remap it.
(This, it cannot be part of a large page.)
Subset of P_KERN_DYNAMIC.
P_SP_POOL
Page is in kernel dynamic memory allocator's superpage pool free list.
(Subset of P_KERN_DYNAMIC.)
Hardware-Dependent Layer Page Frame Data Entry
pf_hdl field of the struct pfdat
contains hardware dependent information associated with each page.
It is of type
struct hdlpfdat, defined
in hdl_pfdat.h.
Table 18
struct hdlpfdat
Element Meaning hdlpf_flags
Flags that show the HDL status of the page. Values include:
HDLPF_TRANS: A virtual address translation
exists for this page.
HDLPF_PROTECT: Page is protected from user
access. If this flag is set, the saved
values (below) are valid unless HDLPF_STEAL
is also set.
HDLPF_STEAL: Virtual translation should be
removed when pending I/O is complete.
HDLPF_MOD: Analogous to changing the
pde_modified flag in the hpde.
pde_ref flag in the hpde.
HDLPF_READA: Read-ahead page in transit;
used to indicate to the hdl_pfault()
routine that it should start the next I/O
request before waiting for the current I/O
request to complete.
hdlpf_savear
Saved page access rights.
hdlpf_saveprot
Saved page protection ID.
MAPPING VIRTUAL TO PHYSICAL MEMORY
The
HTBLhtbl)
of page directory entries (hpdes)
to pinpoint an address in the
enormous virtual address space.
Control register 25 (CR25) contains the hash table address
(see reg.h).
hpde and hpde2_0)
Structure" above for details of the contents of each table entry.
pdes, hpdes,
of pdirs.
htbl:
htbl
index.
htbl.
Each entry in the table is referred to as a pde (page directory
entry), and is of type struct hpde.
pde to verify the entry.
pde to
complete the translation from virtual address to physical address.
Figure 22 Mapping from the
htbl Entry
to the Page Directory Entry
htbl +-----+
| |
| |
| |
+-----+ +------+ | |
|Space| |Offset| | |
+-----+ +------+ | |
\ / | |
\ / | |
\ / | |
_/ \_ | |
----------- | |
\ hash / | |
\ / | |
| | |
V | |
+----------+ +-----+
|htbl index|------> htbl[n] | pde | ----> RAM
+----------+ +-----+
| |
| |
| |
+-----+
htbl[nhtbl-1] | pde |
+-----+
When Multiple Addresses Hash to the Same
htbl Entryhtbl index.
The entry in htbl is actually the starting point for a linked
list of pdes.
Each entry has a pde_next pointer that points to another
pde, or contains NULL if it is the last item of the linked list.
htbl contains sufficient entries,
as that the linked lists seldom grow beyond three links.
htbl entry
can point to two other collections of pdes, ranging from
base_pdir to htbl
and from pdir (which is also the end of htbl) to
max_pdir.
The entirety of the htbl and surrounding pdes
is referred
to collectively as the sparse pdir.
htbl is always aligned to begin at an
address that is a multiple of its size
(that is, a multiple of nhtbl * sizeof(struct hpde)).
pdir_free_list or pd_fl2_0->head points to a linked
list of sparse pdir entries that are not being used and are available for
use. pdir_free_list_tail or pde on that linked list.
(The variable names changed slightly from the PA-RISC 1.1 pdir implementation
to the PA-RISC 2.0 pdir implementation.)
Figure 23 How Multiple Addresses Hash to the Same
htbl Entry
+------------+
base_pdir | |
| |
| |
...> ============== -------> RAM
: | |\ |
: | \|
: | |\
: | | \
: | | pde
: | | /
: | |/
: | /|
: | |/ |
: ============== ..
: | | :
:....|............|..:
| |
| |
pdir +------------+
| |
| |
| |
| |
| |
| |
max_pdir | |
+------------+
Mapping Physical to Virtual Addresses
pfn_to_virt table.
Like the
pfdat table,
this is a two level table
that can be imagined as a giant array containing one
pfn_to_virt_entry_t entry for each page of physical memory.
The first level table is called pfn_to_virt_ptr[].
pfn_to_virt_entry_t contains
either space and offset of the virtual page (in the case of
a single translation to a page) or a list of alias structures (when the
physical page has more than one virtual address translation).
Figure 24 Physical-to-virtual Address Translation
pfn_to_virt_ptr[] pfn_to_virt_entry_t
+-----+ >+------------+
| | / | |
| | / +------------+
| | / | |
| | / | | struct alias entries
+-----+/ +------------+ +------+ +------+ +------+ +------+
pfn.>| | +..>| *alias |->|alias1|<->|alias2|<->|alias3|<->|aliasn|
: +-----+ : +------------+ +------+ +------+ +------+ +------+
: | | : | | |space.offset
: | | : +------------+ |vtopde()
: | | : |space.offset| |
: +-----+ : +------------+ V
: : | | +-----------------------------------------+
+...........+ +------------+ | hpde corresponding to this space.offset |
| | +-----------------------------------------+
+------------+
pfn_to_virt_entry_t may contain the space.offset (virtual
address) corresponding to a physical address or it may have a pointer to
a link list of alias structures, each of which has a space.offset pair.
Address Aliasing
copy-on-write use aliases.
Aliases may also be used to add kernel translations of user pages.
pfn) are maintained via alias chains off the
pfn_to_virt_entry_t.
(With large pages, the aliases are linked
from the pfn_to_virt_entry_t corresponding to the base
pfn of the page.)
When a pfn_to_virt_entry_t's space field is
invalid and the offset field is non-zero, the non-zero value points to the
beginning of a linked list of alias structures. Each alias structure
contains the space and offset of the alias, and a temporary hold field for a
pde's access rights and protection ID.
The pf_lock
of the alias's base pfn's pfdat
protects the alias chain from being read and modified.
hpde for a particular alias space and offset,
the space and
offset are hashed for the hpde chain
and its corresponding pd_lock. Once
the pd_lock is obtained,
the vtopde() routine walks the hpde hash chain
to find a match of the tag.
aa_entfreelist
is the head of the doubly-linked list of free alias
entries.
The system gets an alias structure from aa_entfreelist, in
which it stores the information for this new virtual-to-physical
translation.
max_aapdir
contains the total number of alias hpdes
on the system.
Once a page is allocated for use as alias hpdes, it is not
returned, so the value of max_aapdir may grow over time
but will never
shrink.
hpdes
is stored in aa_pdircnt. When an
alias hpde is used or reserved
(we reserve one if we include an htbl hpde
in an alias linked list, in case we have to move it later),
aa_pdircnt is
decremented.
When an alias hpde is returned to aa_pdirfreelist or
unreserved, aa_pdircnt is incremented.
aa_entcnt.
Once a
page is allocated for use as a group of alias structures, it is not returned.
We do not keep track of the total number of alias structures on the
system, just the number of available structures.
MAINTAINING PAGE AVAILABILITY
vhand and sched daemons (system processes)
handle the actual
paging and deactivation.
vhand monitors free pages to keep their number above
a threshold and
ensure sufficient memory for demand paging.
vhand governs the overall
state of the paging system.
sched becomes operative when the number
of pages available in memory diminishes below a certain level.
vhand
and sched will be described in the context of their work shortly.
sched process is known colloquially as the swapper.
Paging Thresholds
Figure 25 Available Memory in the System
total memory at boot-up --> +------------------------+ phys_mem_pages
| kernel static memory |
| |
freemem at boot --> +------------------------+
| |
. .
. .
| |
+------------------------+ lotsfree
| |
| |
| |
vhand begings paging --> +........................+ gpgslim*
| page |
+------------------------+ desfree
| |
sched begins deactivating --> +------------------------+ minfree
| deactivate |
+------------------------+ 0
* fluctuates between desfree and lotsfree
freemem represents
the total number of free pages.
setmemthresholds() routine.
Table 19
setmemthresholds() Paging Thresholds
Paging threshold Meaning lotsfree
Plenty of free memory, specified in pages. The upper
bound from which the paging daemon will begins to steal pages.
desfree
Amount of memory desired free, specified in pages. This is the lower
bound at which the paging daemon begins stealing pages.
minfree
The minimal amount of free memory tolerable, specified in pages.
If free memory drops below this boundary, sched() recognizes the
system is desperate for memory and deactivates entire
processes whether they are runnable or not.
The
gpgslim Paging Thresholdgpgslim paging threshold
is the point at which vhand starts
paging.
gpgslim adjusts dynamically according to the needs of the
system. It oscillates between an upper bound called lotsfree
and a
lower bound called desfree.
Both lotsfree and desfree are
calculated when the system boots up and are based on the size of system
memory.
gpgslim is set to 1/4 the distance between
lotsfree and desfree
(desfree + (lotsfree - desfree)/4). As
the system runs, this value fluctuates
between desfree and lotsfree.
When the sum of available memory and the number of pages scheduled
for I/O (soon to be freed) falls below gpgslim,
vhand begins aging
and stealing little-used pages in an attempt to increase the available
memory above this threshold.
gpgslim.
If the system is not
stressed, gpgslim starts falling,
because it does not need to have a lot
more pages freed.
As memory becomes more scarce (defined as freemem
reaching zero too often), the system inrceases gpgslim
so that it will page earlier, and hopefully not have freemem
reach zero as often.
freemem
decreases to minfree,
the system starts to deactivate entire processes.
How Memory Thresholds are Tuned
Table 20 Paging Threshold Values
Threshold Basic Value
Limit if Initial
freemem < 2 GBAdditional Amount per 2G of Initial freememlotsfree1/16
freemem32 MB
32 MB desfree1/64
freemem4 MB
8 MB minfree1/4
desfree1 MB
4 MB How Paging is Triggered
schedpaging() runs periodically
and wakes up vhand
whenever it finds that
the sum of free memory and paroled memory (freemem +
parolemem) is less than lotsfree.
The rate schedpaging() runs is termed vhandrunrate,
a tunable
parameter (set to run by default at eight times per second).
vhand can also be awakened by reserve_freemem()
and allocate_page().
reserve_freemem() is a routine that is called to reserve memory.
It will wake vhand if it can't reserve sufficient memory
and finds freemem + parolemem < gpgslim.
allocate_page() is a routine that is called to actually allocate
memory. If it is called by code that cannot wait (e.g. because it is
running on the interrupt stack), and cannot find the requested memory, it
will wake up vhand. Also, regardless of whether its caller
can wait, if it can't find the requested
memory it will wake up the unhashdaemon, which removes pages
from the page cache.
vhand, the Pageout Daemon vhand's function is to keep memory available by freeing up the
least recently referenced pages. It also performs other functions related
to maintaining memory availability, such as garbage collection of the
kernel memory allocator free lists.
Two-Handed Clock Algorithm
vhand uses a two-handed clock algorithm to decide which
pages to free. Conceptually, it has two hands
(called the "age hand" and the "steal hand")
passing through all of memory.
One hand marks each page as "not recently referenced". The other hand follows
after a delay, and checks each page to see whether it's been accessed (and
so marked as recently referenced) since the first hand cleared its referenced
bit. Those which have not been accessed may be stolen (paged out and
the memory made available to other users).
vhand steps through memory
by following a doubly linked list of pregions, called the
active pregion list. It doesn't step through all
pregions each time it is woken, and normally looks at only
a portion of the pages in each pregion.
Since memory used for the file system buffer cache isn't associated
with any pregion, a special dummy pregion called
bufcache_preg is used to put it in the list of things for
vhand to scan.
pregions rather than simply scanning all pages (e.g.
using the pfdats) has the advantage of automatically skipping
kernel memory, and memory that's already free.
steal hand
reached that process' pregions, all the pages it stole would
come from that one process, leaving it frantically paging back in its
working set ... essentially thrashing. (This is particularly ugly if
the process happens to be interactive and awaiting user input ... the user
doesn't want to wait for large numbers of pageins before his program
responds to his mouse movement.) This is why only a portion of each
pregion is aged or stolen on each pass, and vhand
thus needs multiple passes through the active pregion
list to visit all of pagable memory.
vhand will consume more CPU time.
The kernel automatically keeps an appropriate distance between the
hands, based on the available paging bandwidth, the number of pages
that need to be stolen, the number of pages already scheduled to be freed,
and the frequency by which vhand runs.
Table 21
pregion Elements used by vhand
Element Purpose p_agescanLast age hand location p_stealscanLast steal hand location p_ageremainRemaining pages to be aged p_bestniceBest nice value of all processes sharing the underlying region p_forw, p_backLinks in active pregion list pregion linked list
of physical
memory to look for memory pages that have not been referenced recently
and move them to secondary storage - the swap space. Pages that have
not been referenced from the time the age hand passes to the time the
steal hand passes are pushed out of memory. The hands rotate at a
variable rate determined by the demand for memory.
vhand daemon decides when to start paging by determining how
much free memory is available. Once free memory drops below the
gpgslim threshold, paging occurs.
vhand attempts to free enough pages
to bring the supply of memory back up to gpgslim.
The page daemon continues to age pages (that is, clear
their reference bits) when woken even if there's enough memory
that it doesn't need to steal pages; of course, it won't be woken
very often in that situation.
Factors Affecting
vhandvhand responds to various workloads, transient situations,
and memory
configurations. When aging and stealing from pregions,
vhand:
pregion field p_agescan
to track the last age hand
location.
pregion field p_ageremain
to track remaining pages to be
aged.
pregion field p_stealscan
to track the last steal hand
location.
vfd/dbd pairs to swap if they have no valid pages.
pregion,
it ages some constant fraction of
pages before moving to the next region
(by default 1/16 of the region's
total pages).
The p_agescan tag enables the age hand to move to the
location within a pregion
where it left off during its previous pass,
while the p_ageremain charts
how many pages must be aged to fill the
1/16 quota before moving on to the next pregion.
pregion field p_stealscan
to locate itself
within a pregion and resume taking pages that have not been
referenced since last aged.
If no valid page remain, vhand pushes out of
memory the vfd/dbd pairs associated with the region.
vhand runs (by default eight times per second).
gpgslim.
vhand is biased against threads that have nice priorities:
the nicer a
thread, the more likely vhand will steal its pages.
The pregion field
p_bestnice reflects the best
(numerically, the smallest value) nice
value of all threads sharing a pregion.
What Happens when
vhand Wakes Up
vhand uses the SCRITICAL flag
to get access to the system critical
memory pool. (The SCRITICAL flag
for the vhand process is set when
the process starts running for the first time.)
vhand establishes pagecounts
for pages to age and pages to steal.
vhand updates the value of gpgslim,
based on value of
memzeroperiod.
vhand updates pageoutrate,
using pageoutcnt.
vhand updates targetlaps,
the number of desired laps between the
age and steal hands. If less CPU cycles are being used than the
value of targetcpu,
vhand increases the value of targetlaps (up
to a maximum of 15); if more CPU cycles are being used than
targetcpu, targetlaps is decreased.
vhand updates agerate,
the number of pages to age per second.
vhandinfoticks is non-zero,
diagnostic information prints to the
console.
Table 22 Variables Affecting
vhand
Variable Purpose memzeroperiod
Minimum time period (default=3 seconds)
permissible for freemem to reach zero events;
determines how often gpgslim is adjusted
when vhand() is running.
gpgslim is incremented if freemem does
not reach zero twice within
memzeroperiod.
gpgslim is decremented if freemem
reaches zero twice within memzeroperiod.
pageoutrate
Current pageout rate, calculated empirically from number of pageouts
completed.
pageoutcnt
Recent count of pageouts completed.
targetlaps
Ideal gap between steal and age hands for handlaps; adapts at
run time. During normal operation, the hands should be as far apart
as possible to give processes maximum time to reset a cleared
reference bit being used by a page. targetlaps is defined in the
kernel as a static variable; it does not appear in the symbol table.
targetcpu
Maximum percentage of CPU vhand should spend paging. (default=10%)
handlaps
Actual number of laps between the age and steal hands.
agerate
Number of pages the age hand visits to age per
second; adapts continually to system load.
agerate is defined in the kernel as a static variable
(meaning that it does
not appear in the symbol table).
stealrate
How many pages the steal hand visits per
second; adapts continually to system load.
stealrate is defined in the kernel as a static
variable (meaning that it does not appear in the
symbol table).
vhand Steals and Ages Pagesvhand establishes its criteria,
it proceeds to traverse the linked
list of pregions.
Continuing in the clock-hands analogy, vhand is ready to
move its hands.
vhand determines how many pages
and what pages are available to
steal.
bufcache_preg,
vhand steals
buffers from the buffer cache with the stealbuffers() routine.
The global parameter dbc_steal_factor determines how much more
aggressively to steal buffer cache pages than pregion pages. If
dbc_steal_factor has a value of 16, buffer cache pages are
treated no differently than pregion pages;
the default value of 48
means that buffer cache pages are stolen three times as
aggressively as pregion pages.
pregion
whose region has no valid
pages (that is, r_nvalid == 0),
and none of the processes using the region are
loaded in memory (that is, r_incore == 0),
vhand pushes its B-tree out to
the swap device.
vhand steals all unreferenced pages
between p_stealhand and
(p_agescan - p_count/16 * handlaps), up to the steal quota.
vhand updates p_stealscan
to the page number following the
last stolen page of the affected pregion.
vhand has not stolen as many pages as permissible,
it moves to the next pregion and repeats the
process until it satisfies the system's demand.
vhand moves the age hand
to clear the reference bit from a
selected number of pages.
bufcache_preg,
vhand ages one
sixteenth of the pages in the buffer cache with the agebuffers()
routine.
vhand determines the best nice value
(that is, the lowest number)
of all the pregions using the region.
For each page in the region,
if the nice value is less than a randomly generated number, vhand
does not age the page. (I.e. pages belonging to higher priority processes
(numerically low nice values) are less likely to be aged.)
vhand ages all pages
between p_agehand and
(p_agehand + p_ageremain)
by clearing the pde_ref bit and
purging the TLB.
vhand updates p_agescan
to be the page number after
the last page scanned (and potentially aged) in the affected pregion.
The
sched() routinesched() routine (colloquially termed "the swapper")
handles the
deactivation and reactivation of processes when free memory falls below
minfree, or when the system appears to be thrashing.
sched() chooses to
deactivate on a process level and then deactivates each thread.
sched() determines the system:
freemem falls below the deactivation
threshold minfree and more than one process is running.
What to Deactivate or Reactivate
sched() deactivates processes and prevents them from running,
thus reducing the rate at which new pages are accessed.
Once sched()
detects that available memory has risen above minfree
and the system
is not thrashing,
sched() reactivates the deactivated processes and
continues monitoring memory availability.
sched() routine walks through the active process list
calculating each
process's deactivation priority based on type, state, length of time in
memory, and how long it has been sleeping. (Batch and processes
marked for serialization by the serialize() command
are more likely
to be deactivated than interactive processes.) The best candidate is then
marked for deactivation.
sched routine
walks through the active process list calculating each
deactivated process' reactivation priority based on how long it has been
deactivated, its size, state, and type. Batch processes and those marked
by the serialize() command
are less likely to be reactivated than is
an interactive process. Once the most deserving process has been
determined, it is reactivated.
When a Process is Deactivated
sched()
SDEACT flag in the proc struct
and the TSDEACT flag in each
uareas
to the active pregion list
so that vhand can page them.
out.
pregions associated with the target process
in front of
the steal hand, so that vhand can steal from them immediately.
vhand to scan and steal pages
from the entire pregion,
instead of 1/16.
Eventually,
vhand pushes the deactivated process's pages to secondary
storage.
Processes stay deactivated until the system has freed up enough memory and the paging rate has slowed sufficiently to reactivate processes. The process with the highest reactivation priority is then reactivated.
Once a process is chosen for reactivation,
sched():
uareas
from the active pregion list.
uareas.
Earlier HP-UX implementations did not permit a process to be swapped
out if it was holding a lock, doing I/O, or was not at a signalable priority.
Even if priority made it most likely to be deactivated,
vhand bypassed
the process.
Now, if the most deserving process cannot be deactivated immediately, it
is marked for self-deactivation; that is, sched()
sets the SDEACTSELF on its proc struct
and the TSDEACTSELF on each of its thread structs.
The next time one of the threads must fault in a page,
the thread deactivates the process.
Thrashing is defined as low CPU usage with high paging rate. Thrashing might occur when several processes are running, several processes are waiting for I/O to complete, or active processes have been marked for serialization.
On systems with very demanding memory needs (for example, systems that run many large processes), the paging daemons can become so busy deactivating/reactivating, and swapping pages in and out that the system spends too much time paging and not enough time running processes.
When this happens, system performance degrades rapidly, sometimes to such a degree that nothing seems to be happening. At this point, the system is said to be thrashing, because it is doing more overhead than productive work.
If your working set is larger than physical memory, the system will thrash. To solve the problem:
If you are left with one huge process constrained with physical memory and the system still thrashes, you will need to rewrite the application so that it uses fewer pages simultaneously, by grouping data structures according to access, for example.
All processes marked by the serialize command are run serially. This functionality unjams the bottleneck (recognizable by process throughput degradation) caused by groups of large processes contending for the CPU. By running large processes one at a time, the system can make more efficient use of the CPU as well as system memory since each process does not end up constantly faulting in its working set, only to have the pages stolen when another process starts running.
As long as there is enough memory in the system, processes marked by
serialize() behave no differently
than other processes in the system.
However, once memory becomes tight, processes marked by serialize are
run one at a time in priority order. Each process runs for a finite interval
of time before another serialized process may run. The user cannot
enforce an execution order on serialized processes.
serialize() can be run from the command line
or with a PID value.
serialize() also has a timeshare option
that returns the PID
specified to normal timeshare scheduling algorithms.
If serialization is insufficient to eliminate thrashing, you will need to add more main memory to the system.
Since vhand()is tuned
to be nice regarding I/O usage and CPU usage, it
allows the pager to fault out swapped processes. The swapper marks the
process to be swapped for deactivation,
and takes its threads off the run queue.
Since it cannot run, once its pages are aged, they cannot be referenced
again. When the steal hand comes around, it steals all the pages in the
region.
When memory pressure is high,
sched() selects a process to swap using
the routine choose_deactivate().
This routine is biased to choose
non-interactive processes over interactive ones, sleeping processes over
running ones, and long-running processes over newer ones.
Once a process has been chosen to be deactivated, the following actions occur:
SDEACT flag
and its threads' TSDEACT flags are set.
SDEACTSELF flag and its threads' TSDEACTSELF
flags are set. When I/O completes, the process deactivates in the
paging routines.
p_deactime
in the proc structure
and the threads' kt_deactime in the kthread
structure are set to the
current time to establish a record of how long the process is
deactivated.
pregions are positioned
in the active pregion chain
to ready it for
the steal hand.
uarea pregions
are added to the list of active pregions for them to get
paged out.
deactive_cnt is incremented.
A process that has been inactive long enough for all its pages to have
been aged and stolen is virtually swapped out already. The global
deactprocs points to the head of a list of inactive processes,
its chain
running through the pregion element p_nextdeact.
When memory pressure eases, a deactivated process is reactivated. The
choose_reactivate() routine is biased to choose interactive over
non-interactive ones processes, runnable processes over sleeping ones,
and processes that have been deactivated longest over those more
recently deactivated.
Now, however, HP-UX provides the option of using Memory Resource Groups
to assign a group of processes their own memory pool. These processes
are in effect given their own physmem_pages, freemem,
minfree, desfree, lotsfree,
gpgslim, etc..
This allows groups of processes to page independently, producing a lot less interference between them. This may be useful for server consolidation, where several applications originally written for individual servers are instead run togetehr on a single larger server.
With Memory Resource Groups, vhand and sched
behave almost as if each MRG were completely separate, with its own
individual pager and swapper. (The actual implementation is a bit more
complex, as it must account for processes and memory moving between
MRGs, the ability for one MRG to borrow memory from another, memory
use that can't be assigned to any single process (or any MRG), and the need
to maintain global memory availability as well as individual MRG memory
availability.) The global variables discussed above are still present,
and act as a summary of the overall system state.
Swap space is an area on a high-speed storage device (almost always a disk drive), reserved for use by the virtual memory system for deactivation and paging processes. At least one swap device (primary swap) must be present on the system.
During system startup, the location (disk block number) and size of each swap device is displayed in 512-KB blocks. You can add swap as needed (that is, dynamically) while the system is running, without having to regenerate the kernel.
The swapper reserves swap space at process creation time, but does not allocate swap space from the disk until pages need to go out to disk. Reserving swap at process creation protects the swapper from running out of swap space.
HP-UX uses both physical and pseudo-swap to enable efficient execution of programs.
System memory used for swap space is called pseudo-swap space. It
allows users to execute processes in memory without allocating physical
swap. Pseudo-swap is controlled by an operating-system parameter; by
default, swapmem_on is set to 1, enabling pseudo-swap.
Typically, when the system executes a process, swap space is reserved for the entire process, in case it must be paged out. According to this model, to run one gigabyte of processes, the system would have to have one gigabyte of configured swap space. Although this protects the system from running out of swap space, disk space reserved for swap is under-utilized if minimal or no swapping occurs.
To avoid such waste of resources, HP-UX is configured to access up to 7/8 of system memory capacity as pseudo-swap. This means that system memory serves two functions: as process-execution space and as swap space. By using pseudo-swap space, a two-gigabyte memory system with two-gigabyte of swap can run up to 3.75 GB of processes. As before, if a process attempts to grow or be created beyond this extended threshold, it will fail.
When using pseudo-swap for swap, the pages are locked; as the amount of pseudo-swap increases, the amount of lockable memory decreases.
For factory-floor systems (such as controllers), which perform best when the entire application is resident in memory, pseudo-swap space can be used to enhance performance: you can either lock the application in memory or make sure the total number of processes created does not exceed 7/8 of system memory.
When the number of processes created approaches capacity, the system
might exhibit thrashing and a decrease in system response time. If
necessary, you can disable pseudo-swap space by setting the tunable
parameter swapmem_on
in /usr/conf/master.d/core-hpux to zero.
At the head of a doubly linked list of regions that have pseudo-swap
allocated is a null terminated list called pswaplist.
File-system swap space is located on a mounted file system and can vary in size with the system's swapping activity. However, its throughput is slower than device swap, because free file-system blocks may not always be contiguous, leading to extra read/write requests; and becuase of the extra overhead of an additional layer of code.
To optimize system performance, file-system swap space is allocated and
de-allocated in swchunk-sized chunks.
swchunk is a configurable
operating system parameter; its default is 2048 KB (2 MB). Once a
chunk of file system space is no longer being used by the paging system,
it is released for file system use, unless it has been preallocated with
swapon.
If swapping to file-system swap space, each chunk of swap space is a file
in the file system swap directory, and has a name constructed from the
system name and the swaptab index
(such as becky.6 for swaptab[6]
on a system named becky).
Several configurable parameters deal with swap space:
| Parameter | Purpose |
|---|---|
swchunk |
The number of DEV_BSIZE blocks in a unit of swap space,
by default, 2 MB on all systems.
|
maxswapchunks |
Maximum number of swap chunks allowed on a system. |
swapmem_on |
Parameter allowing creation of more processes than you have physical swap space for, by using pseudo-swap. |
There are a number of kernel global variables related to swap space,
shown in the next table.
The most important to swap space reservation are swapspc_cnt,
swapspc_max, swapmem_cnt,
swapmem_max, and sys_mem.
| Variable | Meaning |
|---|---|
bswlist |
Head of free swap header list. |
swdevt[] |
Device swap table. |
fswdevt[] |
File system swap table. |
swaptab[] |
Table of swap chunks. |
swapphys_cnt |
Pages of available physical swap space on disk.
This counts unallocated pages, whether or not they've been
reserved; swapspc_cnt (below) counts
only unreserved pages.
|
swapphys_buf |
Pages of physical swap space to keep available.
(If swapphys_cnt becomes less than this,
vhand's age hand will free swap space when it finds
that the in-memory copy of a page is newer than the
on-disk copy. (Of course this means that swap space will need to
be allocated again when the page needs to be paged out.)
|
swapspc_cnt |
Total amount of swap currently available on all devices and file systems enabled in units of pages. Updated each time swap is reserved or released, as well as each time a device or file system is enabled for swapping. |
swapspc_max |
Total amount of device and file-system swap currently enabled on the system in units of pages. Updated each time a device or file system is enabled for swapping. |
swapmem_cnt |
Total number of pages of pseudo-swap currently available.
Initialized to swapmem_max.
|
swapmem_max |
Maximum number of pages of pseudo-swap enabled. Initialized to 7/8 available system memory. |
pswaplist |
Linked list of regions using pseudo-swap. |
maxdev_pri |
Highest available swap device priority. |
maxfs_pri |
Highest available swap file system priority. |
phys_mem_pages |
Page count of physical memory on the system. |
sys_mem |
Number of pages of memory not available for use as pseudo-swap.
Normally initialized to 1/8 available system memory + 25 pages
+ sysmem_max pages.
|
sysmem_max |
Added to sys_mem (number of pages not available for
pseudo-swap) during system initialization
on systems with device swap available, provided this leaves
swapmem_max > 0.
|
maxmem |
Set to the inital value of freemem after allocation
of the initial dbc_min_pct of phys_mem_pages
for the dynamic buffer cache.
maxmem - swapmem_max is used as an upper limit
for sys_mem when the kernel is returning pages stolen
from pseudo-swap.
|
freemem |
Page count of total remaining unreserved blocks of free memory. |
freemem_cnt |
Number of threads sleeping on global_freemem to wait for memory.
(There are other ways to wait for memory which are not counted here.)
|
System swap space values are calculated as follows:
swapspc_max
(for device swap
and file system swap) + swapmem_max (for pseudo-swap).
swapspc_max - [sum(swdevt[n].sw_nfpgs)
+ sum(fswdevt[n].fsw_nfpgs)] (for device swap and file system
swap) + (swapmem_max - swapmem_cnt) (for pseudo-swap).
In HP-UX, only data area growth (using sbrk())
or stack growth will
cause a process to die for lack of swap space. Program text does not use
swap.
Swap reservation is a numbers game. The system has a finite number of pages of physical swap space. By decrementing the appropriate counters, HP-UX reserves space for its processes.
Most UNIX systems and UNIX-like systems allocate swap when needed.
However, if the system runs out of swap space but needs to write a
process' page(s) to a swap device, it has no alternative but to kill the
process. To alleviate this problem, HP-UX reserves swap at the time the
process is forked or exec'd.
When a new process is forked or executed, if
insufficient swap space is available and reserved to handle the entire
process, the process may not execute.
At system startup,
swapspc_cnt and swapmem_cnt are initialized to
the total amount of swap space and pseudo-swap available.
Whenever the swapon() call is made
to add device or file system swap,
the amount
of swap newly enabled is converted to units of pages and added to the
two global swap-reservation counters swapspc_max
(total enabled swap)
and swapspc_cnt (available swap space).
Each time swap space is reserved for a process (that is, at process
creation or growth time), swapspc_cnt
is decremented by the number of
pages required. The kernel does not actually assign disk blocks until
needed.
Once swap space is exhausted (that is, swapspc_cnt == 0), any
subsequent request to reserve swap causes the system to allocate
additional chunks of file-system swap space. If successful, both
swapspc_max and swapspc_cnt are updated
and the current (and
subsequent requests) can be satisfied. If a file-system chunk cannot be
allocated, the request fails, unless pseudo-swap is available.
When swap space is no longer needed (due to process termination or
shrinkage), swapspc_cnt is incremented
by the number of pages freed.
swapspc_cnt never exceeds swapspc_max
and is always greater than
or equal to zero. If a chunk of file-system swap is no longer needed, it is
released back to the file system
and swapspc_max and swapspc_cnt
are updated.
If no device or file system swap space is available, the system uses
pseudo-swap as a last resort.
It decrements swapmem_cnt and locks the
pages into memory. Pseudo-swap is either free or allocated; it is never
reserved.
The rswap_lock spinlock guards the swap reservation structures
swapspc_cnt, swapspc_max,
swapmem_cnt, swapmem_max, sys_mem,
and pswaplist.
Approximately 7/8 of available system memory is available as
pseudo-swap space
if the tunable parameter swapmem_on is set to 1.
Pseudo-swap is tracked in the global pseudo-swap reservation counters
swapmem_max (enabled pseudo-swap)
and swapmem_cnt (currently
available pseudo-swap). If physical swap space is exhausted and no
additional file-system swap can be acquired, pseudo-swap space is
reserved for the process by decrementing swapmem_cnt.
For example, on a 256 MB system,
swapmem_max and swapmem_cnt track
approximately 224 MB of pseudo-swap space, the remainder tracked by
the global sys_mem,
which represents the number of pages reserved for
system use only.
Processes track the number of pseudo-swap pages allocated to them by
incrementing a per region counter r_swapmem.
All regions using pseudo
swap are linked on the pseudo-swap list pswaplist.
Once both device swap and pseudo-swap
are exhausted
(that is, swapspc_cnt==0 and swapmem_cnt==0),
attempts at process creation or
growth will fail.
Once a process no longer needs its allocated pseudo-swap space,
swapmem_cnt is incremented by the amount released
and r_swapmem is
updated.
Pseudo-swap consumes memory that could otherwise
be used for other purposes (see the sections below),
so it is used sparingly. The operating system periodically checks
to see if physical swap space has been recently freed. If it has, the
system attempts to migrate processes using pseudo-swap only to use the
available physical swap by walking the doubly linked list of pseudo-swap
regions. swapspc_cnt is decremented
by the r_swapmem value for each
region on the list until either swapspc_cnt drops to zero
or no other
regions utilize pseudo-swap.
swapmem_cnt is then incremented by the
amount of pseudo-swap successfully migrated.
Pseudo-Swap competes with the kernel for the use of system memory.
1/8 of available memory (sys_mem pages)
is initially made unavailble for pseudo-swap
use; however, this is nowhere near enough to handle
both kernel dynamic memory and buffer cache space.
Instead, the kernel "steals" memory from pseudo-swap for
these purposes, decrementing swapmem_cnt when
it steals a page; once swapmem_cnt reaches zero,
it starts taking pages from sys_mem until
that too reaches zero.
When "stolen" pseudo-swap is returned,
the amount being released is first
added to sys_mem.
Once sys_mem grows to its maximum value
(maxmem - swapmem_max), any
additional pages returned are used to increase swapmem_cnt.
Because pseudo-swap is related to system memory usage, the swap reservation scheme reflects lockable memory policies.
Although the system is not necesarily allocating additional memory
when a process locks itself into memory, locked pages are no longer
available for general use.
This causes swapmem_cnt to be decremented
to account for the pages.
swapmem_cnt is also decremented by the size
of the entire process if that process gets plocked in memory.
All swap devices and file systems enabled for swap have an associated
priority, ranging from 0 to 10, indicating the order that swap space from
a device or file system is used. System administrators can specify
swap-space priority using a parameter of the swapon(1M) command.
Swapping rotates among both devices and file systems of equal priority. Given equal priority, however, devices are swapped to by the operating system before file systems, because devices make more efficient use of CPU time.
We recommend that you assign the same priority to most swap devices, unless a device is significantly slower than the rest. Assigning equal priorities limits disk head movement, which improves paging performance.
swdev_pri swdevt swaptab
+---------+ +--------+ /+--------+
0| |----->| dev1 |-----> +--------+
+---------+ +-| | \+--------+
1| |\ | +--------+ /+--------+
+---------+ \ +>| dev2 |-----> +--------+
| | \ | | | +--------+
| | \ +--------+ | +--------+
| | \>| dev3 |\ \+--------+
10+---------+ | | \ /+--------+
+--------+ \ > +--------+
| | \. \+--------+
| | \ /+--------+
+--------+ . > +--------+
: | +--------+
swfs_pri . | +--------+
+---------+ : \+--------+
0| | fswdevt . | |
+---------+ +--------+: | |
1| |----->| fs1 |. | |
+---------+ | | | |
| | +--------+ | |
| | | | | |
| | | | | |
10+---------+ +--------+ +--------+
Swap space is alloctaed on HP-UX using the following data structures:
swdev_pri[]),
used to link together
swap devices with the same priority. That is, the entry in
swdev_pri[n] is the head of a list
of swap devices having priority n.
The first field in swdev_pri[] structure
is the head of the list; the
sw_next field in the swdevt[] structure
links each device into the
appropriate priority list.
swfs_pri[]),
which serves the
same purpose as swdev_pri[], but for file system swap priority.
struct swdevt),
used to establish
the fundamental swap device information.
struct fswdevt),
for supplementary file-system swap.
struct swaptab), which keeps
track of the available free pages of swap space.
struct swapmap),
whose entries together
with swaptab combine for a swap disk block descriptor.
The following table details the elements of the struct swdevt.
swdevt[]
(struct swdevt)| Element | Meaning |
|---|---|
sw_dev |
Actual swap device, as defined by its major (upper 8 bits) and minor (lower 24 bits) numbers. |
sw_flags |
Several flags. The SW_ENABLE flag indicates that
swap has been enabled on this device.
|
sw_start |
Offset into the swap area on disk, in kilobytes. |
sw_nblksavail |
Size of swap area, in kilobytes. |
sw_nblksenabled |
Number of blocks enabled for swap. Must be a
multiple of swchunk (2MB default).
|
sw_nfpgs |
Number of free swap pages on the device. Updated whenever a page is used or freed. |
sw_priority |
Priority of swap device (0-10). |
sw_head, sw_tail |
Indexes of first and last swaptab[] entry
associated with this swap device.
|
sw_next |
Pointer to the next device swap entry (swdevt) at this priority;
implemented as a circular list used to update the pointer in
swdev_pri for round-robin use of all devices
at a particular priority.
|
The following table details the elements of the struct
fswdevt.
fswdevt[]
(struct fswdevt)| Element | Meaning |
|---|---|
fsw_next |
Pointer to next file system swap (fswdevt entry) at this priority; implemented as a circular list. |
fsw_flags |
Several flags. The FSW_ENABLE flag indicates that the
swap has been enabled on this file system.
|
fsw_nfpgs |
Number of free swap pages in this file system swap; updated whenever a page is used or freed. |
fsw_allocated |
Number of swchunks
allocated on this file system for swap.
|
fsw_min |
Minimum swchunks to be preallocated
when file system swap is enabled.
|
fsw_limit |
Maximum swchunks allowed on file system; unlimited if set to zero.
|
fsw_reserve |
Minimum blocks (of size fsw_bsize) reserved for non-swap use on
this file system.
|
fsw_priority |
Priority of file system (0-10). |
fsw_vnode |
vnode of the file system swap directory (/paging)
under which the
swap files are created.
|
fsw_bsize |
Block size used on this file system; used to determine how much
space fsw_reserve is reserving.
|
fsw_head, fsw_tail |
Index into swaptab[] of first, last entry
associated with this file system swap.
|
fsw_mntpoint |
File system mount point; character representation of fsw_vnode,
used for utilities (such as swapinfo(1M)) and error messages.
|
swaptab and swapmap Structures
Two structures track swap space. The swaptab[] array tracks
chunks of swap space. swapmap entries hold swap information on a
per-page level.
swaptab defaults to track a 2MB chunk of space and
swapmap tracks each page within that 2MB chunk.
Each entry in the swaptab[] array has a pointer
(called st_swpmp) to a
unique swapmap.
swapmap entries have backwards pointers to the
swaptab index.
There is one entry in the swapmap for each page
represented by the swaptab entry (default 2 MB, or 512 pages);
that is,
swapmap conforms in size to swchunk.
A linked list of free swap pages begin
at the swaptab entry's st_free
and uses each free swapmap entry's sm_next.
When a page of swap is
needed, the kernel walks the structures
(using the get_swap() routine
in vm_swalloc.c),
which calls other routines that actually locate the
chunk, and so forth.
swdev_pri[].curr, which points to a swdevt entry.
sw_nfpgs is zero (no free pages),
we follow the pointer sw_next to
get the next swdevt entry at this priority.
swfs_pri[].curr,
the file system swap at this priority,
checking fsw_nfpgs for free
pages.
swdevt
or fswdevt with free pages, we walk that
device's swaptab list,
starting with sw_head or fsw_head, and using
st_next in each swaptab entry,
until we find a swaptab entry with
non-zero st_nfpgs.
st_free points to the first free swapmap entry
(and thus first free
page) in this swaptab chunk.
get_swchunk() routine creates a disk block descriptor
(dbd) using 14
bits of dbd_data
for the swaptab index and 14 bits for the swapmap
index. The r_bstore in the region
is set to the disk device swapdev_vp
and the dbd is marked DBD_BSTORE.
When faulting in from swap, the same process is followed as for
faulting in from the file system:
r_bstore and dbd_data are hashed
together and checked for a soft fault, then devswap_pagein() is
called. The devswap_pagein() routine
uses the dbd_data as a
14-bit swaptab index and a 14-bit swapmap index
to determine the
location of the page on disk.
Now all information needed to retrieve the page from swap has been stored.
swaptab and swapmap Structures
swapmap
>+---------+
/ | |
/ | |
/ | |
swaptab entry / | |
+-->+------------+ / | |
| | | / | |
| +------------+ / | |
| | st_swpmp |/ +---------+
| | st_free |-------->| sm_next |---+
| +------------+ +---------+ |
| | | | | |
| +------------+ +---------+<--+
| | sm_next |---+
| +---------+ |
| | | |
| | | |
| | | |
| +---------+<--+
| | sm_next |---+
+---+-+--------------+--------------+ +---------+ |
| | | dbd_swptb | dbd_swpmp |->| | -----
| | | (14 bits) | (14 bits) | +---------+ ---
+---+-+--------------+--------------+ | | -
| | |
+--- dbd_type (3 bits) = DBD_BSTORE +---------+
struct swaptab)| Element | Meaning |
|---|---|
st_free |
Index to the first free page in the chunk. Each entry maps to a 4KB page of swap. |
st_next |
Index to next swaptab entry for same device or
file-system swap; at end of list, st_next is -1.
|
st_flags |
ST_INDEL: File-system swap flag, indicating chunk
is being deleted; do not allocate pages from it. Set
only by the swapdel() routine.
ST_FREE: File-system swap flag, indicating chunk
may be deleted, because none of its pages are in use.
In the case of remote swap, the chunk should not be
deleted immediately; set st_free_time to current
time plus 30 minutes (1800 seconds) when setting
this flag. Once 30 minutes has elapsed, the chunk
can be freed. If the chunk is needed during the
interim, the flag can be cleared.
ST_INUSE: swaptab entry is being changed.
|
st_dev, st_fsp |
Pointers to swdevt[] entry
or fswdevt[]
that references the swaptab entry.
|
st_vnode |
Vnode of device or swap file. |
st_nfpgs |
Number of free pages in this (swchunk) swaptab entry.
|
st_swpmp |
Pointer to swapmap[] array
that defines this swchunk of swap pages.
|
st_free_time |
Indicates when remote fs chunk can be freed (see explanation of
ST_FREE flag).
|
struct swapmap)| Element | Meaning |
|---|---|
sm_ucnt |
Number of threads using the page. When decremented to zero, the swap page is free and the free pages linked list can be updated. |
sm_next |
Index of the next free page in the swapmap[]. This
is valid only if sm_ucnt is zero; that means that
this swapmap entry is included in the linked list
beginning with swaptab's st_free.
|
Recall that for a process to execute, all the regions
(for data, text, and so
forth) have to be set up; yet pages are not loaded into memory until the
process demands them. Only when the actual page is accessed is a
translation established.
A compiled program has a header containing information on the size of
the data and code regions. As a process is created from the compiled code
by fork and exec, the kernel sets up the process's data structures and the
process starts executing its instructions from user mode. When the
process tries to access an address that is not currently in main memory, a
page fault occurs. (For example, you might attempt to execute from a
page not in memory.) The kernel switches execution from user mode to
kernel mode and tries to resolve the page fault
by locating the pregion
containing the sought-after virtual address. The kernel then uses the
pregion's offset and region
to locate information needed for reading in the
page.
If the translation is not already present and the page is required, the
pdapage() routine executes
to add the translation (space ID, offset into
the page, protection ID and access permissions assigned the page, and
logical frame number of the page), and then on demand brings in that
page and sets up the translation, hashes in the table, and all the rest.
In main memory, the kernel also looks for a free physical page in which to load the requested page. If no free page is available, the system pages out selected used pages to make room for the requested page. The kernel then retrieves (pages in) the required page from file space on disk. It also often pages in additional (adjacent) pages that the process might need.
Then the kernel sets up the page's permissions and protections, and exits back to user mode. The process executes the instruction again, this time finding the page and continuing to execute.
The flexibility of demand paging lies in the fact that it allows a process to be larger than physical memory. Its disadvantage lies in the degree of complexity paging requires of the processor; instructions must be restartable to handle page faults.
By default, all HP-UX processes are load-on-demand. A demand paged process does not preload a program before it is executed. The process code and data are stored on disk and loaded into physical memory on demand in page increments. (Programs often contain routines and code that are rarely accessed. For example, error handling routines might constitute a large percentage of a program and yet may never be accessed.)
HP-UX now implements copy-on-write of EXEC_MAGIC processes, to
enable the system to manipulate processes more efficiently. The system
used to copy the entire data segment of a process every time the process
fork'd,
increasing fork time as the size of the data and code segments
increased. Only one translation of a physical page is maintained; a
parent process can point to and read a physical page, but copies it only
when writing on the page. The child process does not have a page
translation and must copy the page for either read or write access.
Copy-on-write means that pages in the parent's region
are not copied to
the child's region until needed.
Both parent and child can read the pages
without being concerned about sharing the same page. However, as soon
as either parent or child writes to the page, a new copy is written, so that
the other process retains the original view of the page.
For more information about the implementaton of EXEC_MAGIC,
see the
HP-UX Process Management white paper.
When a process is fork'd,
a duplicate copy of its parent process forms
the basis of the child process.
Under the kernel procdup() routine,
the system walks the pregion list
of the parent process, duplicating each pregion for the child process.
How this is done is dictated by the region type.
region is type RT_SHARED,
a new pregion is created that
attaches to the parent's region.
region is type RT_PRIVATE,
the region is duplicated first, and
then a new pregion is created and attached to the new region.
pregions for Shared regions
Because a region of type RT_SHARED
is shared by parent and child, fewer
changes occur to the pregions and region.
Only a new pregion must be
created and attached to the shared region.
pregion is allocated
and fields copied from the parent pregion
to the child pregion.
pregion elements used by vhand
(p_agescan, p_ageremain,
and p_stealscan) are initialized to zero
and the child pregion is
added to the active pregion chain
just before the stealhand, to
prevent it from being stolen yet.
region elements r_incore
and r_refcnt
are incremented to
reflect the number of in-core pregions
accessing the region
and the
number of pregions, in-core or paged,
accessing the region.
pregions with
Shared regions
parent pregion child pregion
+------------+ +------------+
| | | |
| | \ | |
+------------+ =========+ +------------+
| p_reg |-+ / | p_reg |-+
+------------+ | +------------+ |
| | | | | |
| | | | | |
+------------+ | +------------+ |
| |
Per-process resources | |
======================|=================================|==========
System resources | |
| shared region |
+->+------------+<----------------+
| |
+------------+
| RT_SHARED |
+------------+
| |
| |
+------------+
pregions for Private regions
The procedure is considerably more complex when an RT_PRIVATE
region is copied.
region is allocated.
region's pointers are set:
r_fstore, the forward store pointer
is pointed to the same value
as the parent's,
and the vnode's reference count (v_count) is
incremented.
r_bstore, the backward store pointer
is set to the kernel global
swapdev_vp, and its v_count is incremented also.
region is attached to the end of the linked list of active
regions.
fork() fails
and returns the error ENOMEM.
region's B-tree structures
are initialized and sufficient
swap space is reserved for a completely filled B-tree.
vfd and dbd proto values
are copied to the child's
B-tree root.
vfd proto in both the parent region
and the child region are set
so that all pages of the region are copy-on-write.
B-tree element b_vproto
is set to indicate that the
copy-on-write flag (pg_cw)
must be set in the vfd for any new vfddbd
pair added to the B-tree.
vfddbds
is created for the child's B-tree (equal to each
chunk of vfddbds in the parent's B-tree)
and filled with proto
values.
The pg_cw bit is already set to copy-on-write for all default
vfds in the child B-tree's chunk.
region
of Type RT_PRIVATE
parent pregion child pregion
+------------+ +------------+
| | | |
| | \ | |
+------------+ =========+ +------------+
| p_reg |-+ / | p_reg |-+
+------------+ | +------------+ |
| | | | | |
| | | | | |
+------------+ | +------------+ |
| |
Per-process resources | |
======================|=================================|================
System resources | |
| shared region | private region
+->+------------+ +->+------------+
| | | |
+------------+ \ +------------+
| RT_PRIVATE | =========+ | RT_PRIVATE |
+------------+ / +------------+
| | | |
| | | |
+------------+ +------------+
copy-on-write When the vfd is Valid
Before the chunks of vfddbds in the child region can be used, the
validity of every entry must be checked.
vfd is not valid
(that is, its pg_v is not set), the pg_cw of the
parent's vfd must be set and copied to the child.
If pg_lock is set in
the parent, it must be unset in the child, as locks are not inherited.
Once the vfd is valid,
further modifications are made to the low-level
structures:
r_nvalid element in the child region
is incremented to reflect
the number of valid pages.
vfd contains a pfn (page frame number),
which indexes into the
pfdat[] array.
The pfdat entry pf_use count
(number of regions
using this page) must be incremented.
vfd's copy-on-write bit isn't set,
the pde must be set for
translations to the page to behave as copy-on-write.
If a page has been written to a swap device, but has since been modified,
the swap-device data now differs from the data in memory. The disk
page must be disassociated from the page in memory
by setting the dbd
type to DBD_NONE.
Then, the next time the page is written to a swap
device, it will be assigned a new location.
Everything is now set up
from the perspective of the parent's B-tree for
copy-on-write.
region's copy-on-write Status
r_swalloc
is set to the number of region and B-tree
pages reserved.
r_prev and r_next
are set to link the child region to the parent
region.
pregion,
rather than copying it
from the parent pregion. This establishes two ranges of virtual
addresses (different space, same offset) translating to the single
range of physical address.
HTBL.
procdup() creates a duplicate copy of a process
based on forktype,
parent process (pp), child process (cp),
and parent thread (pt) and
child thread (ct).
procdup() allocates memory
for the uarea of the child. (In fact,
procdup() is the routine
that calls createU() to create the uarea
too.)
procdup() calls dupvas()
to duplicate the parent's virtual address
space, based on the kind of process (fork vs vfork)
being executed.
fork,
dupvas() duplicates the parent
process's virtual address space; if the process was vfork'd the
parent's virtual address space is used.
dupvas() looks for and finds each private data object,
does whatever
each requires to be duplicated (there are special considerations
required for text, memory mapping, data objects, graphics), and when
it finishes duplicating the special objects, calls private_copy()
or
shared_copy(),
depending on whether it is dealing with a private or
shared region.
region is shared, shared_copy increments the reference
count on the region to indicate it is being shared.
region is private, private_copy locks the region and
enables the region to be duplicated
by calling dupreg().
dupreg() allocates a new region for the child, duplicates the parent's
vfds and the entire region structure,
then calls do_dupc() to duplicate
entries under the region.
do_dupc() sets up a parent-child relationship,
and by duplicating
the relationship, sets up the child to be copy-on-write.
It makes
sure the parent's region is valid,
sets copy-on-write for the child, sets
the translation as rx (read-execute) only,
duplicates information for
every vfddbd combination in the region.
do_dupc then calls hdl_cw()
to update the child's access rights and
make the child copy on write.
Once this is completed, the child process exists as a duplicated version of the parent process. The child process is attached to the child's address space and is no longer dependent on the parent.
uarea for the Child Process
Each thread of a process has its own uarea.
When a process fork()s, the new process has only a single thread,
and that thread needs a uarea. procdup()
creates this uarea by calling createU().
(uarea pregions aren't copied
by dupvas(), so the child will have only one uarea,
no matter how many threads
(and associated uareas) the parent had.)
The createU() routine
builds a uarea and address space for the child
process. The uarea is set up last
for a fork'd process, to prevent the
child process from resuming
in the middle of pregion duplication code.
If the process is vfork'd,
the uarea is created during exec(). Until
then, the child uses the parent thread's uarea.
FORK_PROCESS, a temporary
space is allocated for a working copy of the parent's uarea to be
modifed into the child's uarea. The temporary space will be freed
after the uarea is copied to the new region.
fork() updates the
savestate in the parent uarea's u_pcb
just before copying the data.
(vfork() does not do this
because it creates the uarea during
exec(), and the savestate will change immediately.)
region is allocated for the new uarea,
its data structure is
initialized, its r_bstore value set to the swap device,
and the
new region is added to the list of active regions.
The uarea has no
r_fstore value, since it comes with ready-made data.
uarea's pregion,
which is initialized.
Each uarea has a unique space ID.
The new pregion is marked with
the PF_NOPAGE flag.
uarea pregions are unaffected by vhand
because they are not added to the list of active pregions.
Only if an
entire process is swapped out
are the uarea's pages written to a swap
device.
pregion is attached into the linked list of
pregions connected to the vas.
Its pointer is stored in r_pregs, its
p_prpnext set to NULL,
and its r_incore and r_refcnt set to one.
uarea and B-tree pages and
the default dbd is set to DBD_DFILL,
the uarea pages (UPAGES) are
allocated. Each page requires a page of physical memory
(sleeping if
none is available immediately).
The pfn is stored in the vfd, the
pg_v is set as valid, r_nvalid is incremented,
and a pde is created
for the physical-to-virtual translation.
The pfdat entry's P_UAREA
and HDLPF_TRANS flags are set,
and the dbd is set to DBD_NONE.
kt_upreg in the child's thread structure
is pointed to the child thread's uarea pregion.
Conceivably, the child can now run successfully. The current state is
therefore saved in the copied uarea
with a setjmp() call and pointed
to with pcb_sswap.
Thus, when the child first calls the resume()
routine, it detects that pcb_sswap is non-zero
and does a longjmp() to
get back here.
The child then return from procdup() with the value
FORKRTN_CHILD.
The parent's open file table is copied to the child and the copied uarea is
copied into the actual pregion. This copy causes TLB miss faults that
cause the pregion's pdes to be written to the TLB,
thus associating the
uarea's virtual address with the physical pages just set up.
The process
completes by returning from procdup() with the return value
FORKRTN_PARENT.
copy-on-write Page
When the parent accesses
one of its RT_PRIVATE pages for read,
the processor generates a TLB miss fault, which the kernel handles as an
interrupt. The TLB miss fault handler finds the hpde
and inserts the
information (including the new access rights) into the processor's TLB.
On return from the interrupt, the processor retries the read and is
successful,
since PDE_AR_CW allows user-mode read access.
copy-on-write Page
address = space.offset address = spacep.offset
| +-----------------------------------------------------+ |
| | Situation: | |
+->| * No translation exists | |
| (miss handler cannot find pde). | |
+-----------------------------------------------------+ |
| +----------------------------------------------------+ |
| | Actions: | |
+->| * Create alias translation | |
| * Retry instruction. | |
+----------------------------------------------------+ |
| +---------------------------------------------------+ |
| | Situation: | |
+->| * Translation exists (miss handler finds pde). |<-+
| * Translation is marked invalid |
+---------------------------------------------------+
| +--------------------------------------------------+
| | Actions: |
+->| * Update TLB with PDE_AR_CW permissions. |
| * Retry instruction. |
+--------------------------------------------------+
copy-on-write Page
When the child accesses one of its pages for read, the TLB miss
handler does not find an hpde for the virtual address,
because none has
been set one up yet.
The virtual address was set up in the pregion
structure. If you are not doing copy-on-access (which is now the default)
and the page is needed, the aliased translation must be made.
save_state is created.
vas pointer is taken
and the skip list searched to find the
pregion containing the page with this address.
When regions are initialized,
the disk block descriptor (dbd) dbd_data
field of the is set to DBD_DINVAL (0x1fffffff) in all cases. The
prototype dbd_type values are set as follows:
DBD_FSTORE for text and initialized data,
DBD_DZERO for stack and uninitialized data.
When a page is read for the first time, a TLB miss fault results because
the physical page (and therefore its translation in the sparse PDIR) does
not yet exist. The fault handler is responsible for bringing in the page
and restarting the instruction that faulted. In determining whether or
not the page is valid,
the fault handler determines which pregion in the
faulting process contains the faulting address. The fault code eventually
calls virtual_fault(),
the primary virtual-fault handling routine .
The arguments passed to this routine are the virtual address causing the
fault, the pregion,
and a flag
indicating read or write access.
The kernel searches the B-tree
for the vfd and dbd of the page. If the
valid bit in the vfd flag is set,
another process has read the address into
memory already.
If the r_zomb flag is set in the region, the kernel
prints Pid %d killed due to text modification or page I/O
error message and returns SIGKILL, which the handler sends to the
process.
If the dbd_type value is set to DBD_DZERO
(as is the case for stack and
uninitialized data),
the process sets the copy-on-write bit to zero. The
kernel then checks to determine whether the page pertains to a system
process or to a high-priority thread. If neither and memory is tight, the
process sleeps until free memory is driven down to the priority
associated with the process. (In worst case, a thread might wait until
memory is above desfree.)
Once the process is restarted,
vfd and dbd pointers are examined to
ensure their continued accuracy.
A free pfdat entry is acquired from the physical memory
allocator, its pfn (pf_pfn) placed in the vfd,
the vfd's valid bit set, and
the region's r_nvalid counter (number of valid pages)
incremented.
The pages is zeroed, and its
virtual-to-physical translation is
added to the sparse PDIR.
Finally, the kernel changes dbd_type to DBD_NONE
and dbd_data to
0xfffff0c.
If a process has a virtual fault on a DBD_FSTORE page,
the kernel uses
the r_fstore pointer to the vnode,
to determine which
file-system specific pagein() routine
(for example, ufs_pagein(),
nfs_pagein(), cdfs_pagein(),
vx_pagein()) to call. The pagein()
routines are used to recover the correct page from a free list of memory
pages or to read in a correct page from disk.
The pagein() routine gets information
about the page being faulted from
the vm_pagein_init() routine,
which gets the vfd/dbd pairs, sets
up the region index, and ascertains that no valid page already exists.
One page must be reserved. Then vm_no_io_required() is called to
determine if the page fault can be satisfied locally,
either by a zero-filled page
(sparse file) or from the page cache.
vm_no_io_required() checks for the faulted page
in the page cache by calling lgpg_cache_lookup().
lgpg_cache_lookup(), uses pageincache()
to find the base page, and then uses lgpg_lookup()
to find whether it's part of a suitable large page.
pageincache()
hashes on the vnode pointer and data
to choose a pfdat pointer in phash[].
The routine walks the
pf_hchain chain of pfdat entries
looking for a matching vnode
pointer (pf_devvp) and data value (pf_data).
If it finds a match, it
removes it from the free list.
If the page is found in the page cache,
the region's valid page
count (r_nvalid) is incremented,
the vfd is updated with the pfn
(pf_pfn),
and a virtual-to-physical translation for the page to the
sparse PDIR is added (if it had been removed).
DBD_FSTORE Page
pfdat
+--------------+
hash linked list +---->|P_HASH|P_FREE |<---+
(pf_hchain) | +--| |<-+ | free linked list
| | +--------------+ | |(pf_next, pf_prev)
| +->|P_HASH|P_FREE |<-+ |
| +--| |<-+ |
| | +--------------+ | |
| | | | | |
| | +--------------+ | |
devvp dbd_data | +->| P_HASH | | |
\ / | +--| | | |
\ / | V +--------------+ | |
\ / | --- | | | |
_/ \_ | - | | | |
----------- | +--------------+ | |
\ / phash | | P_FREE |<-|-+
\ / +-----+ | | |<-|-+
| | | | +--------------+ | |
V +-----+ | +--------------+ | |
index---->| |------->|P_HASH|P_FREE |<-+ |
+-----+ | +--| |<---|-+
| | | | +--------------+ | |
| | | | | P_FREE |<---+ |
| | | | | |<-----+
| | | | +--------------+
| | | | | |
| | | | +--------------+
+-----+ | +->| P_HASH |
+-----| |
+--------------+
| |
| |
+--------------+
If the required page is not found in the page cache,
the pagein() routines
refer to the dbd to ascertain which page to fetch.
(The information had been
stored in the dbd by vm_no_io_required().)
The pagein() routines will generally try to read more than
just the single page where the fault occurred; they try both to use
larger than 4K pages (where that's appropriate, given memory availability,
file attributes, etc.) and to simply read-ahead extra pages from a file
that's being accessed sequentially, so that they'll be already available
at the time of the next page fault on that file.
A page (or more) of memory is allocated from the physical memory allocator,
a virtual-to physical
translation added to the sparse PDIR, the I/O scheduled from the disk to
the page, and the process put to sleep awaiting the non-read-ahead I/O to
complete (the process does not await read-ahead I/O to complete). The
vfd is marked valid.
The dbd is left with dbd_type
set to DBD_FSTORE
and dbd_data set to the block address on the disk.
Regardless of whether the page data is retrieved from zero-fill, free list,
or disk, the page directory entry (pde) has been touched.
The instruction
is retried and gets a TLB miss fault; the miss handler writes the
modified pde data into the TLB; the instruction is retried again and
succeeds.
exec()
When the system performs an exec(), the virtual memory system
concerns itself with cleaning up old pregions/regions
and setting up
new ones.
vfork()
Cleanup in the vfork() case is simple.
vas
and attaches it to the child process
(p_vas).
uarea and stack of the parent process are copied and the
pregion and region
are created for the child uarea, just as for
a FORK_PROCESS fork type, and the
thread switches from using the parent's kernel stack to the new
child kernel stack.
pregions: dispreg()
If exec() is called
after a FORK_PROCESS fork, several regions must
be disposed of first.
Typically, all pregions are disposed of except for the
PT_UAREA pregion, which is still needed.
If the file is calling exec() on
itself, we save a little processing and keep the PT_TEXT and
PT_NULLDREF regions, too.
deactivate_preg() is used
to deactivate the pregion
by removing
it from the active pregion list.
If the agehand is pointing to the
pregion being deactivated
and stealhand is pointing to the next
region in the active pregion list,
the agehand is moved back one
pregion to prevent the agehand
from exceeding the stealhand in
sequence. Otherwise if the agehand or stealhand
is pointing to the
pregion being deactivated, both hands are moved forward one
pregion.
hdl_detach() is called to handle hardware dependent aspects
of detaching the region from the process' address space. In particular, if
this is the last reference to the address space, its resources must be
freed up:
wait_for_io()
to await completion of any pending I/O to the
region (that is, r_poip = 0), so that no I/O request returns to
modify a page now assigned a different purpose.
do_deltransc()
on each chunk of the region's B-tree
to delete all the virtual address
translations. That is, for each valid vfd,
do_deltransc() calls hdl_deletetrans(),
which calls pddpage()
to:
hpde (set space to -1, address to 0,
pde_phys (pfn)to 0, pde_ref to 0,
pde_os to 0).
hpde is not the htbl entry,
the hpde is moved from hash list to
free list.
If it is the HTBL hpde and it is unused,
an effort is made to
fill it with a translation down its linked list, and then free the copied
hpde.
pfn_to_virt table.
region.)
pregion pointer is removed
from the r_pregs list and the
memory used by the pregion is freed
(that is, returned to the kernel
memory allocator).
r_incore and r_refcnt elements
are decremented. If
r_refcnt equals zero, the region is freed also.
The routine freereg() (called if the region
is to be freed) does the following:
pgfree() to:
wait_for_io() (again)
to await completion of any pending I/O to the
region (that is, r_poip = 0), so that no I/O request returns to
modify a page now assigned a different purpose.
region's B-tree (again),
calling do_freepagesc() on each
chunk of the B-tree to
free (freepfd()) all the valid pages of the region.
pf_use field of the pfdat is decremented.
pf_use will now be 0;
it can be
freed for other uses.
Its P_FREE flag is set
and the page is returned to the physical memory allocator.
The kernel global freemem is incremented. If any other
processes are waiting for memory, we wake them all up so that the
first one here can have the page (the losers of the race will go to
sleep again).
r_bstore is swapdev_vp,
the reserved swap pages (r_swalloc)
are released, as are the swap pages reserved for the B-tree
structure (r_root->b_rpages).
r_root and r_chunk region elements
are returned to the kernel memory allocator.
activeregions is decremented;
the region is removed from the
active region list and the list of regions
associated with its vnode,
and the region struct itself is returned
to the kernel memory allocator.
If the process for which memory structures are being created is the first
to use the file as an executable,
the executable file's vnode's v_vas is NULL,
and requires creating the pseudo-vas,
pseudo-pregion, and region.
Otherwise, the pseudo-vas' reference count is updated.
PT_TEXT pregion is attached
depends on the type
of executable.
EXEC_MAGIC,
a PT_TEXT pregion is
attached to the pseudo-vas region.
EXEC_MAGIC,
VA_WRTEXT is set in the process
vas,
the pseudo-vas' region is duplicated
as a type RT_PRIVATE
region (performing all the steps discussed
for an RT_PRIVATE
region),
RF_SWLAZYWRT is set in the new region
so that no swap is
reserved before needed,
and a PT_TEXT pregion is attached to it.
pregion's virtual
address.
PT_NULLDREF pregion
is attached to the global region
(globalnullrp), using the same space as PT_TEXT.
vas' region
is duplicated as a type RT_PRIVATE
region using r_off to point to the beginning
of the data portion of
the executable file.
A PT_DATA pregion is attached to it. If this is
an EXEC_MAGIC executable,
we use the PT_TEXT pregion's
space, otherwise a new space is assigned.
PT_DATA pregion
is incremented by the size of bss
(uninitialized data area),
using dbd type DBD_DZERO. This sets
b_protoidx to the end of the inititialized data area and
b_proto2 to DBD_ZERO. More swap is reserved.
SSIZE +1)
is created for the user
stack. The dbd proto value is set to DBD_DZERO,
and a PT_STACK
pregion is attached at USRSTACK.
The PT_UAREA pregion's
space is used.
PT_MMAP
pregions are created:
an RT_SHARED pregion containing text
mapped into the third quadrant with a space of KERNELSPACE and
an RT_PRIVATE pregion
containing associated data (such as
library global variables)
with the PT_DATA pregion's space.
VA_WRTEXT is set,
the data pregion takes the first available
address above where the text ends (in the first or second
quadrant); othwerwise it is assigned the first available address
in the second quadrant.
exit()
From the virtual memory perspective,
an exit() resembles the first
part of an exec().
All virtual memory resources associated with the
process are discarded, but no new ones are allocated.
Thus, when exiting from a vfork child
before the child has performed
an exec(),
nothing needs to be cleaned up from virtual memory except
to return resources to the parent process.
If exiting from a non-vfork
child, the virtual memory resources are discarded by calling
dispreg().