The information contained within this document is subject to change without notice.
HEWLETT-PACKARD MAKES NO WARRANTY OF ANY KIND WITH REGARD TO THIS MATERIAL, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
Hewlett-Packard shall not be liable for errors contained herein nor for incidental consequential damages in connection with the furnishing, performance, or use of this material.
Warranty. A copy of the specific warranty terms applicable to your Hewlett-Packard product and replacement parts can be obtained from your local Sales and Service Office.
Restricted Rights Legend. Use, duplication, or disclosure by the U.S. Government Department is subject to restrictions as set forth in subparagraph (c) (1) (ii) of the Rights in Technical Data and Computer Software clause at DFARS 252.227-7013 for DOD agencies, and subparagraphs (c) (1) and (c) (2) of the Commercial Computer Software Restricted Rights clause at FAR 52.227-19 for other agencies.
Copyright Notices. (C)copyright 1983-2000 Hewlett-Packard Company, all rights reserved.
This documentation contains information that is protected by copyright. All rights are reserved. Reproduction, adaptation, or translation without written permission is prohibited except as allowed under the copyright laws.
(C)Copyright 1981, 1984, 1986 UNIX System Laboratories, Inc.
(C)copyright 1986-1992 Sun Microsystems, Inc.
(C)copyright 1985-86, 1988 Massachusetts Institute of
Technology.
(C)copyright 1989-93 The Open Software Foundation, Inc.
(C)copyright 1986 Digital Equipment Corporation.
(C)copyright 1990 Motorola, Inc.
(C)copyright 1990, 1991, 1992 Cornell University
(C)copyright 1989-1991 The University of Maryland.
(C)copyright 1988 Carnegie Mellon University.
Trademark Notices. UNIX is a registered trademark in the United States and other countries, licensed exclusively through X/Open Company Limited.
NFS is a trademark of Sun Microsystems, Inc.
OSF and OSF/1 are trademarks of the Open Software Foundation, Inc. in the U.S. and other countries.
First Edition: April 1997 (HP-UX Release 10.30) Second Edition: September 2000 (HP-UX Release 11.11)
exec()
struct hpde and struct hpde2_0,
the Hashed Page Directory
struct pregion
struct region)
a.out Support by Regions
struct vfd)
struct dbd)
B-tree Node Description (struct bnode)
struct broot
struct vfdcw
struct pfdat (Page Frame Data)
pf_flag Values
struct hdlpfdat
setmemthresholds() Paging Thresholds
pregion Elements used by vhand
vhand
swdevt[] (struct swdevt)
fswdevt[]
(struct fswdevt)
struct swaptab)
struct swapmap)
pregion
vfd)
dbd)
vfddbds
B-tree (order = 3, depth = 3)
pseudo-vas Structures
htbl Entry to the Page Directory Entry
htbl Entry
swaptab and swapmap Structures
pregions with Shared regions
region of Type RT_PRIVATE
copy-on-write Page
DBD_FSTORE Page
The memory management system is designed to make memory resources available safely and efficiently to threads and processes:
The data and instructions of any process (a program in execution) or thread of execution within a process must be available to the CPU by residing in physical memory at the time of execution.
To execute a process, the kernel creates a per-process virtual address space that is set up by the kernel; portions of the virtual space are mapped onto physical memory. Virtual memory allows the total size of user processes to exceed physical memory. Through "demand paging", HP-UX enables you to execute threads and processes by bringing virtual pages into main memory only as needed (that is, "on demand") and pushing out portions of a process's address space that have not been recently used.
The term "memory management" refers to the rules that govern physical and virtual memory and allow for efficient sharing of the system's resources by user and system processes.
The system uses a combination of pageout and deactivation to manage physical memory. Paging involves writing recently unreferenced pages from main memory to disk from time to time. A page is this smallest unit of physical memory that can be mapped to a virtual address with a given set of access attributes. On a loaded system, total unreferenced pages might be a large fraction of memory.
Deactivation takes place if the system is unable to maintain a large enough free pool of physical memory. When an entire process is deactivated, the pages associated with the process can be written out to secondary storage, since they are no longer referenced. A deactivated process cannot run, and therefore, cannot reference its data.
Secondary storage supplements physical memory. The memory management system monitors available memory and, when it is low, writes out pages of a process or thread to a secondary storage device called a swap device. The data is read from the swap device back into physical memory when it is needed for the process to execute.
On a PA-RISC system, every page of physical memory is addressed by a physical page number (PPN), which is a software "reduction" of the physical page number from the physical address. Access to pages (and thus to the data they contain) are done through virtual addresses, except under specific circumstances. (When virtual translation must be turned off (the D and I bits are off), pages are accessed by their absolute addresses.)
When a program is compiled, the compiler generates virtual addresses for the code. Virtual addresses represent a location in memory. These virtual addresses must be mapped to physical addresses (locations of the physical pages in memory) for the compiled code to execute. User programs use virtual addresses only.
The kernel and the hardware coordinate a mapping of these virtual and physical addresses for the CPU, called "address translation," to locate the process in memory.
The PA-RISC architecture is segmented; a complete virtual address consists of a space identifier (SID) and an offset within that space.
The offset may be 32 bits or 64 bits wide; earlier PA-RISC processors (before PA-RISC 2.0) only support 32 bit offsets.
From the point of view of a user program, the segmentation is not obvious; instead, user programs experience an almost flat address space with either 32 or 64 bit virtual addresses (depending on how the process was compiled).
The kernel however deals in the full complexity of space and offset.
From the kernel point of view, every process running on a PA-RISC processor shares a single global virtual address space, with global virtual addresses (GVAs) composed of both space and offset. (These GVAs are 96 bit on PA-RISC 2.0 processors running in 64-bit (wide) mode; smaller on earlier processors.) This global virtual address space is also shared by the kernel.
Although any process can create and attempt to read or write any global virtual address, the kernel uses page granularity access control mechanisms to prevent unwanted interference between processes.
When a virtual page is "paged" into physical memory, free physical pages are allocated to it by the physical memory allocator. These pages may be randomly scattered throughout the memory depending on their usage history. Translations are needed to tell the processor where the virtual pages are loaded. The process of translating the virtual into physical address is called virtual address translation.
Potentially the virtual address space can be much greater than the physical address space. The virtual memory system enables the CPU to execute programs much larger than the available physical memory and allows you run many more programs at a time than you could without a virtual memory system.
The more main memory in the system, the more data the system can access and the more (or larger) processes it can retain and execute without having to page or cause deactivation as frequently. Memory-resident resources (such as page tables) also take up space in main memory, reducing the space available to applications.
At boot time, the system loads HP-UX from disk into RAM, where it remains memory-resident until the system is shut down.
User programs and commands are also loaded from disk into RAM, but in small portions as they are needed. When a program terminates, the operating system frees the memory used by the process.
Disk access is slow compared to RAM access. Excessive disk access can lead to increased latency or reduced throughput and can lead to the disk access becoming the bottleneck in the system. To avoid this, you need to do some sort of buffering. Buffering, paging, and deactivation algorithms optimize disk access and determine when data and code for currently running programs are returned from RAM to disk. When a user or system program writes data to disk, the data is either written directly from the program's RAM (e.g. if writing to a "raw" device) or buffered in what is called the buffer cache and written to disk in relatively big chunks. Programs also read files and database structures from disk into RAM. When you issue the sync command before shutting down a system, all modified buffers of the buffer cache are flushed (written) out to disk.
On each processor, there are also registers and cache, which are even faster than main memory. Actual program execution actually happens in registers, which get data from the cache and other registers. The cache contains the current working copy of parts of main memory. Most of the time when discussing memory management, cache and registers will be completely ignored; data and instructions will be treated as being accessed directly from main memory. They are mentioned here in an attempt to reduce confusion:
From this point on, this section only discusses "main memory".
+------------------------------+ | | | | | | | | | | | | | | | | | | | | | | | | | Lockable memory | | | | | | | |Available memory | | | | | +..............................+ | |Physical memory | | | | | | | | +------------------------------+ | | HP-UX kernel | | | at bootup | | | +------------------------------+ |
Not all physical memory is available to user processes. Kernel text and initialized data occupy about 10 MB of RAM; additional memory is used by kernel bss (uninitialized data), and (especially) various structures allocated during kernel boot. Many of the structures allocated during kernel boot can be quite large. The sizes of some are determined by kernel tunables, but many are sized based on the amount of physical memory in the system, e.g. such a structure might have one 96 byte entry for every 4096 byte page of physical memory.
Instead of allocating all its data structures at system initialization, the HP-UX kernel dynamically allocates and releases some kernel structures as needed by the system during normal operation. This allocation comes from the available memory pool; thus, at any given time, part of the available memory is used by the kernel and the remainder is available for user programs.
Physical address space is the entire range of addresses used by hardware (4GB on 32 bit (narrow mode) kernels), and is divided into memory address space, processor-dependent code (PDC) address space, and I/O address space. The next figure shows the expanse of memory available for computation. Memory address space takes up 15/16 of the system address space, while address space allotted to PDC and I/O consume a relatively small range of addresses.
+-----------+
0x00000000| page zero |
+-----------+
| |
| | +-----------------------+
| Memory | /| PDC address space |0xF0000000
| address | / | |
| space | / +-----------------------+
| | / | |0xF1000000
| | / | |
| | / | I/O Register |
0xF0000000+-----------+/ | address |
| PDC & I/O | | space |
0xFFFFFFFF+-----------+ | |
\ | |
\ +.......................+
\ | Central bus |
\ | address space |
\ +.......................+
\ | Broadcast address |0xFFFC0000
\| space (local, global) |0xFFFFFFFF
+-----------------------+
+-----------------------+
0x00000000 00000000| page zero |
+.......................+
| |
| |
| |
| |
| |
| |
| Memory |
| address |
| space |
| |
| |
| |
| |
| |
| |
| |
| |
+-----------------------+
0xF0000000 00000000| PDC address space |
0xF1000000 00000000| |
+-----------------------+
| I/O Register |
| address |
| space |
+.......................+
| Central bus |
| address space |
+.......................+
0xFFFFFFFF FFFC0000| Broadcast address |
0xFFFFFFFF FFFFFFFF| space (local, global) |
+-----------------------+
Pages kept in memory for the lifetime of a process by means of a system
call (such as mlock, plock, or shmctl)
are termed locked memory.
Locked memory cannot be paged and processes with locked memory
cannot be deactivated. Typically, locked memory holds frequently
accessed programs or data structures, such as critical sections of
application code. Keeping them memory-resident improves application
performance.
The lockable_mem variable tracks how much memory can be locked.
Available memory is a portion of physical memory, minus the amount of
space required for the kernel and its data structures. The initial value of
lockable_mem is the available memory on the system after boot-up,
minus the value of the system parameter, unlockable_mem.
The value of lockable memory depends on several factors:
unlockable_mem is a kernel tunable
parameter. Changing the value of unlockable_mem alters the
initial value of lockable_mem also.
HP-UX places no explicit limits on the amount of available memory you may lock down; instead, HP-UX restricts how much memory cannot be locked.
Other kernel resources that use memory (such as the dynamic buffer cache) can cause changes.
As the amount of memory that has been locked down increases, existing
processes compete for a smaller and smaller pool of usable memory. If
the number of pages in this remaining pool of memory falls below the
paging threshold called lotsfree,
the system will activates its paging
mechanism, by scheduling vhand in an attempt to keep a reasonable
amount of memory free for general system use.
Care must be taken to allow sufficient space for processes to make forward progress; otherwise, the system is forced into paging and deactivating processes constantly, to keep a reasonable amount of memory free.
Data is removed to secondary storage if the system is short of main memory. The data is typically stored on disks accessible either via system buses or network to make room for active processes.
Swap refers to a physical memory management strategy (predating UNIX) where entire processes are moved between main memory and secondary storage. Modern virtual memory systems today no longer swap entire processes, but rather use a paging scheme, where individual pages of data and instructions can be paged in from secondary storage as needed, or paged out again to free up memory for other uses. This is backed up by a deactivation scheme that allows whole processes to be pushed out if the system is desperately short of memory. However, the secondary storage dedicated to storing paged out data is still referred to as "swap space".
Device swap can take the form of an entire disk or LVM(1)
logical volume
of a disk. A file system can be configured to offer free space for swap; this
is termed file-system swap. If more swap space is required, it can be
added dynamically to a running system, as either device swap or
file-system swap.
The swapon command is used to allocate disk space or
a directory in a file system for swap.
(1) Logical Volume Manager (LVM) is a set of commands and underlying software to handle disk storage resources with more flexibility than offered by traditional disk partitions.
A computer has a finite amount of RAM available, but each 32-bit HP-UX process has a 4 GB virtual address space apportioned in four one-gigabyte quadrants. (64-bit HP-UX processes have an even larger virtual address space, though they can't actually use the full (16 Exabyte) range of virtual addresses addressable with 64 bits. It too is broken into 4 quadrants equal sized quadrants.) This is termed virtual memory.
Virtual memory is the software construct that allows each process sufficient computational space in which to execute. It is accomplished with hardware support.
As software is compiled and run, it generates virtual addresses that provide programmers with memory space many times larger than physical memory alone.
HP-UX is a Shared Address Space (SAS) operating system. A given virtual address (including space ID) refers to the same page of memory for all processes; translations are not changed when the process context changes.
Thus, the number of bits available for the space ID (segment) and offset (often simply called "virtual address") determines the ultimate size of the total virtual address space available to the kernel and all prcesses together.
As PA-RISC evolved, the number of bits usable for space and offset have increased. On PA-RISC 2.0, the space ID is 32 bits (18 bits actually used in HPUX 11.11) and the offset is effectively 42 bits (though stored in a 64 bit field). (PA-RISC 1.1 systems, and PA-RISC 2.0 running in narrow (32 bit) mode have a smaller offset.)
NOTE: Understand, however, that a single process has significant
limitations on the virtual address space it is allowed to access.
For example, a 32-bit SHARE_MAGIC executable text is limited
to 1 GB and data is limited to 1 GB.
Also, the total amount of shared virtual address
space in the system is limited to much less than theoretically addressable;
without using memory windows, the total shared space on a wide mode
(64-bit) system is limited
to approximately 8 TB (i.e. 2 64-bit quadrants).
A physical address points to a page in memory that represents 4096 bytes of data. The physical address also contains an offset into this page. Thus, the complete physical address is composed of a physical page number(PPN) and page offset. The PPN is the 20 or 52 most significant bits of the physical address where the page is located. These bits are concatenated with an 12-bit page offset to form the 32 or 64-bit physical address.
Page Number Page Offset +--------------------+------------+ |00000000000000000100|100001110011| +--------------------+------------+ 0 19 20 31
Page Number Page Offset +---------------------------------------------------+------------+ |000000000000000000000000000000000000000000000000100|100001110011| +---------------------------------------------------+------------+ 0 51 52 63
To handle the translation of the virtual address to a physical address the virtual address also needs to be looked at as a virtual page number(VPN) and page offset. Since the page size is 4096 bytes, the low order 12 bits of the offset are assumed to be the offset into the page. The space ID and the high order bits of the offset are the VPN.
For any given address you can determine the page number by discarding the least significant 12 bits. What remains is the virtual page number for a virtual address or the physical page number for the physical address.
The next figure shows the bit layout of a 32-bit virtual address of 0x0.4873.
32-bit Space ID 32-bit Offset
+--------------------------------+--------------------+------------+
|00000000000000000000000000000000|00000000000000000100|100001110011|
+--------------------------------+--------------------+------------+
| | | |
+----------------------------------------------------+ +-----------+
| |
VPN = 0x4 Page Offset
0x873
The virtual page number's address must be translated to obtain the associated page number, with page offset 0x873.
+---------------------------------------------------+
| +--------------------+ |
| | Central Processing | |
| | Unit (CPU) | +-------------------+ |
| +--------------------+ | Floating Point | |
| |-------------->| Coprocessor | |
| | +-------------------+ |
| |------------------------+ |
| | | |
| V V |
| +--------------------+ +-------------------+ |
| | | | Translation | |
| | Cache | | Lookaside Buffer | |
| | | | (TLB) | |
| +--------------------+ +-------------------+ |
| | | |
| |<-----------------------+ |
| +--------------------+ |
| | System Interface | |
| | Unit (SLU) | |
| +--------------------+ |
| | |
+------------V--------------------------------------+
| Central Bus
==================================================================
The figure above and the table that follows, name the principal processor components; of them, registers, translation lookaside buffer, and cache are crucial to memory management, and will be discussed in greater detail following the table.
| Component | Purpose |
|---|---|
| Central Processing Unit (CPU) |
The main component responsible for reading
program and data from memory, and
executing the program instructions. Within
the CPU are the following:
|
| Instruction and Data Cache | The cache is a portion of high-speed memory used by the CPU for quick access to data and instructions. The most recently accessed data is kept in the cache. |
| Translation Lookaside Buffer (TLB) |
The processor component that enables the
CPU to access data through virtual address
space by:
|
| Floating Point Coprocessor | An assist processor that carries out specialized tasks for the CPU. |
| System Interface Unit (SIU) | Bus circuitry that allows the CPU to communicate with the central (native) bus. |
The translation lookaside buffer (TLB) translates virtual addresses to physical addresses.
+---------------------------+\ | | \ | | \ | | \ | | \ | | \ | | \ | | +--------+ | Virtual | +---+ |Physical| | address |<-->|TLB|<-->|address | | space | +---+ |space | | | +--------+ | | / | | / | | / | | / | | / | | / +---------------------------+/
Address translation is handled from the top of the memory hierarchy
hitting the fastest components first (such as the TLB on the processor)
and then moving on to the page directory table
(pdir in main memory)
and lastly to secondary storage.
The TLB looks up the translation for the virtual page numbers (VPNs) and gets the physical page numbers (PPNs) used to reference physical memory.
Virtual address Main Memory
+-------------------+-----------+ +--------+
|Virtual Page Number|Byte Offset| | 0 |
+-------------------+-----------+ | |
| | | |
| +-------------------+ | |
V | | |
VPN PPN Rights ID O U T D P | | |
+------------+-------+----+---+-+-+-+-+-+ | | |
| | | | | | | | | | | +------>[] |
+------------+-------+----+---+-+-+-+-+-+ | PPN | | |
T| | | | | | | | | | | + | | |
L+------------+-------+----+---+-+-+-+-+-+ | Offset| | |
B| | | | | | | | | | | | | |
+------------+-------+----+---+-+-+-+-+-+ | | | |
| | | | |
V Physical address V | | |
+--------------------+-----------+ | | |
|Physical Page Number|Byte Offset|---+ |physmem |
+--------------------+-----------+ +--------+
Ideally the TLB would be large enough to hold translations for every page of physical memory; however this is prohibitively expensive; instead the TLB holds a subset of entries from the page directory table (PDIR) in memory. The TLB speeds up the process of examining the PDIR by caching copies of its most recently utilized translations.
Because the purpose of the TLB is to satisfy virtual to physical address translation, the TLB is only searched when memory is accessed while in virtual mode. This condition is indicated by the D-bit in the PSW (or the I-bit for instruction access).
Depending on model, the TLB may be organized on the processor in one of two ways:
The advantage of having a split Data TLB (DTLB) and Instruction TLB (ITLB), is that it is possible to account for the different characteristics of data and instruction locality and type of access (frequent random access of data versus relatively sequential single usage of instructions).
Because TLB size is limited, it is desirable to use as few entries as possible to translate the largest possible amount of memory. PA-RISC 2.0 processors provide a variable page size, and memory is organized to use large page sizes wherever this is reasonable. In particular, the memory initially allocated for the kernel at boot time is mapped with the largest possible page size that fits it. (Other memory will be mapped with large pages if possible, but there are tradeoffs that may make this impractical, especially on small memory systems.)
PA-RISC processors before PA-RISC 2.0 do not support a general purpose variable page size. Instead, they may provide a block TLB. The block TLB is quite small, but its entries can map more than a single 4K page (i.e. multiple hpdes). Block TLB entries are used to reference kernel memory that remains resident. (Memory referenced by a block TLB entry cannot be paged out.) The block TLB is typically used for graphics, because their data is accessed in huge chunks. It is also used for mapping other static areas such as kernel text and data.
Since the TLB translates virtual to physical addresses, each entry contains both the Virtual Page Number (VPN) and the Physical Page Number (PPN). Entries also contain Access Rights, an Access Identifier, and five flags.
| Flag | Name | Meaning |
|---|---|---|
| O | Ordered | Accesses to data for load and store are ranked by strength -- strongly ordered, ordered, and weakly ordered. (See PA-RISC 2.0 specifications for model and definitions.) |
| U | Uncacheable | Determines whether data references to a page from memory address space may be moved into the cache. Typically set to 1 for data references to a page that maps to the I/O address space or for memory address space that must not be moved into cache. |
| T(1) | Page Reference Trap | If set, any access to this page causes a reference trap to be handled either by hardware or software trap handlers |
| D | Dirty | When set, this bit indicates that the associated page in memory differs from the same page on disk. The page must be flushed before being invalidated. |
| B | Break | This bit causes a trap on any instruction that is capable of writing to this page |
| P | Prediction method for branching | Optional, used for performance tuning. |
(1) The T,D, and B flags are only present in data or unified TLBs.
In PA 1.x architecture, an E bit (or "valid" bit) indicates that the TLB entry reflects the current attributes of the physical page in memory.
The operating system maintains a table in memory called the Page Directory (PDIR) which keeps track of all virtual pages currently in memory. When a page is mapped in some virtual address space, it is allocated an entry in the PDIR. The PDIR is what links a virtual address to a physical page in memory.
The PDIR is implemented as a memory-resident table of software structures called hashed page directory entries (HPDEs), which contain virtual and physical addresses. When the processor needs to find a physical page not indexed in the TLB, it can search the PDIR with a virtual address to find the matching address.
The PDIR table is a hash table with collision chains. The virtual address is used to hash into one of the buckets in the hash table and the corresponding chain is searched until a chain entry with a matching virtual address is found.
Note that the page table is not a purely software construct. On systems that provide hardware for TLB miss handling, this is the table examined by the hardware to attempt to find an appropriate translation to insert in the TLB when resolving a TLB miss fault.
A trap occurs because translation is missing in the translation lookaside buffer (TLB). If the processor can find the missing translation in the PDIR, it installs it in the TLB and allows execution to continue. If not, a page fault occurs.
A page fault is a trap taken when the address needed by a process is missing from the main memory. This occurrance is also known as a PDIR miss. A PDIR miss indicates that the page is either on the free list, in the page cache, or on disk; the memory management system must then find the requested page on the swap device or in the file system and bring it into main memory.
Conversely, a PDIR hit indicates that a translation exists for the virtual address in the TLB.
hpde and hpde2_0) StructureEach PDE contains information on the virtual-to-physical address translation, along with other information necessary for the management of each page of virtual memory.
PA-RISC 1.1 and PA-RISC 2.0 systems use different hashed page directory
entry structures, with mostly similar field names and purposes.
The following table combines the structural elements of the PA-RISC 1.1
hashed page directory entry (struct hpde) and the PA-RISC 2.0
hashed page directory entry (struct hpde2_0).
struct hpde and struct hpde2_0,
the Hashed Page Directory
| Element | PA-RISC Version | Meaning |
|---|---|---|
pde_valid |
PA-RISC 1.1 | Flag set by the kernel to indicate a valid pde entry. |
pde_invalid |
PA-RISC 2.0 | Flag set by the kernel to indicate an invalid pde entry. |
pde_vpage |
both | Virtual page - the virtual offset divided by 4096. |
pde_space |
both | Contains the complete virtual space ID. |
pde_rtrap |
both | Data reference trap enable bit; when set, any access to the page causes a page reference trap interruption. |
pde_dirty |
both | Dirty bit; marked if the page differs in memory from what is on disk. |
pde_dbrk |
both | Data break; used by the TLB. |
pde_ar |
both | Access rights; used by the TLB.(1) |
pde_uncache |
both | Uncache bit. |
pde_order |
PA-RISC 2.0 | Strong ordering bit. |
pde_br_predict |
PA-RISC 2.0 | Branch prediction bit. |
pde_ref_trickle |
both |
Trickle-up bit for references. Used with pde_ref
on systems whose hardware can search the htbl directly.
|
pde_block_mapped |
both | Block mapping flag; indicates page is mapped by block TLB and cannot be aliased. |
pde_executed |
both | Used by the stingy cache flush algorithm to indicate that page is referenced as text(2). |
pde_ref |
both |
Reference bit set by the kernel when it receives
certain interrupts; used by vhand to tell if a
page has been used recently.
|
pde_accessed |
both | Used by the stingy cache flush algorithm to indicate that the page may be in data cache. |
pde_modified |
both | Indicator to the high-level virtual memory routines as to whether the page has been modified since last written to a swap device. |
pde_uip |
both | Lock flag used by trap-handling code. |
pde_protid |
both | Protection ID, used by the TLB. |
pde_os |
PA-RISC 2.0 | Entry in use. |
pde_alias |
both | Virtual alias field. If set, the pde has been allocated from elsewhere in kernel memory, rather than as a member of the sparse PDIR. |
pde_wx_demote |
PA-RISC 2.0 (64-bit kernels only) | User space fic. |
pde_phys |
PA-RISC 1.1 | Physical page number; the physical memory address divided by the page size (4096 bytes). |
pde_phys_u |
PA-RISC 2.0 | Physical page number: most significant 25 bits. |
pde_phys |
PA-RISC 2.0 | Physical page number: least significant 27 bits address divided by the page size. |
var_page |
PA-RISC 2.0 | Page size. |
pde_next |
both | Pointer to next entry, or null if end of list. |
(1) For detailed information on access rights, see the
PA-RISC 2.0 Architectural reference,
chapter 3, "Addressing and Access Control."
For information about how programs can manipulate this field, see
mmap(2) and mprotect(2) manpages.
(2) Stingy cache flush is a performance enhancement by which the kernel recognizes whether or not to flush the cache.
Cache is fast, associative memory on the processor module that stores recently accessed instructions and data. From it, the processor learns whether it has immediate access to data or needs to go out to (slower) main memory for it.
Cacheable data going to the CPU from main memory passes through the cache. Conversely, the cache serves as the means by which the CPU passes data to and from main memory. Cache reduces the time required for the CPU to access data by maintaining a copy of the data and instructions most recently requested.
A cache improves system performance because most memory accesses are to addresses that are very close to or the same as previously accessed addresses. The cache takes advantage of this property by bringing into cache a block of data whenever the CPU requests an address. Though this depends on size of the cache, associativity, and workload, a vast majority of the time (according to performance measurements), the cache has what you're looking for the next time, enabling you to reference it.
Depending on model, PA-RISC processors are equipped with either a unified cache or separate caches for instructions and data (for better locality and faster performance). In multiprocessing systems, each processor has its own cache, and a cache controller maintains consistency.
Cache memory itself is organized as follows:
Cache Tag
+---------------------------+-+-+--------------------+ /|\
| |v|d| | |
| |a|i| | |
|Physical Page Number (PPN) |l|r| Tag Parity Bits | |
| |i|t| | |
| |d|y| | |
+---------------------------+-+-+--------------------+ |
|Cache
Cache Line |entry
+----------------------------------+-----------------+ |
| | | |
| | | |
| Data words |Data parity bits | |
| | | |
| | | |
+----------------------------------+-----------------+ \|/
When a process executes, it stores its code (text) and data in processor registers for referencing. If the data or code is not present in the registers, the CPU supplies the virtual address of the desired data to the TLB and to the cache controller. Depending on implementation, caches can be direct mapped, set associative, or fully associative. Recent PA implementations use direct associative caches and fully associative TLBs. Virtual addresses can be sent in parallel to the TLB and cache because the cache is virtually indexed.
A physical page may not be referenced by more than one virtual page, and a virtual address cannot translate to two different physical addresses; that is, PA-RISC does not support hardware address aliasing, although HP-UX implements software address aliasing for text only in EXEC_MAGIC executables.
The cache controller uses the low-order bits of the virtual address to index into the direct-mapped cache. Each index in the cache finds a cache tag containing a physical page number (PPN) and a cache line of data. If the cache controller finds an entry at the cache location, the cache line is checked to see whether it is the right one by looking at the PPN in the cache tag and the one returned by the TLB, because blocks from many different locations in main memory can be mapped legitimately to a given cache location. If the data is not in cache but the page is translated, the resultant data cache miss is handled completely by the hardware. A TLB miss occurs if the page is not translated in the TLB; if the translation is also not in the PDIR, HP-UX uses the page fault code to fault it in. If not in RAM, the data and code might have to be paged from disk, in which case the disk-to-memory transaction must be performed.
+---------------------------------+
|+-------+ processor |
|| CPU | |
|+-------+ |
| | : virtual address | +------------------+
| | :..................... | | RAM |
| | V V | | |
|+-------+ +-------+ | |page directory |
|| CPU | | TLB | | | +-----+ |
|+-------+ +-------+ | | |-----| |
| | : : | | |-----| |
| | : PPN PPN : | | +-----+ |
| | ....> <.... | | |
+---|-----------------------------+ +------------------+
| bus
===============================================================
|
+--------+
| disk |
+--------+
On a more detailed level, the next figure demonstrates the mapping of virtual and physical address components.
Virtual address
+-----------+-------------+
+--------------------| virtual | offset in |-----------------+
| | page # | page | |
| +-----------+-------------+ |
| |
| Address translation in |
| Translation Lookaside Buffer Physical address in Cache |
| +-------------+-------------+ +-------------+---------+ |
+->| Virtual | Physical |----->| Physical | Offset |<-+
| page number | page number | +->| page number | in page |<-+
+-------------+-------------+ | +-------------+---------+ |
| |
| Physical address in RAM |
| +-------------+---------+ |
+->| Physical | Offset |<-+
| page number | in page |
+-------------+---------+
The sequence followed by the processor as it validates addresses is one of "hit or miss".
In addition to assisting in virtual address translation, the translation lookaside buffer (TLB) serves a security function on behalf of the processor, by controlling access and ensuring that a user process sees only data for which it has privilege rights.
The TLB contains access rights and protection identifiers. PA-RISC 2.0 allows up to eight protection IDs to be associated with each process. These IDs are held in control registers CR-8, CR-9, CR-12, and CR-13 (2 per register). (PA-RISC 1.1 only allows four protection IDs to be associated with each process.)
| Security check | Purpose |
|---|---|
| Protection Checks |
The P-bit (Protection ID Validation Enable bit) of
the Processor Status Word (PSW) is checked:
|
| Access Rights Check |
Access Rights are stored in a seven-bit field
containing permissible access type and two
privilege levels affecting the executing instruction:
|
The following figure shows the checkpoints for controlling access to a page of data through the TLB. Two checks are performed for controlling access to a page of data through the TLB: protection check and access rights check. If both checks pass, access is granted to the page referenced by the TLB.
Control Registers
+-----------------+
| | TLB Entry
CR 8|Protection ID 1,2|-+ +---------------+
CR 9|Protection ID 3,4| | | |
| | +-+ +---------------+
CR 12|Protection ID 5,6| | | +----------------| Access ID |
CR 13|Protection ID 7,8|-+ | | +---------------+
| | | | +--| Access Rights |
+-----------------+ | | | +---------------+
| | Type of | | |
PSW | | Access | +---------------+
+-------+-+---+ | | | |
| |P| | | | / \ | IA Queue
+-------+-+---+ | | / \ | +------+--+
| | | / \ | | +---------+--+
+---------------+ | | | | | +-| | |
V V V V V V +---------+--+
+------------+ +--------+ |
| | | Access |<-------------+
| Protection | | Rights |
| Check | | Check |
+------------+ +--------+
| |
+---+ +---+
| |
V V
+-------------+
| Both Checks |
| Passed? |
+-------------+
|
V
Access Granted
If the two PPNs do not match (assuming a TLB hit), the cache line is loaded because the bytes referenced on the virtual page are not yet in the cache. The time it takes to service a cache miss varies depending on if the data already present in the cache is clean or dirty. (When the cache is dirty, the old contents are written out to memory and the new contents are read in from memory.) If the cache line is "clean" (that is, not modified), it does not have to be written back to main memory, and the penalty is fewer instruction cycles than if the cache is dirty and must be written back to main memory.
Page found in PDIR (deposit in TLB)
+-----------+-------------------------+
V | +--|---+
+--------+ V | | |
+->| hashes | +-----+ TLB miss | [ ] | Not Found
| +--------+ | |------------------>| |-----------+
| /| | | TLB | TLB Hit +------+ |
| | VPN------>| |-----------+ PDIR V
| | | +-----+ |PPN s/w
| | | | (cache line) handler
| +- | -----------+------------ | -------------------+
| | | | |
| | | V |
| | V / \ |
CPU | | +-------+ PPN / \ No/Cache Miss +-----+
requests| +------->| Cache |------> =? -------------->| |
virtual | +-------+ \ / +-----+
address | \ / RAM
| |Yes/cache hit
| Return data to |
+-----+ CPU from cache |
| CPU |<-----------------------------+
+-----+
Registers, high-speed memory in the processor's CPU, are used by the software as storage elements that hold data for instruction control flow, computations, interruption processing, protection mechanisms, and virtual memory management.
All computations are performed between registers or between a register and a constant (embedded in an instruction), which minimizes the need to access main memory or code. This register-intensive approach accelerates performance of a PA-RISC system. This memory is much faster than conventional main memory but it is also much more expensive, and therefore used for processor-specific purposes.
Registers are classified as privileged or non-privileged, depending on the privilege level of the instruction being executed.
| Type of Register | Purpose |
|---|---|
| 32 General Registers, each 64 bits in size (non-privileged) |
Used to hold immediate results or data that
is accessed frequently, such as the passing
of parameters. Listed are those with uses
specified by PA-RISC or HP-UX.
|
| 7 Shadow Registers (privileged) | Store contents of GR1,8,9,16,17,24, and 25 on interrupt, so that they can be restored on return from interrupt. Numbered SHR0-SHR6. |
| 8 Space Registers (SR5-SR7 are privileged) |
Hold the space IDs for the current running process.
|
| 25 Control Registers (numbered CR0, and CR8 through CR31), each 64 bits. (Most are privileged.) |
Used to reflect different states of the system, many related primarily
to interrupt handling.
|
| 32 Floating Point Registers, 64-bits each, or 64, 32-bits each |
Data registers used to hold computations.
|
| 2 Instruction Address Queues |
Two queues 2 elements deep.
The front elements of the queues (IASQ_Front and IAOQ_Front) form the
virtual address of the current instruction, while the back
elements (IASQ_Back and IAOQ_Back) contain the address of the
following instruction.
|
| 1 Processor Status Word (PSW), 64 bits (privileged) | Contains the current processor state. When an interruption occurs, the PSW is saved into the Interrupt Processor Status Word (IPSW), to be restored later. Low-order five bits of the PSW are the system mask, and are defined as mask/unmask or enable/disable. Interrupts disabled by PSW bit are ignored by the processor; interrupts masked remain pending until unmasked. |
uarea vas
+-------+ +---------------->+-----+
| | | +--------->| |<--------------+
+-------+ proc | | pregion +-----+ |
|u_procp|---->+-----+ | +->+-----+<->+--+<->+--+<->+--+<-+
+-------+ | | | | | | | | | | |
| | +-----+ | +-----+ +--+ +--+ +--+
+-------+ |p_vas|-+ |p_reg|--+
+-----+ +-----+ |
| | | | |
+-----+ +-----+ |
Process resources |
=========================================|============================
System resources | region
+--->+------+
| |
+------+ broot
|r_root|--->+------+
+------+ | |
chunk | | +------+
+-----+<----+ +------+ +-|b_root|
| | | | +------+
+-----+ | +--+<---------+ | |
hpde RAM <--| vfd | | B-tree | | +------+
+--------+ | dbd | | +--+
| | /|\ +-----+ | / | \
+--------+ | | | | V V \|
|pde_phys|--+ | | | +--+ +--+ +--+
+--------+ +-----+ | | | | | | |
| | | +--+ +--+ +--+
+--------+ | / | \
| |/ V \|
| +--+ +--+ +--+
+---| | | | | |
+--+ +--+ +--+
Process management uses kernel structures down to the pregions to
execute the threads of a process.
The uarea, proc structure,
vas, and pregion are
per-process resources, because each process has its own unique copies of
these structures, which are not shared among multiple processes.
Below the pregion level are the systemwide resources. These
structures can be shared among multiple processes (although they are
not required to be shared).
Memory management kernel structures map pregions to physical memory and provide support for the processor's ability to translate virtual addresses to physical memory. The table that follows introduces the structures involved in memory management; these are discussed later in detail.
| Kernel structure | Purpose |
|---|---|
vas |
Keeps track of the structural elements associated with a
process in memory. One vas maintained per process.
|
pregion |
A per-process resource that describes the regions attached to the process. |
region |
A memory-resident system resource that can be shared
among processes. Points to the process's B-tree,
vnode, pregions.
|
B-tree |
Balanced tree that stores pairs of page indices and
chunk addresses. At the root of a B-tree of VFDs and
DBDs is struct broot.
|
hpde |
Contains information for virtual to physical translation
(that is, from VFD to physical memory).
|
vas)
The vas represents the virtual address space of a process
and serves as
the head of a doubly linked list of process region data structures called
pregions.
The vas data structure is always memory resident.
When a process is created, the system allocates a vas structure and puts
its address in p_vas,
a field in the proc structure.
The virtual address space of a process is broken down into logical chunks
of virtually contiguous pages. (See the Process Management white paper
for a table of vas entries.)
pregion
Each pregion represents a process's view
of a particular portion of
its virtual address space and information on getting to those pages.
The pregion points to
the region data structure
that describes the pages' physical locations in
memory or in secondary storage.
The pregion also contains the virtual
addresses to which the process's pages are mapped, the page usage (text,
data, stack, and so forth), and page protections (read, write, execute, and
so on).
pregion
+---------+
+------------->| vas |<-------------+
| +---------- |
| /| |\ |
| / \ |
V |/ \| V
+---------+ +---------+ +---------+ +---------+
| pregion |<->| pregion |<->| pregion |<->| pregion |
+---------+ +---------+ +---------+ +---------+
/|\
|
V
+---------+
| region |
+---------+
The following elements of a per-process pregion structure are
important to the virtual memory subsystem.
struct pregion| Element | Purpose |
|---|---|
p_type |
Type of pregion.
|
*p_reg |
Pointer to the region attached by the pregion.
|
p_space, p_vaddr |
Virtual address of the pregion, based on virtual
space and virtual offset.
|
p_off |
Offset into the region, specified in pages.
|
p_count |
Number of pages mapped by the pregion.
|
p_ageremain, p_agescan,
p_stealscan, p_bestnice |
Used in the vhand algorithm to age and steal
pages of memory (discussed later).
|
*p_vas |
Pointer to the vas to which the pregion is linked.
|
p_forw, p_back |
The doubly-linked list, used by vhand to walk the active pregions.
|
p_deactsleep |
The address at which a deactivated process is sleeping. |
p_pagein |
Size of an I/O, used for scheduling when moving data into memory. |
p_strength, p_nextfault |
Used to track the ratio between sequential and
random faults; used to adjust p_pagein.
|
The region is a system-wide kernel data structure that associates groups
of pages with a given process. Regions can be one of two types, private
(used by a single process) or shared (able to be used by more than one
process). Space for a region data structure is allocated as needed. The
region structure is never written to a swap device,
although its B-tree
may be.
Regions are pointed to by pregions, which are a per-process resource.
Regions point to the vnode
where the blocks of data reside when not in
memory.
struct region)
| Element | Meaning |
|---|---|
r_flags |
Region flags (enumerated shortly). |
r_type |
|
r_pgsz |
Size of region in pages (not just those presently in memory). |
r_nvalid |
Number of valid pages in region.
This equals the number of valid vfds
in the B-tree or b_chunk.
|
r_dnvalid |
Number of pages in swapped region. If the
system swaps the entire process, the value of
r_nvalid is copied here to later calculate how
many pages the process will need when it faults
back in. This information is used to decide
which process to reactivate.
|
r_swalloc |
Total number of pages reserved and allocated for
this region on the swap device. Does not account
for swap space allocated for vfd/dbd pairs.
|
r_swapmem, r_vfd_swapmem |
Memory reserved for pseudo-swap or vfd swap.
|
r_lockmem |
Number of pages currently allocated to the
region for lockable memory, including lockable
memory allocated for vfd/dbd pairs.
|
r_pswapf, r_pswapb |
Forward and backward pointers to list of regions using pseudo-swap pages
(pswaplist).
|
r_refcnt |
Number of pregions pointing at the region
|
r_zomb |
Set to indicate modified text. If an executing a.out file on a remote
system has changed, the pages are flushed from the processor's cache,
causing the next attempted access to fault. The fault handler finds
that r_zomb is non-zero,
prints the message Pid %d killed due to
text modification or page I/O error
and sends the process a SIGKILL.
|
r_off |
Offset into the page-aligned vnode, specified in pages; valid only if
RF_UNALIGNED is not set.
Page r_off of the vnode is referenced by the
first entry of the first chunk of the region's B-tree.
|
r_incore |
Number of pregions sharing the region whose
associated processes have the SLOAD flag set.
|
r_dbd |
Disk block descriptor for B-tree pages written
to a swap device Specifies the location of the
first page; the pages are stored together in a contiguous area of swap space.
|
r_fstore, r_bstore |
Pointers to vnode of origin and destination of
block. This data depends on the type of
pregion above the region. In most cases,
r_bstore is set to the paging system vnode, the
global swapdev_vp that is initialized at system
startup.
|
r_forw, r_back |
Pointers to linked list of all active regions.
|
r_lock |
Region lock structure used to get read or read/write locks to modify the region structure. |
r_mlock |
Lock used to serialize mlock operations on this region.
|
r_poip |
Number of page I/Os in progress. |
r_root |
Root of B-tree; if referencing more than one
chunk, r_key is set to DONTUSE_IDX.
|
r_key, r_chunk |
Used instead of B-tree search (r_root)
if only a single chunk
of vfddbds is needed (referencing 32 or fewer pages on a 32-bit
kernel, or 64 or fewer pages on a 64-bit kernel).
|
r_next, r_prev |
Circularly linked list of all regions sharing vnode.
|
r_preg_un |
pregion(s) pointing to the region.
|
r_excproc |
Pointer to the proc table entry, if the process
has RF_EXCLUSIVE set in r_flags.
|
r_lchain |
Linked list of memory lock ranges. |
r_mlockswap |
Swap reserved to cover locks. |
r_pgszhint |
Page size hint. |
r_hdl |
Hardware-dependent layer structure. |
a.out Support for Unaligned PagesText and data of most executables start on a four-kilobyte page boundary. HP-UX can treat these as memory-mapped files, because a page in the file maps directly to a page in memory.
In addition to the fields shown above,
struct region has fields to support
executables compiled on older versions of HP-UX whose text and data do
not align on a (4 KB) page boundary. These executables are referenced by
regions whose r_flags has RF_UNALIGNED set.
a.out Support by Regions| Element | Meaning |
|---|---|
r_byte, r_bytelen |
Offset into the a.out file and length of its text.
|
r_hchain |
Hash list of unaligned regions. |
r_flags.
Here are some of the possible flag values:
| Region Flag | Meaning |
|---|---|
RF_ALLOC |
Always set because HP-UX regions are allocated and freed on demand; there is no free list. |
RF_UNALIGNED |
Set if text of an executable does not start
on a page boundary. In this case, the text
is read through the buffer cache to align
it, and the vfds are pointed at the buffer
cache pages.
|
RF_WANTLOCK |
Set if a thread wanted to lock a vfd of this
region (to do I/O on the page), but found it already locked and
went to sleep. After the vfd is
unlocked, this flag ensures that
wakeup() is called so the waiting
thread(s) can proceed.
|
RF_HASHED |
The text is unaligned (RF_UNALIGNED)
and thus is on a hash chain. The region is
hashed with r_fstore and r_byte; the
head of each hash chain is in texts[].
The RF_UNALIGNED flag may be set
without the RF_HASHED flag (if the
system tries to get the hashed region but
it is locked, the system will create a
private one), but the RF_HASHED flag will
never be set without the RF_UNALIGNED
flag.
|
RF_EVERSWP, RF_NOWSWP |
Set if the B-tree has ever been or is now
written to a swap device. These flags are
used for debugging.
|
RF_IOMAP |
This region was created with an iomap()
system call, and thus requires special
handling when calling exit().
|
RF_LOCAL |
Remote file using local swap space. |
RF_EXCLUSIVE |
The mapping process is allowed exclusive
access to the region. This flag is set, and
r_excproc is set to the proc table
pointer.
|
RF_STATIC_PREDICT |
Text object uses static branch prediction for compiler optimization. |
RF_ALL_MLOCKED |
Entire region is memory locked. |
RF_SWAPMEM |
Region is using pseudo-swap; that is, a portion of memory is being held for swap use. |
RF_LOCKED_LARGE |
Region is locked using large pages. |
RF_SUPERPAGE_TEXT |
Text region using large pages. |
RF_FLIPPER_DISABLE |
Disable kernel assist prediction; a flag used for performance profiling. |
RF_MPROTECTED |
Some part of the region is subject to the system call mprotect,
which is performed on an memory-mapped file.
|
r_key, r_chunk,
and r_root are used to find information
about the individual pages of a region.
Each page is represented by a vfd (if it's in memory)
or dbd (if it's on disk).
For each page, the vfd and dbd are grouped
together into a struct vfddbd.
By definition, if the vfd's pg_v bit is set,
the vfd is used; if not, the
dbd is used.
Since information is typically needed about groups of (rather than individual) pages, pages are grouped into chunks. A chunk contains 32 or 64 pairs of virtual frame descriptors and disk block descriptors:
vfd)
A one-word structure called a virtual frame descriptor enables processes
to reference pages of memory.
The vfd is used when the process is in
memory, and can be used to refer to the page of physical memory described in
the pfdat table (pfdat_ptr[],
described
below).
vfd)
+----------+---------------------+
| flags | page frame number |
+----------+---------------------+
11 31
struct vfd)| Element | Meaning |
|---|---|
pg_v |
Valid flag. If set, this page of memory contains valid
data and pg_pfnum is valid. If not set, the page's
valid data is on a swap device.
|
pg_cw |
Copy-on-write flag. If set, a write to the page causes a data protection fault, at which time the system copies the page. |
pg_lock |
Lock flag. If set, raw I/O is occurring on this page. Either the data is being transferred between the page and the disk, or data is being transferred between two memory pages. The kernel sleeps waiting for completion of I/O before launching further raw I/O to or from this page. Nothing can read the page while it is being written to disk. |
pg_mlock |
If set, the page is locked in memory and cannot be paged out. |
pg_pfnum (aliased as pg_pfn) |
Page frame number, from which can be accessed the
correct pfdat entry for this page.
|
dbd)
When the pg_v bit in a vfd is not set,
the vfd is invalid and the page of
data is not in memory but on disk. In this case, the disk block descriptor
(dbd) gives valid reference to the data.
Like the vfd structure, the dbd
is one word long.
dbd)+----+---------------------------+ |type| data | +----+---------------------------+ 0 3 31
struct dbd)| Element | Meaning |
|---|---|
dbd_type |
Type of data:
|
dbd_data |
vnode type (jfs, nfs,
ufs, swap space) specific data.
Used by the file system (or swap space management) code to find the data
in a file pointed to by a vnode.
|
(1) When the dbd_type is DBD_FSTORE,
it means that the page
of data resides in the file pointed to by r_fstore (typically a
file system). When the dbd_type is DBD_BSTORE,
the page of
data resides in the file of device pointed to by r_bstore
(typically a swap device).
Since information is typically needed about groups of (rather than individual) pages, pages are grouped into chunks. A chunk contains 32 or 64 pairs of virtual frame descriptors and disk block descriptors:
vfd).
dbd).
vfd's pg_v bit is set,
the vfd is used; if not, the
dbd is used.
A one-to-one correspondence is maintained between vfd
and dbd
through the vfddbd structure,
which simply contains one vfd (c_vfd)
and one dbd (c_dbd).
vfddbdschunks +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +----------+ +------+ /| vfd | +------+/ +----------+ +------+\ | dbd | +------+ \+----------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+
HP-UX regions use chunks of vfds and dbds
to keep track of page
ownership:
Each region contains either a single array of vfddbds
(a chunk)
or a
pointer to a B-tree.
The structure called a B-tree allows for quick
searches and efficient storage of sparse data.
A bnode is the same size
as a chunk; both can be gotten from the same source of memory.
The
region's B-tree stores pairs of page indices and chunk addresses. HP-UX
uses an order 29 B-tree.
A B-tree is searched with a key and yields a value.
In the region
B-tree, the key is the page number in the region divided by
the
number of vfddbds in a chunk.
B-tree (order = 3, depth = 3)
++-+-+-+-++
||9| | | ||
+++++++++++
| | | | | |
+-+-+-+-+-+
| |
+-----+ +-----+
| |
V V
++-+-+-+-++ ++-+--+-+-++
||4|7| | || ||9|11| | ||
+++++++++++ +++++-++++++
| | | | | | | | | | | |
+-+-+-+-+-+ +-+-+--+-+-+
| | | | |
+-------------------+ | | | |
| +---------+ | | +---------+
| | | | |
V V V V V
++-+-+-+-++ ++-+-+-+-++ ++-+-+-+-++ ++-+--+-+-++ ++--+--+-+-++
||1|3| | || ||4|6| | || ||7|8| | || ||9|10| | || ||11|12| | ||
+++++++++++ +++++++++++ +++++++++++ +++++-++++++ +++-++-++++++
| |G|H| | | | |D|E| | | | |J|I| | | | |F| B| | | | | C| A| | |
+-+-+-+-+-+ +-+-+-+-+-+ +-+-+-+-+-+ +-+-+--+-+-+ +-+--+--+-+-+
Each node of a B-tree contains room for order+1 keys (or index
numbers) and order+2 values. If a node grows to contain more than
order keys, it is split into two nodes; half of the pairs are kept in the
original node and the other half are copied to the new node.
The B-tree
node data also includes the number of valid elements contained in that
node.
B-tree Node Description (struct bnode)
| Element | Meaning |
|---|---|
b_key[B_SIZE] |
The array of keys used for each page index of the bnode.
|
b_nelem |
Number of valid keys/values in the bnode.
|
b_down[B_SIZE+1] |
The array of values in the bnode,
either pointers to another bnode
(if this is an interior bnode)
or pointers to chunks (if this is a leaf
bnode).
|
B-tree struct broot points
to the start of the B-tree.
struct broot| Element | Meaning |
|---|---|
b_root |
Pointer to the initial point of the B-tree.
|
b_depth |
Number of levels in the B-tree
|
b_npages |
Pages used to construct the B-tree, counting both
pages used for chunks and bnodes.
|
b_rpages |
Number of swap pages
reserved for the B-tree by the kernel, using the
routine grow_vfdpgs(). Amount of swap allocated for
the vfd/dbd pairs in the B-tree structure.
|
b_list |
Pointer to a linked list of memory pages used for bnodes
or chunks in this region. The first page in this list
usually has free space available (if b_nfrag is non-zero).
New bnodes or chunks
can be allocated from here and added to the B-tree.
|
b_nfrag |
Number of chunks available (not yet allocated) in b_list.
Since chunks are allocated from the end of the page, this is also
the index of the most recently allocated chunk in the page (decrement
it to get the next available one).
|
b_rp |
Pointer to the region using the B-tree.
|
b_protoidx, b_proto1,
b_proto2 |
Two prototpe dbd values,
and the page index at which we switch from
b_proto1 to b_proto2.
This is used to
minimize time and memory costs when allocating chunk space.
|
b_vproto |
List of page ranges which are copy on write. This allows pages to be set
copy on write without having to immediately allocate the actual
B-tree entries.
This is used to determine the vfd prototype.
(See "vfd Prototypes" below.)
|
b_key_cache[], b_val_cache[] |
Caches of most recently used keys and pointers to
chunks associated with the keys; checked first when
looking for a particular struct vfddbd
(before searching the B-tree).
|
vfd Prototypes
The
When a file is opened as an a.out or shared library, the easiest way to
keep track of the region is to create a
The
All
Even after all processes using the
The page frame data (
If physical memory addresses always started with page zero, and increased
in a continuous sequence, it would be implemented as a single level
array. (Indeed, it was implemented this way in older HPUX releases,
as the hardware they ran on had such a continuous address range.)
However, some recent systems have huge gaps in their physical addresses
(e.g. one might have memory from page 0 to page 0x1000, and then from page
0x20000 to 0x21000); a table that represented all addresses would be
much larger than actually needed.
Consequently the first layer (
The
(1) Hashing is done on the tuple
The
The PA-RISC hardware attempts to convert a virtual address to a
physical address by looking in the TLB. If it cannot resolve the
address, it generates a page fault (interrupt type 6 for an instruction
TLB miss fault; interrupt type 15 for a data TLB miss fault). The kernel
must then handle this fault.
PA-RISC uses a hashed page table (
See "The Page Table or PDIR"
above for additional discussion of this table,
and
"The Hashed Page Directory (
NOTE: For historical reasons, the entries of this table
can be referred to as
To find an address in the
b_vproto field of the .
The list is sorted
by struct broot
contains a list of ranges of pages to be treated as copy-on-write.
This allows pages to be set copy-on-write without their B-tree
entries being allocated immediately.
It is of type struct vfdcw.
When creating vfds, the prototype is determined by checking whether the
page is present in this list to dertmine which prototype
Table 15
struct vfdcw
Element Meaning v_start[MAXVPROTO]
Page that indexes start of copy-on-write range; set to -1 if unused.
v_end[MAXVPROTO]
End of copy-on-write range
pseudo-vas for Text
and Shared Library pregionspseudo-vas
the first time the file
is opened as an executable. This is done by calling mapvnode()
and
storing the vas pointer
in the vnode's v_vas element. On subsequent
opens of the file as an executable,
the non-NULL value in v_vas aids in
finding the region to which the virtual address space is being attached.
pseudo-vas is type PT_MMAP,
and the associated pregion has
PF_PSEUDO set in p_flags.
This pregion is attached to the region for
this vnode.
All the processes that use this executable or shared library
(non-pseudo pregions) then attach to the region with type PT_TEXT
(a.out) or PT_MMAP (shared library).
The number of processes using a
particular vnode as an executable is kept
in the pseudo-vas in
va_refcnt.
pregions associated with a region are connected with a
doubly-linked list that begins with the region element r_pregs,
and is
defined in the pregions by p_prpnext and p_off, the pregion's offset into the region,
and is NULL-terminated.
a.out or shared library exit,
the
handle to the region remains; its pages can be disposed of at that time.
Figure 21 Mapping the
pseudo-vas Structures
a.out shlib
vnode vnode
+-----+ +---->+-------+ +-----+ +---->+-------+
| | | |pseudo | | | | |pseudo |
+-----+ | +>| vas |<+ +-----+ | +>| vas |<+
|v_vas|-+ | +-------+ | |v_vas|-+ | +-------+ |
+-----+ | | +-----+ | |
| | | +-------+ | | | | +-------+ |
+-----+ +>| MMAP |<+ +-----+ +>| MMAP |<+
.............|pregion| ................|pregion|
+-----------------| | : | |--------+
| : +-------+ : +-------+ |
| : : |
| : proc[n].p_vas--+ : |
| : V : V
| : +-------+ : +-------+
| : | vas | +----------------------------->| MMAP |
| : +--------->| |<-----------+ |region |
| : | +-------+ | : | +-------+
| V V V V V /|\
| +-------+ +-------+ +-------+ +-------+ |
| | TEXT |<->| |<->| MMAP |<->| | proc[m].pvas |
| |pregion| | | |pregion| | | | |
| +-------+ +-------+ +-------+ +-------+ | |
| : | :............. V |
| : | : +-------+ |
| : | r_prpnext +------------------->| vas |<---+ |
| :...|............. | : | | | |
| | : | : +-------+ | |
| V V V V V |
| +-------+ +-------+ +-------+ +-------+ +-------+ |
+->| TEXT | | TEXT |<->| |<->| MMAP |<->| | |
|region |<-------|pregion| | | |pregion| | | |
+-------+ +-------+ +-------+ +-------+ +-------+ |
| |
+---------------+
Hardware-Independent Page Information Table (
pfdat) pfdat) table is a two level table
which represents all reallocatable pages of physical memory. (Memory
premanently allocated at kernel boot time is not represented.)
Conceptually it may be imagined as a giant array indexed by the
page frame number (pfn, i.e. the physical page number).
pfdat_ptr)
is basically an array of pointers to sub-tables.
Each pointer represents PFN_CONTIGUOUS_PAGES (0x1000) pages
of possible physical address space, but the pointers are NULL
unless there's actual physical memory in that range.
(As a memory-saving optimization, memory allocated permanently at boot
is treated as nonexistent for purposes of this table.)
pfdat structures are used for several purposes.
pfdats.
Table 16 Principal Entries in
struct pfdat
(Page Frame Data)
Element Meaning pf_hchain
Hash chain link.
pf_devvp(1)
vnode for device.
pf_next, pf_prev
Next and previous free pfdat entries.
pf_vnext, pf_vprev
Links for linked list of pages associated with the same vnode.
pf_lock
Lock pfdat entry (beta semaphore), used to lock the page while
modifying the pde
(physical-to-virtual translation, access rights,
or protection ID).
pf_pfn
Physical page frame number.
pf_use
Number of regions sharing the page; when pf_use drops to zero,
the page can be placed on the free linked list.
pf_cache_waiting
If set, this element means that a thread is waiting to grab the
pf_lock on that page. Required for synchronization.
pf_data
Disk block number or other data
to uniquely identify this page within pf_devvp.
pf_sizeidx
Identifies the page size for the base page of a large page
in a physical memory free list. That size determines which free list it's
placed in.
pf_size
Page size of a variable sized page that's in use.
pf_flags
Page frame data flags (shown in the next table).
pf_hdl
Hardware dependent layer elements
(see
hdlpfdat discussion, shortly).
(pf_devvp, pf_data).
Flags Showing the Status of the Page
Table 17 Principal
pf_flag Values
Flag Meaning P_FREE
Page is free (available for allocation).
P_BAD
Page is marked as bad by the memory deallocation subsystem.
P_HASH
Page is on a hash queue.
P_SYS
Page is being used by the kernel rather than by a
user process. Pages marked with this flag
include dynamic buffer cache pages, B-tree
pages and the results of dynamic kernel memory
allocation.
P_DMEM
Page is locked by the memory diagnostics
subsystem; set and cleared with an ioctl() call
to the dmem driver.
P_LCOW
Page is being remapped by copy-on-write.
P_UAREA
Page is used by a pregion of type PT_UAREA.
P_KERN_DYNAMIC
Page is used for kernel dynamic memory. (Subset of P_SYS.)
This includes pages in the kernel dynamic memory free lists.
P_KERN_NO_LGPG
Page is allocated (as kernel dynamic memory)
by a user who intends to remap it.
(This, it cannot be part of a large page.)
Subset of P_KERN_DYNAMIC.
P_SP_POOL
Page is in kernel dynamic memory allocator's superpage pool free list.
(Subset of P_KERN_DYNAMIC.)
Hardware-Dependent Layer Page Frame Data Entry
pf_hdl field of the struct pfdat
contains hardware dependent information associated with each page.
It is of type
struct hdlpfdat, defined
in hdl_pfdat.h.
Table 18
struct hdlpfdat
Element Meaning hdlpf_flags
Flags that show the HDL status of the page. Values include:
HDLPF_TRANS: A virtual address translation
exists for this page.
HDLPF_PROTECT: Page is protected from user
access. If this flag is set, the saved
values (below) are valid unless HDLPF_STEAL
is also set.
HDLPF_STEAL: Virtual translation should be
removed when pending I/O is complete.
HDLPF_MOD: Analogous to changing the
pde_modified flag in the hpde.
pde_ref flag in the hpde.
HDLPF_READA: Read-ahead page in transit;
used to indicate to the hdl_pfault()
routine that it should start the next I/O
request before waiting for the current I/O
request to complete.
hdlpf_savear
Saved page access rights.
hdlpf_saveprot
Saved page protection ID.
MAPPING VIRTUAL TO PHYSICAL MEMORY
The
HTBLhtbl)
of page directory entries (hpdes)
to pinpoint an address in the
enormous virtual address space.
Control register 25 (CR25) contains the hash table address
(see reg.h).
hpde and hpde2_0)
Structure" above for details of the contents of each table entry.
pdes, hpdes,
of pdirs.
htbl:
htbl
index.
htbl.
Each entry in the table is referred to as a pde (page directory
entry), and is of type struct hpde.
pde to verify the entry.
pde to
complete the translation from virtual address to physical address.
Figure 22 Mapping from the
htbl Entry
to the Page Directory Entry
htbl +-----+
| |
| |
| |
+-----+ +------+ | |
|Space| |Offset| | |
+-----+ +------+ | |
\ / | |
\ / | |
\ / | |
_/ \_ | |
----------- | |
\ hash / | |
\ / | |
| | |
V | |
+----------+ +-----+
|htbl index|------> htbl[n] | pde | ----> RAM
+----------+ +-----+
| |
| |
| |
+-----+
htbl[nhtbl-1] | pde |