Note: I, JAK, did not write most of the material below - Paul Wilson did! I mention this because a couple of people have asked me questions about this page to which I did not know the answers, apparently believing that I wrote all of this.
Re: naive questions, docs, etc. To: Rik van RielSubject: Re: naive questions, docs, etc. From: "Paul R. Wilson" Date: Mon, 4 Jan 1999 18:31:25 -0600 Cc: linux-mm@kvack.org, wilson@cs.utexas.edu Sender: owner-linux-mm@kvack.org Here's my first batch of notes on the VM system. It's mostly introductory, overall-picture kinds of things, but I need feedback on it before I can write nitty-gritty stuff. --------------------------------------------------------------------------- THE GNU/LINUX 2.2 VIRTUAL MEMORY SYSTEM, PART I Comments welcome---send to wilson@cs.utexas.edu (Paul Wilson). I. INTRODUCTION This document introduces basic aspects of the design of the Linux virtual memory system, circa version 2.2. I assume people understand basic VM concepts (like page tables and clock algorithms) and basically how the Linux kernel works (system calls, etc.) --- To understand the caching of pages, we must make a three-way distinction between virtual pages, logical pages and page frames. * A virtual page is a page as seen through an address mapping (e.g., in a page table) of a process. That is, it's the page that lives at a particular page-sized page aligned range of addressing, as seen by the process. * A logical page is an abstract unit of storage, holding a page full of data. It may be mapped as a particular virtual page from the point of view of one process, but as a different virtual page from the point of view of another process, or as a page-sized hunk of data in the file system when viewed through the file system. (e.g., in the case of memory mapped files) * A page frame is a physical page of main memory (typically RAM) There's really another category of page, which is a disk page, but they're managed in several ways so I'll defer discussing them until later. Disk pages and page frames are actual units of storage. Virtual pages are units of addressing (how pages are "named" by a process). --- The important thing to recognize is that logical pages are what a program sees, but it sees them *as* virtual pages at particular places in its address space. Virtual and logical pages aren't the same thing, and the virtual memory system has two main jobs: 1. keeping the active (recently-touched) logical pages in page frames (RAM pages). This is the job of the replacement mechanism, which evicts logical pages which haven't been touched recently from main memory, in favor of logical pages which have been touched recently. 2. making sure that the right logical pages show up as the right virtual pages in any particular process's virtual memory mapping. This is the job of the address-mapping mechanisms, including page tables and "vm areas". The virtual memory system also supports copy-on-write sharing of pages, so that large regions of memory can be logically copied without actually copying the pages until it's necessary. In the case of copy-on-write sharing, a given physical page may represent multiple logical pages, which were "logically" copied when the page was shared from one process to another. (For example, a vfork() logically copies the entire address space of a process, but the actual copying occurs a page at a time as attempts to modify pages cause write faults that transparently copy the pages rather than letting them be modified.) In the case of a copy-on-write shared page, the same physical page (RAM page frame or disk page) represents more than one logical page of storage, until and unless the actual copying is done. 2. KERNEL MODE AND MAIN MEMORY ALLOCATION ========================================= Kernel Mode and Physical Addressing ----------------------------------- The Linux kernel sees memory at its actual physical addresses, either by executing in "real mode" on some processors (like the early x86 family processors), or by using a page table(s?) that map virtual memory addresses to the corresponding physical addresses (e.g., on the Alphas). One way or another, the kernel sees main memory "as it is." [ Question 0: Is this right? I've seen several conflicting descriptions. Does the kernel execute in real mode on x86's, or use a kernel page table that's an identity function, or something else? I actually have a bunch of questions related to this, and cache/TLB interactions, and NT-like NEITHER_IO, but I'll put off asking them. ] [Answer 0: Not quite. The kernel sees all of physical memory mapped starting at PAGE_OFFSET, which is typically 3GB on x86. -- jak] In contrast, normal processes see memory via page tables that give them their own virtual address spaces, and the virtual address of a logical page has little to do with the physical address it resides in---it's wherever the kernel decided to put it. On X86 processors, [the top?] 1 GB of the address space is reserved for the kernel, and user processes cannot see that part of the address space at all---no pages are mapped there in user-process page tables. On entry to the kernel, the addressing mode is changed, and the kernel can see physical RAM there. [ Question 0b: Can the kernel also see the user space of the process its executing on behalf of, through the page table of the user process it's executing on behalf of? That's how some OS's do it, so that kernel-mode drivers can copy straight from userspace to kernel space and vice versa, and can address memory either by user virtual address or physical address. ] [ Answer 0b: Yes. All user page directories share the >=PAGE_OFFSET kernel page tables. get_pgd_slow() copies the relevant parts of swapper_pg_dir when a new page directory is allocated. -- jak] Kernel Memory Allocation ------------------------ [Comment from jak: this section is not accurate wrt 2.4 kernels. See the kernel allocators page. ] The kernel manages the allocation of physical memory using several memory allocators, the main one being kmalloc. Kmalloc keeps track of which regions of main memory are in use, and divides memory up into blocks which are powers of two multiples of the virtual memory page size. A physical page of memory may be used to cache a virtual memory page or file page, or as raw memory for DMA I/0, or to hold page tables used by the VM system, or to hold data structures belonging to the kernel itself, etc. kmalloc The main kernel memory allocator is kmalloc, which uses the binary buddy system, and only manages blocks of memory whose are a power-of-two multiple of the page size. Kmalloc manages physical memory by physical address---the blocks it returns are contiguous in physical memory, and kmalloc itself knows nothing about virtual memory. This is imporant for supporting certain kinds of devices, which require large blocks of contiguous physically-addressed memory, such as DMA devices that bypass the CPU (and its address-translation hardware). The binary buddy system is a simple and relatively fast allocator, but it is a poor allocator in many respects. The fact that it can only manage block sizes in powers of two means that using it straightforwardly requires rounding the requested block sizes up to a power of two, which can incur a large cost in internal fragmentation, i.e., wasted space within blocks. Linux therefore provides two other allocators for allocating memory within the kernel. The "slab" allocator gets largish hunks of memory from kmalloc and carves them into smaller pieces as needed, and vmalloc provides areas of contiguous virtual memory that aren't necessarily contiguous in physical memory. The Slab Allocator [ Is this basically the same design as the "slab" allocator from a USENIX paper a few years back? (McKusick & Karels?)] vmalloc [ But looking at it, it seems to allocate contiguous physical memory by default. This seems like a mistake, and could be one cause of the fragmentation that was a problem before the slab allocator was added. ] 3. PAGE FRAMES, CORRESPONDING PAGE STRUCTS, AND LOGICAL PAGES ============================================================= All page frames (physical RAM pages) have basic metadata stored in an array named mem_map, which is parallel to the physical address ordering of page frames in physical memory. (E.g., the 2nd record of mem_map describes the 2nd page frame.) The elements of this array are of type mem_map_t, better known by the typedef name "page". (See the header file mm.h and the comments there.) (A page struct is not a page, and is not _part_ of a page---it's the entry in the mem_map array that describes a page frame and its current contents. It acts as a kind of header, but the header is "off to the side" in the mem_map array, rather than at the begining of the page.) A page frame may be used for other purposes than caching pages, e.g., it may be part of an area holding the kernel's own data structures. A page struct contains several fields, including a flags field that holds flags describing how the page frame is being used and what is stored in it. It also holds several link fields, for linking the page frame into a doubly-linked list and/or a hash table. These allow page frames to be linked together into pools of page frames that are used in particular ways. The page struct has several flags used for synchronization purposes and for the replacement mechanism. It also has a "count" field indicating how many mappings of this page frame currently exist, e.g., how many page table entries from (possibly) different processes currently point to this page frame. --- An important role of the page struct is to serve as a temporary identifier of a logical page. The virtual memory system ensures that a given logical page is cached in at most one page frame at any given time. While it's there, a pointer to the corresponding page struct can act as an identifier of the logical page, as well as the page frame holding it. Page Frames Holding Different Kinds of Logical Pages ---------------------------------------------------- A logical page of storage (which may be physically located in a page frame or on disk) is commonly a page of a process's private virtual address space, but may be a page of a (System V IPC) shm (shared memory) segment, shared between processes, or a page of a memory-mapped file. (An mmapped page may also be shared, if multiple processes map the same page(s) of a file into their address spaces.) Different kinds of logical pages are identified in different ways. File pages If a page frame is used to cache a (logical) page of a normal file, the corresponding page struct's "inode" and "offset" fields hold a pointer to the inode struct that the kernel uses to represent the file, and the offset of the page within that file. The combination of the inode and the offset in the file is sufficient to uniquely identify a logical page. A page frame may be be subdivided, however, and used to buffer several file blocks that are smaller than the page size. In this case, the corresponding page struct's "buffers" field points to a circular list of "buffer head" structs representing those buffers. [ QUESTION 1: I haven't looked at this very closely, or not with much comprehension. Do sub-page buffers like this have to hold consecutive file blocks, or are they managed arbitrarily by the buffer cache replacement policy? How does this interact with the main clock mechanism? ] Private pages A page frame may be used to cache an ordinary page of a process's virtual address space, which is not part of a file or a shm segment (see below). [ AS I UNDERSTAND IT ] the page struct corresponding to such a page does not store the identity of the page whose contents it stores---the only record of what's there is in the page tables of the processes. [ QUESTION 2: Is that right? If so, it seems to me it'd be nicer to store a pointer to the task, or to a special kind of inode object belonging to the task. This would be much more symmetrical, and would make the swapper more modular because it wouldn't have to invent identities for teh pages it caches. It'd also be helpful in debugging and integrity-checking the kernel, and could be valuable information to a smarter swapper/clusterer. (On a COW, the new page frame would get a pointer to the process that made the copy.) Are page frames caching new private pages registered in any particular data structure? ] shm (IPC shared memory) pages (This is a bit confusing because there are different senses of the phrase "shared memory.") Logical pages of memory mapped files may be shared, but that's not what "shared memory" means in the Linux VM system. (I'll try to use the term "shm" when it's not obvious.) Linux supports System V interprocess communication, which manages regions of logical storage. These are different from files, and very different from regions of virtual storage. [ QUESTION 2B: hmmm again... it seems like it would be good to put the id of the shm segment in the page struct... or is that done? ] 4. ADDRESS MAPPING: VM_OPERATIONS, VM_AREAS, PAGE TABLES, PTE's =============================================================== Linux uses several kinds of data structures and a hardware cache (the Page Table Entry cache, a.k.a. TLB for "Translation Lookaside Buffer") to present the appropriate logical pages as the right virtual pages, from a process's point of view. The virtual memory mappings of a process are represented by two data structures, one high-level (vm_area_struct's), and the other low-level (page tables). The high-level data structure describes regions of the address space, but does not specify everything about the state of individual pages. The low-level data structure (page table) provides detailed information on individual pages, such as whether the page is in main memory, and if so, where. VM AREAS -------- The high-level structure is a list of vm_area structs summarizes a process's use of its address space, with an entry for each contiguous region of the address space (a range of pages) that is treated differently from the preceding or following pages. The list is ordered by address (virtual page number). For example, when a process maps a file into a previously unoccupied part of its address space, the mmap system call creates a vm_area struct recording the virtual page range it is mapped to and how the kernel should handle operations like page faults, copy-on-write page copies, and pageouts in that region. These operations are specified by a pointer from the vm_area_struct to a vm_operations_struct, which is simply a struct whose fields contain function pointers, pointing to the functions for those operations. When the kernel needs to perform a given operation on a given page of a process's address space, it can search the list of vm_area_struct's for the one describing the region containing that page. It can then extract the pointer to the vm_operations struct, and from that extract the function pointer to perform that operation. (This is essentially a manual implementation of virtual function tables as in C++; the vm_area_struct is an instance of of the class described by its vm_operations_struct, which provide es the code to manipulate the instance.) [ QUESTION 3: Why isn't this actually done consistently? Some of this stuff is hardcoded, and it uglifies the code. There are several ugly code sequences that test whether a vm_area has a non-null vm_ops, and if so whether it has a function pointer for a particular operation, with hardcoded defaults for null values. Would it be easy to replace the null fields with simple wrappers (to either do nothing, or call the appropriate existing procedure)? Or is there deep magic that would break because this stuff isn't actually modular? I am a little afraid to touch it near it. ] [ QUESTION 4: There used to be AVL trees of vm_area_struct's. What happened to them? Are they going to come back? Linear lists are bad news for some apps, including our page-faulting persistent object store, and process checkpointing. It's also potentially bad for generational GC's that use page protections to imlement the generational write barrier. ] MULTILEVEL PAGE TABLES ---------------------- [Terminology warning here: for a few paragraphs, I'll use the standard term "page table" to mean a whole (e.g., multilevel) page table that describes a whole address space. Then I'll introduce the Linux terminology, in which a "page table" is only a lowest-level unit in a hierachical multilevel page table. Bleah. ] The page table of a process provides detailed information about the pages visible to a process through its virtual address space, and are used by the CPU and/or trap handlers to translate the user processe's virtual addresses into physical addresses, so that that the user process can access actual memory. Conceptually, a page table is simply a very large array of "page table entries," or pte's---one per page of the virtual address space. Each element in this array corresponds to one page in the address space, with the ordering of the page table entries corresponding to the ordering of the virtual address pages they describe. The PTE for a given virtual page records whether the page is in main memory, and if so, which page of physical memory (the physical address of the virtual page). It also records certain auxiliary information, such as whether the page has been touched recently, and whether it has been dirtied (written to) recently---these bits of information are used by the replacement policy that manages the main-memory caching of VM pages. The PTE also records protections that have been set on a page---whetehr the process is allowed to read from or write to the page, and perhaps whether it's allowed to execute code residing in that page. Protectiosn are used to ensure that processes do not modify things they're not supposed to (such as a memory-mapped file mapped read-only), and are also used to implement copy-on-write lazy copying of regions of memory. Page tables are not actually represented as very large contiguous arrays, one per virtual page. Instead, they are represented sparsely, as a simple kind of multiway tree with a rigidly fixed geometry, which typically has many missing subtrees. (This is the classic multilevel page table structure you'll see in any introductory OS textbook.) Cross-architecture 3-level Abstraction Actual computers of different types use different kinds of page tables, with different numbers of levels. The architecture-independent parts of Linux kernel source code is written as though there were always three levels of the page table structure, which there actually are in Alpha Linux. On the x86 processors, the multilevel page tables are actually only two levels deep, which is sufficient for addressing a 4GB address space (32-bit addressing) with 4KB pages. The code that traverses the "middle levels" of page tables does nothing on the x86 architecture---it gets preprocessed and compiled down to essentially nothing via platform-specific #ifdefs. This allows other code to be written as though all machines had three-level page tables. On x86 processors, the actual (two-level) geometry of page tables is fixed in the hardware---the CPU actually traverses the page tables without running any software. On some other machines, there is no hardware support for multilevel page tables at all; these machines may use either all-software page table searches, or something called an "inverted page table", which isn't really a page table in the normal sense at all. An inverted page table is really a kind of secondary cache of page table entries, stored in main memory, rather than a whole page table. [ PPC does this, right?---it has hardware probe of inverted page table, and trap to software on miss, IIRC. Does Alpha or ARM, or any MIPS machines Linux runs on? ] Linux terminology---names for levels In Linux (Intel-derived?) terminology, the term "page table" is not generally used for an entire multilevel page table. Rather, the term "page table" is used to mean only the lowest-level node of the multiway tree that implements the whole "page table" in the normal sense. The abstract three-level page table has a name for each kind of node. A top-level node is called a PAGE GLOBAL DIRECTORY, or "pgd", because it acts as the index to all the pages belonging to the process. A middle-level node is called a PAGE MIDDLE DIRECTORY, or "pmd" A bottom level is called a "page table," because it holds actual pte's describing particular (virtual) pages. The PTE Cache (a.k.a. TLB) -------------------------- All modern computers designed for virtual memory incorporate a special hardware cache called a PTE cache or TLB (Translation Lookaside Buffer), which caches page table entries in the CPU, so that the CPU usually doesn't have to probe the page table to find a PTE that lets it translate an address. The PTE cache is the magic gadget that makes virtual memory practical. Without it the CPU would have to do extra main memory reads for every read or write instruction executed by the running program, just to look up teh PTE that let it translate a virtual address into a physical one. (Amazingly, IBM actually built such a horrible beast at one time. It was intended to be a compatible and cheap version of one of their expensive mainframes, and was intentionally so slow it didn't signficantly cut into their market for the expensive fast ones.) Rather than looking up a PTE in the page table each time it needs to translate an address, the CPU looks in its page table entry cache to find the right page table entry. If it's there already, it reuses it without actually traversing the page table. Occasionally, the PTE cache doesn't hold the PTE it needs, so the CPU loads the needed entry from the page table, and caches that. Note that a PTE cache does not cache normal data---it only caches address translation information from the page table. A page table entry is very small, and the PTE cache only caches a relatively small number of them (depending on the CPU, usually somewhere from 32 and 1024 of them). This means that PTE cache misses are a couple of orders of magnitude more common than than page faults---any time you touch a page you haven't touched fairly recently, you're likely to miss the pte cache. This isn't usually a big deal, because PTE cache misses are many orders of magnitude cheaper than page faults---you only need to fetch a pte from main memory, not fetch a page from disk. A PTE cache is very fast on a hit, and is able to translate addresses in a fraction of an instruction cycle. This translation can generally be overlapped with other parts of instruction setup, so the PTE hardware gives you virtual memory support at essentially zero time cost. --- (Most of the following can be skipped without loss, for most kernel-hacking purposes.) On some machines, such as x86's, the CPU automatically loads page table entries from the multilevel page table whenever a PTE cache miss occurs. On a few machines, a PTE cache miss causes a trap to software, and the OS does the translation by running a simple routine. [ Old MIPS machines used to do this. Do any new machines? ] [ Yes, Hitachi SuperH chips work this way. --jak] On other machines, the CPU automatically probes an inverted page table when the on-chip PTE cache is missed. The inverted page table is really just an off-chip PTE cache, much larger than the on-chip one, and probing it usually just takes a memory load. If this off-chip cache is missed, the CPU traps to the kernel to do the real page-table lookup. (Such machines are designed to be able to do entirely without conventional multilevel page tables. Rather than probing a multilevel page table, information about all in-memory pages is kept in the inverted page table, and information about out-of memory pages can be kept in other data structures (e.g., Linux's vm_area lists could hold it.) Linux does NOT work this way, however---on such machines, Linux manages the three-level page tables in software, but only needs to probe them on the very occasional inverted page table miss. 5. VM CACHING AND REPLACEMENT ============================= VM caching in Linux is rather complex, and the code is a little difficult to understand at first, for several reasons. One is that some of the names of variables and procedures is less than intuitive and/or less than consistent. Another another is that the coding style takes a little getting used to---lots of one-branch if's with the following code being an implicit else. This is designed to generate excellent straight-line usual-case code given GCC's optimization heuristics and some lame branch predictors in some CPU's that Linux runs on. Here I'll try to give a basic sketch of what's going on, to show the forest, not the trees. The trees will be discussed later. The Main Clock Algorithm ------------------------ The main component of the VM replacement mechanism is a clock algorithm. Clock algorithms are commonly used because they provide a passible approximation of LRU replacement and are cheap to implement. (All common general-purpose CPU's have hardware support for clock algorithms, in the form of the reference bit maintained by the PTE Cache. This hardware support is very simple and fast, which is why all designers of modern general-purpose CPU's put it in.) A little refresher on the general idea clock algorithms: A clock algorithm cycles slowly through the pages that are in RAM, checking to see whether they have been touched (and perhaps dirtied) lately. For this, the hardware-supported reference and dirty bits of the page table entries are used. The reference bit is automatically set by the PTE cache hardware whenever the page is touched---a flag bit is set in the page table entry, if the PTE is evicted from the PTE cache, it will be written back to its home position in the page table. The clock algorithm can therefore examine the reference bits in page-table entries to "examine" the corresponding page. The basic idea of the clock algorithm is that a slow incremental sweep repeatedly cycles through the all of the cached (in-RAM) pages, noticing whether each page has been touched (and perhaps dirtied) since the last time it was examined. If a page's reference bit is set, the clock algorithm doesn't consider it for eviction at this cycle, and continues its sweep, looking for a better candidate for eviction. Before continuing its sweep, however, it resets the reference bit in the page table entry. Resetting the reference bit ensures that the next time the page is reached in the cyclic sweep, it will indicate whether the page was touched since _this_ time. Visiting all of the pages cyclically ensures that a page is only considered for eviction if it hasn't been touched for at least a whole cycle. The clock algorithm proceeds in increments, usually sweeping a small fraction of teh in-memory pages at a time, and keeps a record of its current position between increments of sweeping. This allows it to resume its sweeping from that page at the next increment. Technically, this simple clock scheme is known as "second chance" algorithm, because it gives a page a second chance to stay in memory---one more sweep cycle. More refined versions of the clock algorithm may keep multiple bits, recording whether a page has been touched in the last two cycles, or even three or four. Only one hardware-supported bit is needed for this, however. Rather than just testing the hardware supported bit, the clock hand records the current value of the bit before resetting it, for use next time around. Intuitively, it would seem that the more bits are used, the more precise an approximation of LRU we'd get, but that's usually not the case. Once two bits are used, clock algorithms don't generally get much better, due to fundamental weaknesses of clock algorithms. [ I can elaborate on that if anyone's interested. ] Linux uses a simple second-chance (one-bit clock) algorithm, sort of, but with several elaborations and complications. The main clock algorithm is implemented by the kernel swap demon, a kernel thread that runs the procedure kswapd(). kswapd is an infinite loop, which incrmentally scans all the normal VM pages subject to paging, then starting over. Kswapd generally does its clock sweeping in increments, and sleeps in between increments so that normal processes may run. (See the file mm/vmscan.c.) kswapd isn't the only replacement algorithm, however---there are actually several interacting replacement algorithms in the Linux memory management system. * Pages that are part of (System V IPC) shared memory pages are swept by a different clock algorithm, implemented by shm_swap() * Page frames managed by the file buffering system are managed differently, for several reasons. (For example, file blocks may be smaller than the VM page size, and filesystem metadata are flushed to disk more often than normal blocks.) Code for file buffering is in the file fs/buffer.c, including bdflush(), which is run as a kernel thread (like kswapd()) to manage buffer flushing. * (Is there another one? There's the VFS metadata cache, and the slab allocator caches, but they're pretty different...) [ I'm pretty confused about the exact relationships between all of the caches and caching algorithms. There seem to be significant changes since last time I studied the kernel source, and there are very few big-picture comments. QUESTION 5: what exactly does and doesn't go in the page cache? QUESTION 6: what are the interactions between shm caching, other page caching, and file buffer caching? I think the source needs some very comments of the form The X cache holds Y's and consists of the set of page frames entered into the data structure Z. It needs these for the page cache, the buffer cache, and the swap cache at least. It also needs comments about how these things basically interact. (Are these caches distinct? What gets moved between them, and when and how? ] Equally important, there is another clock algorithm, implemented by calling shrink_mmap() from the kernel swap demon (kswapd()). This clock algorithm is different from the others, in that it's a "back end" algorithm for the main clock implemented by kswapd(). That is, pages may be evicted from the normal clock but not actually evicted from memory. An important distinction between the main kswapd clock and the auxiliary shrink_mmap clock is that the main clock sweeps over *virtual* pages, while the shrink_mmap clock sweeps over page *frames* (RAM pages). The front-end clock uses the hardware-supported PTE reference bits, and consolidates the information they hold into the per-page-frame reference bits held in the PG_referenced bit (of the "flags" field) in the page structs that represent page frames. [ QUESTION 7: is this right? ] The relationships between these two clocks are fairly subtle. [ At least, they're far from obvious to me...] A Clock Over Virtual Pages -------------------------- Notice that when the PTE cache sets the reference bit on a page, it is setting a bit for a page table entry, i.e. for a *virtual* page, _not_ a page frame or the logical page it holds. (A logical page may be cached in a page frame, but be mapped by more than one page table entry, likely in different processes' page tables. The same logical page therefore has as many reference bits as there are mappings of that page as virtual pages in address spaces, each one recording whether one process has touched the page "lately." If a logical page has been touched by *any* process recently, it is a poor candidate for eviction. We therefore need something like the OR of the reference bits for different mappings of the page. (It might seem as though we'd want to take the sum, and prefer to keep in memory those pages that have been touched the most. This is not a good idea. In general, a single touch by one process is just as important as a large number of touches by a large number of processes, if they all occur in a short amount of time. The replacement policy's real job is not to predict how often a page will be touched, and keep the most-touched pages in RAM. It is to predict how *soon* pages will be touched *next* and keep the soonest-to-be-touched pages in RAM. This is a basic fact that too few people really understand.) In general, the replacement mechanism's job is to cache LOGICAL pages, not virtual pages. But the hardware gives reference information about virtual pages, not logical ones---or at least, not directly. There are two basic approaches to solving this problem. The obvious one is to perform a traditional clock sweep over the page frames (RAM pages) holding the (logical) pages in question, and use the reference bits from the all of the pte's that map that page frame into any page table. The straightforward way of doing this would be to maintain an index, recording which virtual pages of which processes are mappings of each page frame (and the logical page it holds). [ QUESTION 8: is this what the phrases "pte chaining" and/or "pte lists" refer to? I've seen those terms in old mails in the archive, but I'm not clear on exactly what was being discussed, what got implemented, why it's (apparently) not used, etc. ] The Linux replacement policy does not do the obvious thing, for several reasons. Linux performs a clock sweep over the *virtual* pages, by cycling through each process's pages in address order. For this it uses the vm_area mappings and page tables of the processes, so that it can scan the pages of each process sequentially. Rather than sweeping through all of the pages of an an entire process before switching to another, the main clock tries to evict a batch of pages from a process, and then move on to another process. It visits all of the (pageable) processes and then repeats. The effect of this is that there is a large number of distinct clock sweeps, one per pageable proceses, and the overall clock sweep advances each of these smaller sweeps periodically. [ This used to be a round-robin through the processes, which made sense to me. I don't understand the current thing with the two passes over the task list. I don't see how it's fair, or how it's unfair in just the right way.] Several motivations led to this design: 1. related pages should be paged out together, to increase locality in the paging store (so-called swap files or swap partitions). By evicting a moderate number of virtual pages from a given process, in virtual address order, the sweep through virtual address space tends to group related pages together in the paging store. 2. by alternating between processes at a coarser granularity, it avoids evicting a large number of pages from a given victim process---after it's evicted a reasonable number of pages from a particular victim, it moves on to another, to provide some semblance of fairness between the processes. 3. The use of a main clock over processes and virtual address pages and a secondary clock over page frames provides a way of combining the hardware-supported virtual page reference bits to get recency-of-touch information about logical pages stored in page frames. 4. The secondary clock (and the use of a separate per-page-frame PG_referenced bit maintained in software) can act as an additional "aging" period for pages that are evicted from the main clock. A page can be held in the "swap cache" after being evicted from the main clock, and allowed to age a while before being evicted from RAM. [ QUESTION 9: Are all of these things right? Is there something else that's important and should be listed? I got #4 from something someone said in email, but I may have completely misunderstood, and from the 2.2 pre1 sources I don't see how this effect is achieved. ] The last two points require some explanation, and involve something called the "swap cache." The Swap Cache -------------- The swap cache is just a set of page frames holding logical pages that have been evicted from the main clock, but whose contents are have not yet been discarded. The contents of page frames need not be copied to "move" them into the swap cache---rather, the page frame is simply marked as "swap cached" by the main clock algorithm, and linked into a hash table that holds all of the page frames that currently constitute the swap cache. (This is done by add_to_swap_cache() in mm/swap_state.c.) There is some subtlety here. When the main clock algorithm (over virtual pages) finds a page whose (hardware-supported) reference bit (in the pte) IS set, indicating that the page has been touched recently, it sets the PG_referenced bit in the page struct for the page frame holding the logical page. On the other hand, if the virtual page's pte reference bit is NOT set, it can be evicted from the main clock pool, but may be retained in the swap cache. [ QUESTION 10: I'm fairly unclear on all of this. Can somebody write a high-level description of what's going on in try_to_swap_out() after a page has been established to be present and pageable? It seems to me that there ought to be a check not only of the pte's reference bit, but of the page frame's PG_referenced bit---if the page has been referenced recently via another mapping, you should NOT page it out. It also seems to me that if there's an aging grace period after eviction from the main clock, then there needs to be an additional referenced bit per page frame. As I interpret the 2.2 pre1 sources, it looks like try_to_swap_out immediately copies a set pte reference bit to the page struct, so the single bit in the page struct only tells you whether the main clock has noticed a set referenced bit lately. This may or may not give you a grace period depending on the vagaries of the two clocks, which don't seem to be coordinated. It seems to me that pages evicted from the first clock ought to be entered into an LRU list for aging, and actual paging to disk would be from the other end of that list. This would mostly decouple the actual paging from the front-end clock, make the LRU approximation more precise, and give a flexible framework for tweaks. Am I all wrong about this? ] -- This is a majordomo managed list. To unsubscribe, send a message with the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org Follow-Ups: Re: naive questions, docs, etc. From: ebiederm+eric@ccr.net (Eric W. Biederman) Re: naive questions, docs, etc. From: Chip Salzenberg Prev by Date: Re: [patch] arca-vm-6, killed kswapd [Re: [patch] new-vmimprovement , [Re: 2.2.0 Bug summary]] Next by Date: Re: [patch] arca-vm-6, killed kswapd [Re: [patch] new-vm improvement , [Re: 2.2.0 Bug summary]] Prev by thread: Re: naive questions, docs, etc. Next by thread: Re: naive questions, docs, etc. Index(es): Date Thread
Linux MM Outline |
Questions and comments to Joe Knapka
The LXR links in this page were produced by lxrreplace.tcl, which is available for free.