Working Notes on the Linux VM Subsystem

My biggest problem when approaching the MM system was understanding how the kernel manages physical memory. Once that's understood, everything else is basically details of policy. Those details get very complicated, though.

For a description of the page struct fields, the comments in include/linux/mm.h are a good starting point.

When Is A Page Free?

A physical page is free if all of the following are true:

The page exists (many platforms have gaps in the physical address space; for example, it's common to find a 360K gap between 640K and 1MB on PCs.)
The page is not part of the kernel static memory range between PAGE_OFFSET+1MB and start_mem.
The reference count for the page in the mem_map is 0.

Free physical pages have exactly one virtual mapping: in the kernel page tables, at PAGE_OFFSET+physical_page_address.

Page Flags

Does each process get its own page directory?

Kernel 2.2: When a process is first created by fork(), it shares the memory mapping of its parent, with writeable pages marked "copy-on-write". When either process writes to a copy-on-write page, that process gets its own copy of the page. Thus, many processes can share the same page tables.

Kernel 2.4: Page tables are never shared (except kernel ones, I hope! - Yep, get_pgd_slow() still copies the kernel mapping). fork() calls dup_mmap() for the new process, which copies all the page tables. They're WP'd in both procs for copy-on-write.

When a process does an exec, it gets its own MM context into which the new executable is mapped.

Relocating EIP during boot

But there are some jump instructions referring to those labels prior to paging-enable. How do those jumps work? Since jumps less than 256 bytes are coded as relative jumps in x86 machine code, and all jumps in head.S prior to paging-enable are short jumps, the code in head.S never refers directly to an absolute address until after paging is enabled! (It appears that everything prior to the call to the code in head.S takes place in real mode and is part of boot-time magic; I'm not concerned here with anything that takes place prior to entry into head.S.)

kmap() page tables

It appears that (a) pagetable_init() maps all the physical RAM up to max_low_pfn, and (b) the fixmap_init() function for fixed mappings and kmap() pagetables co-opts a portion of the kernel address space.

kmap() etc

The top 128MB of kernel VIRTUAL space is reserved, and will not be included in the low zone. It's reserved for vmalloc() and, presumably, fixmaps - among which (though not really fixed) we find the kmap range. kmap() lets us map small areas in and out of the very top of kernel space as necessary, but it looks like those mappings must never persist for long - in fact, they are usually used and released entirely between calls to schedule().

kmap() is only relevant for KERNEL pages. User space pagetables can freely map highmem pages for as long as they like.

OK, *anyone* who does alloc_pages() with appropriate GFP_MASK might get a high page in return. So why doesn't everyone who does alloc_pages() also kmap() the page? Could we be counting on page->virtual==0 for free highmem pages? Eek, it seems so! But that's not necessarily going to be true: you could kmap() a high page, give it a vaddr, kunmap() it (which doesn't alter the vaddr), free it, and then __get_free_pages() could return that page and return the no-longer-mapped vaddr. This seems bad. We seem to be just trusting people to never call __get_free_pages() with a GFP_HIGHMEM, since __get_free_pages() doesn't check.

OK, you'd have to be pretty blind to accidentally call __gfp() or alloc_pages() with __GFP_HIGHMEM or __GFP_HIGHUSER, and places that use it seem to always kmap()/kunmap() the returned page. So now I think this is all clear to me. The only thing I'm a bit hazy on is how we prevent vmalloc() from overlapping the fixmap addresses, but that's for tomorrow.

Kernel allocators

alloc_pages() is the basic interface to the zone allocator, which hands you a page struct representing a (range of) physical pages.
kmalloc() is the slab allocator, which uses the zone allocator to whack off chunks of kernel physically-mapped-virtual pages, and carves them up into smaller bits for particular uses.
vmalloc() allocates arbitrary physical pages and maps them into kernel VM contiguously.
kmap() temporarily maps *any* physical page into kernel VM.

Zone Allocator

In kernel version 2.4, some major changes to __get_free_pages() et al have been made. It seems the old buddy allocator had problems with memory fragmentation, and had no simple way to distinguish between various kinds of memory (DMA, cacheable, slow). For this reason, the notion of "zones" is introduced in 2.4. The zone allocator carves up the physical address space into a number of zones, and allocates certain types of memory objects preferentially from appropriate zones. Thus, user memory (that is, memory to be mapped into a process address space) is allocated preferentially from the "cacheable" zone, and only from the DMA or slow zones if no cached pages are available; requests for DMA are filled exclusively from the "DMA" zone; and certain other requests (eg pagetables) are filled preferentially from the "slow" zone.

The zone allocator still uses the buddy system internally. The free area lists and buddy bitmaps are maintained on a per-zone basis rather than globally.

In 2.2 and earlier kernels, the physical allocator did have a means of distinguishing DMA memory from other memory, but it was convoluted and ugly, and involved maintaining separate global freelists for DMA and normal RAM. The zone allocator is a major refinement.

Page lists

Hypothesis: "active" == mapped by some process. "inactive_clean" == not mapped, not dirty (different from page on disk). "inactive_dirty" == not mapped, but changed from disk copy - must be written out before being reused.

Q: Why have an "inactive_clean" list? How are "inactive clean" pages different from "free" pages? If those pages are treated differently from free pages, won't that contribute to memory fragmentation within zones?

Possible answer: inactive_clean pages contain data that has recently been used somewhere, and we may need it again. That's what LRU is all about, after all: reusing that page that has spent the longest time unreferenced.

Kernel threads

A kernel thread is a process with no VM of its own. It executes using the kernel's memory context. The kernel_thread() function stars a kernel thread. It's weird - appears to be invoking a system call with undefined register contents. But I can't really read the GCC/asm syntax very well...

Dirty pages

A page is said to be "dirty" if it has been altered from its on-disk representation. All pages (? [later: no, anonymous pages may not have a swap page allocated before the first time try_to_swap_out looks at them]) have an on-disk representation, either in a swap file or in an mmap()ed file. A page might be an anonymous page to which swap space has not been allocated, I think, in which case it must be considered dirty even though it doesn't actually have an on-disk representation. In order to re-use the page frame we will have to allocate swap space and write the logical page out.

VM General Shape Hypothesis

Physical pages get mapped into user VM.

Something ( kswapd() ) walks the page tables looking for referenced PTEs, and maintains the underlying page structs. It keeps referenced physical pages on the active_list, ages unreferenced pages on the inactive_lists.
Pages on the inactive_lists are not mapped, but may be found in the page cache while servicing a fault, in which case they go back to the active list.
Dirty pages live on the inactive_dirty list, get written to disk by page_launder() when necessary and thence move to the inactive_clean list.
We can always reclaim inactive_clean pages at once, since they're not mapped by any process PTEs.
kreclaimd() walks the inactive_clean lists periodically and frees those pages, so that we have some chance of performing allocations of >1 page size.

This looks basically LRU-ish. Instead of a plain LRU list, the unreferenced ("older") end of the list is split off and maintained separately, and furthermore it's split again into dirty and clean lists (presumably so page_launder() doesn't have to wade through clean pages looking for ones to write to disk?) (Yes, that, and also so we can keep a supply of more-or-less untainted inactive-clean pages with which to supply new page requests.)

There are tests about page_count when deactivating pages. It has to be <=1 unless there are buffers associated with the page, in which case <=2. I need to understand exactly how page reference counting works. So. Where is page_count incremented?

It looks as if a page has its count incremented for every task that maps the page (this is done in copy_page_range()). (And better be done when servicing a fault that finds the page in the cache --- sure enough, __find_get_page() does it.
And when adding a page to the page cache, and when creating buffers for a page. So the logic in reclaim_page() makes sense.

OK, pages on any of the lists are checked for being referenced at various points, so it must be that they can be mapped. [No, that's just paranoia. They really, truly won't be mapped.]

reclaim_pages() succeeds if the chosen page is either (a) in the swap cache, or (b) has a page->mapping (eg it is part of a file mapping). Both (a) and (b) can't be true, if the code is any guide, and if neither are true it seems that we have a bug.

OK, it seems that active pages are certainly mapped, and inactive ones are not. Well, maybe inactive_clean ones are?

try_to_swap_out() attempts to get rid of a PTE, and then do the Right Thing with the underlying page. If the page is clean, it gets freed by page_cache_release()! Yay! I knew it had to happen somewhere...

The comments in try_to_swap_out() are old and crusty. It really returns 1 to mean "stop trying to page out the current process", and 0 to mean "keep trying."

You only have to be "not recently used" to be deactivated and moved to the inactive_dirty list (though there are some not-very-strict checks about reference count there). However, to be "inactive_clean" you basically have to be 100% freeable. page_launder() has strict criteria for allowing a page to be inactive_clean - it tries to write the buffer out, and ensures that the page has only one user. (This seems bad though, since that user might eg fork() and have the page be mapped again... have to look that code over more.)

The "cache" counts as a page user. I assume the page cache is just the collection of lists (active, inactive_dirty, inactive_clean). But where is this reference acquired? Not in "add_page_*".

"A day in the life of a user page" would be a nice section.

What's up with __alloc_pages()?

It looks to me like the zone allocator is going out of its way to not act in a "zoney" fashion. I thought the point was to ensure that eg there's DMA around when we need it, not have user pages spread all willy-nilly around physical memory. But it seems that __alloc_pages() is more interested in balancing allocations from the different zones than in ensuring that less-preferred zones are only chosen as a last resort. It seems to me that the "Right Thing" would be to try like hell to fulfill an allocation from the most preferred zone, and only if that fails to try the other zones on the list.

Questions

Are inactive pages mapped or not???? It looks like reclaim_page() is prepared to accept the possibility that a page on the inactive_clean list is in use, but it goes ahead and returns the page anyway, and the caller has to be sure, it seems, that the page really can be freed.
Answer: Paranoia on Rik's part accounts for the weirdness in reclaim_page(). Inactive pages are not mapped by process VM. This is certain to be true because:
1. Adding a page to the cache bumps the page refcount, and all code that deactivates pages checks to make sure that the reference count precludes the possibility of a process VM reference.
2. deactivate_page() will not touch a page that isn't in the cache, so there's no possibility of accidentally deactivating a page whose single reference is not the one resulting from adding the page to the cache.

Page Cache Notes

deactivate_page only works on pages that are already in the cache. Thus we can be sure that the existing reference count includes the one for the cache. The page cache does increment the page refcount. So 1 for buffers, 1 for the cache.
The basic way of mapping memory of any sort into a process's address space is the mmap() system call, implemented by do_mmap().
The fork() system call creates a new process that shares the virtual memory mapping of the parent process. The execve() system call creates a new VM context for the calling process and loads a new executable into the process address space.
do_page_fault() in arch/i386/mm is fairly clear, once you've figured out how the mm_struct and vm_area_struct work. It just figures out what kind of fault occurred and invokes the VMA-specific fault handler in the proper way.

Aargh!

It does not seem possible to understand the VM system in a modular way. The interfaces between, say, the zone allocator and the swap policy are many and varied, and you can't just look at one part and say, "OK, I understand that, now let's look at the layer above, or below." What a nightmare.

What up with PG_referenced? It's set only in getblk(), tested only in the vmscan code. So it propagates page usage information from the buffer cache into the page cache, I guess.

Linux MM Outline