My biggest problem when approaching the MM system was understanding
how the kernel manages physical memory. Once that's understood,
everything else is basically details of policy. Those details get
very complicated, though.
Kernel Mapping Whys and Wherefores
Apparently kernel virtual memory must map all
of physical memory. Why? Must be so that kernel code can access any
page. OK, so why is PAGE_OFFSET so large? Why not just make it 4K?
Hypothesis: all processes are going to share (at least in kernel mode)
the mapping starting at PAGE_OFFSET. That means that if user processes
are going to have address space starting at 0, PAGE_OFFSET has to be
big enough to allow some room for user-mode addresses. There's no
reason user processes couldn't have a totally separate address mapping
from the kernel, and both could start at low addresses; but that would
make it a pain for the kernel to access user-space memory, and would
mean that entry into the kernel would require icky page-table
manipulation. I think the scheme used makes things very easy: user
segments map up to PAGE_OFFSET, and kernel segments map PAGE_OFFSET
and onward. The kernel segments always refer to the same memory
(kernel mem), shared between all processes via shared page tables;
while the user segments refer to user mem for each process by using
non-shared page tables (well, some shared and some private pages,
perhaps). The kernel segments are privileged and can only be accessed
in kernel mode. This hypothesis predicts that the >=PAGE_OFFSET part
of swapper_pg_dir should be copied for each new process, so that
transitions to kernel mode can continue to use the user-space page
dir. [Hypothesis confirmed: get_pdg_slow(), used indirectly in
new_page_tables(), copies the kernel part of the pgd for a new task.]
Pagetables are accessed by the x86 hardware in physical memory
(naturally). The kernel uses the __va(phys_addr) and __pa(virt_addr)
macros to convert from physical to virtual and vice versa; these
merely add or subtract the PAGE_OFFSET. So to map a virtual address V
to a particular physical address P, the kernel finds the proper
page-directory entry containing V, allocates a page-table page (if
necessary), puts the physical address of the pagetable page in the
pgdir (if necessary), converts the pgt address to virtual using
__va(), finds the right page-table entry for the page V to be mapped,
and writes P to that entry, along with some accounting info in the
lower 12 bits. (Only the top 20 bits of a pagetable entry are
significant in address resolution, since a page is 4K on Intel.)
For a description of the page struct
A physical page is free if all of the following are true:
Kernel 2.2:
When a process is first created by fork(), it shares the memory
mapping of its parent, with writeable pages marked
"copy-on-write". When either process writes to a copy-on-write page,
that process gets its own copy of the page. Thus, many processes can
share the same page tables.
Kernel 2.4: Page tables are never shared (except kernel ones,
I hope! - Yep, get_pgd_slow() still copies the kernel mapping). fork()
calls dup_mmap() for the new process, which copies all the page
tables. They're WP'd in both procs for copy-on-write.
When a process does an exec, it gets its own MM context into which the
new executable is mapped.
But there are some jump instructions referring to those labels prior
to paging-enable. How do those jumps work? Since jumps less than 256
bytes are coded as relative jumps in x86 machine code, and all jumps
in head.S prior to paging-enable are short jumps, the code in head.S
never refers directly to an absolute address until after paging is
enabled! (It appears that everything prior to the call to the code in
head.S takes place in real mode and is part of boot-time magic; I'm
not concerned here with anything that takes place prior to entry into
head.S.)
The top 128MB of kernel VIRTUAL space is reserved, and will not be
included in the low zone. It's reserved for vmalloc() and, presumably,
fixmaps - among which (though not really fixed) we find the kmap
range. kmap() lets us map small areas in and out of the very top of
kernel space as necessary, but it looks like those mappings must never
persist for long - in fact, they are usually used and released
entirely between calls to schedule().
kmap() is only relevant for KERNEL pages. User space pagetables can
freely map highmem pages for as long as they like.
OK, *anyone* who does alloc_pages() with appropriate GFP_MASK might
get a high page in return. So why doesn't everyone who does
alloc_pages() also kmap() the page? Could we be counting on
page->virtual==0 for free highmem pages? Eek, it seems so! But that's
not necessarily going to be true: you could kmap() a high page, give
it a vaddr, kunmap() it (which doesn't alter the vaddr), free it, and
then __get_free_pages() could return that page and return the
no-longer-mapped vaddr. This seems bad. We seem to be just trusting
people to never call __get_free_pages() with a GFP_HIGHMEM, since
__get_free_pages() doesn't check.
OK, you'd have to be pretty blind to accidentally call __gfp() or
alloc_pages() with __GFP_HIGHMEM or __GFP_HIGHUSER, and places that
use it seem to always kmap()/kunmap() the returned page. So now I
think this is all clear to me. The only thing I'm a bit hazy on is how
we prevent vmalloc() from overlapping the fixmap addresses, but that's
for tomorrow.
In kernel version 2.4, some major changes to __get_free_pages() et al
have been made. It seems the old buddy allocator had problems with
memory fragmentation, and had no simple way to distinguish between
various kinds of memory (DMA, cacheable, slow). For this reason, the
notion of "zones" is introduced in 2.4. The zone allocator carves up the physical
address space into a number of zones, and allocates certain types of
memory objects preferentially from appropriate zones. Thus, user
memory (that is, memory to be mapped into a process address space) is
allocated preferentially from the "cacheable" zone, and only from the
DMA or slow zones if no cached pages are available; requests for DMA
are filled exclusively from the "DMA" zone; and certain other requests
(eg pagetables) are filled preferentially from the "slow" zone.
The zone allocator still uses the buddy system internally. The free area lists
and buddy bitmaps are maintained on a per-zone basis rather than globally.
In 2.2 and earlier kernels, the physical allocator did have a means of
distinguishing DMA memory from other memory, but it was convoluted and ugly, and
involved maintaining separate global freelists for DMA and normal RAM. The zone
allocator is a major refinement.
Q: Why have an "inactive_clean" list? How are "inactive clean"
pages different from "free" pages? If those pages are treated
differently from free pages, won't that contribute to memory
fragmentation within zones?
Possible answer: inactive_clean pages contain data that has
recently been used somewhere, and we may need it again. That's what
LRU is all about, after all: reusing that page that has spent the
longest time unreferenced.
There are tests about page_count when deactivating pages. It
has to be <=1 unless there are buffers associated with the page,
in which case <=2. I need to understand exactly how page reference
counting works. So. Where is page_count incremented?
OK, pages on any of the lists are checked for being referenced
at various points, so it must be that they can be mapped. [No, that's
just paranoia. They really, truly won't be mapped.]
reclaim_pages() succeeds if the chosen page is either (a) in the
swap cache, or (b) has a page->mapping (eg it is part of a file
mapping). Both (a) and (b) can't be true, if the code is any guide,
and if neither are true it seems that we have a bug.
OK, it seems that active pages are certainly mapped, and inactive
ones are not. Well, maybe inactive_clean ones are?
try_to_swap_out() attempts to get rid of a PTE, and then do the
Right Thing with the underlying page. If the page is clean, it gets
freed by page_cache_release()! Yay! I knew it had to happen
somewhere...
The comments in try_to_swap_out() are old and crusty. It really
returns 1 to mean "stop trying to page out the current process", and 0
to mean "keep trying."
You only have to be "not recently used" to be
deactivated and moved to the inactive_dirty list (though
there are some not-very-strict checks about reference count
there). However, to be "inactive_clean" you basically have
to be 100% freeable. page_launder() has strict criteria for allowing a
page to be inactive_clean - it tries to write the buffer out, and
ensures that the page has only one user. (This seems bad though, since
that user might eg fork() and have the page be mapped again... have to
look that code over more.)
The "cache" counts as a page user. I assume the page cache
is just the collection of lists (active, inactive_dirty,
inactive_clean). But where is this reference acquired? Not
in "add_page_*".
"A day in the life of a user page" would be a nice section.
It looks to me like the zone allocator is going out of its way to
not act in a "zoney" fashion. I thought the point was to
ensure that eg there's DMA around when we need it, not have user pages
spread all willy-nilly around physical memory. But it seems that
__alloc_pages() is more interested in balancing allocations from the
different zones than in ensuring that less-preferred zones are only
chosen as a last resort. It seems to me that the "Right Thing" would
be to try like hell to fulfill an allocation from the most preferred
zone, and only if that fails to try the other zones on the
list.
Answer: Paranoia on Rik's part accounts for the weirdness in
reclaim_page(). Inactive pages are not mapped by process VM.
This is certain to be true because:
It does not seem possible to understand the VM system in a modular
way. The interfaces between, say, the zone allocator and the swap
policy are many and varied, and you can't just look at one part and
say, "OK, I understand that, now let's look at the layer above, or
below." What a nightmare.
What up with PG_referenced? It's set only in getblk(), tested only
in the vmscan code. So it propagates page usage information from the
buffer cache into the page cache, I guess.
Linux MM Outline
Questions and comments to Joe Knapka
The LXR links in this page were produced by lxrreplace.tcl, which is available for free.PAGE_OFFSET
x86 Page Tables
When Is A Page Free?
Free physical pages have exactly one virtual mapping: in the kernel
page tables, at PAGE_OFFSET+physical_page_address.
Page Flags
Does each process get its own page directory?
Relocating EIP during boot
kmap() page tables
kmap() etc
Kernel allocators
representing a (range of) physical pages.
Zone Allocator
Page lists
Kernel threads
Dirty pages
VM General Shape Hypothesis
This looks basically LRU-ish. Instead of a plain LRU list, the
unreferenced ("older") end of the list is split off and maintained
separately, and furthermore it's split again into dirty and clean
lists (presumably so page_launder() doesn't have to wade through clean
pages looking for ones to write to disk?) (Yes, that, and also so
we can keep a supply of more-or-less untainted inactive-clean pages
with which to supply new page requests.)
What's up with __alloc_pages()?
Questions
Page Cache Notes
Aargh!