Swapping and the Page Cache
Pretty much every page of user process RAM is kept in either the
page cache or the swap cache (the swap cache is just
the part of the page cache associated with swap devices). Most user
pages are added to the cache when they are initially mapped into
process VM, and remain there until they are reclaimed for use by
either another process or by the kernel itself. The purpose of the
cache is simply to keep as much useful data in memory as possible, so
that page faults may be serviced quickly. Pages in different parts of
their life cycle must be managed in different ways by the VM system,
and likewise pages that are mapped into process VM in different ways.
The cache is a layer between the kernel memory management code and
the disk I/O code. When the kernel swaps pages out of a task, they do
not get written immediately to disk, but rather are added to the
cache. The kernel then writes the cache pages out to disk as necessary
in order to create free memory.
The kernel maintains a number of page lists which collectively
comprise the page cache. The active_list, the
inactive_dirty_list, and the inactive_clean_list are
used to maintain a more-or-less least-recently-used sorting of user
pages (the page replacement policy actually implemented is something
like "not recently used", since Linux is fairly unconcerned about
keeping a strict LRU ordering of pages). Furthermore, each page of an
executable image or mmap()ed file is associated with a per-inode
cache, allowing the disk file to be used as backing storage for the
page. Finally, anonymous pages (those without a disk file to serve as
backing storage - pages of malloc()'d memory, for example) are
assigned an entry in the system swapfile, and those pages are
maintained in the swap cache.
Note that anonymous pages don't get added to the swap cache - and
don't have swap space reserved - until the first time they are evicted
from a process's memory map, whereas pages mapped from files begin
life in the page cache. Thus, the character of the swap cache is
different than that of the page cache, and it makes sense to make the
distinction. However, the cache code is mostly generic, and I won't be
too concerned about the differences between mapped pages and swap
pages here.
The general characteristics of pages on the LRU page lists are as
follows:
- active_list: pages on the active list have page->age > 0, may
be clean or dirty, and may be (but are not necessarily) mapped by
process page-table entries.
- inactive_dirty_list: pages on this list have page->age == 0,
may be clean or dirty, and are not mapped by any process
PTE.
- inactive_clean_list: each zone has its own inactive_clean_list,
which contains clean pages with age == 0, not mapped by any
process PTE.
During page-fault handling, the kernel looks for the faulting page
in the page cache. If it's found, it can be moved to the active_list
(if it's not already there) and used immediately to service the fault.
Life Cycle of a User Page
I'll present the common case of a page (call it P) that's part of an
mmap()ed data file. (Executable text pages have a similar life cycle,
except they never get dirtied and thus are never written out to disk.)
- The page is read from disk into memory and added to the page cache. This
can happen in a number of different ways:
- Process A attempts to access page P; it is then read in by
the page-fault handler for the process's VM area corresponding
to the mapped file and added to the page cache, and to the
process page tables. The page starts its life on the inode cache
for the file's inode, and on the active_list of the LRU,
where it remains while it is actively being used.
Or:
- Page P is read in during a swap readahead operation, and
added to the page cache. In this case, the reason the page is
read in is simply that it is part of the cluster of blocks on
disk that happens to be easy to read; we don't necessarily know
the page will be needed, but it's cheap to read a bunch of pages
that are sequential on the disk - and the cost of
throwing away those pages if it turns out we don't need them is
trivial, since they can be immediately reclaimed if they're
never referenced. Such pages start life in the swap cache, and
on the active_list. (Actually this will never be the case for an
mmapped page - such pages are never written to swap.)
Or...
- Page P is read in during a mmap cluster readahead
operation, in which case a sequence of adjacent pages following
the faulting page in an mmapped file is read. Such pages start
their life in the page cache associated with the mmapped file,
and on the active list.
- P is written by the process, and thus dirtied. At this point P
is still on the active_list.
- P is not used for a while. Periodic invocations of the kernel
swap daemon kswapd() will gradually reduce the page->age
count. kswapd() wakes up more frequently as memory pressure
increases. P's age will gradually decay to 0 if it is not
referenced, due to periodic invocations of refill_inactive() by
kswapd.
- If memory is tight, swap_out() will eventually be called by
kswapd() to try to evict pages from Process A's virtual address
space. Since page P hasn't been referenced and has age 0, the PTE
will be dropped, and the only remaining reference to P is the one
resulting from its presence in the page cache (assuming, of course,
that no other process has mapped the file in the meantime).
swap_out() does not actually swap the page out; rather, it simply
removes the process's reference to the page, and depends upon the
page cache and swap machinery to ensure the page gets written to
disk if necessary. (If a PTE has been referenced when swap_out()
examines it, the mapped page is aged up - made younger -
rather than being unmapped.)
- Time passes... a little or a lot, depending on memory demand.
- refill_inactive_scan() comes along, trying to find pages that
can be moved to the inactive_dirty list. Since P is not mapped by
any process and has age 0, it is moved from the active_list to the
inactive_dirty list.
- Process A attempts to access P, but it's not present in the
process VM since the PTE has been cleared by swap_out(). The fault
handler calls __find_page_nolock() to try to locate P in the page
cache, and lo and behold, it's there, so the PTE can be immediately
restored, and P is moved to the active_list, where it remains as
long as it is actively used by the process.
- More time passes... swap_out() clears Process A's PTE for page
P, refill_inactive_scan() deactivates P, moving it to the
inactive_dirty list.
- More time passes... memory gets low.
- page_launder() is invoked to clean some dirty pages. It finds P
on the inactive_dirty_list, notices that it's actually dirty, and
attempts to write it out to the disk. When the page has been
written, it can then be moved to the inactive_clean_list. The
following sequence of events occurs when page_launder() actually
decides to write out a page:
- Lock the page.
- We determine the page needs to be written, so we call the
writepage method of the page's mapping. That call invokes
some filesystem-specific code to perform an asynchronous write to
disk with the page locked. At that point, page_launder() is
finished with the page: it remains on the inactive_dirty_list, and
will be unlocked once the async write completes.
- Next time page_launder() is called it will find the page
clean and move it to the inactive_clean_list, assuming no process
has found it in the pagecache and started using it in the
meantime.
- page_launder() runs again, finds the page unused and clean, and
moves it to the inactive_clean_list of the page's zone.
- An attempt is made by someone to allocate a single free page
from P's zone. Since the request is for a single page, it can be
satisfied by reclaiming an inactive_clean page; P is chosen for
reclamation. reclaim_page() removes P from the page cache (thereby
ensuring that no other process will be able to gain a reference to
it during page fault handling), and it is given to the caller as a
free page.
Or:
kreclaimd comes along trying to create free memory. It reclaims
P and then frees it.
Note that this is only one possible sequence of events: a page can live in the
page cache for a long time, aging, being deactivated, being recovered by
processes during page fault handling and thereby reactivated, aging, being
deactivated, being laundered, being recovered and reactivated...
Pages can be recovered from the inactive_clean and active lists as
well as from the inactive_dirty list. Read-only pages, naturally, are
never dirty, so page_launder() can move them from the
inactive_dirty_list to the inactive_clean_list "for free," so to
speak.
Pages on the inactive_clean list are periodically examined by the
kreclaimd kernel thread and freed. The purpose of this is to try to
produce larger contiguous free memory blocks, which are needed in some
situations.
Finally, note that P is in essence a logical page, though of course
it is instantiated by some particular physical page.
Usage Notes
When a page is in the page cache, the cache holds a reference to the
page. That is, any code that adds a page to the page cache must
increment page->count, and any code that removes a page from the page
cache must decrement page->count. Failure to honor these rules will
cause Bad Thingstm to happen, since the page reclamation
code expects cached pages to have a reference count of exactly 1 (or 2
if the page is also in the buffer cache) in order to be
reclaimed.
The existing public interface to the page cache
(add_to_page_cache() etc) already handles page reference counting
properly. You shouldn't attempt to add pages to the cache in any other
way.
Page Cache Data
The basic data structures involved in the page cache are
- The LRU list of page structs:
- The active_list, containing active pages (mapped by process page
tables).
- The inactive_dirty_list, containing pages that are not mapped
by any process page tables, but which may be dirty.
- The per-zone inactive_clean_list, containing freeable pages
that are clean and not mapped by any process page tables. There is
one inactive_clean_list per zone, so that allocations from each
zone can be fulfilled by reclaiming appropriate inactive_clean pages.
- The struct address_space, which represents the virtual memory
image of a file. Each mmap()ed file, including executables and files
which are explicitly mmap()ed by a program, is represented by an
address_space struct which holds on to the inode for the file, a
list of all the VM mappings of the file, and a collection of
pointers to functions needed to handle various VM tasks on the file,
such as reading in and writing out pages. No matter how many
processes have mapped a particular file, there is only one
address_space struct for the file. Swap files are also represented
by address_space structs, which allows the page cache code to be
generic for anonymous, swap-backed pages and mmapped file pages.
- The ubiquitous page struct - these are the items that make up
the LRU lists and inode caches.
- The page hash queues - lists of pages with the same hash
values. The hash queues are used when looking for mapped pages in
the page cache based on the file and the offset of the page within
the file.
Page Cache Code
add_to_page_cache()
void add_to_page_cache( page struct * page, struct address_space *
mapping, unsigned long offset ) is the public interface by which pages
are added to the page cache. It adds the given page to the inode and
hash queues for the specified address_space, and to the LRU lists.
The actual cache-add operation is performed by the helper function
__add_to_page_cache(), in filemap.c on line 500
- Line 506 if the page is locked, it's a bug. (The page may, for
example, be in the process of being written out to disk.)
- Lines 509 and 510: clear the referenced, error, dirty,
uptodate, and arch (architecture-specific?) bits, and set the
locked bit in the page frame.
- Line 511 take a reference to the page. This is the cache's
reference, and is critical for the proper operation of other
page-cache code (notably reclaim_page()).
- The remainder of the code in the function adds the page to the
address_space mapping, to the page hash queue, and to the LRU list.
add_to_page_cache_unique()
This is the same as add_to_page_cache(), except that it first checks
to be sure the page being added isn't already in the cache.
__find_page_nolock()
page struct * __find_page_nolock(struct address_space *mapping,
unsigned long offset, page struct *page) is the basic cache-search
function. It looks at the pages in the hash queue given by the
page argument, examining only those that are associated with
the given mapping (address_space), and returns the one with the
matching offset, if such exists. The code is "interesting", but pretty
straightforward.
As an example of how __find_page_nolock() is used, consider the
kernel's activity when handling a page fault on an executable text
page:
- The fault handler uses the fault address to find the
vm_area_struct corresponding to the fault address. It calls the
*nopage member of the vm_area_struct's vm_ops, which in the case of
a text area will be filemap_nopage().
- filemap_nopage() tracks down the address_space via the
vm_area_struct->file member. It computes the page's offset into the
file using the vm_area_struct's base address and the fault
address. It computes the hash value of the faulting page, and calls
__find_get_page() to search the page cache, giving it the
address_space mapping, offset, and hash queue.
- __find_get_page() calls __find_page_nolock() to search the hash
queue for the page at the given offset in the proper mapping. If
found, it takes an additional reference to the page on behalf of the
faulting process, and returns the page. If the page isn't found,
filemap_nopage() reads the page into the page cache and starts over
from the beginning.
page_cache_read()
int page_cache_read(struct file * file, unsigned long offset) checks
to see if a page corresponding to the given file and offset is already
in the page cache. If it's not, it reads in the page and adds it to
the cache.
Well, actually the other way around:
- Line 557 is the page in the cache? If so, do nothing.
- Line 562 allocate a page (page_cache_alloc() ==
get_free_page()).
- Line 566 add the page to the page cache, associating it with
the specified file and offset.
- Line 567 do whatever is necessary to asychronously read the
page. Note that adding the page to the page cache has locked it;
anyone who wants to use the page must wait for it to be unlocked
using wait_on_page(). The page will be unlocked when its read
completes. [How does this happen?]
wait_on_page()
void wait_on_page(page struct *page) simply checks to see if the given
page is locked, and if it is, waits for it to be unlocked. The actual
waiting is done in ___wait_on_page() in filemap.c:
- Line 609 declare a waitqueue entry and add it to the page's
wait queue. We're declaring the waitqueue entry on the stack; that's
OK, since this instance of ___wait_on_page() won't return until the
page is unlocked.
- The loop on line 612 is executed while the page is locked.
- On line 613 we try to force any I/O in progress on the page
to complete.
- We then set the task state to uninterruptible sleep, check to
see if the page has been unlocked in the meantime, and if not, run
the disk I/O task queue and schedule another process. [sync_page()
seems to normally do run_task_queue(&tq_disk), via
block_sync_page(). So we're forcing I/O twice?]
- We'll wake up and continue from schedule() as soon as the
page gets unlocked and the wait queue is awoken.
__block_write_full_page()
When we need to write a page from the page cache into backing storage,
in page_launder() for example, we invoke the writepage member of the
page mapping's address_space_operations structure. In most cases this
will ultimately invoke __block_write_full_page(struct inode *inode,
page struct *page, get_block_t *get_block), defined in buffer.c on
line 1492 __b_w_f_p() writes out the given page via the buffer
cache. Essentially, all we do here is create buffers for the page if
it hasn't been done yet, map them to disk blocks, and submit them to
the disk I/O subsystem to be physically written out asynchronously. We
set the I/O completion handler to end_buffer_io_async(), which is a
special handler that knows how to deal with page-cache read and write
completions.
kswapd()
kswapd() is the kernel swap thread. It simply loops, deactivating and
laundering pages as necessary. If memory is tight, it is more
aggressive, but it will always trickle a few inactive pages per minute
back into the free pool, if there are any inactive pages to be
trickled.
When there's nothing to do, kswapd() sleeps. It can be woken
explicitly by tasks that need memory using wakeup_kswapd(); a task
that wakes kswapd() can either let it carry on asynchronously, or wait
for it to free some pages and go to sleep again.
try_to_free_pages() does almost the same thing that kswapd() does,
but it does it synchronously without invoking kswapd. Some tasks might
deadlock with kswapd() if, for example, they are holding kernel locks
that kswapd() needs in order to operate; such processes call
try_to_free_pages() instead of wakeup_kswapd(). [An example would be
good here...]
Most of kswapd's work is done by do_try_to_free_pages().
do_try_to_free_pages()
int do_try_to_free_pages(unsigned int gfp_mask, int user) takes a
number of actions in order to try to create free (or freeable) pages.
- Line 921 move clean inactive_dirty pages to the inactive_clean
list, and if necessary, launder some inactive_dirty pages by
starting writes to the disk.
- Line 931 attempt to move some active pages to the
inactive_dirty list.
- Line 935 free unused slab memory.
refill_inactive()
int refill_inactive(unsigned int gfp_mask, int user) is invoked
to try to deactivate active pages. It tries to create enough freeable
pages to erase any existing free memory shortage.
- The loop beginning on line 849 loops over priority
until it reaches 0. Each time through the loop we decrement the
priority; lower priority indicates higher memory pressure, and will
cause us to take more aggressive action to try to free pages.
- On line 857 we attempt to refill the inactive_dirty list by
deactivating as many unused active pages as possible, using
refill_inactive_scan(). If this works well, it's all we need, and
the loop will exit.
- On line 868 we shrink the inode cache and the dentry cache, if
possible.
- Then, on line 874 we attempt to forcibly remove process page
mappings. The swap_out() function is seriously mis-named: it doesn't
actually do any swapping out. Rather, it tries to evict pages from
process PTE mappings. If it can eliminate all process mappings of
the page, the page becomes eligible for deactivation and
laundering.
- If we've eliminated the memory shortage, exit the loop. If we
haven't been able to free any pages, do the loop again, more
aggressively. Note that the loop is certain to exit eventually by
either eliminating the memory shortage, or failing to make progress
even at priority 0. (Unless you're one of those lucky souls with an
infinite number of freeable pages on your system.)
refill_inactive_scan()
refill_inactive_scan() looks at each page on the active list, aging
the pages up if they've been referenced since last time and down if
they haven't (this is counterintuitive - older pages have lower age
values).
If a page's age is 0, and the only remaining references to the
page are those of the buffer cache and/or page cache, then the
page is deactivated (moved to the inactive_dirty_list).
There is some misleading commentary in this function. It doesn't
actually use age_page_down() to age down the page and deactivate it;
rather, it ages the page down, and then additionally checks the
reference count to be sure it is really unreferenced by process PTEs
before actually deactivating the page.
swap_out
int swap_out(unsigned int priority, int gfp_mask) scans process page
tables and attempts to unmap pages from process VM. It computes the
per-task swap_cnt value - basically a "kick me" rating, higher values
make a process more likely to be victimized - and uses it to decide
which process to try to "swap out" first. Larger processes which have
not been swapped out in a while have the best chance of being chosen.
The actual unmapping is done by swap_out_mm(), which tries to
unmap pages from a process's memory map; swap_out_vma(), which tries
to unmap pages from a particular VM area within a process by walking
the part of the page directory corresponding to the VM area looking
for page middle directory entries to unmap; swap_out_pgd() which walks
page middle directory entries looking for page tables to unmap;
swap_out_pmd() which walks page tables looking for page table entries
to unmap; and ultimately, try_to_swap_out(), where all the interesting
stuff happens.
try_to_swap_out()
int try_to_swap_out(struct mm_struct * mm, struct vm_area_struct* vma,
unsigned long address, pte_t * page_table, int gfp_mask) attempts to
unmap the page indicated by the given pte_t from the given
vm_area_struct. It returns 0 if the higher-level swap-out code should
continue to scan the current process, 1 if the current scan should be
aborted and another swap victim chosen. It's a bit complicated:
- Line 46 if the given page isn't present, return 0 to continue
the scan.
- Do some paranoia checks to be sure the page is in a sane state.
- Line 52 if the current process's memory map has been
sufficiently victimized by the swap code, return 1 to force a new
process to be chosen. Otherwise, reduce the process's "good victim"
rating.
- Line 57 check if the page is active.
- Line 59 age the page up (make it younger - someone
didn't have enough caffeine when they wrote the page-aging code :-)
if it's been referenced recently.
- Line 65 if the page is not active, age the page down,
but do not move it to the inactive_dirty_list because (by
virtue of us looking at the page in try_to_swap_out()) it's still
mapped and thus not freeable. That's the meaning of the mysterious
comment on line 64
- Line 72 if the page is young, don't unmap it. Note that under
high memory load, both swap_out() and refill_inactive_scan() will be
called more frequently (often by processes attempting to allocate
free pages), thus pages will age more quickly, and we have a
reasonable chance to find young pages in swap_out().
- Line 75 lock the page frame so nothing unexpected happens to
it while we're molesting it. If we can't lock it, continue to scan
the VMA.
- Line 83 clear the PTE, flush the TLB (so the CPU doesn't
continue to believe the page is mapped).
- Line 94 if the page is already in the swap cache (that is,
already in the page cache and associated with swap space rather than
with a regular file mapping), we just replace the pte with the
swapfile entry that will allow us to swap the page in again when
required. We also mark the page dirty, if necessary, to force it to
be written out next time page_launder() looks at it.
- Line 102 - drop_pte: we jump to this label from a number of
places in order to finish the unmapping operation. Unlock the page,
attempt to deactivate it (this will only succeed if no other process
is using it), and drop the process's reference. Return 0 to continue
the scan.
- Line 124 We get here if the page wasn't already in the swap
cache (it's either an anonymous page that's never been swapped out
before, or else it's an mmapped page with backing storage on a
regular file and a valid page->mapping). If the page is clean we
just goto drop_pte, since the page can be recovered by paging it in
from its backing store, if any. Note that if it's a clean anonymous
page, and it's not in the swap cache, then no process has ever
written data to the page - so it can be trivially replaced, if
necessary, by simply allocating a free page.
- Line 133 the page is definitely dirty. If the page has a
mapping, we're cool, because page_launder() will write it out if
necessary, so we can just goto drop_pte and be done.
- Line 144 The page had no mapping, so we need to allocate a
swap page for it and add it to the swap cache. (The swap cache is
just that part of the page cache that's associated with the
swapper_space mapping rather than with a regular file mapping.) If
the attempt to allocate a swap page fails, we must restore the PTE
by jumping to out_unlock_restore.
- Line 149 swap space allocated successfully. We add the page to
the page cache, set it dirty so page_launder() knows to write it
out, and goto set_swap_pte to replace the PTE with the swap entry,
deactivate the page, and continue the scan.
Simple, eh?
page_launder()
int page_launder(int gfp_mask, int sync) is responsible for moving
inactive_dirty pages to the inactive_clean list, writing them out to
backing disk if necessary. It's pretty straightforward. It makes at
most two loops over the inactive_dirty_list. On the first loop it
simply moves as many clean pages as it can to the
inactive_clean_list. On the second loop it starts asynchronous writes
for any dirty pages it finds, after locking the pages; but it leaves
those pages on the inactive_dirty list. Eventually the writes will
complete and the pages will be unlocked, at which point future
invocations of page_launder() can move them to the
inactive_clean_list.
page_launder() may also do synchronous writes if it's
invoked by a user process that's trying to free up enough memory to
complete an allocation. In this case we start the async writes as
above (vmscan.c line 561), but then wait for them to complete, by
calling try_to_free_buffers() with wait > 1 (vmscan.c, line 595 - line
603). Starting the page write causes the page to be added to the
buffer cache; try_to_free_buffers() attempts to remove the page from
the buffer cache by writing all its buffers to disk, if necessary.
reclaim_page()
page struct * reclaim_page(zone_t * zone) attempts to reclaim an
inactive_clean page from the given zone. It examines each page in the
zone's inactive_clean_list, moving referenced pages to the active
list, and dirty pages to the inactive_dirty list. (Both of those tests
are basically paranoid and should never happen, since code that remaps
an inactive page in the cache also moves it to the active list.) When
it finds a genuine clean page, which should happen immediately, it
removes the page from the page cache and returns it as a free page.
Note the code from line 431 to line 439 every inactive_clean page
is either in the swap cache (if it was an anonymous page) or else it's
in the page cache associated with an mmapped file (page->mapping &&
!PageSwapCache(page)). Exactly one of those conditions will be true.
Note also line 455 this is just another paranoia check. Pages on
the inactive_clean list are no longer in the buffer cache (since
they've been written out and their buffers freed in page_launder), so
every reclaimed page will have a reference count of exactly 1 (the
page cache's reference), in the absence of broken kernel modules or
drivers.
deactivate_page()
deactivate_page() is called within try_to_swap_out() to try to
deactivate a page. It uses deactivate_page_nolock() for the
interesting bits. It checks that the page is unused by anyone except
the caller (which presumably is about to release a reference) and the
page and buffer caches, then makes the page old by setting its age to
0 and moves it to the inactive_dirty_list, but only if it was on
the active list. We never want to deactivate a page that's not in
the page cache, since we have no idea what else such pages may be
being used for, and the page cache makes some pretty strong
assumptions about what it can do with the pages under its control.
Questions and comments to Joe Knapka
The LXR links in this page were produced by lxrreplace.tcl, which is available for free.
Credits