The kernel page tables map as much of physical memory as possible into the address range starting at PAGE_OFFSET. The physical pages occupied by the kernel code and data are reserved and will never be allocated to any other purpose; all other physical pages are allocated to process virtual memory, the buffer cache, additional kernel virtual memory, and so forth, as necessary. In order to do this, we must have a way of keeping track of which physical pages are in use, by whom.
The zone allocator manages physical memory. Any kernel code can call
upon the zone allocator via __alloc_pages() and be given a block of
2n pages aligned on a 2n-page
boundary, for small values of n (up to 8 or so). We can also
hand 2n-page-aligned blocks back to the zone
allocator via __free_pages() to free them. The n exponent is
called the "order" of the allocation. Blocks don't necessarily need to
be freed in the same way that they're allocated, but they must be
freed with the proper alignment, and the initial page frame of the
block must have a nonzero reference count. For example, you can
request an 8-page, 8-page-aligned block, and then free a 2-page,
2-page-aligned block within the larger block, but you must first
increment the reference count of the first page in the smaller
block. The reason for that is that when you allocate an 8-page block,
only the first page of the block has its reference count incremented
-- but the page-freeing code will not free a page with reference count
0; the reference count for the other pages must be handled
manually. Incrementing the refcounts for all pages in a block would
add overhead to the page allocation code that would rarely be
necessary.
Different ranges of physical pages may have different properties, for
the kernel's purposes. For example, Direct Memory Access, which allows
peripherals to read and write data directly to RAM without the CPU's
intervention, may only work for physical addresses less than
16MB. Some systems have more physical RAM than can be mapped between
PAGE_OFFSET and 4GB; those physical pages are not directly accessible
to the kernel, and so must be treated differently. The zone allocator handles such
differences by dividing memory into a number of zones and treating
each zone as a unit for allocation purposes. Any particular allocation
request utilizes a list of zones from which the allocation may be
attempted, in most-preferred to least-preferred order. For example, a
request for a user page should be filled first from the "normal" zone;
if that fails, from the HIGHMEM zone; and if that fails, from the DMA
zone. Thus, the zonelist for such allocations consists of the normal,
HIGHMEM, and DMA zones, in that order. On the other hand, a request
for a DMA page may only be fulfilled from the DMA zone, so
the zonelist for such requests contains only the DMA zone.
The root data structure maintained by the zone allocator is the
zone_struct, which gathers all the data relevant for managing a
particular zone. The zonelist_struct consists of an array of
zone_struct pointers and a gfp_mask indicating which kinds of
allocations can use the zonelist. The zone_struct->offset indicates
the offset in physical memory of the beginning of the zone.
Each physical page in the system is represented by a page struct
in the zone_struct->zone_mem_map array, which is an array of page
structs that parallels the physical block of memory represented by the
zone: zone_mem_map[0] represents the first page in the zone,
zone_mem_map[1] represents the second page, etc. Each page struct
records pertinent properties of the associated page, such as the
reference count. I'll discuss each member of the page struct as it
becomes relevant to the discussion.
Within each zone, the buddy system is used to manage physical
pages.
Most page allocation and freeing is done when allocating and
de-allocating pages for process virtual address space. Since such
allocations usually occur during page-fault processing, the majority
of page allocations are for a single page. However, some kernel
entities, particularly device drivers, have special needs with respect
to physical memory. One might, for example, need to allocate 4
contiguous pages of DMA-capable physical RAM. The buddy system allows
such allocations to be done quickly and efficiently.
Essentially, the buddy system treats physical RAM as a collection of
2n-page-sized blocks aligned on 2n-page
boundaries, and merges adjacent free blocks into single higher-order
blocks. Each 2n-page block is the "buddy" of the other half
of the 2n+1-page block containing it. The allocator keeps
lists of all the free one-page blocks, all the free two-page, two-page-aligned
blocks, all the free four-page, four-page-aligned blocks, etc. To allocate a
block of a given order, we check the freelist of the specified order and all
higher orders. If a block is found at the specified order, it is allocated
immediately. If a block of higher order must be used, then we divide the larger
block into two 2order-1 blocks, add the lower half to the
appropriate freelist, and allocate the memory from the upper half, executing
this step recursively if necessary. When freeing memory, we check whether the
block being freed has a free buddy block; if so, we combine the two blocks into
a single free block by removing the adjacent block from its freelist and then
freeing the larger block; again, this process is performed recursively if
necessary. There are two data structures used within each zone to keep track of
all this: the freelists and the buddy bitmaps. The freelist
head and the buddy bitmap for each order are gathered together into a free_area_struct; the zone_struct
maintains one free_area_struct for each order.
The freelists are maintained using the page structs of the zone's
memory map; the list_head pointers in the free_area_struct point to
the first and last pages in the freelist. Free page structs for each
order are linked together using the page->list->next and
page->list->prev pointers (those are used for other purposes in
non-free pages). All the blocks of a given order are linked using the
first page of the block; so for example, a four-page free block's
first page will be on the order-2 freelist, and the three high pages
will not be on a freelist at all - they are implicitly free by being
part of a free 4-page-aligned block.
The buddy bitmap for each order is maintained using the
zone->free_area[order]->map array. The bitmap for each page order
indicates the fragmentation status for each 2-block region of that
order. A block is fragmented if half of it is at least partially used,
and the other half is completely free; each buddy bitmap describes the
entire zone in this manner. Thus, the memory region addressed by
zone->free_area[order]->map contains
pages_in_zone/(2order+1) bits. For example, if the
machine had a total of 16 pages of free RAM that happened to be
16-page aligned, and the fourth page is in use while the others are all
free, then there would be 4 buddy bitmaps:
The buddy maps don't directly tell us anything about the used vs. free
status of a page; that's the freelists' job. The buddy maps' purpose
is to make it quick and easy to check whether a block being freed has
an unused buddy or not. If we're freeing a block B, and the buddy maps
tell us the next higher-order block is fragmented, then B's buddy must
be free (since B is in use and the block is fragmented); so we should
merge B with its buddy and free the higher-order block instead.
The memory map is built, and the freelists and buddy bitmaps initialized, in
free_area_init_core(). There is a lot of stuff that is only relevant to kernels
with CONFIG_DISCONTIGMEM set. With CONFIG_DISCONTIGMEM, physical memory is
divided into a number of "nodes", each with its own mem_map. I'm going to mostly
ignore that for now and concentrate on the
contiguous-starting-from-physical-address-0 case. The node data (in the
pg_data_t structs) is initialized by the bootmem
allocator; in the non-CONFIG_DISCONTIGMEM case, there is one node,
represented by the global contig_page_data variable.
free_area_init_core() takes seven arguments.
Here's how things happen:
build_zonelists() constructs a table that maps each possible GFP mask
to a zonelist. The GFP mask is a bitmask that tells the allocation
logic what kind of memory we want to allocate, and gives some
information about how we want to do it (for example, whether it is
acceptable to sleep waiting for a free page or not).
Beginning on line 712, we loop over all possible GFP masks. Within the
loop, things go like this:
The interface to __alloc_pages() is embodied up by the
get_free_pages() and alloc_pages() functions, but __alloc_pages() is
really the central physical-page allocation function in the
kernel.
It attempts to allocate a free block of the given order from each of
the zones in the zonelist in turn. It is comparatively well-commented,
so I won't go into too much detail here. Essentially, __alloc_pages()
looks at the zones in the given zonelist in order, trying to allocate
a block of 2order pages using the buddy
allocator. It wakes up the swapper and/or bdflush if there are not
enough free pages; those tasks (both kernel threads) attempt to free
some pages by writing dirty pages
to disk.
There are some complicated interactions between __alloc_pages() and
the higher-level VM code. Each zone keeps track of pages that have
recently been mapped into some task's VM, and we may decide to reclaim
some of those pages rather than allocating actual free pages.
If there are a lot ( >= zone->pages_low) of free pages in any of the
zones in the zonelist, we attempt to immediately allocate a block of
the requested order at line 332 with a call to rmqueue(). Otherwise, the actual allocation
attempts are delegated to __alloc_pages_limit(). Basically, we are
trying to allocate something first from a not-heavily-used zone, and
second from a more-preferred zone. That is, if one zone in
the zonelist has substantially more free pages, we are likely to
allocate from it, even if it's not the preferred zone for the
allocation. But when all the zones are more-or-less equally allocated,
we will allocate from the most-preferred zone that can support the
allocation request.
__alloc_pages_limit(zonelist_t *zonelist, unsigned long order, int
limit, int direct_reclaim) does the actual work of interrogating the
zone_structs and determining whether an allocation from any zone is
possible within the specified constraints. It's called when
__alloc_pages() can't immediately find a mostly-free zone from which
to do the allocation.
The first two arguments are obvious, the last two less so.
limit is an enum that tells us how to figure out how many
pages must be free in a zone before we use that zone for
allocation. zone_struct->pages_min is the minimum number of free+inactive_clean (F+IC) pages the
zone should keep available; zone_struct->pages_low is the number of
F+IC pages at which the zone is considered to be low on free pages;
and zone_struct->pages_high is the number of F+IC pages considered to
be "lots". If we find a zone with a suitable number of F+IC pages, we
call rmqueue() to try to perform the
allocation.
direct_reclaim is 1 if the allocation request is one which can
possibly be fulfilled by reclaiming a page from the inactive_clean
list. This flag is set in __alloc_pages() on line 297. (The PF_MEMALLOC
flag is set if the allocation is a recursive one - that is, if, while
trying to get a free block, we end up needing some memory for some
reason and enter __alloc_pages() again, PF_MEMALLOC will be set.)
Note: the comment beginning on line 498 doesn't make sense to
me. direct_reclaim is set only once, it's a local, and the loop
referred to in the comment doesn't change it, so it seems to me like
direct_reclaim is possible under exactly the conditions expressed on
line 297. Unless there's stuff going on inside reclaim_pages() that's
sensitive to PF_MEMALLOC. And there doesn't appear to be anything like
that.
[working on it...]
Here's how it works:
I must say, this is a whole lot clearer than it was in 2.2!
Zones
The Memory Map
The Buddy System
Now, if the situation were reversed (all pages except page 3 are in
use), then the buddy bitmaps would be:
The Zone Allocator Code
free_area_init_core()
** gmap is an output parameter that receives the
address of the page struct array allocated within free_area_init_core(). In the
non-CONFIG_DISCONTIGMEM case, the global mem_map will receive this value.
build_zonelists()
__alloc_pages()
__alloc_pages_limit()
rmqueue()
expand()
What we're going to do in that case is return the top 2low pages of the allocated block to the requester, and put the remainder of the pages on the appropriate freelists, managing the buddy bitmaps as appropriate. For example, let's say we've requested a 2-page block, but we had to allocate that block from the order-3 (8-page) freelist, and we happen to have chosen physical page 800 as the base of the 8-page block:
In our example, the first trip through the loop we'd be looking at the 4-page (order-2) block starting at page 800; on the second trip we'd be looking at the 2-page (order-1) block staring at page 804; and then we would return the 2-page block at page 806 to the caller.
__free_pages(page struct *page, unsigned long order) frees the block of size 2^order pages starting at the given page. It is very slightly tricky. The entire function looks like this:
void __free_pages(page struct *page, unsigned long order) { if (!PageReserved(page) && put_page_testzero(page)) __free_pages_ok(page, order); }The tricky part is the call to put_page_testzero(), which decrements the page's reference count and returns 1 if the reference count is 0 after the decrement. Thus, if the caller is not the last user of the page, it will not actually be freed.
A great deal of the higher-level VM logic depends upon this test being done properly; see for example try_to_swap_out().
If the caller is the last user of the page block, then __free_pages() goes on to call __free_pages_ok(), which links the page block back into the freelists and manages the buddy maps, coalescing buddy blocks as necessary. This is essentially the inverse operation to expand(), and therefore I will not discuss it further here.
I'm suspending work on the zone-allocator documentation for the moment and moving on to higher-level VM stuff. I need to understand how direct_reclaim() works, which means understanding exactly how the page lists are managed. I'll get back to this at some point, but if you understand how __alloc_pages() works and the nature of the data structures involved, figuring out other zone allocator functions (eg __free_pages_ok() ) is not too hard, if you ignore the interactions between the zone allocator and the page-aging logic.
Initialization | Linux MM Outline | Kernel VM |
Questions and comments to Joe Knapka
The LXR links in this page were produced by lxrreplace.tcl, which is available for free.