The Zone Allocator

The kernel page tables map as much of physical memory as possible into the address range starting at PAGE_OFFSET. The physical pages occupied by the kernel code and data are reserved and will never be allocated to any other purpose; all other physical pages are allocated to process virtual memory, the buffer cache, additional kernel virtual memory, and so forth, as necessary. In order to do this, we must have a way of keeping track of which physical pages are in use, by whom.

The zone allocator manages physical memory. Any kernel code can call upon the zone allocator via __alloc_pages() and be given a block of 2ⁿ pages aligned on a 2ⁿ-page boundary, for small values of n (up to 8 or so). We can also hand 2ⁿ-page-aligned blocks back to the zone allocator via __free_pages() to free them. The n exponent is called the "order" of the allocation. Blocks don't necessarily need to be freed in the same way that they're allocated, but they must be freed with the proper alignment, and the initial page frame of the block must have a nonzero reference count. For example, you can request an 8-page, 8-page-aligned block, and then free a 2-page, 2-page-aligned block within the larger block, but you must first increment the reference count of the first page in the smaller block. The reason for that is that when you allocate an 8-page block, only the first page of the block has its reference count incremented -- but the page-freeing code will not free a page with reference count 0; the reference count for the other pages must be handled manually. Incrementing the refcounts for all pages in a block would add overhead to the page allocation code that would rarely be necessary.

Zones

Different ranges of physical pages may have different properties, for the kernel's purposes. For example, Direct Memory Access, which allows peripherals to read and write data directly to RAM without the CPU's intervention, may only work for physical addresses less than 16MB. Some systems have more physical RAM than can be mapped between PAGE_OFFSET and 4GB; those physical pages are not directly accessible to the kernel, and so must be treated differently. The zone allocator handles such differences by dividing memory into a number of zones and treating each zone as a unit for allocation purposes. Any particular allocation request utilizes a list of zones from which the allocation may be attempted, in most-preferred to least-preferred order. For example, a request for a user page should be filled first from the "normal" zone; if that fails, from the HIGHMEM zone; and if that fails, from the DMA zone. Thus, the zonelist for such allocations consists of the normal, HIGHMEM, and DMA zones, in that order. On the other hand, a request for a DMA page may only be fulfilled from the DMA zone, so the zonelist for such requests contains only the DMA zone.

The root data structure maintained by the zone allocator is the zone_struct, which gathers all the data relevant for managing a particular zone. The zonelist_struct consists of an array of zone_struct pointers and a gfp_mask indicating which kinds of allocations can use the zonelist. The zone_struct->offset indicates the offset in physical memory of the beginning of the zone.

The Memory Map

Each physical page in the system is represented by a page struct in the zone_struct->zone_mem_map array, which is an array of page structs that parallels the physical block of memory represented by the zone: zone_mem_map[0] represents the first page in the zone, zone_mem_map[1] represents the second page, etc. Each page struct records pertinent properties of the associated page, such as the reference count. I'll discuss each member of the page struct as it becomes relevant to the discussion.

The Buddy System

Within each zone, the buddy system is used to manage physical pages.

Most page allocation and freeing is done when allocating and de-allocating pages for process virtual address space. Since such allocations usually occur during page-fault processing, the majority of page allocations are for a single page. However, some kernel entities, particularly device drivers, have special needs with respect to physical memory. One might, for example, need to allocate 4 contiguous pages of DMA-capable physical RAM. The buddy system allows such allocations to be done quickly and efficiently.

Essentially, the buddy system treats physical RAM as a collection of 2ⁿ-page-sized blocks aligned on 2ⁿ-page boundaries, and merges adjacent free blocks into single higher-order blocks. Each 2ⁿ-page block is the "buddy" of the other half of the 2ⁿ⁺¹-page block containing it. The allocator keeps lists of all the free one-page blocks, all the free two-page, two-page-aligned blocks, all the free four-page, four-page-aligned blocks, etc. To allocate a block of a given order, we check the freelist of the specified order and all higher orders. If a block is found at the specified order, it is allocated immediately. If a block of higher order must be used, then we divide the larger block into two 2^order-1 blocks, add the lower half to the appropriate freelist, and allocate the memory from the upper half, executing this step recursively if necessary. When freeing memory, we check whether the block being freed has a free buddy block; if so, we combine the two blocks into a single free block by removing the adjacent block from its freelist and then freeing the larger block; again, this process is performed recursively if necessary. There are two data structures used within each zone to keep track of all this: the freelists and the buddy bitmaps. The freelist head and the buddy bitmap for each order are gathered together into a free_area_struct; the zone_struct maintains one free_area_struct for each order.

The freelists are maintained using the page structs of the zone's memory map; the list_head pointers in the free_area_struct point to the first and last pages in the freelist. Free page structs for each order are linked together using the page->list->next and page->list->prev pointers (those are used for other purposes in non-free pages). All the blocks of a given order are linked using the first page of the block; so for example, a four-page free block's first page will be on the order-2 freelist, and the three high pages will not be on a freelist at all - they are implicitly free by being part of a free 4-page-aligned block.

The buddy bitmap for each order is maintained using the zone->free_area[order]->map array. The bitmap for each page order indicates the fragmentation status for each 2-block region of that order. A block is fragmented if half of it is at least partially used, and the other half is completely free; each buddy bitmap describes the entire zone in this manner. Thus, the memory region addressed by zone->free_area[order]->map contains pages_in_zone/(2^order+1) bits. For example, if the machine had a total of 16 pages of free RAM that happened to be 16-page aligned, and the fourth page is in use while the others are all free, then there would be 4 buddy bitmaps:

The order-0 bitmap would be 8 bits long; the bitmap "01000000" indicates that pages 0 and 1 are unfragmented, pages 2 and 3 are fragmented, pages 4 and 5 are unfragmented, etc.
The order-2 bitmap in this situation would be "1000", since blocks 0-3 are fragmented: part of the right half is in use, whereas the left half is entirely free.
The order-3 bitmap would be "10".
The order-4 bitmap would be "1".

Now, if the situation were reversed (all pages except page 3 are in use), then the buddy bitmaps would be:

The order-0 bitmap would be 8 bits long; the bitmap "01000000" indicates that pages 0 and 1 are unfragmented, pages 2 and 3 are fragmented, pages 4 and 5 are unfragmented, etc.
The order-2 bitmap in this situation would be "0000", since blocks 0-3 are UNfragmented: both halves are at least partially in use.
The order-3 bitmap would be "00".
The order-4 bitmap would be "0".

The buddy maps don't directly tell us anything about the used vs. free status of a page; that's the freelists' job. The buddy maps' purpose is to make it quick and easy to check whether a block being freed has an unused buddy or not. If we're freeing a block B, and the buddy maps tell us the next higher-order block is fragmented, then B's buddy must be free (since B is in use and the block is fragmented); so we should merge B with its buddy and free the higher-order block instead.

The Zone Allocator Code

free_area_init_core()

The memory map is built, and the freelists and buddy bitmaps initialized, in free_area_init_core(). There is a lot of stuff that is only relevant to kernels with CONFIG_DISCONTIGMEM set. With CONFIG_DISCONTIGMEM, physical memory is divided into a number of "nodes", each with its own mem_map. I'm going to mostly ignore that for now and concentrate on the contiguous-starting-from-physical-address-0 case. The node data (in the pg_data_t structs) is initialized by the bootmem allocator; in the non-CONFIG_DISCONTIGMEM case, there is one node, represented by the global contig_page_data variable.

free_area_init_core() takes seven arguments.

int nid is the node id. Always zero when non-CONFIG_DISCONTIGMEM. This argument is not used for anything important, only user notification.
pg_data_t * pgdat is the pointer to the memory node data structure describing the physical memory to be used for this collection of zones.
page struct ** gmap is an output parameter that receives the address of the page struct array allocated within free_area_init_core(). In the non-CONFIG_DISCONTIGMEM case, the global mem_map will receive this value.
unsigned long* zones_size is an array of MAX_NR_ZONES zone sizes, in bytes, which tell us how large each of the zones for this node will be.
unsigned long zone_start_paddr is the physical address of the beginning of the node (and thus of the lowest zone in the node).
unsigned long *zholes_size is an array of MAX_NR_ZONES zone "hole" sizes. If there are physical addresses omitted from the zones, the size of those omissions is passed here.
page struct * lmem_map is the address of the page struct array describing the physical memory of the node, or 0 (in the normal case). If it's 0, we allocate the mem_map here. If it's non-zero we use the given array.

Here's how things happen:

Line 772: count the number of pages.
Line 778: subtract the hole sizes, if necessary.
Line 784: initialize the active_list and inactive_dirty_list. These are used by the high-level MM functions.
Line 794: compute the size of the mem_map and allocate it from bootmem if necessary.
Line 800: assign the mem_map base address to the gmap input parameter and to the pg_data_t->node_mem_map. Up to this point, the node struct contained only enough data for the bootmem allocator to work with. Here we are initializing the run-time data for the node.
Line 810: initialize the page structs. Each is initialized to the "reserved" state, and all the freelists are empty.
Line 817: compute the page offset of the beginning of the node. Note that mem_map has a value here because (in the non-CONFIG_DISCONTIGMEM case) it was passed in as *gdat. For CONFIG_DISCONTIGMEM, mem_map is initialized elsewhere ( mm/numa.c I think ), and pointers to appropriate parts of it are passed in as lmem_map.
Line 818: initialize all the zone_structs for the node. This is pretty obvious, and the parts that aren't we'll postpone until we get to higher-level VM topics.

Line 867: set the zone pointer for all the page structs. For low-memory pages, set page->virtual to the permanent kernel virtual address of the page.
Line 878: compute the sizes for all of the buddy bitmaps for the zone, and allocate space from the bootmem allocator for the bitmaps.
Line 891: call build_zonelists() to construct the zone table.

build_zonelists()

build_zonelists() constructs a table that maps each possible GFP mask to a zonelist. The GFP mask is a bitmask that tells the allocation logic what kind of memory we want to allocate, and gives some information about how we want to do it (for example, whether it is acceptable to sleep waiting for a free page or not).

Beginning on line 712, we loop over all possible GFP masks. Within the loop, things go like this:

Line 716: initialize the zonelist table for the node.
Line 721: figure out the controlling mask bit for the GFP mask. Unless the __GFP_DMA or __GFP_HIGHMEM bits are set, an allocation using the mask is a "normal" one, and should first use the low-memory zone, and if that fails, the DMA zone. If the __GFP_HIGHMEM bit is set, the allocation can additionally use the high memory zone as a fallback. However, if the __GFP_DMA bit is set, the allocation must come from the DMA zone. The variable k indicates which of these situations pertains, and the switch following uses fallthrough logic to construct the appropriate list of zones.

__alloc_pages()

The interface to __alloc_pages() is embodied up by the get_free_pages() and alloc_pages() functions, but __alloc_pages() is really the central physical-page allocation function in the kernel.

It attempts to allocate a free block of the given order from each of the zones in the zonelist in turn. It is comparatively well-commented, so I won't go into too much detail here. Essentially, __alloc_pages() looks at the zones in the given zonelist in order, trying to allocate a block of 2^order pages using the buddy allocator. It wakes up the swapper and/or bdflush if there are not enough free pages; those tasks (both kernel threads) attempt to free some pages by writing dirty pages to disk.

There are some complicated interactions between __alloc_pages() and the higher-level VM code. Each zone keeps track of pages that have recently been mapped into some task's VM, and we may decide to reclaim some of those pages rather than allocating actual free pages.

If there are a lot ( >= zone->pages_low) of free pages in any of the zones in the zonelist, we attempt to immediately allocate a block of the requested order at line 332 with a call to rmqueue(). Otherwise, the actual allocation attempts are delegated to __alloc_pages_limit(). Basically, we are trying to allocate something first from a not-heavily-used zone, and second from a more-preferred zone. That is, if one zone in the zonelist has substantially more free pages, we are likely to allocate from it, even if it's not the preferred zone for the allocation. But when all the zones are more-or-less equally allocated, we will allocate from the most-preferred zone that can support the allocation request.

__alloc_pages_limit()

__alloc_pages_limit(zonelist_t *zonelist, unsigned long order, int limit, int direct_reclaim) does the actual work of interrogating the zone_structs and determining whether an allocation from any zone is possible within the specified constraints. It's called when __alloc_pages() can't immediately find a mostly-free zone from which to do the allocation.

The first two arguments are obvious, the last two less so.

limit is an enum that tells us how to figure out how many pages must be free in a zone before we use that zone for allocation. zone_struct->pages_min is the minimum number of free+inactive_clean (F+IC) pages the zone should keep available; zone_struct->pages_low is the number of F+IC pages at which the zone is considered to be low on free pages; and zone_struct->pages_high is the number of F+IC pages considered to be "lots". If we find a zone with a suitable number of F+IC pages, we call rmqueue() to try to perform the allocation.

direct_reclaim is 1 if the allocation request is one which can possibly be fulfilled by reclaiming a page from the inactive_clean list. This flag is set in __alloc_pages() on line 297. (The PF_MEMALLOC flag is set if the allocation is a recursive one - that is, if, while trying to get a free block, we end up needing some memory for some reason and enter __alloc_pages() again, PF_MEMALLOC will be set.)

Note: the comment beginning on line 498 doesn't make sense to me. direct_reclaim is set only once, it's a local, and the loop referred to in the comment doesn't change it, so it seems to me like direct_reclaim is possible under exactly the conditions expressed on line 297. Unless there's stuff going on inside reclaim_pages() that's sensitive to PF_MEMALLOC. And there doesn't appear to be anything like that.

rmqueue()

[working on it...]

page struct * rmqueue(zone_t *zone, unsigned long order) attempts to remove a block of the given order from the freelists of the given zone. This is the fundamental allocation operation: when you've removed (the page struct representing) a page from the freelists and fixed up the buddy bitmaps, that memory is allocated.

Here's how it works:

The do loop beginning on line 181 is looping over page orders, starting at the requested order - so each trip through the loop we are considering free areas twice the size of the previous iteration.
On line 182 to line 185 we check the freelist of the current order and see if it's empty. If so we proceed to the next order; otherwise,

Line 191: We remove the first element from the chosen freelist.
Line 193: We adjust the fragmentation bit for the buddy pair from which we're allocating.
Line 194: accounting info for the zone struct.
Line 196: expand() walks the buddy bitmaps, marking lower-order blocks fragmented and adding their fragments to the appropriate freelists, if we chose a block of higher-than-requested order from which to allocate.
Line 199: mark the page used in its page struct.

expand()

I must say, this is a whole lot clearer than it was in 2.2!

page struct * expand(zone_t *zone, page struct *page, unsigned long index, int low, int high, free_area_t * area) adjusts the buddy bitmaps and freelists to reflect a new allocation. zone is the zone from which the allocation took place, page is the allocated page, index is the index into mem_map of the allocated page, low is the order of the requested allocation, high is the order of the block actually removed from the freelists, and area is the free_area_struct for the order of the block actually allocated. It may be that high != low, and area != the area of the requested allocation, if the block was allocated from a free list of higher-than-requested order.

What we're going to do in that case is return the top 2^low pages of the allocated block to the requester, and put the remainder of the pages on the appropriate freelists, managing the buddy bitmaps as appropriate. For example, let's say we've requested a 2-page block, but we had to allocate that block from the order-3 (8-page) freelist, and we happen to have chosen physical page 800 as the base of the 8-page block:

For our example, upon entry:
- page == mem_map+800
- index == 800
- low == 1
- high == 3
- area == zone->free_area+high (that is, the order-3 free_area_struct)
Line 153: size is initialized to the size in pages of the allocated block (size = 1<<3 == 8).
The loop beginning on line 155 loops over page orders. Each time through the loop we are looking at an area one-half the size of the previous loop. If we allocated a block from the freelist matching the request size, then low == high and we skip the loop entirely.
Line 158 through line 160 adjust things so that we are looking at the free_area_struct, order, and area size for areas half the size of the previous trip through the loop. On the first trip through the loop, then, we get to line 161 with area pointing to the free_area_struct for the lower half of the allocated area, with size set to one-half the allocated area size, and with high set to the next lower order from the allocated order.
In our example, the first trip through the loop we'd be looking at the 4-page (order-2) block starting at page 800; on the second trip we'd be looking at the 2-page (order-1) block staring at page 804; and then we would return the 2-page block at page 806 to the caller.
Line 161 adds the block we're looking at to the freelist of the appropriate order.
Line 162 adjusts the buddy bitmap for the block we just added to the freelist. All MARK_USED does is invert the bit corresponding to this block and its buddy. This is always the right thing to do, whether you're allocating or freeing a block: if the bit was 1 then the higher-order block was fragmented and we're defragmenting it by changing the free state of one half of the block, and similarly if the bit was 0 the block was not fragmented and we must be fragmenting it by changing the state of one of the halves.
Line 163 and line 164 increment the mem_map index and the page struct pointer to point at the buddy of the block we just freed. Next trip through the loop we will free the lower half of that block, and so forth until we reach the order of the original allocation request, at which point we simply return the block we're looking at.

__free_pages()

__free_pages(page struct *page, unsigned long order) frees the block of size 2^^order pages starting at the given page. It is very slightly tricky. The entire function looks like this:

void __free_pages(page struct *page, unsigned long order)
{
	if (!PageReserved(page) && put_page_testzero(page))
		__free_pages_ok(page, order);
}

The tricky part is the call to put_page_testzero(), which decrements the page's reference count and returns 1 if the reference count is 0 after the decrement. Thus, if the caller is not the last user of the page, it will not actually be freed.

A great deal of the higher-level VM logic depends upon this test being done properly; see for example try_to_swap_out().

If the caller is the last user of the page block, then __free_pages() goes on to call __free_pages_ok(), which links the page block back into the freelists and manages the buddy maps, coalescing buddy blocks as necessary. This is essentially the inverse operation to expand(), and therefore I will not discuss it further here.

I'm suspending work on the zone-allocator documentation for the moment and moving on to higher-level VM stuff. I need to understand how direct_reclaim() works, which means understanding exactly how the page lists are managed. I'll get back to this at some point, but if you understand how __alloc_pages() works and the nature of the data structures involved, figuring out other zone allocator functions (eg __free_pages_ok() ) is not too hard, if you ignore the interactions between the zone allocator and the page-aging logic.

Initialization
Linux MM Outline
Kernel VM

Questions and comments to Joe Knapka

The LXR links in this page were produced by lxrreplace.tcl, which is available for free.

Credits