The Boot-time Allocator

The bootmem allocator is used during the boot process to allocate memory before the kernel MM subsystem is usable. It is quite simple, though not as simple as its predecessor. The bootmem allocator is used, for example, to allocate the mem_map - the array of page structs used by the VM subsystem to keep track of the disposition of physical pages.

The bootmem allocator lives only until the kernel has set up the data structures necessary to support the zone allocator.

The 2.2 Method

The symbol _end represents the end of the loaded kernel data - that is, the next usable byte after the kernel code loaded by the bootloader. (_end, along with a number of other important addresses, is defined in the linker script arch/i386/vmlinux.lds.) This address is in the kernel's virtual memory space at address PAGE_OFFSET+physical_kernel_end. Pages < _end are, naturally, reserved for the kernel's use, and are never used by the VM subsystem.

In the 2.2 kernel, the kernel would reserve memory at boot time by simply incrementing _end as required, in PAGE_SIZE chunks. This was somewhat inefficient, as it wasn't often the case that the size of the data in question was a multiple of PAGE_SIZE; thus many sub-page chunks were unrecoverably lost during the boot process.

The 2.4 Method

2.4 uses a dramatically refined version of the same basic idea. The bootmem allocator in the 2.4 kernel is capable both of performing sub-page-size allocations efficiently, and (when appropriate) of reclaiming pages for the zone allocator after boot. For example, the bootmem allocator's data itself is not used after the zone allocator is initialized, so those pages can be released and given to the zone allocator as free pages.

Bootmem Allocator Data

Memory is organized into "nodes", each node being a (more or less) contiguous chunk of physical RAM; on normal, non-NUMA machines, there is only one node. Each node is represented by a pg_data_t structure, and of course on non-NUMA machines, there is only one of these, contig_page_data. Each pg_data_t has a member bdata of type bootmem_data_t that contains the bootmem allocator's data.

The bootmem_data_t struct contains a pointer (node_bootmem_map) to a bitmap representing all the pages in the node, one bit per page. If a page's boot-map bit is 1, the page is reserved and will not be touched by the VM system during normal system operations; otherwise the page will be given to the zone allocator as a free page when early boot is complete.

bootmem_data_t->node_boot_start is the physical address of the start address of the node. bootmem_data_t->node_low_pfn is the maximum low page frame number on the node -- the bootmem allocator never allocates high memory. The last_pos member is the offset of the last-allocated page, and last_offset is the offset of the next free address within the page.

Overview of Bootmem Allocator Operation

When the system BIOS memory map is interrogated by the kernel in setup_arch(), all nodes are given to the bootmem allocator as "reserved" memory; subsequently, those pages which correspond to real, usable RAM are bootmem-freed and are available to kernel subsystems that need to do boot-time allocations before the zone allocator is enabled. Those allocations are done either by returning an address within a bootmem-reserved page that is not fully utilized, or by reserving new pages as necessary in the bootmem bitmaps. Bootmem-allocated memory can also be freed as long as it is done before the zone allocator is enabled. Once the zone allocator is functional, all non-reserved bootmem pages are given to it as free memory, and the bootmem allocator is henceforth unusable. The bootmem code is in the __init link area, so it is itself released to the zone allocator when system boot is complete.

Note that there may be "holes" in a node's address space; for example, there is frequently a 384K hole between (physical address) 640K and 1024K on PCs. In such cases, setup_arch() simply doesn't free the bootmem pages associated with the holes. This ensures that the nonexistent pages remain reserved, both at boot time and during normal system operation.

Bootmem Allocator Code

We first meet the bootmem allocator (on Intel platforms) in arch/i386/kernel/setup.c in the setup_arch() function. Here, the size of the low and high memory areas is computed, and init_bootmem() is called in order to initialize the bootmem data. On non-NUMA machines, init_bootmem() just calls init_bootmem_core() passing &contig_page_data as the pg_data_t.

init_bootmem_core()

The first argument to init_bootmem_core(pg_data_t *pgdat,u nsigned long mapstart, unsigned long start, unsigned long end) is the pg_data_t pointer whose contents are to be initialized. The second argument is the address of the page struct within the system memory map corresponding to the first page of the node. The last to arguments are the node's start and end addresses.

Line 47 - line 56: basic bookkeeping. Note that in the normal, non-NUMA case, pgdat == &contig_page_data, mapstart == _end, start == 0, and end == the maximum low page frame index on the system.
Line 62 - reserve all pages on the node.

free_bootmem_core()

free_bootmem_core(bootmem_data_t *bdata, unsigned long addr, unsigned long size) is called by setup_arch() in order to register free RAM areas with the bootmem allocator, and by other kernel entities to freee bootmem that they no longer need. While allocations can be done in power-of-two-byte-sized chunks, frees can only be done with page granularity -- any page that is even partially used by permanent kernel data is considered reserved.

The arguments to free_bootmem_core() are a bootmem_data_t describing the node, and the address and size of the block to free. The function converts the start address to a page frame index (rounding up), converts the end address to a page frame index (rounding down0, and zeros the corresponding bits in the bootmem bitmap for the node.

__alloc_bootmem_core()

__alloc_bootmem_core(bootmem_data_t *bdata, unsigned long size, unsigned long align, unsigned long goal) is used to allocate boot memory on a particular node. The more-generic interface __alloc_bootmem() simply tries to allocate from each of the extant nodes until it succeeds; we'll ignore multiple-node systems for the moment.

The arguments are the bootmem_data_t struct describing the node, the size of the requested block, the byte alignment requirement (which must be a power of 2), and a "goal" address. The allocator will return an address > the goal if it can [why?].

Line 141 computes the index of the highest page frame on the node.
Lines 150 to 158 decide whether the goal is reasonable, and set preferred to either the page frame number of the goal, or 0. Line 158 uses a Gnu C-ism, the "omitted-middle" of the ?: operator: x?:y is equivalent to x?x:y. incr is the number of pages to skip forward when an attempt to allocate at a particular page fails. It is equal to the byte alignment requirement/PAGE_SIZE, or 1 if the alignment requirment is less than PAGE_SIZE.
We will make at most two passes over the node's pages. If we start with a goal in mind, and can't allocate at an address > goal, we restart the allocation attempt at the first page of the node (line 175). The loop starting on line 161 iterates over page frame indices within the node. It ends when either (a) we find a free area of sufficient size ( line 172 ), or (b) we scan the entire node without finding a suitable free block.
We get to line 188 if we find a suitable free block. Here we attempt to use a free fragment of the last-allocated page, if the alignment requirment permits us to do so. bdata->last_offset tells us the next usable address within the last-used page. On line 190 we adjust that offset upwards to match the requested byte alignment. On line 194 we check whether sufficient space remains in the last-used page to fulfill the request, and if so we will simply return the address within that page.
Otherwise, we reach line 201, and reserve additional pages as necessary to fulfill the request. We will still return the address computed above.
If the alignment requirement is equal or greater than PAGE_SIZE, we reach line 210 and simply reserve the proper number of pages.
The actual reservation of the pages in the bootmem bitmap is done in the loop on line 217. We then helpfully zero the returned area and then hand it back to the caller.

free_all_bootmem_core()

free_all_bootmem_core(pg_data_t* pgdat) is called when the bootmem allocator is being torn down after early boot. It releases all the non-reserved bootmem pages to the zone allocator.

The loop staring on line 235 loops over the page frames of the node's memory map. It simply un-reserves each page, then calls free_page() to hand the page to the zone allocator.
The loop starting on page 251 loops over the page frames used to store the bootmem bitmap for the node, releasing them to the zone allocator as described above.

Linux MM Outline

Questions and comments to Joe Knapka

The LXR links in this page were produced by lxrreplace.tcl, which is available for free.

Credits