The bootmem allocator is used during the boot process to allocate memory before the kernel MM subsystem is usable. It is quite simple, though not as simple as its predecessor. The bootmem allocator is used, for example, to allocate the mem_map - the array of page structs used by the VM subsystem to keep track of the disposition of physical pages.
The bootmem allocator lives only until the kernel has set up the data
structures necessary to support the zone
allocator.
The 2.2 Method
In the 2.2 kernel, the kernel would reserve memory at boot time by
simply incrementing _end as required, in PAGE_SIZE chunks. This was
somewhat inefficient, as it wasn't often the case that the size of the
data in question was a multiple of PAGE_SIZE; thus many sub-page
chunks were unrecoverably lost during the boot process.
The 2.4 Method
2.4 uses a dramatically refined version of the same basic idea.
The bootmem allocator in the 2.4 kernel is capable both of performing
sub-page-size allocations efficiently, and (when appropriate) of
reclaiming pages for the zone allocator after boot. For example, the
bootmem allocator's data itself is not used after the zone allocator
is initialized, so those pages can be released and given to the zone
allocator as free pages.
Bootmem Allocator Data
Memory is organized into "nodes", each node being a (more or less) contiguous chunk of physical RAM; on normal, non-NUMA machines, there is only one node. Each node is represented by a pg_data_t structure, and of course on non-NUMA machines, there is only one of these, contig_page_data. Each pg_data_t has a member bdata of type bootmem_data_t that contains the bootmem allocator's data.
The bootmem_data_t struct contains a pointer (node_bootmem_map) to a bitmap representing all the pages in the node, one bit per page. If a page's boot-map bit is 1, the page is reserved and will not be touched by the VM system during normal system operations; otherwise the page will be given to the zone allocator as a free page when early boot is complete.
bootmem_data_t->node_boot_start is the physical address of
the start address of the node. bootmem_data_t->node_low_pfn is the
maximum low page frame number on the node -- the bootmem allocator
never allocates high memory. The last_pos member is the offset of the
last-allocated page, and last_offset is the offset of the next free
address within the page.
Overview of Bootmem Allocator Operation
When the system BIOS memory map is interrogated by the kernel in setup_arch(), all nodes are given to the bootmem allocator as "reserved" memory; subsequently, those pages which correspond to real, usable RAM are bootmem-freed and are available to kernel subsystems that need to do boot-time allocations before the zone allocator is enabled. Those allocations are done either by returning an address within a bootmem-reserved page that is not fully utilized, or by reserving new pages as necessary in the bootmem bitmaps. Bootmem-allocated memory can also be freed as long as it is done before the zone allocator is enabled. Once the zone allocator is functional, all non-reserved bootmem pages are given to it as free memory, and the bootmem allocator is henceforth unusable. The bootmem code is in the __init link area, so it is itself released to the zone allocator when system boot is complete.
Note that there may be "holes" in a node's address space; for example,
there is frequently a 384K hole between (physical address) 640K and
1024K on PCs. In such cases, setup_arch() simply doesn't free the
bootmem pages associated with the holes. This ensures that the
nonexistent pages remain reserved, both at boot time and during normal
system operation.
Bootmem Allocator Code
We first meet the bootmem allocator (on Intel platforms) in arch/i386/kernel/setup.c in the setup_arch() function. Here, the size of the low and high memory areas is computed, and init_bootmem() is called in order to initialize the bootmem data. On non-NUMA machines, init_bootmem() just calls init_bootmem_core() passing &contig_page_data as the pg_data_t.
The first argument to init_bootmem_core(pg_data_t *pgdat,u nsigned long mapstart, unsigned long start, unsigned long end) is the pg_data_t pointer whose contents are to be initialized. The second argument is the address of the page struct within the system memory map corresponding to the first page of the node. The last to arguments are the node's start and end addresses.
free_bootmem_core(bootmem_data_t *bdata, unsigned long addr, unsigned long size) is called by setup_arch() in order to register free RAM areas with the bootmem allocator, and by other kernel entities to freee bootmem that they no longer need. While allocations can be done in power-of-two-byte-sized chunks, frees can only be done with page granularity -- any page that is even partially used by permanent kernel data is considered reserved.
The arguments to free_bootmem_core() are a bootmem_data_t describing the node, and the address and size of the block to free. The function converts the start address to a page frame index (rounding up), converts the end address to a page frame index (rounding down0, and zeros the corresponding bits in the bootmem bitmap for the node.
__alloc_bootmem_core(bootmem_data_t *bdata, unsigned long size, unsigned long align, unsigned long goal) is used to allocate boot memory on a particular node. The more-generic interface __alloc_bootmem() simply tries to allocate from each of the extant nodes until it succeeds; we'll ignore multiple-node systems for the moment.
The arguments are the bootmem_data_t struct describing the node, the size of the requested block, the byte alignment requirement (which must be a power of 2), and a "goal" address. The allocator will return an address > the goal if it can [why?].
free_all_bootmem_core(pg_data_t* pgdat) is called when the bootmem allocator is being torn down after early boot. It releases all the non-reserved bootmem pages to the zone allocator.
Linux MM Outline |
Questions and comments to Joe Knapka
The LXR links in this page were produced by lxrreplace.tcl, which is available for free.