The Linux MM System: Initialization

There are three essential stages in the MM initialization process:

Activating paged memory management
Initializing the kernel's page tables in swapper_pg_dir.
Initializing the various MM-related kernel data.

Turning On Paging (i386)

The kernel code is loaded at physical address 0x100000 (1MB), which is then remapped to PAGE_OFFSET+0x100000 when paging is turned on. This is done using compiled-in page tables (in arch/i386/kernel/head.S) that map physical range 0-8MB to itself and to PAGE_OFFSET...PAGE_OFFSET+8MB. Then we jump to start_kernel() in init/main.c, which is located at PAGE_OFFSET+some_address. This is a bit tricky: it is critical that the code that turns on paging in head.S do so in such a way that the address space it is executing out of remains valid; hence the 0-4MB identity mapping. start_kernel() is not called until paging is turned on, and assumes it is running at PAGE_OFFSET+whatever. Thus the page tables in head.S must also map the addresses used by the kernel code for the jump to start_kernel() to succeed; hence the PAGE_OFFSET mapping.

There is some magical code right after paging is enabled in head.S:

/*
 * Enable paging
 */
3:
	movl $swapper_pg_dir-__PAGE_OFFSET,%eax
	movl %eax,%cr3		/* set the page table pointer.. */
	movl %cr0,%eax
	orl $0x80000000,%eax
	movl %eax,%cr0		/* ..and set paging (PG) bit */
	jmp 1f			/* flush the prefetch-queue */
1:
	movl $1f,%eax
	jmp *%eax		/* make sure eip is relocated */
1:

The code between the two 1: labels loads the address of the second label 1: into EAX and jumps there. At this point the instruction pointer EIP is pointing to physical location 1MB+something. The labels are all in kernel virtual space (PAGE_OFFSET+something), so this code effectively relocates the instruction pointer from physical to virtual space.

The start_kernel() function initializes all kernel data and then starts the "init" kernel thread. One of the first things that happens in start_kernel() is a call to setup_arch(), an architecture-specific setup function which handles low-level initialization details. For x86 platforms, that function lives in arch/i386/kernel/setup.c.

The first memory-related thing setup_arch() does is compute the number of low-memory and high-memory pages available; the highest page numbers in each memory type get stored in the global variables highstart_pfn and highend_pfn, respectively. High memory is memory not directly mappable into kernel VM; this is discussed further below.

Next, setup_arch() calles init_bootmem() to initialize the boot-time memory allocator. The bootmem allocator is used only during boot, to allocate pages for permanent kernel data. We will not be too much concerned with it henceforth. The important thing to remember is that the bootmem allocator provides pages for kernel initialization, and those pages are permanently reserved for kernel purposes, almost as if they were loaded with the kernel image; they do not participate in any MM activity after boot.

Initializing the Kernel Page Tables

Thereafter, setup_arch() calls paging_init() in arch/i386/mm/init.c. This function does several things. First, it calls pagetable_init() to map the entire physical memory, or as much of it as will fit between PAGE_OFFSET and 4GB, starting at PAGE_OFFSET.

In pagetable_init(), we actually build the kernel page tables in swapper_pg_dir that map the entire physical memory range to PAGE_OFFSET. This is simply a matter of doing the arithmetic and stuffing the correct values into the page directory and page tables. This mapping is created in swapper_pg_dir, the kernel page directory; this is also the page directory used to initiate paging. (Virtual addresses up to the next 4MB boundary past the end of memory are actually mapped here when using 4MB page tables, but "that's OK as we won't use that memory anyway"). If there is physical memory left unmapped here - that is, memory with physical address greater than 4GB-PAGE_OFFSET - that memory is unusable unless the CONFIG_HIGHMEM option is set.

Near the end of pagetable_init() we call fixrange_init() to reserve pagetables (but not populate them) for compile-time-fixed virtual-memory mappings. These tables map virtual addresses that are hard-coded into the kernel, but which are not part of the loaded kernel data. The fixmap tables are mapped to physical pages allocated at run time, using the set_fixmap() call.

After initializing the fixmaps, if CONFIG_HIGHMEM is set, we also allocate some pagetables for the kmap() allocator. kmap() allows the kernel to map any page of physical memory into the kernel virtual address space for temporary use. It's used, for example, to provide mappings on an as-needed basis for physical pages that aren't directly mappable during pagetable_init().

The fixmap and kmap pagetables occupy a portion of the top of kernel virtual space - addresses which therefore cannot be used to permanently map physical pages in the PAGE_OFFSET mapping. For this reason, 128MB at the top of kernel VM is reserved (the vmalloc allocator also uses addresses in this range). Any physical pages that would otherwise be mapped into the PAGE_OFFSET mapping in the 4GB-128MB range are instead (if CONFIG_HIGHMEM is specified) included in the high memory zone, accessible to the kernel only via kmap(). If CONFIG_HIGMEM is not true, those pages are completely unusable. This becomes an issue only on machines with a large amount of RAM (900-odd MB or more). For example, if PAGE_OFFSET==3GB and the machine has 2GB of RAM, only the first physical 1GB-128MB can be mapped between PAGE_OFFSET and the beginning of the fixmap/kmap address range. The remaining pages are still usable - in fact for user-process mappings they act the same as direct-mapped pages - but the kernel cannot access them directly.

Back in paging_init(), we possibly initialize the kmap() system further by calling kmap_init(), which simply caches the first kmap pagetable [in the TLB?]. Then, we initialize the zone allocator by computing the zone sizes and calling free_area_init() to build the mem_map and initialize the freelists. All freelists are initialized empty and all pages are marked "reserved" (not accessible to the VM system); this situation is rectified later.

When paging_init() completes, we have in physical memory [note - this is not quite right for 2.4]:

0x00000000: 0-page
0x00100000: kernel-text
0x????????: kernel_data
0x????????=_end: whole-mem pagetables
0x????????: fixmap pagetables
0x????????: zone data (mem_map, zone_structs, freelists &c)
0x????????=start_mem: free pages

This chunk of memory is mapped by swapper_pg_dir and the whole-mem-pagetables to address PAGE_OFFSET.

Further VM Subsytem Initialization Tasks

Here we are back in start_kernel(). After paging_init() completes, we do some additional setup of other kernel subsystems, some of which allocate additional kernel memory using the bootmem allocator. Important among these, from the MM point of view, is kmem_cache_init(), which initializes the slab allocator data.

Shortly after kmem_cache_init() is called, we call mem_init(). This function completes the freelist initialization begun in free_area_init() by clearing the PG_RESERVED bit in the zone data for free physical pages; clearing the PG_DMA bit for pages that can't be used for DMA; and freeing all usable pages into their respective zones. That last step, done in free_all_bootmem_core() in bootmem.c, is interesting: it builds the buddy bitmaps and freelists describing all existing non-reserved pages by simply freeing them and letting free_pages_ok() do the right thing. Once mem_init() is called, the bootmem allocator is no longer usable, since all its pages have been freed into the zone allocator's world.

Segmentation

Segments are just used to carve up the linear address space into arbitrary chunks. The linear space is what's managed by the VM subsystem. The x86 architecture supports segmentation in hardware: you can specify addresses as offsets into a particular segment, where a segment is defined as a range of linear (virtual) addresses with particular characteristics such as protection. In fact, you must use the segmentation mechanism on x86 machines; so we set up four segments:

A kernel text segment from 0 to 4GB.
A kernel data segment from 0 to 4GB.
A user text segment from 0 to 4GB.
A user data segment from 0 to 4GB.

Thus, we effectively allow access to the entire virtual address space using any of the available segment selectors.

Questions:

Where are the segments set up?
Answer: the Global Descriptor Table is defined in head.S at line 450. The GDT register is loaded to point at the GDT on line 250.
Why separate kernel and user segments, if they both permit access to the entire 4GB range?
Answer: the properties of the kernel and user segments differ:
```
.quad 0x00cf9a000000ffff        /* 0x10 kernel 4GB code at 0x00000000 */
.quad 0x00cf92000000ffff        /* 0x18 kernel 4GB data at 0x00000000 */
.quad 0x00cffa000000ffff        /* 0x23 user   4GB code at 0x00000000 */
.quad 0x00cff2000000ffff        /* 0x2B user   4GB data at 0x00000000 */
```
The segment registers (CS, DS, etc.) contain a 13-bit index into the descriptor table; the descriptor at that index tells the CPU the properties of the selected segment. The 3 low-order bits of a segment selector are not used to index the descriptor table; rather, they contain the descriptor-type (global or local) and the requested privilege level. Thus the kernel segment selectors 0x10 and 0x18 use RPL 0, while the user selectors 0x23 and 0x2B use RPL 3, the least-privileged level.
Also notice that the high nibble of the third high-order byte differs in the kernel and user cases: in the kernel case, the Descriptor Privilege Level is 0 (most privileged), while the user segment descriptors' DPL is 3 (least privileged). If you read the Intel documentation, you will be able to figure out exactly what all this means, but since x86 segment protection does not figure much in the Linux kernel, I won't discuss it any further here.

Thanks to Andrea Russo for clearing up this Intel segmentation business.

Linux MM Outline
Physical Memory

Questions and comments to Joe Knapka

The LXR links in this page were produced by lxrreplace.tcl, which is available for free.

Credits