There are three essential stages in the MM initialization process:
The kernel code is loaded at physical address 0x100000 (1MB), which is then remapped to PAGE_OFFSET+0x100000 when paging is turned on. This is done using compiled-in page tables (in arch/i386/kernel/head.S) that map physical range 0-8MB to itself and to PAGE_OFFSET...PAGE_OFFSET+8MB. Then we jump to start_kernel() in init/main.c, which is located at PAGE_OFFSET+some_address. This is a bit tricky: it is critical that the code that turns on paging in head.S do so in such a way that the address space it is executing out of remains valid; hence the 0-4MB identity mapping. start_kernel() is not called until paging is turned on, and assumes it is running at PAGE_OFFSET+whatever. Thus the page tables in head.S must also map the addresses used by the kernel code for the jump to start_kernel() to succeed; hence the PAGE_OFFSET mapping.
There is some magical code right after paging is enabled in head.S:
/* * Enable paging */ 3: movl $swapper_pg_dir-__PAGE_OFFSET,%eax movl %eax,%cr3 /* set the page table pointer.. */ movl %cr0,%eax orl $0x80000000,%eax movl %eax,%cr0 /* ..and set paging (PG) bit */ jmp 1f /* flush the prefetch-queue */ 1: movl $1f,%eax jmp *%eax /* make sure eip is relocated */ 1:The code between the two 1: labels loads the address of the second label 1: into EAX and jumps there. At this point the instruction pointer EIP is pointing to physical location 1MB+something. The labels are all in kernel virtual space (PAGE_OFFSET+something), so this code effectively relocates the instruction pointer from physical to virtual space.
The start_kernel() function initializes all kernel data and then starts the "init" kernel thread. One of the first things that happens in start_kernel() is a call to setup_arch(), an architecture-specific setup function which handles low-level initialization details. For x86 platforms, that function lives in arch/i386/kernel/setup.c.
The first memory-related thing setup_arch() does is compute the number of low-memory and high-memory pages available; the highest page numbers in each memory type get stored in the global variables highstart_pfn and highend_pfn, respectively. High memory is memory not directly mappable into kernel VM; this is discussed further below.
Next, setup_arch() calles init_bootmem() to initialize the boot-time memory
allocator. The bootmem allocator is used only during
boot, to allocate pages for permanent kernel data. We will not be too much
concerned with it henceforth. The important thing to remember is that the
bootmem allocator provides pages for kernel initialization, and those pages are
permanently reserved for kernel purposes, almost as if they were loaded with the
kernel image; they do not participate in any MM activity after boot.
Thereafter, setup_arch() calls paging_init() in arch/i386/mm/init.c.
This function does several things. First, it calls pagetable_init() to map the
entire physical memory, or as much of it as will fit between PAGE_OFFSET and
4GB, starting at PAGE_OFFSET.
In pagetable_init(), we actually build the kernel page tables in swapper_pg_dir
that map the entire physical memory range to PAGE_OFFSET. This is simply a
matter of doing the arithmetic and stuffing the correct values into the page directory and page tables. This
mapping is created in swapper_pg_dir, the kernel page directory; this is also
the page directory used to initiate paging. (Virtual addresses up to the next
4MB boundary past the end of memory are actually mapped here when using 4MB page
tables, but "that's OK as we won't use that memory anyway"). If there is
physical memory left unmapped here - that is, memory with physical address
greater than 4GB-PAGE_OFFSET - that memory is unusable unless the CONFIG_HIGHMEM
option is set.
Near the end of pagetable_init() we call fixrange_init() to reserve pagetables (but
not populate them) for compile-time-fixed virtual-memory mappings. These tables
map virtual addresses that are hard-coded into the kernel, but which are not
part of the loaded kernel data. The fixmap tables are mapped to physical pages
allocated at run time, using the set_fixmap() call.
After initializing the fixmaps, if CONFIG_HIGHMEM is set, we also allocate some
pagetables for the kmap() allocator. kmap() allows the
kernel to map any page of physical memory into the kernel virtual address space
for temporary use. It's used, for example, to provide mappings on an
as-needed basis for physical pages that aren't directly mappable during
pagetable_init().
The fixmap and kmap pagetables occupy a
portion of the top of kernel virtual space - addresses which therefore
cannot be used to permanently map physical pages in the PAGE_OFFSET
mapping. For this reason, 128MB at the top of kernel VM is reserved
(the vmalloc allocator also uses
addresses in this range). Any physical pages that would otherwise be
mapped into the PAGE_OFFSET mapping in the 4GB-128MB range are instead
(if CONFIG_HIGHMEM is specified) included in the high memory zone,
accessible to the kernel only via kmap(). If
CONFIG_HIGMEM is not true, those pages are completely unusable. This
becomes an issue only on machines with a large amount of RAM (900-odd
MB or more). For example, if PAGE_OFFSET==3GB and the machine has 2GB
of RAM, only the first physical 1GB-128MB can be mapped between
PAGE_OFFSET and the beginning of the fixmap/kmap address range. The
remaining pages are still usable - in fact for user-process mappings
they act the same as direct-mapped pages - but the kernel cannot
access them directly.
Back in paging_init(), we possibly initialize the kmap() system
further by calling kmap_init(), which simply caches the first kmap
pagetable [in the TLB?]. Then, we initialize the zone allocator by computing the zone sizes
and calling free_area_init() to build the mem_map and initialize the
freelists. All freelists are initialized empty and all pages are
marked "reserved" (not accessible to the VM system); this situation is
rectified later.
When paging_init() completes, we have in physical memory [note - this
is not quite right for 2.4]:
Here we are back in start_kernel(). After paging_init() completes, we do some
additional setup of other kernel subsystems, some of which allocate additional
kernel memory using the bootmem allocator. Important
among these, from the MM point of view, is kmem_cache_init(), which initializes
the slab allocator data.
Shortly after kmem_cache_init() is called, we call mem_init(). This function
completes the freelist initialization begun in free_area_init() by clearing the
PG_RESERVED bit in the zone data for free physical pages; clearing the PG_DMA
bit for pages that can't be used for DMA; and freeing all usable pages into
their respective zones. That last step, done in free_all_bootmem_core() in
bootmem.c, is interesting: it builds the buddy bitmaps and freelists describing
all existing non-reserved pages by simply freeing them and letting free_pages_ok() do the right thing. Once
mem_init() is called, the bootmem allocator is no longer usable, since all its
pages have been freed into the zone allocator's world.
Segments are just used to carve up the linear address space into arbitrary
chunks. The linear space is what's managed by the VM subsystem. The x86
architecture supports segmentation in hardware: you can specify addresses as
offsets into a particular segment, where a segment is defined as a range of
linear (virtual) addresses with particular characteristics such as
protection. In fact, you must use the segmentation mechanism on x86
machines; so we set up four segments:
Questions:
Answer: the Global Descriptor Table is defined in head.S at
line 450. The GDT register is loaded to point at the GDT on line 250.
Answer: the properties of the kernel and user segments differ:
Also notice that the high nibble of the third high-order byte differs
in the kernel and user cases: in the kernel case, the Descriptor
Privilege Level is 0 (most privileged), while the user segment
descriptors' DPL is 3 (least privileged). If you read the Intel
documentation, you will be able to figure out exactly what all this
means, but since x86 segment protection does not figure much in the
Linux kernel, I won't discuss it any further here.
Initializing the Kernel Page Tables
0x00000000: 0-page
0x00100000: kernel-text
0x????????: kernel_data
0x????????=_end: whole-mem pagetables
0x????????: fixmap pagetables
0x????????: zone data (mem_map, zone_structs, freelists &c)
0x????????=start_mem: free pages
This chunk of memory is mapped by swapper_pg_dir and the whole-mem-pagetables to
address PAGE_OFFSET.
Further VM Subsytem Initialization Tasks
Segmentation
Thus, we effectively allow access to the entire virtual address space
using any of the available segment selectors.
Thanks to Andrea Russo for clearing up this Intel segmentation business.
.quad 0x00cf9a000000ffff /* 0x10 kernel 4GB code at 0x00000000 */
.quad 0x00cf92000000ffff /* 0x18 kernel 4GB data at 0x00000000 */
.quad 0x00cffa000000ffff /* 0x23 user 4GB code at 0x00000000 */
.quad 0x00cff2000000ffff /* 0x2B user 4GB data at 0x00000000 */
The segment registers (CS, DS, etc.) contain a 13-bit index
into the descriptor table; the descriptor at that index tells the CPU
the properties of the selected segment. The 3 low-order bits of a
segment selector are not used to index the descriptor table; rather,
they contain the descriptor-type (global or local) and the requested
privilege level. Thus the kernel segment selectors 0x10 and 0x18 use
RPL 0, while the user selectors 0x23 and 0x2B use RPL 3, the
least-privileged level.
Linux MM Outline | Physical Memory |
Questions and comments to Joe Knapka
The LXR links in this page were produced by lxrreplace.tcl, which is available for free.