BIOS и Open Firmware

При включении компьютера вначале начинает работать биос , который позже загружает ядро.


Загрузчик - программа , которая находится в загрузочном секторе. Загрузочным устройством обычно явдяется мастер-диск. Загрузчик может называться BIOS (x86) или firmware (PPC).

В x86 BIOS позволяет вручную настроить порядок загрузочных устройств, которым может быть флоппи , флеш,си-ди или диск. При форматировании диска командой fdisk создается т.н. Master Boot Record (MBR), который располагается в 1-м секторе (sector 0, cylinder 0, head 0) загрузочного устройства. MBR включает в себя небольшую программу и таблицу для 4-х значений. Бут-сектор имеет размер 512 байт , последние 2 байта - 0xAA55. Table 8.1 показывает состав MBR.

Table 8.1. MBR Components






MBR program code



Partition table



Hex marker or signature

Таблица MBR хранит информацию о партициях. Table 8.2 показана в 16-байтном разрезе:

Table 8.2. MBR 16-byte Entries






Active Boot Partition Flag



Starting Cylinder/Head/Sector of boot partition



Partition Type (Linux uses 0x83,PPC PReP uses 0x41)



Ending Cylinder/Head/Sector of boot partition



Partition starting sector number



Partition length (in sectors)

Биос определяет загрузочное устройство.

После этого MBR копируется в память по адресу 0x7c00 и запускается. В таблице MBR находится адрес активной партиции , с этого адреса копируется код в память и запускается. Этот код обычно является загрузчиком операционной системы (GRUB или LILO), который дальше будет запущен и загрузит уже саму операционную систему. Следующая таблица Figure 8.2 показывает , как выглядит память на момент загрузки.

Figure 8.2. View of Memory at Bootup Time


Grand Unified Bootloader (GRUB) - x86-совместимый загрузчик , который загружает Линукс. GRUB 2 портирован на PPC на момент написания этой книги. Информацию можно найти на GRUB определяет файловую систему загрузочного устройства, ядро загружается по его имени. GRUB - 2-ступенчатый (two-stage) загрузчик. [1] Первая фаза называется BIOS и находится в MBR. 2-я часть уже грузится с помощью Stage 1 . Можно нарисовать следующую цепочку :

Stage 1

  1. Инициализация.

  2. Определение загрузочного устройства.

  3. Загрузка первого сектора Stage 2.

  4. Загрузка Stage 2.

Stage 2

  1. Остальная загрузка Stage 2.

  2. Запуск загруженного кода.

Доступ к GRUB можно получить из командной строки. Фрагмент конфигурационного файла GRUB:

---------------------------------------------------------------------- /boot/menu.lst ... title Kernel 2.6.7, test kernel root (hd0,0) kernel /boot/bzImage-2.6.7-mytestkernel root=/dev/hda1 ro [2] ... -----------------------------------------------------------------------

[2] В строке дается список параметров. Подробности можно найти на

Опциями являются : title, - просто метка; root, - текущее загрузочное устройство root device для hd0, partition 0; и само kernel, которое является boot image . Остальные параметры передаются уже самим ядром при загрузке.

Определение места в памяти , в котором будет находится сам kernel image , можно найти в коде : arch/i386/boot/setup.S для x86:

 61 INITSEG = DEF_INITSEG   # 0x9000, we move boot here, out of the way
 62 SYSSEG = DEF_SYSSEG   # 0x1000, system loaded at 0x10000 (65536).
 63 SETUPSEG = DEF_SETUPSEG   # 0x9020, this is the current segment

Мы видим , что образ загружается по линейному адресу 0x9000 , после чего запускается код , лежащий по адресу 0x9020. Ядро разархивируется и грузится по адресу 0x10000 и запускается.

GRUB основан на Multiboot Specification. Multiboot Specification

The Multiboot Specification describes an interface between any potential bootloader and any potential operating system. The Multiboot Specification does not say how a bootloader should work, but how it must interface with the operating system being loaded. Currently targeted at x86 architectures and free 32-bit operating systems, it provides a standard means for a bootloader to pass configuration information to an operating system. The OS image can be of any type (ELF or special), but must contain a multiboot header in the first 8K of the image, as well as the magic number 0x1BADB002. The multiboot-compliant loader should also provide a method for auxiliary boot modules or drivers to be used by the OS at boot time as certain OSes do not load all the programs necessary for operation into the bootable kernel image. This is often done to modularize boot kernels and keep the boot kernel to a manageable size.

The Multiboot Specification dictates that, when the bootloader invokes the OS, the system must be in a specific 32-bit real mode state such that the OS can successfully make calls back into BIOS if desired. Finally, the bootloader must present the OS with a data structure filled with essential machine data. We now look at the multiboot information data structure.

 typedef struct multiboot_info
 ulong flags;   // indicate following fields
 ulong mem_lower;  // if flags[0],amnt of mem < 1M
 ulong mem_upper;  // if flags[0],amnt of mem > 1M
 ulong boot_device;  // if flags[1],drive,part1,2,3 
 ulong cmdline;   // if flags[2],addr of cmd line
 ulong mods_count;  // if flags[3],#of boot modules
 ulong mods_addr;  // if flags[3],addr of first
         boot module.    
 aout_symbol_table_t aout_sym; // if flags[4], symbol table
     from a.out kernel image
 elf_section_header_table_t elf_sec;// if flags[5], header 
      from ELF kernel.
 } u;
 ulong mmap_length;  // if flags[6],BIOS mem map len
 ulong mmap_addr;  // if flags[6],BIOS map addr
 ulong drives_length;  // if flags[7],BIOS drive info structs
 ulong drives_length;  // if flags[7],first BIOS drive info 
 ulong config_table  // if flags[8],ROM config table 
 ulong boot_loader_name  // if flags[9],addr of string
 ulong apm_table  // if flags[10],addr of APM info table
 ulong vbe_control_info  // if flags[11],video mode settings
 ulong vbe_mode_info
 ulong vbe_mode
 ulong vbe_interface_seg
 ulong vbe_interface_off
 ulong vbe_interface_len

A pointer to this structure is passed in EBX when control is passed to the OS. The first field, flags, indicates which of the following fields are valid. Unused fields must be 0. You can learn more about the Multiboot Specification at

8.2.2. LILO

The LInux LOader (LILO) has been used for years as an x86 loader for Linux. It was one of the earliest boot-loading programs available to assist in the configuration and loading of the Linux kernel. LILO is similar to GRUB in the sense that it is a two-stage bootloader. LILO uses a configuration file and does not have a command-line interface.

Again, we start with BIOS initializing the system and loading the MBR (Stage 1) into memory and transferring control to it. The breakdown of the events occurring in each of LILO's stages is as follows:

Stage 1

  1. Begins execution and displays "L."

  2. Detects disk geometry and displays "I."

  3. Loads Stage 2 code.

Stage 2

  1. Begins execution and displays "L."

  2. Locates boot data and OS and displays "O."

  3. Determines which OS to start and jumps to it.

A stanza from the LILO configuration file looks like this:

 label=Kernel 2.6.7, my test kernel

The parameters are image, which indicates the pathname of the kernel; label, which is a string describing the configuration; root, which indicates the partition where the root filesystem resides; and read-only, which indicates that the root partition cannot be altered during boot.

Here is a list of the differences between GRUB and LILO:

  • LILO stores configuration information in the MBR. If any changes are made, /sbin/lilo must be run to update the MBR.

  • LILO cannot read various filesystems.

  • LILO has no interactive command-line interface.

Let's review what happens when LILO is the bootloader. First, the MBR (which contains LILO) is copied to 0x7c00 and begins execution. LILO begins by copying the kernel image referenced in /etc/lilo.conf from the hard drive. This image, created by build.c, is made up of the init sector (loaded at 0x90000), the setup sector (loaded at 0x90200), and the compressed image (loaded at 0x10000). LILO then jumps to label start_of _setup at address 0x90200.

8.2.3. PowerPC and Yaboot

Yaboot is a bootloader based on the Open Firmware (OF) of New World PowerPC machines. Similar to LILO and GRUB, Yaboot uses a configuration file and a utility such as ybin or ybootconfig to set up a bootstrap partition containing Yaboot. Similar to the x86 BIOS, OF allows configuration of the boot device. However, in the OF case, it varies by system. OF settings can be usually found by pressing "Command+Option/Alt+o+f..?"

Yaboot uses the following steps to boot:

Yaboot gets called by OF.

Finds boot device, boot path, and opens boot partition.

Opens /etc/yaboot.conf or command shell.

Loads image or kernel and initrd.

Executes image.

As you can see, the kernel-loading stanza for Yaboot is similar to LILO and GRUB:


As in LILO, ybin installs Yaboot to the boot partition. Any updates/changes to the Yaboot configuration require rerunning ybin.

Documentation on Yaboot can be found at

8.3. Architecture-Dependent Memory Initialization

We now take a moment to discuss hardware management features in PPC and x86. Both x86 and PowerPC architectures have hardware memory-management features to support real and virtual addressing environments. As in all operating systems, Linux Memory Management depends on the underlying hardware architecture. This section describes the hardware initialization of both architectures. Because the initialization of memory management is extremely hardware dependent, the hardware specifications need to be understood in order to follow the initialization process. Memory management is one of the first subsystems to be initialized and begins prior to the execution of start_kernel() because of its highly architecture-dependent nature.

8.3.1. PowerPC Hardware Memory Management

Also known as "storage control" in the PowerPC world, this section describes the hardware-supported features of address translation specific to the PowerPC architecture. We follow up with a discussion on how Linux uses (or disregards, for the sake of portability) these features from system power-on through kernel initialization. Real Addressing Mode

From embedded up to high performance, all PowerPC processors come out of hardware reset in real mode.[3] PowerPC real-addressing mode is defined as having the processor in a state of disabled address translation. Address translation is controlled by the instruction relocate (IR) and data relocate (DR) bits in the Machine State Register (MSR). For fetch instructions, if the IR bit is 0, the effective address (EA) is the same as the real address. For load and store instructions, the DR bit in the MSR plays a similar role.

[3] Even the 440 series of processors, which technically have no real mode, start with a "shadow" TLB that maps linear addresses to physical addresses.

The MSR, which is illustrated in Figure 8.3, is a 64- or 32-bit register that describes the current state of the processor. On a 32-bit implementation, the IR and DR are bits 26 and 27.

Figure 8.3. PowerPC Machine State Register (MSR)

Because address translation in Linux is a combination of hardware and software structures, real mode is fundamental to the boot process of initializing the memory subsystem and the memory-management structures of Linux. The need to enable address translation is exemplified by the inherent limitations of real mode. Real mode is only capable of addressing the implemented address width; this is 64- or 32-bit in most applications. The two major limitations are as follows:

  • There is no hardware protection for load/store operations.

  • Any access (instruction or data) to or from an address that does not have a device physically attached to the bus might cause a Machine Check (also known as a Checkstop), which in most cases, is unrecoverable. Address Translation

The lack of address translation is real addressing. Address translation opens the door to virtual addressing where every possible address is not physically available at any given instance, but through the clever use of hardware and software, every possible address can be made virtually available when accessed.

With address translation enabled, the PowerPC architecture translates an EA by one of two methods: Segmented Address Translation or Block Address Translation (see Figure 8.4). If the EA can be translated by both methods, Block Address Translation takes precedence. Address translation is said to be enabled when MSRIR=1, or MSRDR=1, or both. Segmented Address Translation breaks virtual memory into segments, which are divided into 4KB pages, each representing physical memory. Block Address Translation breaks memory into regions ranging from 128MB to 256MB.

Figure 8.4. 32-Bit Address Translation

Memory Addressing Terminology

When we reference memory, we really only have two distinct methodologies or modes: real addressing, where each increment of the address specifies a specific base unit (usually a byte) in physical memory; and virtual addressing, where the address is a computation in hardware and/or software. Here are some example terms used for each:

  • Real addressing. Physical, bus

  • Virtual addressing. Effective, protected, and translated

In PowerPC, effective address space is considered a subset of virtual address space.

Terms such as linear, flat, and logical can apply to both modes.

Segmented Address Translation Direct Store Segment T

The next level of translation is determined by the T bit, which is located in the Segment Register. Bits 0:3 of the EA select one of 16 segment registers (SRs) in the PowerPC 7xx series. Figure 8.5 illustrates the segment register.

Figure 8.5. Segment Register

With the T bit set, the segment is deemed a direct store segment to an I/O device, and there is no reference to hardware page tables. The I/O address is made up of a permission bit, the BUID, the controller-specific field, and bits 4:31 of the EA. Linux does not use direct store segmentation.

When the Segmented Address Translation Ordinary Segment T is not set, the virtual segment ID (VSID) field is used.

Referring to Figure 8.6, a 52-bit virtual address (VA) is formed by concatenating bits 20:31 of the EA (the offset within a given page), bits 4:19 of the EA, and bits 8:31 of the selected segment register VSID field. The most significant 40 bits of the VA make up the virtual page number (VPN). The PowerPC architecture uses a Hashed Page Table to map VPNs to real page numbers (the real address of a desired page in memory). The hash function uses the VPN and the value in Storage Description Register 1 (SDR1) to store and retrieve a Page Table Entry (PTE). The PTE, which is illustrated in Figure 8.7, is an 8-byte structure that contains all the necessary attributes of a page in memory.

Figure 8.6. Segment Translation

Figure 8.7. Page Table Entry

Block Address Translation

As its name implies, Block Address Translation (BAT) is an addressing mechanism that allows for mapping blocks of contiguous memory from 125KB to 256MB. BAT registers are privileged special purpose registers (SPRs) in the PowerPC architecture. Figure 8.8 illustrates the BAT register.

Figure 8.8. BAT Register

The formation of a real address from a BAT register can be seen in Figure 8.9. Four Instruction BAT (IBAT) registers and four Data BAT (DBAT) registers can be read or written using mtspr and mfspr PPC instructions.[4]

[4] Block Address Translation is not implemented on all PowerPC processors. Notably, it was not implemented on G4 or G5. It is implemented in the 4xx-embedded processors.

Figure 8.9. BAT Real

Translation Lookaside Buffers

The Translation Lookaside Buffers (TLBs) can be thought of as a hardware cache with hardware protection for the paging system. The TLB varies in length with PowerPC architectures and contains an index of the most recently used PTEs. The paging software must be sure to keep the TLBs in sync with the page table. When the processor cannot find a page in the hash table,[5] the Linux page tables are then searched. If the page is still not found, a normal page fault is generated. Information on optimization of the synchronization between the Linux page tables and PPC hash tables can be found in the document, "Low Level Optimizations in the PowerPC/Linux Kernels," by Paul Mackerras.

[5] Hash tables are not implemented on all PowerPC processors. They are absent in the 4xx- and 8xx-embedded systems where a TLB miss generates an exception in the hardware and the paging software, and then brings the page in.

Storage Access Mode Control

When address translation is enabled (MSRIR=1, or MSRDR=1, or both) and accomplished by way of Segmented Address Translation or Block Address Translation, the storage mode is determined by four control bits: W, I, M, and G. For Segmented Address Translation, they are bits 25:28 of the second word of a PTE, and the same bits for the second SPR of the DBAT. (The G-bit is reserved in the IBAT.) Two more bitsReference and Control, which are located in the PTEare available for Segmented Address Translation. The R and C bits are set by hardware or software. (See the following sidebar for a discussion of the W, I, M, G, R, and C bits.)

Control Bits

The W, I, M, G, R, and C bits control how the processor accesses the cache and main memory:

  • W (Write Through). If data is in the cache and a store operation is performed on it, if W=1, the copy in main memory must also be updated.

  • I (Cache Inhibit). Updates bypass cache and goes straight through to main memory.

  • M (Memory Coherence). When M=1, hardware memory coherency is enforced.

  • G (Guarded). When G=1, speculative execution is suppressed.

  • R (Referenced). When R=1, the Page Table entry has been referenced.

  • C (Changed). When C=1, the Page Table entry has been changed. How Linux Uses PPC Address Translation

We now look at the code that influences memory management in PPC.

The following code is the first in the kernel distribution to get control. This routine calls back into the Firmware for allocation of temporary regions by using the claim() function. The kernel is then decompressed into its proper location:

 40  void boot(int a1, int a2, void *prom)
 54  claim(initrd_start, RAM_END - initrd_start, 0);
 55  printf("initial ramdisk moving 0x%x <- 0x%p (%x bytes)\n\r",
 56   initrd_start, (char *)(&__ramdisk_begin), initrd_size);
 57  memcpy((char *)initrd_start, (char *)(&__ramdisk_begin), initrd_size);
 63  /* claim 3MB starting at PROG_START */
 64   claim(PROG_START, PROG_SIZE, 0);
 65   dst = (void *) PROG_START;
 66   if (im[0] == 0x1f && im[1] == 0x8b) {
 67  /* claim some memory for scratch space */
 68  avail_ram = (char *) claim(0, SCRATCH_SIZE, 0x10);
 69  begin_avail = avail_high = avail_ram;
 70  end_avail = avail_ram + SCRATCH_SIZE;
 71  printf("heap at 0x%p\n", avail_ram);
 72  printf("gunzipping (0x%p <- 0x%p:0x%p)...", dst, im, im+len);
 73  gunzip(dst, PROG_SIZE, im, &len);
 74  printf("done %u bytes\n", len);
 75  printf("%u bytes of heap consumed, max in use %u\n",
 76   avail_high - begin_avail, heap_max);
 86  sa = (unsigned long)PROG_START;
 87   printf("start address = 0x%x\n", sa);
 89   (*(kernel_start_t)sa)(a1, a2, prom);

Line 40

Entry point to this file is the function boot(a1, a2, *prom).

Line 54

Function claim() is called to allocate memory just below 1M and ramdisk is copied into that memory.

Line 64

Function claim() is called to allocate 3M of memory, starting at 0x1_0000 for the image.

Line 68

Function claim() is called to allocate 8K of memory starting at 0x00 for scratch/heap.

Line 73

The image is gunzipped to address 0x1_0000 (PROG_START).

Line 89

Jump to 0x1_0000 ((*kernel_start_t)sa) with parameters (a1, a2, and prom) where a1 holds the value in r3 (equal to the boot ramdisk start), a2 holds the value in r4 (equal to the boot ramdisk size or 0xdeadbeef in the case of no ramdisk) and prom holds the value in r5 (code stored in system ROM).

The next code block readies the hardware memory-management features of the various PowerPC processors. The first 16M of RAM is mapped to 0xc0000000:

 131  __start:
 150  bl  early_init  in <arch/ppc/kernel/setup.c> (283)
 170  bl  mmu_off
 171   RFI: SRR0=>IP, SRR1=>MSR
 172  #ifndef CONFIG_POWER4
 173   bl  clear_bats
 174   bl  flush_tlbs
 176   bl  initial_bats
 177  #if !defined(CONFIG_APUS) && defined(CONFIG_BOOTX_TEXT)
 178   bl  setup_disp_bat
 179  #endif
 180  #else /* CONFIG_POWER4 */
 181   bl  reloc_offset
 182   bl  initial_mm_power4
 183  #endif /* CONFIG_POWER4 */
 185  /*
 186  * Call setup_cpu for CPU 0 and initialize 6xx Idle
 187  */
 188   bl  reloc_offset
 189   li  r24,0    /* cpu# */
 190   bl  call_setup_cpu   /* Call setup_cpu for this CPU */
 195  #ifdef CONFIG_POWER4
 196   bl  reloc_offset
 197  bl  init_idle_power4
 198  #endif /* CONFIG_POWER4 */
 210  bl  reloc_offset
 211  mr  r26,r3
 212  addis  r4,r3,KERNELBASE@h  /* current address of _start */
 213  cmpwi  0,r4,0    /* are we already running at 0? */
 214  bne  relocate_kernel
 224  turn_on_mmu:
 225  mfmsr  r0
 226  ori  r0,r0,MSR_DR|MSR_IR
 227  mtspr  SRR1,r0
 228  lis  r0,start_here@h
 229  ori  r0,r0,start_here@l
 230  mtspr  SRR0,r0
 231  SYNC
 232  RFI     /* enables MMU */

Line 131

This is the entry point to this code. Get minimal mmu environment set up. (Note that APUS stands for Amiga Power Up System.)

Line 150

There might be a difference between where the kernel is loaded and where it is linked. The function early_init returns the physical address of the current code.

Line 170

Shut off memory-management unit of PPC. If both IR and DR are enabled, leave them on; otherwise, shut off relocation.

Lines 173176

If not power4 or G5, clear the BAT registers, flush TLBs, and set up BATs to map the first 16M of RAM to 0xc0000000.

Note the various labels for kernel memory used throughout the kernel:




Lines 181182

By using segmentation, set up kernel memory for power4 and G5.

Lines 188198

setup_cpu() initializes the kernel and user features, such as cache configuration, or whether an FPU or MMU exists. (Note that at this writing, init_idle_power4 is a noop.)

Line 210

Relocate kernel to KERNELBASE or 0x00, depending on the platform.

Lines 224232

Turn on the MMU (if it is not already) by enabling IR and DR in MSR. Then, execute an RFI instruction causing a jump to the label start_here:. (Note: The RFI instruction loads the MSR with the contents of SRR1 and branches to the address in SRR0.)

The following code is where the kernel starts. It sets up all memory in the system based on the command line:

 1337  start_here:
 1364  bl  machine_init  
 1365  bl  MMU_init
 1385  lis  r4,2f@h
 1386  ori  r4,r4,2f@l
 1387  tophys(r4,r4)
 1388  li  r3,MSR_KERNEL & ~(MSR_IR|MSR_DR)
 1389  FIX_SRR1(r3,r5)
 1390  mtspr  SRR0,r4
 1391  mtspr  SRR1,r3
 1392  SYNC
 1393  RFI
 1394  /* Load up the kernel context */
 1395  2:  bl  load_up_mmu
 1411  /* Now turn on the MMU for real! */
 1412  li  r4,MSR_KERNEL
 1413  FIX_SRR1(r4,r5)
 1414  lis  r3,start_kernel@h
 1415  ori  r3,r3,start_kernel@l
 1416  mtspr  SRR0,r3
 1417  mtspr  SRR1,r4
 1418  SYNC
 1419  RFI

Line 1337

This line is the entry point to this section.

Line 1364

machine_init() (see the file arch/ppc/kernel/setup.c, line 532) sets up machine-specific information, such as NVRAM, L2, CPU cache line size, debugging, and so on.

Line 1365

MMU_init() (see file arch/ppc/mm/init.c, line 234) discovers the total memory size for highmem and lowmem. It then initializes the MMU hardware (MMU_init_hw(), line 267), sets up Hash Page Table (arch/ppc/mm/hashtable.s), maps all RAM starting at KERNELBASE (mapin_ram(), line 272), maps all I/O (setup_io_mappings(), line 285), and initializes context management(mmu_context_init(), line 288).

Line 1385

Shut off IR and DR to set up SDR1. This holds the real address of the Page Table and how many bits from the hash are used in the Page Table Index.

Line 1395

Clear TLBs, load SDR1 (hash table base and size), set up segmentation, and, depending on the particular PPC platform, initialize the BAT registers.

Lines 14121419

Turn on IR, DR, and RFI to start_kernel in /init/main.c. Note that at interrupt time in the PowerPC architecture, the contents of the Instruction Address Registser (ISR) holds the address the processor must return to after servicing the interrupt. This value is saved in the Save Restore Register 0 (SRR0). The Machine Status Register is in turn saved in the Save Restore Register 1 (SRR1). In shorthand, at interrupt time:

  • IAR->SRR0

  • MSR->SRR1

The RFI instruction, which is normally executed at the end of an interrupt routine, is the inverse of this procedure, where SRR0 is restored to the IAR and SRR1 is restored to the MSR. In shorthand:

  • SRR0->IAR

  • SRR1->MSR

The code in lines 13851419 uses this methodology to turn memory management on and off by this three-step process:

Sets the desired bits for the MSR (refer to Figure 8.1) in SRR1.

Sets the desired address we want to jump to in SRR0.

Executes the RFI instruction.

8.3.2. x86 Intel-Based Hardware Memory Management

At power-on, all Intel processors are in real address mode. Real addressing is a compatibility mode to the early Intel processors. As processors grew more complex, legacy code was always in use that newer processors still needed to be able to run. In real address mode, the processor can execute a program written for the 8086 and 8088 processors using the same instructions and, more importantly, the same method of addressing or address translation. The end result of address translation is how the processor accesses the system memory. The early Intel processors had a 20-bit address bus, which accessed approximately 64K bytes of memory. This is the limitation put on the early code in the system. In real address mode, the linear address is the same as the physical address. As we move through the code that initializes memory management, we see more of the features of the later processors being used in the hardware and more complex structures added to the software.

The code in setup.S performs several important functions with respect to memory initialization:

 307  #define SMAP 0x534d4150
 309  meme820:
 310   xorl  %ebx, %ebx    # continuation counter
 311   movw  $E820MAP, %di    # point into the whitelist
 312         # so we can have the bios
 313         # directly write into it.
 315  jmpe820:
 316   movl  $0x0000e820, %eax   # e820, upper word zeroed
 317   movl  $SMAP, %edx    # ascii 'SMAP'
 318   movl  $20, %ecx    # size of the e820rec
 319   pushw  %ds     # data record.
 320   popw  %es
 321   int  $0x15     # make the call
 322   jc  bail820     # fall to e801 if it fails
 324   cmpl  $SMAP, %eax    # check the return is 'SMAP'
 325   jne  bail820     # fall to e801 if it fails
 333  good820:
 334   movb  (E820NR), %al    # up to 32 entries
 335   cmpb  $E820MAX, %al
 336   jnl  bail820
 338   incb  (E820NR)
 339   movw  %di, %ax
 340   addw  $20, %ax
 341   movw  %ax, %di
 342  again820:
 343   cmpl  $0, %ebx    # check to see if
 344   jne  jmpe820     # %ebx is set to EOF
 345  bail820:

Lines 307345

Looking at the code segment, we first see (on line 321) a call to the BIOS int15h function with ax= 0xe820. This returns the addresses and lengths of the many different types of memory of which BIOS is aware. This simple memory map represents the basic pool from which all the pages of memory in Linux are obtained. As seen from further studying of the code, the memory map can be obtained by three methods: 0xe820, 0xe801, and 0x88. All three methods have to do with compatibility with existing BIOS distributions and their platforms.

---------------------------------------------------------------------- arch/i386/boot/setup.S 595 # Now we move the system to its rightful place ... but we check if we have a # big-kernel. In that case we *must* not move it ... 597 testb $LOADED_HIGH, %cs:loadflags 598 jz do_move0 # .. then we have a normal low 599 # loaded zImage 600 # .. or else we have a high 601 # loaded bzImage 602 jmp end_move # ... and we skip moving 603 604 do_move0: 605 movw $0x100, %ax # start of destination segment 606 movw %cs, %bp # aka SETUPSEG 607 subw $DELTA_INITSEG, %bp # aka INITSEG 608 movw %cs:start_sys_seg, %bx # start of source segment 609 cld 610 do_move: 611 movw %ax, %es # destination segment 612 incb %ah # instead of add ax,#0x100 613 movw %bx, %ds # source segment 614 addw $0x100, %bx 615 subw %di, %di 616 subw %si, %si 617 movw $0x800, %cx 618 rep 619 movsw 620 cmpw %bp, %bx # assume start_sys_seg > 0x200, 621 # so we will perhaps read one 622 # page more than needed, but 623 # never overwrite INITSEG 624 # because destination is a 625 # minimum one page below source 626 jb do_move 627 628 end_move: ----------------------------------------------------------------------

Lines 595628

This code is the kernel image created by build.c and loaded by LILO. It is made up of the init sector (at address 0x9000), the setup sector (at address 0x9200), and the compressed image. The image is originally loaded at address 0x10000. If it is LARGE (>0X7FF), it is left in place; otherwise, it is moved down to 0x1000.

 723   # Try enabling A20 through the keyboard controller
 724  #endif /* CONFIG_X86_VOYAGER */
 725  a20_kbc:
 726   call  empty_8042
 728  #ifndef CONFIG_X86_VOYAGER
 729   call  a20_test    # Just in case the BIOS worked
 730   jnz  a20_done    # but had a delayed reaction.
 731  #endif
 733   movb  $0xD1, %al    # command write
 734   outb  %al, $0x64
 735   call  empty_8042
 737   movb  $0xDF, %al    # A20 on
 738   outb  %al, $0x60
 739   call  empty_8042

Forming the 20-bit Physical Address in Intel Real Address Mode

The Intel 8088 processor in the original IBM PC had only 20 address lines [0...19]. This allowed the system to access up to 1 megabyte plus approximately 64K bytes of memory (0 to 0x10_FFEF) internally, but physically (on the bus) the last 64K of addressable memory was actually the first 64K of real memory!

Internal to the processor, a 20-bit address is formed from a 16-bit segment selector and a 16-bit segment offset. The selector is shifted left 4 bits and added to the offset, which is extended by 4 bits. The sum of these registers is the physical address seen on the bus.

For example:

To obtain the highest address, we load a segment selector (CS, DS, ES, and so on) with a value of 0xFFFF and an index register (SI, DI, and so on) with a value of 0xFFFF. Internal to the processor, the segment selector is shifted left 4 bits and added to the offset.

0xFFFF shifted left 4 bits



Add the offset



Internal sum



External Physical Address



This resulting Physical Address is the same as a segment selector with the value of 0x0000 and an offset value of 0xFFEF (0000:FFEF).

Accessing the highest address and above would wrap back into low memory at 0xFFEF. Certain programs written for this processor would depend on this 20-bit wrap-around behavior. The introduction of the Intel 286 and later processors with wider address busses incorporated Real Addressing to maintain backward compatibility with 8088 and 8086. Real Addressing mode did not take into account legacy software that depended on the 20-bit wrap-around. The A20M# signal pin was added to mimic this "feature" of the earlier processors. Asserting this signal would mask off the A20 signal allowing the low memory to be accessed once again.

A logic gate was used to enable or disable the memory bus A20 signal. The original design to assert this gate was to use an extra I/O signal from the keyboard controller that was controlled by I/O ports 0x60 and 0x64. A "Fast Gate A20" method was later developed which used I/O port 0x92 designed into the system board. Since all x86 processors come out of reset in Real Address mode, it is wise for boot code to make certain address line A20 is enabled by one or both of these methods.

Lines 723739

This code is a fascinating throwback to the early Intel processors. This is a mere nuisance in the setup of Memory Management.

 790  # set up gdt and idt
 791  lidt  idt_48     # load idt with 0,0
 792  xorl  %eax, %eax    # Compute gdt_base
 793  movw  %ds, %ax    # (Convert %ds:gdt to a linear ptr)
 794  shll  $4, %eax
 795  addl  $gdt, %eax
 796  movl  %eax, (gdt_48+2)
 797  lgdt  gdt_48     # load gdt with whatever is
 798        # appropriate
 981  gdt:
 982   .fill GDT_ENTRY_BOOT_CS,8,0
 984   .word  0xFFFF     # 4Gb - (0x100000*0x1000 = 4Gb)
 985   .word  0     # base address = 0
 986   .word  0x9A00     # code read/exec
 987   .word  0x00CF     # granularity = 4096, 386
 988         # (+5th nibble of limit)
 990   .word  0xFFFF     # 4Gb - (0x100000*0x1000 = 4Gb)
 991   .word  0     # base address = 0
 992   .word  0x9200     # data read/write
 993   .word  0x00CF     # granularity = 4096, 386
 994         # (+5th nibble of limit)
 995  gdt_end:
 996   .align  4
 998   .word  0     # alignment byte
 999  idt_48:
 1000   .word  0     # idt limit = 0
 1001  .word  0, 0     # idt base = 0L
 1003   .word  0     # alignment byte
 1004  gdt_48:
 1005   .word  gdt_end - gdt - 1   # gdt limit
 1006   .word  0, 0     # gdt base (filled in later)

Lines 790797

The structures and data for the provisional GDT and IDT are compiled into the end of setup.S. These tables are implemented in their simplest form.

Lines 9811006

These lines are the compiled-in values for the provisional GDT. The GDT has a code and data descriptor, each representing 4GB of memory starting at 0x00. The IDT is left initialized to 0x00 and is filled in later.

As far as memory management on an Intel platform is concerned, entering protected mode is one of the most important phases. At this point, the hardware begins to build a virtual address space for the operating system.

Protected Mode

The Intel method of memory management is called protected mode. The protection refers to multiple independent segmented address spaces that are protected from each other. The other half of Intel memory management is paging or page translation. System programmers can make use of various combinations of segmentation and paging, but Linux uses a flat model where segmentation is all but eliminated. In the flat model, each process has access to its entire 32-bit address space (4GB).

 830  movw  $1, %ax     # protected mode (PE) bit
 831  lmsw  %ax     # This is it!
 832  jmp  flush_instr
 834  flush_instr:
 835   xorw  %bx, %bx    # Flag to indicate a boot
 836   xorl  %esi, %esi    # Pointer to real-mode code
 837   movw  %cs, %si
 838   subw  $DELTA_INITSEG, %si
 839   shll  $4, %esi  

Lines 830831

Set the PE bit in the Machine Status Word to enter protected mode. The jmp instruction begins executing in protected mode.

Lines 834839

Save a 32-bit pointer to real-mode for decompressing and loading the kernel later on in startup_32().

Recall that in real addressing mode, code is executed by using 16-bit instructions. The current file is compiled using the .code16 assembler directive, which enforces this mode; this is also known as a 16-bit module in the Intel Programmer's Reference. To jump from a 16-bit module to a 32-bit module, the Intel architecture (and assembler magic) allows us to build a 32-bit instruction in a 16-bit module.

Build and execute the 32-bit jump:

 841  # jump to startup_32 in arch/i386/kernel/head.S
 842  #
 843  # NOTE: For high loaded big kernels we need a
 844  #  jmpi 0x100000,__BOOT_CS
 845  #
 846  #  but we haven't yet reloaded the CS register, so the default size 
 847  #  of the target offset still is 16 bit.
 848  #  However, using an operand prefix (0x66), the CPU will properly
 849  #  take our 48 bit far pointer. (INTeL 80386 Programmer's Reference
 850  #  Manual, Mixing 16-bit and 32-bit code, page 16-6)
 852   .byte 0x66, 0xea    # prefix + jmpi-opcode
 853  code32:  .long  0x1000     # will be set to 0x100000
 854         # for big kernels
 855   .word  __BOOT_CS

Line 852

This line builds the 32-bit jump instruction.

After this jump is executed, the system uses the provisional GDT and the code is executing in 32-bit protected mode, starting at the label startup_32 in arch/i386/kernel/head.S line 57. Protected Mode

Until this point, the discussion has been how to get the Intel system ready to set up paging. As we trace through the code in head.S, we see what initialization needs to take place and how Linux uses the x86-based protected mode paging system. This is the final code before the kernel is started in main.c. For complete information on the many possible modes and settings that relate to memory initialization and Intel processors, look at the Intel Architecture Software Developers Manual, Volume 3.

 057  ENTRY(startup_32)
 059  /*
 060  * Set segments to known values.
 061  */
 062   cld
 063   lgdt boot_gdt_descr - __PAGE_OFFSET
 064   movl $(__BOOT_DS),%eax
 065   movl %eax,%ds
 066   movl %eax,%es
 067   movl %eax,%fs
 068   movl %eax,%gs
 081  /*
 082  * Initialize page tables. This creates a PDE and a set of page
 083  * tables, which are located immediately beyond _end. The variable
 084  * init_pg_tables_end is set up to point to the first "safe" location.
 085  * Mappings are created both at virtual address 0 (identity mapping)
 086  * and PAGE_OFFSET for up to _end+sizeof(page tables)+INIT_MAP_BEYOND_END.
 087  *
 088  * Warning: don't use %esi or the stack in this code. However, %esp
 089  * can be used as a GPR if you really need it... 
 090  */
 091  page_pde_offset = (__PAGE_OFFSET >> 20);
 093   movl $(pg0 - __PAGE_OFFSET), %edi
 094   movl $(swapper_pg_dir - __PAGE_OFFSET), %edx
 095   movl $0x007, %eax    /* 0x007 = PRESENT+RW+USER */
 096  10:
 097    leal 0x007(%edi),%ecx    /* Create PDE entry */
 098   movl %ecx,(%edx)    /* Store identity PDE entry */
 099   movl %ecx,page_pde_offset(%edx)   /* Store kernel PDE entry */
 100   addl $4,%edx
 101   movl $1024, %ecx
 102  11:
 103   stosl
 104   addl $0x1000,%eax
 105   loop 11b
 106   /* End condition: we must map up to and including INIT_MAP_BEYOND_END */
 107   /* bytes beyond the end of our own page tables; the +0x007 is the attribute bits */
 108  leal (INIT_MAP_BEYOND_END+0x007)(%edi),%ebp
 109   cmpl %ebp,%eax
 110   jb 10b
 111  movl %edi,(init_pg_tables_end - __PAGE_OFFSET)
 113  #ifdef CONFIG_SMP
 156  3:
 157  #endif /* CONFIG_SMP */
 159  /*
 160  * Enable paging
 161  */
 162   movl $swapper_pg_dir-__PAGE_OFFSET,%eax
 163   movl %eax,%cr3   /* set the page table pointer.. */
 164   movl %cr0,%eax
 165   orl $0x80000000,%eax
 166   movl %eax,%cr0   /* ..and set paging (PG) bit */
 167   ljmp $__BOOT_CS,$1f  /* Clear prefetch and normalize %eip */
 168  1:
 169   /* Set up the stack pointer */
 170   lss stack_start,%esp
 177   pushl $0
 178   popfl
 180  #ifdef CONFIG_SMP
 181   andl %ebx,%ebx
 182   jz 1f     /* Initial CPU cleans BSS */
 183   jmp checkCPUtype
 184  1:
 185  #endif /* CONFIG_SMP */
 187  /*
 188  * start system 32-bit setup. We need to re-do some of the things done
 189  * in 16-bit mode for the "real" operations.
 190  */
 191   call setup_idt
 193  *
 194  * Copy bootup parameters out of the way.
 195  * Note: %esi still has the pointer to the real-mode data.
 196  */
 197   movl $boot_params,%edi
 198   movl $(PARAM_SIZE/4),%ecx
 199   cld
 200   rep
 201   movsl
 202   movl boot_params+NEW_CL_POINTER,%esi
 203   andl %esi,%esi
 204   jnz 2f    # New command line protocol
 206   jne 1f
 207   movzwl OLD_CL_OFFSET,%esi
 208   addl $(OLD_CL_BASE_ADDR),%esi
 209  2:
 210   movl $saved_command_line,%edi
 211   movl $(COMMAND_LINE_SIZE/4),%ecx
 212   rep
 213   movsl
 214  1:
 215  checkCPUtype:
 279   lgdt cpu_gdt_descr
 280   lidt idt_descr
 303   call start_kernel

Line 57

This line is the 32-bit protected mode entry point for the kernel code. Currently, the code uses the provisional GDT.

Line 63

This code initializes the GDTR with the base address of the boot GDT. This boot GDT is the same as the provisional GDT used in setup.S (4GB code and data starting at address 0x00000000) and is used only by this boot code.

Lines 6468

Initialize the remaining segment registers with __BOOT_DS, which resolves to 24 (see /include/asm-i386/segment.h). This value points to the 24th selector (starting at 0) in the final GDT, which is set later in this code.

Lines 91111

Create a page directory entry (PDE) in swapper_pg_dir that references a page table (pg0) with 0 based (identity) entries and duplicate PAGE_OFFSET (kernel memory) entries.

Lines 113157

This code block initializes secondary (non-boot) processors to the page tables. For this discussion, we focus on the boot processor.

Lines 162164

The cr3 register is the entry point for x86 hardware paging. This register is initialized to point to the base of the Page Directory, which in this case, is swapper_pg_dir.

Lines 165168

Set the PG (paging) bit in cr0 of the boot processor. The PG bit enables the paging mechanism in the x86 architecture. The jump instruction (on line 167) is recommended when changing the PG bit to ensure that all instructions within the processor are serialized at the moment of entering or exiting paging mode.

Line 170

Initialize the stack to the start of the data segment (see also lines 401403).

Lines 177178

The eflags register is a read/write system register that contains the status of interrupts, modes, and permissions. This register is cleared by pushing a 0 onto the stack and directly popping it into the register with the popfl instruction.

Lines 180185

The general-purpose register ebx is used as a flag to indicate whether it is the boot processor to the processor that runs this code. Because we are tracing this code as the boot processor, ebx has been cleared (0), and we jump to the call to setup_idt.

Line 191

The routine setup_idt initializes an Interrupt Descriptor Table (IDT) where each entry points to a dummy handler. The IDT, discussed in Chapter 7, "Scheduling and Kernel Synchronization," is a table of functions (or handlers) that are called when the processor needs to immediately execute time-critical code.

Lines 197214

The user can pass certain parameters to Linux at boot time. They are stored here for later use.

Lines 215303

The code listed on these lines does a large amount of necessary (but tedious) x86 processor-version checking and some minor initialization. By way of the cupid instruction (or lack thereof), certain bits are set in the eflags register and cr0. One notable setting in cr0 is bit 4, the extension type (ET). This bit indicates the support of math-coprocessor instructions in older x86 processors. The most important lines of code in this block are lines 279280. This is where the IDT and the GDT are loaded (by way of the lidt and lgdt instructions) into the idtr and gdtr registers. Finally, on line 303, we jump to the routine start_kernel().

With the code in head.S, the system can now map a logical address to a linear address to finally a physical address (see Figure 8.10). Starting with a logical address, the selector (in the CS, DS, ES, etc., registers) references one of the descriptors in the GDT. The offset is the flat address that we seek. The information from the descriptor and the offset are combined to form the logical address.

Figure 8.10. Boot-Time Paging

In the code walkthrough, we saw how the Page Directory (swapper_pg_dir) and Page Table (pg0) were created and that cr3 was initialized to point to the Page Directory. As previously discussed, the processor becomes aware of where to look for the paging components by cr3's setting, and setting cr0 (PG bit) is how the processor is informed to start using them. On the logical address, bits 22:31 indicate the Page Directory Entry (PDE), bits 12:21 indicate the Page Table Entry (PTE), and bits 0:11 indicate the offset (in this example, 4KB) into the physical page.

The system now has 8MB of memory mapped out using a provisional paging system. The next step is to call the function start_kernel() in init/main.c.

8.3.3. PowerPC and x86 Code Convergence

Notice that both the PowerPC code and the x86 code have now converged on start_kernel() in init/main.c. This routine, which is located in the architecture-independent section of the code, calls architecture-specific routines to finish memory initialization.

The first function called in this file is setup_arch() in arch/i386/ kernel/ setup.c, which then calls paging_init() in arch/i386/mm/init.c, which then calls pagetable_init() in the same file. The remainder of system memory is allocated here to produce the final page tables.

In the PowerPC world, much has already been done. The setup_arch() file in arch/ppc/kernel/setup.c then calls paging_init() in arch/ppc/mm/init.c. The one notable function performed in paging_init() for PPC is to set all pages to be in the DMA zone.

8.4. Initial RAM Disk

LILO, GRUB, and Yaboot support the loading of the initial RAM disk (initrd). initrd acts as a root filesystem before the final root filesystem is loaded and initialized. We refer to the loading of the final root filesystem as pivoting the root.

This initial step allows Linux to initially come up with certain modules precompiled and then dynamically load other modules and drivers from initrd. The major difference to the bootloader is that it loads a minimal kernel and the RAM disk during Stage 2. The kernel initializes using the RAM disk, mounts the final root filesystem, and then removes the initrd.

initrd allows for

  • Configuring a kernel at boot time

  • Keeping a small general-purpose kernel

  • Having one kernel for several hardware configurations

The previously referenced stanzas are the most common for loading Linux with Yaboot, GRUB, and LILO. Each bootloader has a rich set of commands for their configuration files. For a customized or special function boot process, a quick Web search on GRUB and LILO configuration files yields good information on the subject.

Now that we have seen how the kernel is loaded and how memory initialization starts, let's look at the process of kernel initialization.

8.5. The Beginning: start_kernel()

This discussion begins with the jump to the start_kernel() (init/main.c) function, the first architecture-independent part of the code to be called.

With the jump to start_kernel(), we execute Process 0, which is otherwise known as the root thread. Process 0 spawns off Process 1, known as the init process. Process 0 then becomes the idle thread for the CPU. When /sbin/init is called, we have only those two processes running:

 396  asmlinkage void __init start_kernel(void)
 397  {
 398   char * command_line;
 399   extern char saved_command_line[];
 400   extern struct kernel_param __start___param[], __stop___param[];
 405   lock_kernel();
 406   page_address_init();
 407   printk(linux_banner);
 408   setup_arch(&command_line);
 409   setup_per_cpu_areas();
 415   smp_prepare_boot_cpu();
 422   sched_init();
 424   build_all_zonelists();
 425   page_alloc_init();
 426   printk("Kernel command line: %s\n", saved_command_line);
 427   parse_args("Booting kernel", command_line, __start___param,
 428     __stop___param - __start___param,
 429     &unknown_bootoption);
 430   sort_main_extable();
 431   trap_init();
 432   rcu_init();
 433   init_IRQ();
 434   pidhash_init();
 435   init_timers();
 436   softirq_init();
 437   time_init();
 444   console_init();
 445   if (panic_later)
 446    panic(panic_later, panic_param) ;
 447   profile_init();
 448   local_irq_enable();
 450   if (initrd_start && !initrd_below_start_ok &&
 451     initrd_start < min_low_pfn << PAGE_SHIFT) {
 452    printk(KERN_CRIT "initrd overwritten (0x%08lx < 0x%08lx) - "
 453     "disabling it.\n",initrd_start,min_low_pfn << PAGE_SHIFT);
 454    initrd_start = 0;
 455   }
 456  #endif
 457   mem_init();
 458   kmem_cache_init();
 459   if (late_time_init)
 460    late_time_init();
 461   calibrate_delay();
 462   pidmap_init();
 463   pgtable_cache_init();
 464   prio_tree_init();
 465   anon_vma_init();
 466  #ifdef CONFIG_X86
 467   if (efi_enabled)
 468    efi_enter_virtual_mode();
 469  #endif
 470   fork_init(num_physpages);
 471   proc_caches_init();
 472   buffer_init();
 473   unnamed_dev_init();
 474   security_scaffolding_startup();
 475   vfs_caches_init(num_physpages);
 476   radix_tree_init();
 477   signals_init();
 478   /* rootfs populating might need page-writeback */
 479   page_writeback_init();
 480  #ifdef CONFIG_PROC_FS
 481   proc_root_init();
 482  #endif
 483   check_bugs();
 490   init_idle(current, smp_processor_id());
 493   rest_init();
 494  }

8.5.1. The Call to lock_kernel()

Line 405

In the 2.6 Linux kernel, the default configuration is to have a preemptible kernel. A preemptible kernel means that the kernel itself can be interrupted by a higher priority task, such as a hardware interrupt, and control is passed to the higher priority task. The kernel must save enough state so that it can return to executing when the higher priority task finishes.

Early versions of Linux implemented kernel preemption and SMP locking by using the Big Kernel Lock (BKL). Later versions of Linux correctly abstracted preemption into various calls, such as preempt_disable(). The BKL is still with us in the initialization process. It is a recursive spinlock that can be taken several times by a given CPU. A side effect of using the BKL is that it disables preemption, which is an important side effect during initialization.

Locking the kernel prevents it from being interrupted or preempted by any other task. Linux uses the BKL to do this. When the kernel is locked, no other process can execute. This is the antithesis of a preemptible kernel that can be interrupted at any point. In the 2.6 Linux kernel, we use the BKL to lock the kernel upon startup and initialize the various kernel objects without fear of being interrupted. The kernel is unlocked on line 493 within the rest_init() function. Thus, all of start_kernel() occurs with the kernels locked. Let's look at what happens in lock_kernel():

 42 static inline void lock_kernel(void)
 43 {
 44   int depth = current->lock_depth+1;
 45   if (likely(!depth))
 46     get_kernel_lock();
 47   current->lock_depth = depth;
 48 }

Lines 4448

The init task has a special lock_depth of -1. This ensures that in multi-processor systems, different CPUs do not attempt to simultaneously grab the kernel lock. Because only one CPU runs the init task, only it can grab the big kernel lock because depth is 0 only for init (otherwise, depth is greater than 0). A similar trick is used in unlock_kernel() where we test (--current->lock_depth < 0). Let's see what happens in get_kernel_lock():

 10 extern spinlock_t kernel_flag;
 12 #define kernel_locked()   (current->lock_depth >= 0)
 14 #define get_kernel_lock()  spin_lock(&kernel_flag)
 15 #define put_kernel_lock()  spin_unlock(&kernel_flag)
 59 #define lock_kernel()       do { } while(0)
 60 #define unlock_kernel()       do { } while(0)
 61 #define release_kernel_lock(task)    do { } while(0)
 62 #define reacquire_kernel_lock(task)    do { } while(0)
 63 #define kernel_locked()       1  

Lines 1015

These macros describe the big kernel locks that use standard spinlock routines. In multiprocessor systems, it is possible that two CPUs might try to access the same data structure. Spinlocks, which are explained in Chapter 7, prevent this kind of contention.

Lines 5963

In the case where the kernel is not preemptible and not operating over multiple CPUs, we simply do nothing for lock_kernel() because nothing can interrupt us anyway.

The kernel has now seized the BKL and will not let go of it until the end of start_kernel(); as a result, all the following commands cannot bepreempted.

8.5.2. The Call to page_address_init()

Line 406

The call to page_address_init() is the first function that is involved with the initialization of the memory subsystem in this architecture-dependent portion of the code. The definition of page_address_init() varies according to three different compile-time parameter definitions. The first two result in page_address_init() being stubbed out to do nothing by defining the body of the function to be do { } while (0), as shown in the following code. The third is the operation we explore here in more detail. Let's look at the different definitions and discuss when they are enabled:

 376 #if defined(WANT_PAGE_VIRTUAL)
 382 #define page_address_init() do { } while(0)
 385 #if defined(HASHED_PAGE_VIRTUAL)
 388 void page_address_init(void);
 391 #if !defined(HASHED_PAGE_VIRTUAL) && !defined(WANT_PAGE_VIRTUAL)
 394 #define page_address_init() do { } while(0)

The #define for WANT_PAGE_VIRTUAL is set when the system has direct memory mapping, in which case simply calculating the virtual address of the memory location is sufficient to access the memory location. In cases where all of RAM is not mapped into the kernel address space (as is often the case when himem is configured), we need a more involved way to acquire the memory address. This is why the initialization of page addressing is defined only in the case where HASHED_PAGE_VIRTUAL is set.

We now look at the case where the kernel has been told to use HASHED_PAGE_VIRTUAL and where we need to initialize the virtual memory that the kernel is using. Keep in mind that this happens only if himem has been configured; that is, the amount of RAM the kernel can access is larger than that mapped by the kernel address space (generally 4GB).

In the process of following the function definition, various kernel objects are introduced and revisited. Table 8.2 shows the kernel objects introduced during the process of exploring page_address_init().

Table 8.2. Objects Introduced During the Call to page_address_init()

Object Name







Global variable


Global variable


Global variable

 510 static struct page_address_slot {
 511  struct list_head lh;    
 512 spinlock_t lock;    
 513 } ____cacheline_aligned_in_smp page_address_htable[1<<PA_HASH_ORDER]; 
 591 static struct page_address_map page_address_maps[LAST_PKMAP];
 593 void __init page_address_init(void)
 594 {
 595   int i;
 597   INIT_LIST_HEAD(&page_address_pool);
 598   for (i = 0; i < ARRAY_SIZE(page_address_maps); i++)    
 599     list_add(&page_address_maps[i].list, &page_address_pool)  ;
 600   for (i = 0; i < ARRAY_SIZE(page_address_htable); i++) {
 601     INIT_LIST_HEAD(&page_address_htable[i].lh);
 602     spin_lock_init(&page_address_htable[i].lock);
 603   }
 604   spin_lock_init(&pool_lock);
 605 }

Line 597

The main purpose of this line is to initialize the page_address_pool global variable, which is a struct of type list_head and point to a list of free pages allocated from page_address_maps (line 591). Figure 8.11 illustrates page_address_pool.

Figure 8.11. Data Structures Surrounding the Page Address Map Pool

Lines 598599

We add each list of pages in page_address_maps to the doubly linked list headed by page_address_pool. We describe the page_address_map structure in detail next.

Lines 600603

We initialize each page address hash table's list_head and spinlock. The page_address_htable variable holds the list of entries that hash to the same bucket. Figure 8.12 illustrates the page address hash table.

Figure 8.12. Page Address Hash Table

Line 604

We initialize the page_address_pool's spinlock.

Let's look at the page_address_map structure to better understand the lists we just saw initialized. This structure's main purpose is to maintain the association with a page and its virtual address. This would be wasteful if the page had a linear association with its virtual address. This becomes necessary only if the addressing is hashed:

 490 struct page_address_map {
 491   struct page *page;
 492   void *virtual;
 493   struct list_head list;
 494 };

As you can see, the object keeps a pointer to the page structure that's associated with this page, a pointer to the virtual address, and a list_head struct to maintain its position in the doubly linked list of the page address list it is in.

8.5.3. The Call to printk(linux_banner)

Line 407

This call is responsible for the first console output made by the Linux kernel. This introduces the global variable linux_banner:

 31  const char *linux_banner = 
 32   "Linux version " UTS_RELEASE " (" LINUX_COMPILE_BY "@"

The version.c file defines linux_banner as just shown. This string provides the user with a reference of the Linux kernel version, the gcc version it was compiled with, and the release.

8.5.4. The Call to setup_arch

Line 408

The setup_arch() function in arch/i386/kernel/setup.c is cast to the __init type (refer to Chapter 2 for a description of __init) where it runs only once at system initialization time. The setup_arch() function takes in a pointer to any Linux command-line data entered at boot time and initializes many of the architecture-specific subsystems, such as memory, I/O, processors, and consoles:

 1083  void __init setup_arch(char **cmdline_p)
 1084  {
 1085   unsigned long max_low_pfn;
 1087   memcpy(&boot_cpu_data, &new_cpu_data, sizeof(new_cpu_data));
 1088   pre_setup_arch_hook();
 1089   early_cpu_init();
 1091   /*
 1092   * FIXME: This isn't an official loader_type right
 1093   * now but does currently work with elilo.
 1094   * If we were configured as an EFI kernel, check to make
 1095   * sure that we were loaded correctly from elilo and that
 1096   * the system table is valid. If not, then initialize normally.
 1097   */
 1098  #ifdef CONFIG_EFI
 1099   if ((LOADER_TYPE == 0x50) && EFI_SYSTAB)
 1100    efi_enabled = 1;
 1101  #endif
 1103   ROOT_DEV = old_decode_dev(ORIG_ROOT_DEV);
 1104   drive_info = DRIVE_INFO;
 1105   screen_info = SCREEN_INFO;
 1106   edid_info = EDID_INFO;
 1107   apm_info.bios = APM_BIOS_INFO;
 1108   ist_info = IST_INFO;
 1109   saved_videomode = VIDEO_MODE;
 1110   if( SYS_DESC_TABLE.length != 0 ) {
 1111    MCA_bus = SYS_DESC_TABLE.table[3] &0x2;
 1112    machine_id = SYS_DESC_TABLE.table[0];
 1113    machine_submodel_id = SYS_DESC_TABLE.table[1];
 1114    BIOS_revision = SYS_DESC_TABLE.table[2];
 1115   }
 1116   aux_device_present = AUX_DEVICE_INFO; 
 1118  #ifdef CONFIG_BLK_DEV_RAM
 1119   rd_image_start = RAMDISK_FLAGS & RAMDISK_IMAGE_START_MASK;
 1120   rd_prompt = ((RAMDISK_FLAGS & RAMDISK_PROMPT_FLAG) != 0);
 1121   rd_doload = ((RAMDISK_FLAGS & RAMDISK_LOAD_FLAG) != 0);
 1122  #endif
 1123   ARCH_SETUP
 1124   if (efi_enabled)
 1125    efi_init();
 1126   else
 1127    setup_memory_region();
 1129   copy_edd();
 1131   if (!MOUNT_ROOT_RDONLY)
 1132    root_mountflags &= ~MS_RDONLY;
 1133   init_mm.start_code = (unsigned long) _text;
 1134   init_mm.end_code = (unsigned long) _etext;
 1135   init_mm.end_data = (unsigned long) _edata;
 1136   init_mm.brk = init_pg_tables_end + PAGE_OFFSET;
 1138   code_resource.start = virt_to_phys(_text);
 1139   code_resource.end = virt_to_phys(_etext)-1;
 1140   data_resource.start = virt_to_phys(_etext);
 1141   data_resource.end = virt_to_phys(_edata)-1;
 1143   parse_cmdline_early(cmdline_p);
 1145   max_low_pfn = setup_memory();
 1147   /*
 1148   * NOTE: before this point _nobody_ is allowed to allocate
 1149   * any memory using the bootmem allocator.
 1150   */
 1152  #ifdef CONFIG_SMP
 1153   smp_alloc_memory(); /* AP processor realmode stacks in low memory*/
 1154  #endif
 1155   paging_init();
 1158   {
 1159    char *s = strstr(*cmdline_p, "earlyprintk=");
 1160    if (s) {
 1161     extern void setup_early_printk(char *);
 1163     setup_early_printk(s);
 1164     printk("early console enabled\n");
 1165    }
 1166   }
 1167  #endif
 1170   dmi_scan_machine();
 1172  #ifdef CONFIG_X86_GENERICARCH
 1173   generic_apic_probe(*cmdline_p);
 1174  #endif  
 1175   if (efi_enabled)
 1176    efi_map_memmap();
 1178   /*
 1179   * Parse the ACPI tables for possible boot-time SMP configuration.
 1180   */
 1181   acpi_boot_init();
 1183  #ifdef CONFIG_X86_LOCAL_APIC
 1184   if (smp_found_config)
 1185    get_smp_config();
 1186  #endif
 1188  register_memory(max_low_pfn);
 1190  #ifdef CONFIG_VT
 1191  #if defined(CONFIG_VGA_CONSOLE)
 1192   if (!efi_enabled || (efi_mem_type(0xa0000) != EFI_CONVENTIONAL_MEMORY))
 1193    conswitchp = &vga_con;
 1194  #elif defined(CONFIG_DUMMY_CONSOLE)
 1195   conswitchp = &dummy_con;
 1196  #endif
 1197  #endif
 1198  }

Line 1087

Get boot_cpu_data, which is a pointer to the cpuinfo_x86 struct filled in at boot time. This is similar for PPC.

Line 1088

Activate any machine-specific identification routines. This can be found in arch/xxx/machine-default/setup.c.

Line 1089

Identify the specific processor.

Lines 11031116

Get the system boot parameters.

Lines 11181122

Get RAM disk if set in arch/<arch>/defconfig.

Lines 11241127

Initialize Extensible Firmware Interface (if set in /defconfig) or just print out the BIOS memory map.

Line 1129

Save off Enhanced Disk Drive parms from boot time.

Lines 11331141

Initialize memory-management structs from the BIOS-provided memory map.

Line 1143

Begin parsing out the Linux command line. (See arch/<arch>/kernel/ setup.c.)

Line 1145

Initializes/reserves boot memory. (See arch/i386/kernel/setup.c.)

Lines 11531155

Get a page for SMP initialization or initialize paging beyond the 8M that's already initialized in head.S. (See arch/i386/mm/init.c.)

Lines 11571167

Get printk() running even though the console is not fully initialized.

Line 1170

This line is the Desktop Management Interface (DMI), which gathers information about the specific system-hardware configuration from BIOS. (See arch/i386/kernel/dmi_scan.c.)

Lines 11721174

If the configuration calls for it, look for the APIC given on the command line. (See arch/i386/machine-generic/probe.c.)

Lines 11751176

If using Extensible Firmware Interface, remap the EFI memory map. (See arch/i386/kernel/efi.c.)

Line 1181

Look for local and I/O APICs. (See arch/i386/kernel/acpi/boot.c.) Locate and checksum System Description Tables. (See drivers/acpi/tables.c.) For a better understanding of ACPI, go to the ACPI4LINUX project on the Web.

Lines 11831186

Scan for SMP configuration. (See arch/i386/kernel/mpparse.c.) This section can also use ACPI for configuration information.

Line 1188

Request I/O and memory space for standard resources. (See arch/i386/kernel/std_resources.c for an idea of how resources are registered.)

Lines 11901197

Set up the VGA console switch structure. (See drivers/video/console/vgacon.c.)

A similar but shorter version of setup_arch() can be found in arch/ppc/kernel/setup.c for the PowerPC. This function initializes a large part of the ppc_md structure. A call to pmac_feature_init() in arch/ppc/platforms/pmac_feature.c does an initial probe and initialization of the pmac hardware.

8.5.5. The Call to setup_per_cpu_areas()

Line 409

The routine setup_per_cpu_areas() exists for the setup of a multiprocessing environment. If the Linux kernel is compiled without SMP support, setup_per_cpu_areas() is stubbed out to do nothing, as follows:

 317  static inline void setup_per_cpu_areas(void) { }

If the Linux kernel is compiled with SMP support, setup_per_cpu_areas() is defined as follows:

 327 static void __init setup_per_cpu_areas(void)
 328 {
 329   unsigned long size, i;
 330   char *ptr;
 331   /* Created by linker magic */
 332   extern char __per_cpu_start[], __per_cpu_end[];
 334   /* Copy section for each CPU (we discard the original) */
 335   size = ALIGN(__per_cpu_end - __per_cpu_start, SMP_CACHE_BYTES);
 336 #ifdef CONFIG_MODULES
 337   if (size < PERCPU_ENOUGH_ROOM)
 338     size = PERCPU_ENOUGH_ROOM;
 339 #endif
 341   ptr = alloc_bootmem(size * NR_CPUS);
 343   for (i = 0; i < NR_CPUS; i++, ptr += size) {
 344     __per_cpu_offset[i] = ptr - __per_cpu_start;
 345     memcpy(ptr, __per_cpu_start, __per_cpu_end - __per_cpu_start);
 346   }
 347 }

Lines 329332

The variables for managing a consecutive block of memory are initialized. The "linker magic" variables are defined during linking in the appropriate architecture's kernel directory (for example, arch/i386/kernel/

Lines 334341

We determine the size of memory a single CPU requires and allocate that memory for each CPU in the system as a single contiguous block of memory.

Lines 343346

We cycle through the newly allocated memory, initializing each CPU's chunk of memory. Conceptually, we have taken a chunk of data that's valid for a single CPU (__per_cpu_start to __per_cpu_end) and copied it for each CPU on the system. This way, each CPU has its own data with which to play.

8.5.6. The Call to smp_prepare_boot_cpu()

Line 415

Similar to smp_per_cpu_areas(), smp_prepare_boot_cpu() is stubbed out when the Linux kernel does not support SMP:

 106 #define smp_prepare_boot_cpu()     do {} while (0)

However, if the Linux kernel is compiled with SMP support, we need to allow the booting CPU to access its console drivers and the per-CPU storage that we just initialized. Marking CPU bitmasks achieves this.

A CPU bitmask is defined as follows:

 10 #if NR_CPUS > BITS_PER_LONG && NR_CPUS != 1
 13 struct cpumask
 14 {
 15   unsigned long mask[CPU_ARRAY_SIZE];
 16 };

This means that we have a platform-independent bitmask that contains the same number of bits as the system has CPUs.

smp_prepare_boot_cpu() is implemented in the architecture-dependent section of the Linux kernel but, as we soon see, it is the same for i386 and PPC systems:

 66 /* bitmap of online cpus */
 67 cpumask_t cpu_online_map;
 70 cpumask_t cpu_callout_map;
 1341 void __devinit smp_prepare_boot_cpu(void)
 1342 {
 1343   cpu_set(smp_processor_id(), cpu_online_map);
 1344   cpu_set(smp_processor_id(), cpu_callout_map);
 1345 }
 49 cpumask_t cpu_online_map;
 50 cpumask_t cpu_possible_map;
 331 void __devinit smp_prepare_boot_cpu(void)
 332 {
 333   cpu_set(smp_processor_id(), cpu_online_map);
 334   cpu_set(smp_processor_id(), cpu_possible_map);
 335 }

In both these functions, cpu_set() simply sets the bit smp_processor_id() in the cpumask_t bitmap. Setting a bit implies that the value of the set bit is 1.

8.5.7. The Call to sched_init()

Line 422

The call to sched_init() marks the initialization of all objects that the scheduler manipulates to manage the assignment of CPU time among the system's processes. Keep in mind that, at this point, only one process exists: the init process that currently executes sched_init():

 3896 void __init sched_init(void)
 3897 {
 3898   runqueue_t *rq;
 3899   int i, j, k;
 3919   for (i = 0; i < NR_CPUS; i++) {
 3920     prio_array_t *array;
 3922     rq = cpu_rq(i);
 3923     spin_lock_init(&rq->lock);
 3924     rq->active = rq->arrays;
 3925     rq->expired = rq->arrays + 1;
 3926     rq->best_expired_prio = MAX_PRIO;
 3938     for (j = 0; j < 2; j++) {
 3939       array = rq->arrays + j;
 3940       for (k = 0; k < MAX_PRIO; k++) {
 3941         INIT_LIST_HEAD(array->queue + k);
 3942         __clear_bit(k, array->bitmap);
 3943       }
 3944       // delimiter for bitsearch
 3945       __set_bit(MAX_PRIO, array->bitmap);
 3946     }
 3947   }
 3948   /*
 3949   * We have to do a little magic to get the first
 3950   * thread right in SMP mode.
 3951   */
 3952   rq = this_rq();
 3953   rq->curr = current;
 3954   rq->idle = current;
 3955   set_task_cpu(current, smp_processor_id());
 3956   wake_up_forked_process(current);
 3958   /*
 3959   * The boot idle thread does lazy MMU switching as well:
 3960   */
 3961   atomic_inc(&init_mm.mm_count);
 3962   enter_lazy_tlb(&init_mm, current);
 3963 }

Lines 39193926

Each CPU's run queue is initialized: The active queue, expired queue, and spinlock are all initialized in this segment. Recall from Chapter 7 that spin_lock|_init() sets the spinlock to 1, which indicates that the data object is unlocked.

Figure 8.13 illustrates the initialized run queue.

Figure 8.13. Initialized Run Queue rq

Lines 39383947

For each possible priority, we initialize the list associated with the priority and clear all bits in the bitmap to show that no process is on that queue. (If all this is confusing, refer to Figure 8.14. Also, see Chapter 7 for an overview of how the scheduler manages its run queues.) This code chunk just ensures that everything is ready for the introduction of a process. As of line 3947, the scheduler is in the position to know that no processes exist; it ignores the current and idle processes for now.

Figure 8.14. rq->arrays

Lines 39523956

We add the current process to the current CPU's run queue and call wake_up_forked_process() on ourselves to initialize current into the scheduler. Now, the scheduler knows that exactly one process exists: the init process.

Lines 39613962

When lazy MMU switching is enabled, it allows a multiprocessor Linux system to perform context switches at a faster rate. A TLB is a transaction lookaside buffer that contains the recent page translation addresses. It takes a long time to flush the TLB, so we swap it if possible. enter_lazy_tlb() ensures that the mm_struct init_mm isn't being used across multiple CPUs and can be lazily switched. On a uniprocessor system, this becomes a NULL function.

The sections that were omitted in the previous code deal with initialization of SMP machines. As a quick overview, those sections bootstrap each CPU to the default settings necessary to allow for load balancing, group scheduling, and thread migration. They are omitted here for clarity and brevity.

8.5.8. The Call to build_all_zonelists()

Line 424

The build_all_zonelists()function splits up the memory according to the zone types ZONE_DMA, ZONE_NORMAL, and ZONE_HIGHMEM. As mentioned in Chapter 6, "Filesystems," zones are linear separations of physical memory that are used mainly to address hardware limitations. Suffice it to say that this is the function where these memory zones are built. After the zones are built, pages are stored in page frames that fall within zones.

The call to build_all_zonelists() introduces numnodes and NODE_DATA. The global variable numnodes holds the number of nodes (or partitions) of physical memory.

The partitions are determined according to CPU access time. Note that, at this point, the page tables have already been fully set up:

 1345  void __init build_all_zonelists(void)
 1346  {
 1347   int i;
 1349   for(i = 0 ; i < numnodes ; i++)
 1350    build_zonelists(NODE_DATA(i));
 1351   printk("Built %i zonelists\n", numnodes);
 1352  }

build_all_zonelists() calls build_zonelists() once for each node and finishes by printing out the number of zonelists created. This book does not go into more detail regarding nodes. Suffice it to say that, in our one CPU example, numnodes are equivalent to 1, and each node can have all three types of zones. The NODE_DATA macro returns the node's descriptor from the node descriptor list.

8.5.9. The Call to page_alloc_init

Line 425

The function page_alloc_init() simply registers a function in a notifier chain.[6] The function-registered page_alloc_cpu_notify() is a page-draining function[7] associated with dynamic CPU configuration.

[6] Chapter 2 discusses notifier chains.

[7] Page draining refers to removing pages that are in use by a CPU that will no longer be used.

Dynamic CPU configuration refers to bringing up and down CPUs during the running of the Linux system, an event referred to as "hotplugging the CPU." Although technically, CPUs are not physically inserted and removed during machine operation, they can be turned on and off in some systems, such as the IBM p-Series 690. Let's look at the function:

 1787  #ifdef CONFIG_HOTPLUG_CPU
 1788  static int page_alloc_cpu_notify(struct notifier_block *self,
 1789      unsigned long action, void *hcpu)
 1790  {
 1791   int cpu = (unsigned long)hcpu;
 1792   long *count;
 if (action == CPU_DEAD) {
 1796    count = &per_cpu(nr_pagecache_local, cpu);
 1797    atomic_add(*count, &nr_pagecache);
 1798    *count = 0;
 1799    local_irq_disable();
 1800    __drain_pages(cpu);
 1801    local_irq_enable();
 1802   }
 1803   return NOTIFY_OK;
 1804  }
 1805  #endif /* CONFIG_HOTPLUG_CPU */
 1807  void __init page_alloc_init(void)
 1808  {
 1809   hotcpu_notifier(page_alloc_cpu_notify, 0);
 1810  }

Line 1809

This line is the registration of the page_alloc_cpu_notify() routine into the hotcpu_notifier notifier chain. The hotcpu_notifier() routine creates a notifier_block that points to the page_alloc_cpu_notify() function and, with a priority of 0, then registers the object in the cpu_chain notifier chain(kernel/cpu.c).

Line 1788

page_alloc_cpu_notify() has the parameters that correspond to a notifier call, as Chapter 2 explained. The system-specific pointer points to an integer that specifies the CPU number.

Lines 17941802

If the CPU is dead, free up its pages. The variable action is set to CPU_DEAD when a CPU is brought down. (See drain_pages() in this same file.)

8.5.10. The Call to parse_args()

Line 427

The parse_args() function parses the arguments passed to the Linux kernel.

For example, nfsroot is a kernel parameter that sets the NFS root filesystem for systems without disks. You can find a complete list of kernel parameters in Documentation/kernel-parameters.txt:

 116 int parse_args(const char *name,
 117    char *args,
 118    struct kernel_param *params,
 119    unsigned num,
 120    int (*unknown)(char *param, char *val))
 121 {
 122   char *param, *val;
 124   DEBUGP("Parsing ARGS: %s\n", args);
 126   while (*args) {
 127     int ret;
 129     args = next_arg(args, &param, &val);
 130     ret = parse_one(param, val, params, num, unknown);
 131     switch (ret) {
 132     case -ENOENT:
 133       printk(KERN_ERR "%s: Unknown parameter '%s'\n",
 134        name, param);
 135       return ret;
 136     case -ENOSPC:
 137       printk(KERN_ERR
 138        "%s: '%s' too large for parameter '%s'\n",
 139        name, val ?: "", param);
 140       return ret;
 141     case 0:
 142       break;
 143     default:
 144       printk(KERN_ERR
 145        "%s: '%s' invalid for parameter '%s'\n",
 146        name, val ?: "", param);
 147       return ret;
 148     }
 149   }
 151   /* All parsed OK. */
 152   return 0;
 153 }

Lines 116125

The parameters passed to parse_args() are the following:

  • name. A character string to be displayed if any errors occur while the kernel attempts to parse the kernel parameter arguments. In standard operation, this means that an error message, "Booting Kernel: Unknown parameter X," is displayed.

  • args. The kernel parameter list of form foo=bar,bar2 baz=fuz wix.

  • params. Points to the kernel parameter structure that holds all the valid parameters for the specific kernel. Depending on how a kernel was compiled, some parameters might exist and others might not.

  • num. The number of kernel parameters in this specific kernel, not the number of arguments in args.

  • unknown. Points to a function to call if a kernel parameter is specified that is not recognized.

Lines 126153

We loop through the string args, set param to point to the first parameter, and set val to the first value (if any, val could be null). This is done via next_args() (for example, the first call to next_args() with args being foo=bar,bar2 baz=fuz wix). We set param to foo and val to bar, bar2. The space after bar2 is overwritten with a \0 and args is set to point at the beginning character of baz.

We pass our pointers param and val into parse_one(), which does the work of setting the actual kernel parameter data structures:

 46 static int parse_one(char *param,
 47      char *val,
 48      struct kernel_param *params,
 49      unsigned num_params,
 50      int (*handle_unknown)(char *param, char *val))
 51 {
 52   unsigned int i;
 54   /* Find parameter */
 55   for (i = 0; i < num_params; i++) {
 56     if (parameq(param, params[i].name)) {
 57       DEBUGP("They are equal! Calling %p\n",
 58        params[i].set);
 59       return params[i].set(val, &params[i]);
 60     }
 61   }
 63   if (handle_unknown) {
 64     DEBUGP("Unknown argument: calling %p\n", handle_unknown);
 65     return handle_unknown(param, val);
 66   }
 68   DEBUGP("Unknown argument '%s'\n", param);
 69   return -ENOENT;
 70 }

Lines 4654

These parameters are the same as those described under parse_args() with param and val pointing to a subsection of args.

Lines 5561

We loop through the defined kernel parameters to see if any match param. If we find a match, we use val to call the associated set function. Thus, the set function handles multiple, or null, arguments.

Lines 6266

If the kernel parameter was not found, we call the handle_unknown() function that was passed in via parse_args().

After parse_one() is called for each parameter-value combination specified in args, we have set the kernel parameters and are ready to continue starting the Linux kernel.

8.5.11. The Call to trap_init()

Line 431

In Chapter 3, we introduced exceptions and interrupts. The function TRap_init() is specific to the handling of interrupts in x86 architecture. Briefly, this function initializes a table referenced by the x86 hardware. Each element in the table has a function to handle kernel or user-related issues, such as an invalid instruction or reference to a page not currently in memory. Although the PowerPC can have these same issues, its architecture handles them in a somewhat different manner. (Again, all this is discussed in Chapter 3.)

8.5.12. The Call to rcu_init()

Line 432

The rcu_init() function initializes the Read-Copy-Update (RCU) subsystem of the Linux 2.6 kernel. RCU controls access to critical sections of code and enforces mutual exclusion in systems where the cost of acquiring locks becomes significant in comparison to the chip speed. The Linux implementation of RCU is beyond the scope of this book. We occasionally mention calls to the RCU subsystem in our code analysis, but the specifics are left out. For more information on the Linux RCU subsystem, consult the Linux Scalability Effort pages at

 297 void __init rcu_init(void)
 298 {
 299   rcu_cpu_notify(&rcu_nb, CPU_UP_PREPARE,
 300       (void *)(long)smp_processor_id());
 301   /* Register notifier for non-boot CPUs */
 302   register_cpu_notifier(&rcu_nb);
 303 }

8.5.13. The Call to init_IRQ()

Line 433

The function init_IRQ() in arch/i386/kernel/i8259.c initializes the hardware interrupt controller, the interrupt vector table and, if on x86, the system timer. Chapter 3 includes a thorough discussion of interrupts for both x86 and PPC, where the Real-Time Clock is used as an interrupt example:

 410 void __init init_IRQ(void)
 411 {
 412  int i;
 422  for (i = 0; i < (NR_VECTORS - FIRST_EXTERNAL_VECTOR); i++) {
 423   int vector = FIRST_EXTERNAL_VECTOR + i;
 424   if (i >= NR_IRQS)
 425    break;
 430   if (vector != SYSCALL_VECTOR) 
 431    set_intr_gate(vector, interrupt[i]);
 432  }
 437  intr_init_hook();
 443  setup_timer();
 449  if (boot_cpu_data.hard_math && !cpu_has_fpu)
 450   setup_irq(FPU_IRQ, &fpu_irq);
 451 }

Lines 422432

Initialize the interrupt vectors. This associates the x86 (hardware) IRQs with the appropriate handling code.

Line 437

Set up machine-specific IRQs, such as the Advanced Programmable Interrupt Controller (APIC).

Line 443

Initialize the timer clock.

Lines 449450

Set up for FPU if needed.

The following code is the PPC implementation of init_IRQ():

 700  void __init init_IRQ(void)
 701  {
 702   int i;
 704   for (i = 0; i < NR_IRQS; ++i)
 705    irq_affinity[i] = DEFAULT_CPU_AFFINITY;
 707   ppc_md.init_IRQ();
 708  }

Line 704

In multiprocessor systems, an interrupt can have an affinity for a specific processor.

Line 707

For a PowerMac platform, this routine is found in arch/ppc/platforms/ pmac_pic.c. It sets up the Programmable Interrupt Controller (PIC) portion of the I/O controller.

8.5.14. The Call to softirq_init()

Line 436

The softirq_init() function prepares the boot CPU to accept notifications from tasklets. Let's look at the internals of softirq_init():

 317 void __init softirq_init(void)
 318 {
 319   open_softirq(TASKLET_SOFTIRQ, tasklet_action, NULL);
 320   open_softirq(HI_SOFTIRQ, tasklet_hi_action, NULL);
 321 }
 327 void __init softirq_init(void)
 328 {
 329  open_softirq(TASKLET_SOFTIRQ, tasklet_action, NULL);
 330  open_softirq(HI_SOFTIRQ, tasklet_hi_action, NULL);
 331 tasklet_cpu_notify(&tasklet_nb, (unsigned long)CPU_UP_PREPARE,
 332         (void *)(long)smp_processor_id());
 333 register_cpu_notifier(&tasklet_nb);
 334 }

Lines 319320

We initialize the actions to take when we get a TASKLET_SOFTIRQ or HI_SOFTIRQ interrupt. As we pass in NULL, we are telling the Linux kernel to call tasklet_action(NULL) and tasklet_hi_action(NULL) (in the cases of Line 319 and Line 320, respectively). The following implementation of open_softirq() shows how the Linux kernel stores the tasklet initialization information:

 177 void open_softirq(int nr, void (*action)(struct softirq_action*),
 void * data)
 178 {
 179   softirq_vec[nr].data = data;
 180   softirq_vec[nr].action = action;
 181 }

8.5.15. The Call to time_init()

Line 437

The function time_init() selects and initializes the system timer. This function, like TRap_init(), is very architecture dependent; Chapter 3 covered this when we explored timer interrupts. The system timer gives Linux its temporal view of the world, which allows it to schedule when a task should run and for how long. The High Performance Event Timer (HPET) from Intel will be the successor to the 8254 PIT and RTC hardware. The HPET uses memory-mapped I/O, which means that the HPET control registers are accessed as if they were memory locations. Memory must be configured properly to access I/O regions. If set in arch/i386/defconfig.h, time_init() needs to be delayed until after mem_init() has set up memory regions. See the following code:

 376 void __init time_init(void)
 377 {
 379  if (is_hpet_capable()) {
 380   late_time_init = hpet_time_init;
 381   return;
 382  }
 387 #endif
 388  xtime.tv_sec = get_cmos_time();
 389  wall_to_monotonic.tv_sec = -xtime.tv_sec;
 390  xtime.tv_nsec = (INITIAL_JIFFIES % HZ) * (NSEC_PER_SEC / HZ);
 391  wall_to_monotonic.tv_nsec = -xtime.tv_nsec;
 393  cur_timer = select_timer();
 394  printk(KERN_INFO "Using %s for high-res timesource\n",cur_timer->name);
 396  time_init_hook();
 397 }  

Lines 379387

If the HPET is configured, time_init() must run after memory has been initialized. The code for late_time_init() (on lines 358373) is the same as time_init().

Lines 388391

Initialize the xtime time structure used for holding the time of day.

Line 393

Select the first timer that initializes. This can be overridden. (See arch/i386/ kernel/timers/timer.c.)

8.5.16. The Call to console_init()

Line 444

A computer console is a device where the kernel (and other parts of a system) output messages. It also has login capabilities. Depending on the system, the console can be on the monitor or through a serial port. The function console_init() is an early call to initialize the console device, which allows for boot-time reporting of status:

 2347 void __init console_init(void)
 2348 {
 2349  initcall_t *call;
 2352  (void) tty_register_ldisc(N_TTY, &tty_ldisc_N_TTY);
 2359   disable_early_printk();
 2360 #endif
 2366  call = &__con_initcall_start;
 2367  while (call < &__con_initcall_end) {
 2368   (*call)();
 2369   call++;
 2370  }
 2371 }  

Line 2352

Set up the line discipline.

Line 2359

Keep the early printk support if desired. Early printk support allows the system to report status during the boot process before the system console is fully initialized. It specifically initializes a serial port (ttyS0, for example) or the system's VGA to a minimum functionality. Early printk support is started in setup_arch(). (For more information, see the code discussion on line 408 in this section and the files /kernel/printk.c and /arch/i386/kernel/ early_printk.c.)

Line 2366

Initialize the console.

8.5.17. The Call to profile_init()

Line 447

profile_init() allocates memory for the kernel to store profiling data in. Profiling is the term used in computer science to describe data collection during program execution. Profiling data is used to analyze performance and otherwise study the program being executed (in our case, the Linux kernel itself):

 30 void __init profile_init(void)
 31 {
 32   unsigned int size;
 34   if (!prof_on)
 35     return;
 37   /* only text is profiled */
 38   prof_len = _etext - _stext;
 39   prof_len >>= prof_shift;
 41   size = prof_len * sizeof(unsigned int) + PAGE_SIZE - 1;
 42   prof_buffer = (unsigned int *) alloc_bootmem(size);
 43 }

Lines 3435

Don't do anything if kernel profiling is not enabled.

Lines 3839

_etext and _stext are defined in kernel/head.S. We determine the profile length as delimited by _etext and _stext and then shift the value by prof_shift, which was defined as a kernel parameter.

Lines 4142

We allocate a contiguous block of memory for storing profiling data of the size requested by the kernel parameters.

8.5.18. The Call to local_irq_enable()

Line 448

The function local_irq_enable() allows interrupts on the current CPU. It is usually paired with local_irq_disable(). In previous kernel versions, the sti(), cli() pair were used for this purpose. Although these macros still resolve to sti() and cli(), the keyword to note here is local. These affect only the currently running processor:

 446  #define local_irq_disable()  _asm__ __volatile__("cli": : :"memory")
 447  #define local_irq_enable()  __asm__ __volatile__("sti": : :"memory")

Lines 446447

Referring to the " Inline Assembly" section in Chapter 2, the item in the quotes is the assembly instruction and memory is on the clobber list.

8.5.19. initrd Configuration

Lines 449456

This #ifdef statement is a sanity check on initrdthe initial RAM disk.

A system using initrd loads the kernel and mounts the initial RAM disk as the root filesystem. Programs can run from this RAM disk and, when the time comes, a new root filesystem, such as the one on a hard drive, can be mounted and the initial RAM disk unmounted.

This operation simply checks to ensure that the initial RAM disk specified is valid. If it isn't, we set initrd_start to 0, which tells the kernel to not use an initial RAM disk.[8]

[8] For more information, refer to Documentation/initrd.txt.

8.5.20. The Call to mem_init()

Line 457

For both x86 and PPC, the call to mem_init() finds all free pages and sends that information to the console. Recall from Chapter 4 that the Linux kernel breaks available memory into zones. Currently, Linux has three zones:

  • Zone_DMA. Memory less than 16MB.

  • Zone_Normal. Memory starting at 16MB but less than 896MB. (The kernel uses the last 128MB.)

  • Zone_HIGHMEM. Memory greater than 1GB.

The function mem_init() finds the total number of free page frames in all the memory zones. This function prints out informational kernel messages regarding the beginning state of the memory. This function is architecture dependent because it manages early memory allocation data. Each architecture supplies its own function, although they all perform the same tasks. We first look at how x86 does it and follow it up with PPC:

---------------------------------------------------------------------- arch/i386/mm/init 445 void __init mem_init(void) 446 { 447 extern int ppro_with_ram_bug(void); 448 int codesize, reservedpages, datasize, initsize; 449 int tmp; 450 int bad_ppro; ... 459 #ifdef CONFIG_HIGHMEM 460 if (PKMAP_BASE+LAST_PKMAP*PAGE_SIZE >= FIXADDR_START) { 461 printk(KERN_ERR "fixmap and kmap areas overlap - this will crash\n"); 462 printk(KERN_ERR "pkstart: %lxh pkend:%lxh fixstart %lxh\n", 463 PKMAP_BASE, PKMAP_BASE+LAST_PKMAP*PAGE_SIZE, FIXADDR_START); 464 BUG(); 465 } 466 #endif 467 468 set_max_mapnr_init(); ... 476 /* this will put all low memory onto the freelists */ 477 totalram_pages += __free_all_bootmem(); 478 479 480 reservedpages = 0; 481 for (tmp = 0; tmp < max_low_pfn; tmp++) ... 485 if (page_is_ram(tmp) && PageReserved(pfn_to_page(tmp))) 486 reservedpages++; 487 488 set_highmem_pages_init(bad_ppro); 490 codesize = (unsigned long) &_etext - (unsigned long) &_text; 491 datasize = (unsigned long) &_edata - (unsigned long) &_etext; 492 initsize = (unsigned long) &__init_end - (unsigned long) &__init_begin; 493 494 kclist_add(&kcore_mem, __va(0), max_low_pfn << PAGE_SHIFT); 495 kclist_add(&kcore_vmalloc, (void *)VMALLOC_START, 496 VMALLOC_END-VMALLOC_START); 497 498 printk(KERN_INFO "Memory: %luk/%luk available (%dk kernel code, %dk reserved, %dk data, %dk init, %ldk highmem)\n", 499 (unsigned long) nr_free_pages() << (PAGE_SHIFT-10), 500 num_physpages << (PAGE_SHIFT-10), 501 codesize >> 10, 502 reservedpages << (PAGE_SHIFT-10), 503 datasize >> 10, 504 initsize >> 10, 505 (unsigned long) (totalhigh_pages << (PAGE_SHIFT-10)) 506 ); ... 521 #ifndef CONFIG_SMP 522 zap_low_mappings(); 523 #endif 524 } -----------------------------------------------------------------------

Line 459

This line is a straightforward error check so that fixed map and kernel map do not overlap.

Line 469

The function set_max_mapnr_init() (arch/i386/mm/init.c) simply sets the value of num_physpages, which is a global variable (defined in mm/memory.c) that holds the number of available page frames.

Line 477

The call to __free_all_bootmem() marks the freeing up of all low-memory pages. During boot time, all pages are reserved. At this late point in the bootstrapping phase, the available low-memory pages are released. The flow of the function calls are seen in Figure 8.15.

Figure 8.15. __free_all_bootmem() Call Hierarchy

Let's look at the core portion of free_all_bootmem_core() to understand what is happening:

---------------------------------------------------------------------- mm/bootmem.c 257 static unsigned long __init free_all_bootmem_core(pg_data_t *pgdat) 258 { 259 struct page *page; 260 bootmem_data_t *bdata = pgdat->bdata; 261 unsigned long i, count, total = 0; ... 295 page = virt_to_page(bdata->node_bootmem_map); 296 count = 0; 297 for (i = 0; i < ((bdata->node_low_pfn-(bdata->node_boot_start >> PAGE_SHIFT))/8 + PAGE_SIZE-1)/PAGE_SIZE; i++,page++) { 298 count++; 299 ClearPageReserved(page); 300 set_page_count(page, 1); 301 __free_page(page); 302 } 303 total += count; 304 bdata->node_bootmem_map = NULL; 305 306 return total; 307 } -----------------------------------------------------------------------

For all the available low-memory pages, we clear the PG_reserved flag[9] in the flags field of the page struct. Next, we set the count field of the page struct to 1 to indicate that it is in use and call __free_page(), thus passing it to the buddy allocator. If you recall from Chapter 4's explanation of the buddy system, we explain that this function releases a page and adds it to a free list.

[9] Recall from Chapter 6 that this flag is set in pages that are to be pinned in memory and that it is set for low memory during early bootstrapping.

The function __free_all_bootmem() returns the number of low memory pages available, which is added to the running count of totalram_pages (an unsigned long defined in mm/page_alloc.c).

Lines 480486

These lines update the count of reserved pages.

Line 488

The call to set_highmem_pages_init() marks the initialization of high-memory pages. Figure 8.16 illustrates the calling hierarchy of set_highmem_pages_init().

Figure 8.16. highmem_pages_init Calling Hierarchy

Let's look at the bulk of the code performed in one_highpage_init():

 253  void __init one_highpage_init(struct page *page, int pfn, int bad_ppro)
 254  {
 255         if (page_is_ram(pfn) && !(bad_ppro && page_kills_ppro(pfn))) {
 256                  ClearPageReserved(page);
 257                 set_bit(PG_highmem, &page->flags);
 258                  set_page_count(page, 1);
 259                  __free_page(page);
 260                  totalhigh_pages++;
 261          } else
 262                  SetPageReserved(page);
 263  }

Much like __free_all_bootmem(), all high-memory pages have their page struct flags field cleared of the PG_reserved flag, have PG_highmem set, and have their count field set to 1. __free_page() is also called to add these pages to the free lists and the totalhigh_pages counter is incremented.

Lines 490506

This code block gathers and prints out information regarding the size of memory areas and the number of available pages.

Lines 521523

The function zap_low_mappings flushes the initial TLBs and PGDs in low memory.

The function mem_init() marks the end of the boot phase of memory allocation and the beginning of the memory allocation that will be used throughout the system's life.

The PPC code for mem_init() finds and initializes all pages for all zones:

 393  void __init mem_init(void)
 394   {
 395   unsigned long addr;
 396   int codepages = 0;
 397   int datapages = 0;
 398   int initpages = 0;
 399   #ifdef CONFIG_HIGHMEM
 400   unsigned long highmem_mapnr;
 402   highmem_mapnr = total_lowmem >> PAGE_SHIFT;
 403   highmem_start_page = mem_map + highmem_mapnr;
 404  #endif /* CONFIG_HIGHMEM */
 405   max_mapnr = total_memory >> PAGE_SHIFT;
 407   high_memory = (void *) __va(PPC_MEMSTART + total_lowmem);
 408   num_physpages = max_mapnr;  /* RAM is assumed contiguous */
 410   totalram_pages += free_all_bootmem();
 413   /* if we are booted from BootX with an initial ramdisk,
 414    make sure the ramdisk pages aren't reserved. */
 415   if (initrd_start) {
 416  for (addr = initrd_start; addr < initrd_end; addr += PAGE_SIZE)
 417    ClearPageReserved(virt_to_page(addr));
 418  }
 419  #endif /* CONFIG_BLK_DEV_INITRD */
 421  #ifdef CONFIG_PPC_OF
 422   /* mark the RTAS pages as reserved */
 423   if ( rtas_data )
 424    for (addr = (ulong)__va(rtas_data);
 425     addr < PAGE_ALIGN((ulong)__va(rtas_data)+rtas_size) ;
 426     addr += PAGE_SIZE)
 427     SetPageReserved(virt_to_page(addr));
 428  #endif
 429  #ifdef CONFIG_PPC_PMAC
 430   if (agp_special_page)
 431    SetPageReserved(virt_to_page(agp_special_page));
 432  #endif
 433   if ( sysmap )
 434    for (addr = (unsigned long)sysmap;
 435     addr < PAGE_ALIGN((unsigned long)sysmap+sysmap_size) ;
 436     addr += PAGE_SIZE)
 437     SetPageReserved(virt_to_page(addr));
 439   for (addr = PAGE_OFFSET; addr < (unsigned long)high_memory;
 440    addr += PAGE_SIZE) {
 441    if (!PageReserved(virt_to_page(addr)))
 442     continue;
 443    if (addr < (ulong) etext)
 444     codepages++;
 445    else if (addr >= (unsigned long)&__init_begin
 446      && addr < (unsigned long)&__init_end)
 447     initpages++;
 448    else if (addr < (ulong) klimit)
 449     datapages++;
 450   }
 452  #ifdef CONFIG_HIGHMEM
 453   {
 454    unsigned long pfn;
 456   for (pfn = highmem_mapnr; pfn < max_mapnr; ++pfn) {
 457     struct page *page = mem_map + pfn;
 459     ClearPageReserved(page);
 460     set_bit(PG_highmem, &page->flags);
 461     set_page_count(page, 1);
 462     __free_page(page);
 463     totalhigh_pages++;
 464    }
 465    totalram_pages += totalhigh_pages;
 466   }
 467  #endif /* CONFIG_HIGHMEM */
 469  printk("Memory: %luk available (%dk kernel code, %dk data, %dk init, %ldk highmem)\n",
 470     (unsigned long)nr_free_pages()<< (PAGE_SHIFT-10),
 471     codepages<< (PAGE_SHIFT-10), datapages<< (PAGE_SHIFT-10),
 472     initpages<< (PAGE_SHIFT-10),
 473     (unsigned long) (totalhigh_pages << (PAGE_SHIFT-10)));
 474   if (sysmap)
 475    printk(" loaded at 0x%08x for debugger, size: %ld bytes\n",
 476     (unsigned int)sysmap, sysmap_size);
 477  #ifdef CONFIG_PPC_PMAC
 478   if (agp_special_page)
 479    printk(KERN_INFO "AGP special page: 0x%08lx\n", agp_special_page); 
 480  #endif
 482   /* Make sure all our pagetable pages have page->mapping
 483    and page->index set correctly. */
 484   for (addr = KERNELBASE; addr != 0; addr += PGDIR_SIZE) {
 485    struct page *pg;
 486    pmd_t *pmd = pmd_offset(pgd_offset_k(addr), addr);
 487    if (pmd_present(*pmd)) {
 488     pg = pmd_page(*pmd);
 489     pg->mapping = (void *) &init_mm;
 490     pg->index = addr;
 491    }
 492   }
 493   mem_init_done = 1;
 494  }

Lines 399410

These lines find the amount of memory available. If HIGHMEM is used, those pages are also counted. The global variable totalram_pages is modified to reflect this.

Lines 412419

If used, clear any pages that the boot RAM disk used.

Lines 421432

Depending on the boot environment, reserve pages for the Real-Time Abstraction Services and AGP (video), if needed.

Lines 433450

If required, reserve some pages for system map.

Lines 452467

If using HIGHMEM, clear any reserved pages and modify the global variable totalram_pages.

Lines 469480

Print memory information to the console.

Lines 482492

Loop through page directory and initialize each mm_struct and index.

8.5.21. The Call to late_time_init()

Lines 459460

The function late_time_init() uses HPET (refer to the discussion under "The Call to time_init" section). This function is used only with the Intel architecture and HPET. This function has essentially the same code as time_init(); it is just called after memory initialization to allow the HPET to be mapped into physical memory.

8.5.22. The Call to calibrate_delay()

Line 461

The function calibrate_delay() in init/main.c calculates and prints the value of the much celebrated "BogoMips," which is a measurement that indicates the number of delay() iterations your processor can perform in a clock tick. calibrate_delay() allows delays to be approximately the same across processors of different speeds. The resulting valueat most an indicator of how fast a processor is runningis stored in loop_pre_jiffy and the udelay() and mdelay() functions use it to set the number of delay() iterations to perform:

 void __init calibrate_delay(void)
   unsigned long ticks, loopbit;
   int lps_precision = LPS_PREC;
 186   loops_per_jiffy = (1<<12);
   printk("Calibrating delay loop... ");
 189   while (loops_per_jiffy <<= 1) {
    /* wait for "start of" clock tick */
    ticks = jiffies;
    while (ticks == jiffies)
     /* nothing */;
    /* Go .. */
    ticks = jiffies;
    ticks = jiffies - ticks;
    if (ticks)
 200   }
 /* Do a binary approximation to get loops_per_jiffy set to equal one clock
  (up to lps_precision bits) */
 204   loops_per_jiffy >>= 1;
   loopbit = loops_per_jiffy;
 206   while ( lps_precision-- && (loopbit >>= 1) ) {
    loops_per_jiffy |= loopbit;
    ticks = jiffies;
    while (ticks == jiffies); 
    ticks = jiffies;
    if (jiffies != ticks)  /* longer than 1 tick */
     loops_per_jiffy &= ~loopbit;
 214   }
 /* Round the value and print it */  
 217   printk("%lu.%02lu BogoMIPS\n",
 219    (loops_per_jiffy/(5000/HZ)) % 100);

Line 186

Start at 0x800.

Lines 189200

Keep doubling loops_per_jiffy until the amount of time it takes the function delay(loops_per_jiffy) to exceed one jiffy.

Line 204

Divide loops_per_jiffy by 2.

Lines 206214

Successively add descending powers of 2 to loops_per_jiffy until tick equals jiffy.

Lines 217219

Print the value out as if it were a float.

8.5.23. The Call to pgtable_cache_init()

Line 463

The key function in this x86 code block is the system function kmem_cache_create(). This function creates a named cache. The first parameter is a string used to identify it in /proc/slabinfo:

 529 kmem_cache_t *pgd_cache;
 530 kmem_cache_t *pmd_cache;
 532 void __init pgtable_cache_init(void)
 533 {
 534   if (PTRS_PER_PMD > 1) {
 535     pmd_cache = kmem_cache_create("pmd",
 536           PTRS_PER_PMD*sizeof(pmd_t),
 537           0, 538           SLAB_HWCACHE_ALIGN | SLAB_MUST_H  WCACHE_ALIGN,     
 539           pmd_ctor,
 540           NULL);
 541     if (!pmd_cache)
 542       panic("pgtable_cache_init(): cannot create pmd c  ache");
 543   }    
 544   pgd_cache = kmem_cache_create("pgd",
 545         PTRS_PER_PGD*sizeof(pgd_t),
 546         0,
 548         pgd_ctor,
 549         PTRS_PER_PMD == 1 ? pgd_dtor : NULL);
 550   if (!pgd_cache)
 551     panic("pgtable_cache_init(): Cannot create pgd cache");
 552 }
 976 void pgtable_cache_init(void)
 977 {
 978   zero_cache = kmem_cache_create("zero",
 979         PAGE_SIZE,
 980         0,
 982         zero_ctor, 
 983         NULL);
 984   if (!zero_cache)
 985     panic("pgtable_cache_init(): could not create zero_cache  !\n"); 
 986 }

Lines 532542

Create the pmd cache.

Lines 544551

Create the pgd cache.

On the PPC, which has hardware-assisted hashing, pgtable_cache_init() is a no-op:

 685  #define pgtable_cache_init()  do { } while (0)

8.5.24. The Call to buffer_init()

Line 472

The buffer_init() function in fs/buffer.c holds data from filesystem devices:

 3031  void __init buffer_init(void)
   int i;
   int nrpages;
 3036   bh_cachep = kmem_cache_create("buffer_head",
     sizeof(struct buffer_head), 0,
     0, init_buffer_head, NULL);
 3039   for (i = 0; i < ARRAY_SIZE(bh_wait_queue_heads); i++)
 3044   nrpages = (nr_free_buffer_pages() * 10) / 100;
   max_buffer_heads = nrpages * (PAGE_SIZE / sizeof(struct buffer_head));
   hotcpu_notifier(buffer_cpu_notify, 0);
 3048  }  

Line 3036

Allocate the buffer cache hash table.

Line 3039

Create a table of buffer hash wait queues.

Line 3044

Limit low-memory occupancy to 10 percent.

8.5.25. The Call to security_scaffolding_startup()

Line 474

The 2.6 Linux kernel contains code for loading kernel modules that implement various security features. security_scaffolding_startup() simply verifies that a security operations object exists, and if it does, calls the security module's initialization functions.

How security modules can be created and what kind of issues a writer might face are beyond the scope of this text. For more information, consult Linux Security Modules ( and the Linux-security-module mailing list (

8.5.26. The Call to vfs_caches_init()

Line 475

The VFS subsystem depends on memory caches, called SLAB caches, to hold the structures it manages. Chapter 4 discusses SLAB caches detail. The vfs_caches_init() function initializes the SLAB caches that the subsystem uses. Figure 8.17 shows the overview of the main function hierarchy called from vfs_caches_init(). We explore in detail each function included in this call hierarchy. You can refer to this hierarchy to keep track of the functions as we look at each of them.

Figure 8.17. vfs_caches_init() Call Hierarchy

Table 8.3 summarizes the objects introduced by the vfs_caches_init() function or by one of the functions it calls.

 1623  void __init vfs_caches_init(unsigned long mempages)
 1624  {
 1625   names_cachep = kmem_cache_create("names_cache", 
 1626     PATH_MAX, 0, 
 1628   if (!names_cachep)
 1629    panic("Cannot create names SLAB cache");
 1631   filp_cachep = kmem_cache_create("filp", 
 1632     sizeof(struct file), 0,
 1633     SLAB_HWCACHE_ALIGN, filp_ctor, filp_dtor);
 1634   if(!filp_cachep)
 1635    panic("Cannot create filp SLAB cache");
 1637   dcache_init(mempages);
 1638   inode_init(mempages);
 1639   files_init(mempages); 
 1640   mnt_init(mempages);
 1641   bdev_cache_init();
 1642   chrdev_init();
 1643  }

Table 8.3. Objects Introduced by vfs_caches_init

Object Name



Global variable


Global variable


Global variable


Global variable


Global variable




Global variable


Global variable


Struct (discussed in Chapter 6)


Global variable

Line 1623

The routine takes in the global variable num_physpages (whose value is calculated during mem_init()) as a parameter that holds the number of physical pages available in the system's memory. This number influences the creation of SLAB caches, as we see later.

Lines 16251629

The next step is to create the names_cachep memory area. Chapter 4 describes the kmem_cache_create() function in detail. This memory area holds objects of size PATH_MAX, which is the maximum allowable number of characters a pathname is allowed to have. (This value is set in linux/limits.h as 4,096.) At this point, the cache that has been created is empty of objects, or memory areas of size PATH_MAX. The actual memory areas are allocated upon the first and potentially subsequent calls to getname().

As discussed in Chapter 6 the getname() routine is called at the beginning of some of the file-related system calls (for example, sys_open()) to read the file pathname from the process address space. Objects are freed from the cache with the putname() routine.

If the names_cache cache cannot be created, the kernel jumps to the panic routine, exiting the function's flow of control.

Lines 16311635

The filp_cachep cache is created next, with objects the size of the file structure. The object holding the file structure is allocated by the get_empty_filp() (fs/file_table.c) routine, which is called, for example, upon creation of a pipe or the opening of a file. The file descriptor object is deallocated by a call to the file_free() (fs/file_table.c) routine.

Line 1637

The dcache_init() (fs/dcache.c) routine creates the SLAB cache that holds dentry descriptors.[10] The cache itself is called the dentry_cache. The dentry descriptors themselves are created for each hierarchical component in pathnames referred by processes when accessing a file or directory. The structure associates the file or directory component with the inode that represents it, which further facilitates requests to that component for a speedier association with its corresponding inode.

[10] Recall that dentry is short for directory entry.

Line 1638

The inode_init() (fs/inode.c) routine initializes the inode hash table and the wait queue head array used for storing hashed inodes that the kernel wants to lock. The wait queue heads (wait_queue_head_t) for hashed inodes are stored in an array called i_wait_queue_heads. This array gets initialized at this point of the system's startup process.

The inode_hashtable gets created at this point. This table speeds up the searches on inode. The last thing that occurs is that the SLAB cache used to hold inode objects gets created. It is called inode_cache. The memory areas for this cache are allocated upon calls to alloc_inode (fs/inode.c) and freed upon calls to destroy_inode() (fs/inode.c).

Line 1639

The files_init() routine is called to determine the maximum amount of memory allowed for files per process. The max_files field of the files_stat structure is set. This is then referenced upon file creation to determine if there is enough memory to open the file. Let's look at this routine:

 292  void __init files_init(unsigned long mempages)
 293  { 
 294   int n; 
 299   n = (mempages * (PAGE_SIZE / 1024)) / 10;
 300   files_stat.max_files = n; 
 301   if (files_stat.max_files < NR_FILE)
 302    files_stat.max_files = NR_FILE;
 303  }

Line 299

The page size is divided by the amount of space that a file (along with associated inode and cache) will roughly occupy (in this case, 1K). This value is then multiplied by the number of pages to get the total amount of "blocks" that can be used for files. The division by 10 shows that the default is to limit the memory usage for files to no more than 10 percent of the available memory.

Lines 301302

The NR_FILE (include/linux/fs.h) is set to 8,192.

Line 1640

The next routine, called mnt_init(), creates the cache that will hold the vfsmount objects the VFS uses for mounting filesystems. The cache is called mnt_cache. The routine also creates the mount_hashtable array, which stores references to objects in mnt_cache for faster access. It then issues calls to initialize the sysfs filesystem and mounts the root filesystem. Let's closely look at the creation of the hash table:

---------------------------------------------------------------------- fs/namespace.c 1137 void __init mnt_init(unsigned long mempages) { 1139 struct list_head *d; 1140 unsigned long order; 1141 unsigned int nr_hash; 1142 int i; ... 1149 order = 0; 1150 mount_hashtable = (struct list_head *) 1151 __get_free_pages(GFP_ATOMIC, order); 1152 1153 if (!mount_hashtable) 1154 panic("Failed to allocate mount hash table\n"); ... 1161 nr_hash = (1UL << order) * PAGE_SIZE / sizeof(struct list_head); 1162 hash_bits = 0; 1163 do { 1164 hash_bits++; 1165 } while ((nr_hash >> hash_bits) != 0); 1166 hash_bits--; ... 1172 nr_hash = 1UL << hash_bits; 1173 hash_mask = nr_hash-1; 1174 1175 printk("Mount-cache hash table entries: %d (order: %ld, %ld bytes)\n", nr_hash, order, (PAGE_SIZE << order)); ... 1179 d = mount_hashtable; 1180 i = nr_hash; 1181 do { 1182 INIT_LIST_HEAD(d); 1183 d++; 1184 i--; 1185 } while (i); .. 1189 } ----------------------------------------------------------------------

Lines 11391144

The hash table array consists of a full page of memory. Chapter 4 explains in detail how the routine __get_free_pages() works. In a nutshell, this routine returns a pointer to a memory area of size 2 order pages. In this case, we allocate one page to hold the hash table.

Lines 11611173

The next step is to determine the number of entries in the table. nr_hash is set to hold the order (power of two) number of list heads that can fit into the table. hash_bits is calculated as the number of bits needed to represent the highest power of two in nr_hash. Line 1172 then redefines nr_hash as being composed of the single leftmost bit. The bitmask can then be calculated from the new nr_hash value.

Lines 11791185

Finally, we initialize the hash table through a call to the INIT_LIST_HEAD macro, which takes in a pointer to the memory area where a new list head is to be initialized. We do this nr_hash times (or the number of entries that the table can hold).

Let's walk through an example: We assume a PAGE_SIZE of 4KB and a struct list_head of 8 bytes. Because order is equal to 0, the value of nr_hash becomes 500; that is, up to 500 list_head structs can fit in one 4KB table. The (1UL << order) becomes the number of pages that have been allocated. For example, if the order had been 1 (meaning we had requested 21 pages allocated to the hash table), 0000 0001 bit-shifted once to the left becomes 0000 0010 (or 2 in decimal notation). Next, we calculate the number of bits the hash key will need. Walking through each iteration of the loop, we get the following:

Beginning values are hash_bits = 0 and nr_hash = 500.

  • Iteration 1: hash_bits = 1, and (500 >> 1) ! = 0

    (0001 1111 0100 >> 1) = 0000 1111 1010

  • Iteration 2: hash_bits = 2, and (500 >> 2) ! = 0

    (0001 1111 1010 >> 2) = 0000 0111 1110

  • Iteration3: hash_bits = 3, and (500 >> 3) ! = 0

    (0001 1111 1010 >> 3) = 0000 0011 1111

  • Iteration 4: hash_bits = 4, and (500 >> 4) ! = 0

    (0001 1111 1010 >> 4) = 0000 0001 1111

  • Iteration 5: hash_bits = 5, and (500 >> 5) ! = 0

    (0001 1111 1010 >> 5) = 0000 0000 1111

  • Iteration 6: hash_bits = 6, and (500 >> 6) ! = 0

    (0001 1111 1010 >> 6) = 0000 0000 0111

  • Iteration 7: hash_bits = 7, and (500 >> 7) ! = 0

    (0001 1111 1010 >> 7) = 0000 0000 0011

  • Iteration 8: hash_bits = 8, and (500 >> 8) ! = 0

    (0001 1111 1010 >> 8) = 0000 0000 0001

  • Iteration 9: hash_bits = 9, and (500 >> 9) ! = 0

    (0001 1111 1010 >> 9) = 0000 0000 0000

After breaking out of the while loop, hash_bits is decremented to 8, nr_hash is set to 0001 0000 0000, and the hash_mask is set to 0000 1111 1111.

After the mnt_init() routine initializes mount_hashtable and creates mnt_cache, it issues three calls:

 1189   sysfs_init();
 1190   init_rootfs();
 1191   init_mount_tree();
 1192  }

sysfs_init() is responsible for the creation of the sysfs filesystem. init_rootfs() and init_mount_tree() are together responsible for mounting the root filesystem. We closely look at each routine in turn.

 218  static struct file_system_type rootfs_fs_type = {
 219   .name   = "rootfs",
 220   .get_sb  = rootfs_get_sb,
 221   .kill_sb  = kill_litter_super,
 222  };
 237  int __init init_rootfs(void)
 238  {
 239  return register_filesystem(&rootfs_fs_type);
 240  }

The rootfs filesystem is an initial filesystem the kernel mounts. It is a simple and quite empty directory that becomes overmounted by the real filesystem at a later point in the kernel boot-up process.

Lines 218222

This code block is the declaration of the rootfs_fs_type file_system_type struct. Only the two methods for getting and killing the associated superblock are defined.

Lines 237240

The init_rootfs() routine merely register this rootfs with the kernel. This makes available all the information regarding the type of filesystem (information stored in the file_system_type struct) within the kernel.

 1107  static void __init init_mount_tree(void)
 1108  {
 1109   struct vfsmount *mnt;
 1110   struct namespace *namespace;
 1111   struct task_struct *g, *p;
 1113   mnt = do_kern_mount("rootfs", 0, "rootfs", NULL);
 1114   if (IS_ERR(mnt))
 1115    panic("Can't create rootfs");
 1116   namespace = kmalloc(sizeof(*namespace), GFP_KERNEL);
 1117   if (!namespace)
 1118    panic("Can't allocate initial namespace");
 1119   atomic_set(&namespace->count, 1);
 1120   INIT_LIST_HEAD(&namespace->list);
 1121   init_rwsem(&namespace->sem);
 1122   list_add(&mnt->mnt_list, &namespace->list);
 1123   namespace->root = mnt;
 1125   init_task.namespace = namespace;
 1126   read_lock(&tasklist_lock);
 1127   do_each_thread(g, p) {
 1128    get_namespace(namespace);
 1129    p->namespace = namespace;
 1130   } while_each_thread(g, p);
 1131   read_unlock(&tasklist_lock);
 1133  set_fs_pwd(current->fs, namespace->root, 
 1134  set_fs_root(current->fs, namespace->root, 
 1135  }

Lines 11161123

Initialize the process namespace. This structure keeps pointers to the mount tree-related structures and the corresponding dentry. The namespace object is allocated, the count set to 1, the list field of type list_head is initialized, the semaphore that locks the namespace (and the mount tree) is initialized, and the root field corresponding to the vfsmount structure is set to point to our newly allocated vfsmount.

Line 1125

The current task's (the init task's) process descriptor namespace field is set to point at the namespace object we just allocated and initialized. (The current process is Process 0.)

Lines 11341135

The following two routines set the values of four fields in the fs_struct associated with our process. fs_struct holds field for the root and current working directory entries set by these two routines.

We just finished exploring what happens in the mnt_init function. Let's continue exploring vfs_mnt_init.

 1641 bdev_cache_init()
 290  void __init bdev_cache_init(void)
 291  {
 292   int err;
 293   bdev_cachep = kmem_cache_create("bdev_cache",
 294     sizeof(struct bdev_inode),
 295     0,
 297     init_once,
 298     NULL);
 299   if (!bdev_cachep)
 300    panic("Cannot create bdev_cache SLAB cache");
 301   err = register_filesystem(&bd_type);
 302   if (err)
 303    panic("Cannot register bdev pseudo-fs");
 304   bd_mnt = kern_mount(&bd_type);
 305   err = PTR_ERR(bd_mnt);
 306   if (IS_ERR(bd_mnt))
 307    panic("Cannot create bdev pseudo-fs");
 308   blockdev_superblock = bd_mnt->mnt_sb;  /* For writeback */
 309  }

Lines 293298

Create the bdev_cache SLAB cache, which holds bdev_inodes.

Line 301

Register the bdev special filesystem. It has been defined as follows:

 294  static struct file_system_type bd_type = {
 295   .name   = "bdev",
 296   .get_sb  = bd_get_sb,
 297   .kill_sb  = kill_anon_super,
 298  };

As you can see, the file_system_type struct of the bdev special filesystem has only two routines defined: one for fetching the filesystem's superblock and the other for removing/freeing the superblock. At this point, you might wonder why block devices are registered as filesystems. In Chapter 6, we saw that systems that are not technically filesystems can use filesystem kernel structures; that is, they do not have mount points but can make use of the VFS kernel structures that support filesystems. Block devices are one instance of a pseudo filesystem that makes use of the VFS filesystem kernel structures. As with bdev, these special filesystems generally define only a limited number of fields because not all of them make sense for the particular application.

Lines 304308

The call to kern_mount() sets up all the mount-related VFS structures and returns the vfsmount structure. (See Chapter 6 for more information on setting the global variables bd_mnt to point to the vfsmount structure and blockdev_superblock to point to the vfsmount superblock.)

This function initializes the character device objects that surround the driver model:

 1642 chrdev_init
 void __init chrdev_init(void)  
 433   subsystem_init(&cdev_subsys);
 434   cdev_map = kobj_map_init(base_probe, &cdev_subsys);
 435  }

8.5.27. The Call to radix_tree_init()

Line 476

The 2.6 Linux kernel uses a radix tree to manage pages within the page cache. Here, we simply initialize a contiguous section of kernel space for storing the page cache radix tree:

 798 void __init radix_tree_init(void)
 799 {
 800   radix_tree_node_cachep = kmem_cache_create("radix_tree_node",
 801       sizeof(struct radix_tree_node), 0,
 802       SLAB_PANIC, radix_tree_node_ctor, NULL);
 803   radix_tree_init_maxindex();
 804   hotcpu_notifier(radix_tree_callback, 0);
 768 static __init void radix_tree_init_maxindex(void)
 769 {
 770   unsigned int i;
 772   for (i = 0; i < ARRAY_SIZE(height_to_maxindex); i++)
 773     height_to_maxindex[i] = __maxindex(i);
 774 }

Notice how radix_tree_init() allocates the page cache space and radix_tree_init_maxindex() configures the radix tree data store, height_to_maxindex[].

hotcpu_notifier() (on line 804) refers to Linux 2.6's capability to hotswap CPUs. When a CPU is hotswapped, the kernel calls radix_tree_callback(), which attempts to cleanly free the parts of the page cache that were linked to the hotswapped CPU.

8.5.28. The Call to signals_init()

Line 477

The signals_init() function in kernel/signal.c initializes the kernel signal queue:

 2565  void __init signals_init(void)
 2566  {
 2567  sigqueue_cachep =
 2568     kmem_cache_create("sigqueue",
 2569       sizeof(struct sigqueue),
 2570       __alignof__(struct sigqueue),
 2571       0, NULL, NULL);
 2572   if (!sigqueue_cachep)
 2573    panic("signals_init(): cannot create sigqueue SLAB cache");
 2574  }  

Lines 25672571

Allocate SLAB memory for sigqueue.

8.5.29. The Call to page_writeback_init()

Line 479

The page_writeback_init() function initializes the values controlling when a dirty page is written back to disk. Dirty pages are not immediately written back to disk; they are written after a certain amount of time passes or a certain number or percent of the pages in memory are marked as dirty. This init function attempts to determine the optimum number of pages that must be dirty before triggering a background write and a dedicated write. Background dirty-page writes take up much less processing power than dedicated dirty-page writes:

 488 /*
 489 * If the machine has a large highmem:lowmem ratio then scale back the default
 490 * dirty memory thresholds: allowing too much dirty highmem pins an excessive
 491 * number of buffer_heads.
 492 */
 493 void __init page_writeback_init(void)
 494 {
 495   long buffer_pages = nr_free_buffer_pages();
 496   long correction;
 498   total_pages = nr_free_pagecache_pages();
 500   correction = (100 * 4 * buffer_pages) / total_pages;
 502   if (correction < 100) {
 503     dirty_background_ratio *= correction;
 504     dirty_background_ratio /= 100;
 505     vm_dirty_ratio *= correction;
 506     vm_dirty_ratio /= 100;
 507   }
 508   mod_timer(&wb_timer, jiffies + (dirty_writeback_centisecs * HZ) / 100);
 509   set_ratelimit();
 510   register_cpu_notifier(&ratelimit_nb);
 511 }

Lines 495507

If we are operating on a machine with a large page cache compared to the number of buffer pages, we lower the dirty-page writeback thresholds. If we choose not to lower the threshold, which raises the frequency of writebacks, at each writeback, we would use an inordinate amount of buffer_heads. (This is the meaning of the comment before page_writeback().)

The default background writeback, dirty_background_ratio, starts when 10 percent of the pages are dirty. A dedicated writeback, vm_dirty_ratio, starts when 40 percent of the pages are dirty.

Line 508

We modify the writeback timer, wb_timer, to be triggered periodically (every 5 seconds by default).

Line 509

set_ratelimit() is called, which is documented excellently. I defer to these inline comments:

 450 /*
 451 * If ratelimit_pages is too high then we can get into dirty-data overload
 452 * if a large number of processes all perform writes at the same time.
 453 * If it is too low then SMP machines will call the (expensive)
 454 * get_writeback_state too often.
 455 *
 456 * Here we set ratelimit_pages to a level which ensures that when all CPUs are
 457 * dirtying in parallel, we cannot go more than 3% (1/32) over the dirty memory
 458 * thresholds before writeback cuts in.
 459 *
 460 * But the limit should not be set too high. Because it also controls the
 461 * amount of memory which the balance_dirty_pages() caller has to write back.
 462 * If this is too large then the caller will block on the IO queue all the
 463 * time. So limit it to four megabytes - the balance_dirty_pages() caller
 464 * will write six megabyte chunks, max.
 465 */
 467 static void set_ratelimit(void)
 468 {
 469   ratelimit_pages = total_pages / (num_online_cpus() * 32);
 470   if (ratelimit_pages < 16)
 471     ratelimit_pages = 16;
 472   if (ratelimit_pages * PAGE_CACHE_SIZE > 4096 * 1024)
 473     ratelimit_pages = (4096 * 1024) / PAGE_CACHE_SIZE;
 474 }

Line 510

The final command of page_writeback_init() registers the ratelimit notifier block, ratelimit_nb, with the CPU notifier. The ratelimit notifier block calls ratelimit_handler() when notified, which in turn, calls set_ratelimit(). The purpose of this is to recalculate ratelimit_pages when the number of online CPUs changes:

 483 static struct notifier_block ratelimit_nb = {
 484   .notifier_call = ratelimit_handler,
 485   .next   = NULL,
 486 };

Finally, we need to examine what happens when the wb_timer (from Line 508) goes off and calls wb_time_fn():

 414 static void wb_timer_fn(unsigned long unused)
 415 {
 416   if (pdflush_operation(wb_kupdate, 0) < 0)
 417     mod_timer(&wb_timer, jiffies + HZ); /* delay 1 second */
 418 }

Lines 416417

When the timer goes off, the kernel triggers pdflush_operation(), which awakens one of the pdflush threads to perform the actual writeback of dirty pages to disk. If pdflush_operation() cannot awaken any pdflush thread, it tells the writeback timer to trigger again in 1 second to retry awakening a pdflush tHRead. See Chapter 9, "Building the Linux Kernel," for more information on pdflush.

8.5.30. The Call to proc_root_init()

Lines 480482

As Chapter 2 explained, the CONFIG_* #define refers to a compile-time variable. If, at compile time, the proc filesystem is selected, the next step in initialization is the call to proc_root_init():

 40  void __init proc_root_init(void)
 41  {
 42   int err = proc_init_inodecache();
 43   if (err)
 44    return;
 45   err = register_filesystem(&proc_fs_type);
 46   if (err)
 47    return;
 48   proc_mnt = kern_mount(&proc_fs_type);
 49   err = PTR_ERR(proc_mnt);
 50   if (IS_ERR(proc_mnt)) {
 51    unregister_filesystem(&proc_fs_type);
 52    return;
 53   }
 54   proc_misc_init();
 55   proc_net = proc_mkdir("net", 0);
 56  #ifdef CONFIG_SYSVIPC
 57   proc_mkdir("sysvipc", 0);
 58  #endif
 59  #ifdef CONFIG_SYSCTL
 60   proc_sys_root = proc_mkdir("sys", 0);
 61  #endif
 63   proc_mkdir("sys/fs", 0);
 64   proc_mkdir("sys/fs/binfmt_misc", 0);
 65  #endif
 66   proc_root_fs = proc_mkdir("fs", 0);
 67   proc_root_driver = proc_mkdir("driver", 0);
 68   proc_mkdir("fs/nfsd", 0); /* somewhere for the nfsd filesystem to be mounted */
 70   /* just give it a mountpoint */
 71   proc_mkdir("openprom", 0);
 72  #endif
 73   proc_tty_init();
 75   proc_device_tree_init();
 76  #endif
 77   proc_bus = proc_mkdir("bus", 0);
 78  } 

Line 42

This line initializes the inode cache that holds the inodes for this filesystem.

Line 45

The file_system_type structure proc_fs_type is registered with the kernel. Let's closely look at the structure:

 33  static struct file_system_type proc_fs_type = {
 34   .name   = "proc",
 35   .get_sb  = proc_get_sb,
 36   .kill_sb  = kill_anon_super,
 37  };

The file_system_type structure, which defines the filesystem's name simply as proc, has the routines for retrieving and freeing the superblock structures.

Line 48

We mount the proc filesystem. See the sidebar on kern_mount for more details as to what happens here.

Lines 5478

The call to proc_misc_init() is what creates most of the entries you see in the /proc filesystem. It creates entries with calls to create_proc_read_entry(), create_proc_entry(), and create_proc_seq_entry(). The remainder of the code block consists of calls to proc_mkdir for the creation of directories under /proc/, the call to the proc_tty_init() routine to create the tree under /proc/tty, and, if the config time value of CONFIG_PROC_DEVICETREE is set, then the call to the proc_device_tree_init() routine to create the /proc/device-tree subtree.

8.5.31. The Call to init_idle()

Line 490

init_idle() is called near the end of start_kernel() with parameters current and smp_processor_id() to prepare start_kernel() for rescheduling:

 2643 void __init init_idle(task_t *idle, int cpu)
 2644 {
 2645   runqueue_t *idle_rq = cpu_rq(cpu), *rq = cpu_rq(task_cpu(idle));
 2646   unsigned long flags;
 2648   local_irq_save(flags);
 2649   double_rq_lock(idle_rq, rq);
 2651   idle_rq->curr = idle_rq->idle = idle;
 2652   deactivate_task(idle, rq);
 2653   idle->array = NULL;
 2654   idle->prio = MAX_PRIO;
 2655   idle->state = TASK_RUNNING;
 2656   set_task_cpu(idle, cpu);
 2657   double_rq_unlock(idle_rq, rq);
 2658   set_tsk_need_resched(idle);
 2659   local_irq_restore(flags);
 2661   /* Set the preempt count _outside_ the spinlocks! */
 2662 #ifdef CONFIG_PREEMPT
 2663   idle->thread_info->preempt_count = (idle->lock_depth >= 0);
 2664 #else
 2665   idle->thread_info->preempt_count = 0;
 2666 #endif
 2667 }

Line 2645

We store the CPU request queue of the CPU that we're on and the CPU request queue of the CPU that the given task idle is on. In our case, with current and smp_processor_id(), these request queues will be equal.

Line 26482649

We save the IRQ flags and obtain the lock on both request queues.

Line 2651

We set the current task of the CPU request queue of the CPU that we're on to the task idle.

Lines 26522656

These statements remove the task idle from its request queue and move it to the CPU request queue of cpu.

Lines 26572659

We release the request queue locks on the run queues that we previously locked. Then, we mark task idle for rescheduling and restore the IRQs that we previously saved. We finally set the preemption counter if kernel preemption is configured.

8.5.32. The Call to rest_init()

Line 493

The rest_init() routine is fairly straightforward. It essentially creates what we call the init thread, removes the initialization kernel lock, and calls the idle tHRead:

 388  static void noinline rest_init(void)
 389  {
 390   kernel_thread(init, NULL, CLONE_FS | CLONE_SIGHAND);
 391   unlock_kernel();
 392   cpu_idle();
 393  }

Line 388

You might have noticed that this is the first routine start_kernel() calls that is not __init. If you recall from Chapter 2, we said that when a function is preceded by __init, it is because all the memory used to maintain the function variables and the like will be memory that is cleared/freed once initialization nears completion. This is done through a call to free_initmem(), which we see in a moment when we explore what happens in init(). The reason why rest_init() is not an __init function is because it calls the init thread before its completion (meaning the call to cpu_idle). Because the init tHRead executes the call to free_initmem(), there is the possibility of a race condition occurring whereby free_initmem() is called before rest_init() (or the root thread) is finished.

Line 390

This line is the creation of the init thread, which is also referred to as the init process or process 1. For brevity, all we say here is that this thread shares all kernel data structures with the calling process. The kernel thread calls the init() functions, which we look at in the next section.

Line 391

The unlock_kernel() routine does nothing if only a single processor exists. Otherwise, it releases the BKL.

Line 392

The call to cpu_idle() is what turns the root thread into the idle thread. This routine yields the processor to the scheduler and is returned to when the scheduler has no other pending process to run.

At this point, we have completed the bulk of the Linux kernel initialization. We now briefly look at what happens in the call to init().

8.6. The init Thread (or Process 1)

We now explore the init thread. Note that we skip over all SMP-related routines for brevity:

 601  static int init(void * unused)
 602  {
 603   lock_kernel(); 
 612   child_reaper = current;
 627   populate_rootfs();
 629   do_basic_setup();
 635   if (sys_access((const char __user *) "/init", 0) == 0)
 636    execute_command = "/init";
 637   else
 638    prepare_namespace();
 645   free_initmem();
 646   unlock_kernel();
 647   system_state = SYSTEM_RUNNING;
 649   if (sys_open((const char __user *) "/dev/console", O_RDWR, 0) < 0)
 650    printk("Warning: unable to open an initial console.\n");
 652   (void) sys_dup(0);
 653   (void) sys_dup(0);
 662   if (execute_command)
 663    run_init_process(execute_command);
 665   run_init_process("/sbin/init");
 666   run_init_process("/etc/init");
 667   run_init_process("/bin/init");
 668   run_init_process("/bin/sh");
 670   panic("No init found. Try passing init= option to kernel.");
 671  }

Line 612

The init thread is set to reap any thread whose parent has died. The child_reaper variable is a global pointer to a task_struct and is defined in init/main.c. This variable comes into play in "reparenting functions" and is used as a reference to the thread that should become the new parent. We refer to functions such as reparent_to_init() (kernel/exit.c), choose_new_parent() (kernel/exit.c), and forget_original_parent() (kernel/exit.c) because they use child_reaper to reset the calling thread's parent.

Line 629

The do_basic_setup() function initializes the driver model, the sysctl interface, the network socket interface, and work queue support:

 551  static void __init do_basic_setup(void)
 552  {
 553   driver_init();
 555  #ifdef CONFIG_SYSCTL
 556    sysctl_init();
 557  #endif
 560   sock_init();
 562   init_workqueues();
 563   do_initcalls();
 564  }

Line 553

The driver_init() (drivers/base/init.c) function initializes all the subsystems involved in driver support. This is the first part of device driver initializations. The second comes on line 563 with the call to do_initcalls().

Lines 555557

The sysctl interface provides support for dynamic alteration of kernel parameters. This means that the kernel parameters that sysctl supports can be modified at runtime without the need for recompiling and rebooting the kernel. sysctl_init() (kernel/sysctl.c) initializes the interface. For more information on sysctl, read the man page (man sysctl).

Line 560

The sock_init() function is a dummy function with a simple printk if the kernel is configured without net support. In this case, sock_init() is defined in net/nonet.c. In the case that network support is configured then sock_init() is defined in net/socket.c, it initializes the memory caches to be used for network support and registers the filesystem that supports networking.

Line 562

The call to init_workqueues sets up the work queue notifier chain. Chapter 10, "Adding Your Code to the Kernel," discusses work queues.

Line 563

The do_initcalls() (init/main.c) function constitutes the second part of device driver initialization. This function sequentially calls the entries in an array of function pointers that correspond to built-in device initialization functions.[11]

[11] Refer to for an excellent distillation of the initcall mechanism by Trevor Woerner.

Lines 635638

If an early user space init exists, the kernel does not prepare the namespace; it allows it to perform this function. Otherwise, the call to prepare_namespace() is made. A namespace refers to the mount point of a filesystem hierarchy:

 383  void __init prepare_namespace(void)
 384  {
 385   int is_floppy;
 387   mount_devfs();
 391   if (saved_root_name[0]) {
 392    root_device_name = saved_root_name;
 393    ROOT_DEV = name_to_dev_t(root_device_name);
 394    if (strncmp(root_device_name, "/dev/", 5) == 0)
 395     root_device_name += 5;
 396   }
 398   is_floppy = MAJOR(ROOT_DEV) == FLOPPY_MAJOR;  
 400   if (initrd_load())
 401    goto out;
 403   if (is_floppy && rd_doload && rd_load_disk(0))
 404    ROOT_DEV = Root_RAM0;
 406   mount_root();
 407  out:
 408   umount_devfs("/dev");
 409   sys_mount(".", "/", NULL, MS_MOVE, NULL);
 410   sys_chroot(".");
 411   security_sb_post_mountroot();
 412   mount_devfs_fs ();
 413  }

Line 387

The mount_devfs() function creates the /dev mount-related structures. We need to mount /dev because we use it to refer to the root device name.

Lines 391396

This code block sets the global variable ROOT_DEV to the indicated root device as passed in through kernel boot-time parameters.

Line 398

A simple comparison of major numbers indicates whether the root device is a floppy.

Lines 400401

The call to initrd_load() mounts the RAM disk if a RAM disk has been indicated as the kernel's root filesystem. If this is the case, it returns a 1 and executes the jump to the out label, which undoes all we've done in preparation of a root filesystem from a device.

Line 406

The call to mount_root does the majority of the root-filesystem mounting. Let's closely look at this function:

 353  void __init mount_root(void)
 354  {
 355  #ifdef CONFIG_ROOT_NFS
 357    if (mount_nfs_root())
 358     return;
 360    printk(KERN_ERR "VFS: Unable to mount root fs via NFS, trying floppy.\n");
 361    ROOT_DEV = Root_FD0;
 362   }
 363  #endif
 364  #ifdef CONFIG_BLK_DEV_FD
 367    if (rd_doload==2) {
 368     if (rd_load_disk(1)) {
 369       ROOT_DEV = Root_RAM1;
 370       root_device_name = NULL;
 371     }
 372    } else
 373     change_floppy("root floppy");
 374   }
 375  #endif
 376   create_dev("/dev/root", ROOT_DEV, root_device_name);
 377   mount_block_root("/dev/root", root_mountflags);
 378  }

Lines 355358

If the kernel has been configured to mount an NFS filesystem, we execute mount_nfs_root(). If the NFS mount fails, the kernel prints out the appropriate message and then proceeds to try to mount the floppy as the root filesystem.

Lines 364375

In this code block, the kernel tries to mount the root floppy.[12]

[12] A note on rd_doload: This global variable holds a value of 0 if no RAM disk is to be loaded, a value of 1 if a RAM disk is to be loaded, and a value of 2 for a "dual initrd/ramload setup."

Line 377

This function performs the bulk of the root device mounting. We now return to init().

Line 645

The call to free_initmem() frees all memory segments that the routines used up with the __init precursor. This marks our exit from pure kernel space and we begin to set up user mode data.

Lines 649650

Open up the initial console.

Lines 662668

The execute_command variable is set in init_setup() and holds the value of a boot-time parameter that contains the name of the init program to call if we do not want the default /sbin/init to be called. If an init program name is passed, it takes priority over the usual /sbin/init. Note that the call to run_init_process() (init/main.c) does not return because it ends with a call to execve(). Thus, the first init function call to run successfully is the only one run. In the case that an init program is not found, we can use the bash shell to start up.

Line 670

This panic statement should be reached only if all of our tries to execute various init program fails.

This concludes kernel initialization. From here on out, the init process involves itself with system initialization and starting all the necessary processes and daemon support required for user login and support.

