Search     or:     and:
 LINUX 
 Language 
 Kernel 
 Package 
 Book 
 Test 
 OS 
 Forum 
iakovlev.org

Лекции

Это онлайн-версия курса Advanced Operating Systems On-Line Collection
Автор- профессор Douglas W. Jones , университет IOWA

Введение

Железо состоит из нескольких подсистем:

       |-------|
       |  CPU  | CPU   CPU     -- CPUs (may have more than one)
       |   |   |   \   /
       |  MEM  |    MEM        -- main memory units
       |   |   |     |
       |  IOP  |    IOP        -- I/O processors
       |   |\  |     |\
       |   | \_______|_\____   -- a network
       |   |         |
       | other |   other       -- other I/O devices (disks, mice
       |  I/O  |    I/O                              displays,
       |-------|                                     keyboards,
      Traditional                                    printers, etc.)
      Single--CPU
       Operating
        System.
      
      \_________  _________/
                \/
              Modern
         Operating System!
 
 

CPU состоит из : control unit, arithmetic unit, набора регистров, включая program counter:

      Control      Registers
        Unit         _______
       _______      |  PC   | The program counter
      |_______|     |       |
            |       |       | Some operand registers
        ALU |       |  etc  | (details vary immensely
       _____|_      |       |  from machine to machine)
      |_______|<--->|_______|
 

Одной из важнейших операций,выполняемых регистрами,является переключение контекста. Это позволяет сохранить состояние набора регистров и восстанавливать при очередном переключении контекста.

Контекстное переключение выполняется разными процессорами по-разному. Некоторые делают это в форме т.н. ``exchange jump'' инструкции, при котором все делает одна инструкция.

Другие кладут регистры в стек и переключают стековый указатель,а также программный счетчик.

Внутри операционной системы,должна быть специальная структура данных, которая хранит контекст конкретного процесса.

В операционных системах переключение контекста часто делают на ассемблере, в силу критичности операции по времени исполнения. Если это реализовано на си,то процедура,которая делает такое переключение, имеет 2 параметра,которые представляют из себя указатели на 2 структуры данных переключаемых контекстов.

Exercise: Design a context data structure and a context switching routine for any machine for which you can get adequate documentation. A worked example for the fictional Hawk architecture is included in http://www.cs.uiowa.edu/~jones/arch/hawk/trap.html

Память представляется простым ресурсом,но это заблуждение. Люди,создающие компиляторы,сильно зависят от адресации. На некоторых системах эта адресация побайтовая. На других она комбинированная. Например,на большинстве архитектур CISC 1970-х годов (IBM 370, DEC VAX, Motorola 680x0, Intel 80x86) была упрощенная байтовая адресация. На современных RISC-системах байтовая адресация нивелирована.

Управление памяти,виртуальная память-это одно из самых сложных мест в современных операционных системах.

      Virtual Memory     Lens       Real Memory
       _________                    _________
       |         |       ----       |         |
       |         |       \  /       |         |
       |         |        )(        |         |
       |         |       /  \       |         |
       |_________|       ----       |_________|
          CPU's         Memory      What a Logic
        Viewpoint     Management   Analyzer on the
 	                Unit       bus would see
 

Memory Management Unit (или MMU)-неотьемлимая часть современных компьютеров, либо это встроенная часть процессора,либо отдельный блок. Motorola 680x0 или Intel 80x86 имеют довольно сложные MMU. А вот на современных RISC-архитектурах таких как Power PC, SGI MIPS,HP PA все довольно просто.

В состав MMU входят несколько регистров. Если они задействованы в переключении контекста,то этот процесс происходит дольше. Если нет,то то оба переключаемых контекста должны быть расшарены в одном и том же адресном пространстве.

В этом смысле существуют 2 подхода в реализации многопоточности. В первом случае потоки как-бы находятся в едином адресном пространстве. Во втором случае каждый поток имеет свое адресное пространство. Второй вариант сложнее в реализации

Register input-output-простейшая форма ввода-вывода. Опишем это с помощью вызова процедуры из высокоуровневого языка:

 * send( data, device )         -- output
  * data = get( device )         -- input
  * repeat until ready( device ) -- wait for transfer complete
 

В простых устройствах как правило имеется пара регистров I/O registers, это регистр данных и контрольный регистр. send() пишет данные в регистр данных, get() читает данные.

Контрольный регистр имеет бит, который выставляется в зависимости от того, send() это или get(). Также могут быть и другие биты,например для обработки ошибок.

На некоторых машинах I/O регистры замаппированы на конкретный адрес памяти. На других имеются специальные опкоды для доступа к таким регистрам. Размер таких регистров как правило равен машинному слову,Но может быть и меньше.

Программирование I/O девайсов как правило машинно-зависимое. Оно часто реализуется на ассемблере. Даже если оно и реализовано на высокоуровневом языке, оно как правило не переносится на другие архитектуры.

Exercise: Write a routine to read a character from the keyboard input port or from an asynchronous input port of some machine -- do this with no use of operating system routines.

Прямой доступ к input-output более сложен. Устройство в данном случае имеет прямой доступ к памяти. Простейшее устройство DMA:

        _____   MEMORY BUS   ______
        |     | ____________ |      |
        | CPU |<_____   ____>|Memory|
        |_____|      | |     |______|
                     | |
                    __v___
                   |      |
                   | DMA  |
                   |Device|
                   |______|
 
Например к ним относятся простейшие видео-интерфейсы. Рассмотрим простой видеоинтерфейс с разрешением 640 на 480. Следующий алгоритм показывает работу такого устройства:
	repeat -- one iteration per frame
 	    ptr = start
 	    output interframe gap in video signal
 	    repeat 480 times
 		output interline gap in video signal
 		repeat 640 times
 		    wait until it's time for the next pixel
 		    pixel = M[ptr]
 		    ptr = ptr + 1
 		    output grey value of pixel
 		end of loop
 	    end of loop
 	forever
 

Современные видео-интерфейсы сложнее этой схемы. Они имеют множество регистров. Выглядят они примерно так :

        _____   MEMORY BUS   ______
        |     | ____________ |      |
        | CPU |<_____   ____>|Memory|
        |_____|      | |     |______|
           ^         | |
           |        __v___
           |       |      |
            ------>| DMA  |
        Register   |Device|
          I/O      |______|
 

Один дисковый DMA-контролер может обслуживать несколько дисков разного типа:

        _____    MAIN BUS    ______
        |     | ____________ |      |
        | CPU |<_____   ____>|Memory|
        |_____|      | |     |______|
                     | |
                    __v__       SCSI BUS
                   |     | ___________________
                   |SCSI |<____   ______   ___>
                   |_____|     | |      | |
                               | |      | |
                              __v__    __v__
                             |     |  |     |
                             | DISK|  | TAPE|
                             |_____|  |_____|
 

Типичный контролер DMA I/O выполняет следующие действия:

  1. выбирает устройство
  2. определяет адрес данных на устройстве
  3. выделяет буфер в памяти
  4. начинает первод данных в прямом или обратном направлении
  5. ожидает завершения перации
Если устройство I/O имеет прямой доступ к памяти, то данные переводятся в память/из памяти поблочно, как правило параллельно с выполнением управляющей программы.

Шины памяти имеют свою собственную иерархию, которая как правило скрыта и неявна. Например нужно различать шину EISA и главную шину CPU.

Интерфейс пользователя

Обычные пользователи при работе с операционными системами имеют следующие инструменты :

  • Файловые системы
  • Оконные менеджеры
  • Командные языки
  • Ресурсы памяти
  • Защитные ресурсы

Слово ``file'' применительно к следующему:

  • К данному обьекту можно применять I/O-операцию.
    Например, read(f,buffer,len), где f - файл.

  • Именованный обьект в дереве каталогов.
    Например ~jones/ftpdir/pdp8/README

  • Область на диске.
    Например, пространство выделенное под ~jones/ftpdir/pdp8/README идентично файловой ноде I-node 14897 на файл-сервере.

Понятие командного языка может варьироваться в широком диапазоне.

Начиная с Unix, командный язык является неотьемлемой частью операционной системы, и каждая определенная команда как правило соответсвует вызову какой-то утилиты или определенному системному вызову операционной системы.

В Unix он называется shell.У него много вариантов-sh, csh, tcsh, bash.

exercise: Write a C program under Unix that executes the standard ls program (the program that runs when you type the ls command at a Unix shell prompt). The easy way to do this is to use the system() library routine; doing it without this requires use of an appropriate version of exec call (there have been many versions of this through the evolution of UNIX; today, execve() is the primary interface to this function).

Управление ресурсами

Операционная система-это менеджер ресурсов, Который должен разруливать возникающие пользовательские конфликты. К таковым ресурсам относятся

Память
memory manager -включает управление динамической памятью, управление виртуальным адресныи пространством и управление page fault handlers.

Процессор
Ресурсы процесса распределяются между тредами или процессами.

Input/Output
Чило одновременно выполняемых операций input or output как правило ограничено, и возникает необходимость делать schedule.

Файловая система
Доступ к дискам,иерархия каталогов

Память

Память может выделяться в форме pages или segments.

  • Страница-фиксированное число блоков памяти с выровненными адресами.
     Дескриптор страницы    ______________________________
                           |___________________|__________|
                           |  address of page  |0000000000|
     

    Дескриптор страницы-это адрес, общее число таких страниц может достигать 2 в степени n , где n может быть байтом либом словом.

  • Сегмент-область памяти,состоящая из страниц.
     A segment descriptor  ______________________________
                     base  |______________________________| 
                   length  |______________________________|
     

    Дескриптор сегмента возвращает адрес первой его страницы.

    У Intel 8088 байтовая адресация,размер сегмента может быть описан в байтах, Но начинаться сегмент обязан с адреса,который кратен 16 байтам.

  • Сегмент может состоять из множества страниц-говорят,что память постранично сегментированная. В этом случае дескриптором сегмента является таблица адресов страниц.

    На старых мак-ах использовались страницы размером 64 КБ, для каждого приложения выделялась одна стековая страница и одна кодовая.

    Процессор

    Эффективное использование ресурсов процессора в бОльшей степент зависит от логики процессов, нежели от физики самого процессора. Технология разделения ресурсов процессора была разработана еще в 60-х годах. С тех пор мало что изменилось.

    Термины process и task суть синонимы, с точки зрения распределения ресурсов процессора также нет принципиальной разницы между thread и process. Термины timesharing и multitasking появились в 1960-х и подразумевают поддержку множества процессов на минимальном количестве процессоров. Термины multiprocessing и multitasking не являются синонимами; речь в данном случае идет о системе с множеством CPU.

    Юниксовый пользователь в далеком 1970-м на удаленном терминале мог себе позволить роскошь запустить одновременно не более 4 процессов.

    В более поздних виндовс-ориентированных системах к каждому окну уже прикрепляются один или более процессов.

    Но на большинстве персональных компьютеров в период с 70-х до середины 90-х годов многозадачность находилась в зачаточном положении.

    Поддержка многозадачности реализуется с помощтю CPU scheduler. Шедулер обслуживает очередь процессов. Каждый раз когда из этой очереди выбирается процесс, происходит переключение контекста - т.н. context switch.

    Input/Output

    Когда к одному устройству приходит несколько запросов одновременно- как например к сетевой карте- система должна обработать их.

    Для одно-пользовательской системы сойдет и алгоритм first-in-first-out I/O. На многопользовательских серверах малейший нюанс в шедулере может положить систему.

    Дисковое простанство

    Дисковое простанство-один из наиболее важных ресурсов,управляемых системой. Основной функцией файловой системы является выделение места для сохранения файла. Механизм выделения дискового пространства в корне отличается от механизма выделения памяти. Дисковые устройства структурированы по блочному принципу, тут есть свои ограничения на количество операций чтения-записи.

    Устойчивость и надежность

    Механизм защиты является негобходимым компонентом операционной системы, хотя некоторые ее представители обходятся без нее.

    Security подразумевает способность системы противостоять внешним попыткам взлома.

    Надежность определяет способность системы предотвращать ее внешние и внутренние повреждения.

    Защита ресурсов необходима пользователю там,где вообще возможен взлом.

    Каждый пользователь в системе должен иметь защиту доступа к своим ресурсам - памяти,файловой системе и т.д.

    Доступ к ресурсам возникает тогда,когда нескольким пользователям или процессам необходим доступ к общим ресурсам.

    Надежность подразумевает защиту системы как от пользовательских приложений, так и от любых других системного плана.

    На ранних этапах развития операционных систем одна ошибка вызывала крах всей системы. В новейших мульти-процессорных системах,при падении одного процесора, остальная система продолжает работать.

    В масштабированных системах,возникновение ерроров является нормой.

    exercise: Consider the internet, with around one million machines. Assuming each machine has a mean-time between failures of one month, what is the rate of failures, per unit time?

    Управление процессами

    Состояния процесса :

                 ---  Waiting  <---
         signal /                    \ wait
               /                      \
               V   <---preempt---      |
             Ready                  Running
               ^    ---schedule--->    |
               \                      /
         create \                    / terminate
                  -----  Dead <-----
     
    Running
    Процесс может ожидать событий,может быть вытеснен другим процессом. В данный момент может выполняться только один процесс.

    Ready
    Один процесс может находиться в состоянии готовности стать активным, если этого захочет шедулер.

    Waiting
    Все остальные процессы находятся в состоянии ожидания.

    Dead
    Умерший процесс.

    Если в системе нет ни одного процесса,то остается еще нулевой процесс- т.н. idle process.

    Этот процесс может быть занят системными делами- диагностировать память,собирать мусор и т.д.

    Ожидающие процессы находятся в очереди - queue structure. Это может быть linked-list FIFO queue,выполняющий 2 функции, в котором все процессы имеют одинаковый приоритет. Если процессам назначается приоритет,то это уже не FIFO.

    Большинство real-time систем присваивают каждому процессу свой приоритет. В юниксе процесс может изменять и увеличивать свой приоритет, если он например делает I/O,или наоборот,процесс может терять свой приоритет, если он вытеснен.

    Preemption Mechanism

    Выполняемый процесс может быть вытеснен другим процессом. Это может сделать например real-time-clock прерывание по истечению времени time-slice, что показано в следующем псевдо-коде:

       Clock interrupt service routine:
           begin
              Disable Interrupts (if not already disabled)
     
              -- Save process state of interrupted process
              current.pc := interrupt-return.pc
              current.registers := interrupt-return.registers
     
              -- Change interrupted process state to ready
              Enqueue( current, ready queue )
     
              -- Make new process into the current process
              current := Dequeue( ready queue )
     
              -- Restore process state of new process
              interrupt-return.pc := current.pc
              interrupt-return.registers := current.registers
     
              Enable Interrupts (if necessary)
              return
           end
     

    В старых системах таймер задавался с определенной периодичностью- например 60 раз в секунду. В современных системах шедулер может менять значение time-slice (интервал между прерываниями) в зависимости от поведения процессов.

    Прерывания должны блокироваться в тот момент, когда происходит смена активного процесса.

    Семафоры

    Базовая классическая реализация шедулера поддерживает 2 основных состояния процесса -готов и выполняется.

    Встает отдельный вопрос-что делать с процессами,которые стоят в очереди ожидания. Его можно положить в цикл:

          loop while desired condition not yet met
              relinquish
           endloop
     
    Для решения этой проблемы были использованы семафоры (E. Dijkstra, Hierarchical Ordering of Sequential Processes, Acta Informatica Vol. 1, 1971, pages 115-138.) Семафоры появились в версии System V Unix.

    Семафор-обьект,используемый для координации процессов. Для каждого условия процессу нужен отдельный семафор. Процесс может ожидать события, к которому привязан уникальный семафор. Когда процесс получит такое событие,он будет блокирован до тех пор, пока все другие приложения ,повешенные на этот семафор,не ответят ему.

    Семафор описывается целым числом:

         wait( sem ) or P( sem ) or sem.wait or sem.P:
              while sem <= 0 do nothing
              sem = sem - 1
              return
     
          signal( sem ) or V( sem ) or sem.signal or sem.V:
              sem = sem + 1
     

    P и V - это ожидающая и сигнальная операции. P - это пауза.

    Более комплексное представление семафоров :

          semaphore representation
              queue: a queue of processes blocked on this semaphore
              count: an integer
     
    Если счетчик равен нулю,очередь будет состоять из списка процессов, блокированных семафором.Если счетчик не равен нулю,эта очередь будет пуста. Типичная схема реализации семафоров:
         wait( sem ) or P( sem ) or sem.wait or sem.P:
              Disable Interrupts
              if sem.counter > 0
     	    sem.counter = sem.counter - 1
              else
                 -- Save process state
                 current.pc = proc-return.pc
                 current.regs = proc-return.regs
        
                 -- Change state to waiting
     	    enqueue( current, sem.queue )
        
                 -- Make new process current process
                 current = Dequeue( ready queue )
        
                 -- Restore process state
                 proc-return.pc = current.pc
                 proc-return.regs = current.regs
              endif
              Enable Interrupts
              return
     
          signal( sem ) or V( sem ) or sem.signal or sem.V:
              Disable Interrupts
              if empty( sem.queue )
                 sem.count := sem.count + 1
              else
                 temp := dequeue( sem.queue );
                 enqueue( temp, ready queue )
              endif
              Enable Interrupts
              return
     
    Эта схема корректна в том случае,если все процессы имеют одинаковый приоритет. В противном случае будут выполняться процессы с низким приоритетом,а с высоким будут в очереди. Это называется инверсией приоритетов.

    Exercise: Fix signal so it does not cause priority inversions when processes have an integer attribute called priority, and assuming that all queues of processes honor this priority field.

    Сети Petri

    Conway в 1961 году, рассуждая о мультипроцессорных системах, придумал следующую схему,описывающую параллелизм в прикладных программах. Она описывает сети Petri:

           ___            ___
           |   |     |    |   |
           | A |---->|----| B |   A occurs before B
           |___|     |    |___|
     
            ___            ___
           |   |     |    |   |
           | A |---->|----| B |
           |___|     |    |___|
             |
           __V__     A occurs before either B or C, but
             |       not both; A could be a conditional
            _|_      test, with control flowing to eithe
           |   |     B or C depending on the condition.
           | C |           ___
           |___|          |   |
                          | A |
                          |___|
                            |
                          __V__
                            |    C will occur after A
            ___            _V_   and C will occur after B
           |   |     |    |   |  entirely independently.
           | B |---->|----| C |
           |___|     |    |___|
     

    Сеть Петри-это граф,состоящий из квадратов и вертикальных разделителей-переходов. Мы ставим цель в стартовую позицию и затем перемещаемся от квадрата к квадрату через сетевые переходы.

    Имеется параллельная программа:

                ___
                |   |
                | A |   The fork operation:
                |___|   A occurs before both
                  |     B and C in parallel.
           _______V_______
             |         |
            _V_       _V_
           |   |     |   |
           | B |     | C |
           |___|     |___|
     
            ___       ___
           |   |     |   |
           | A |     | B |
           |___|     |___|
             |         |
           __V_________V__
                  |
                 _V_    The join operation:
                |   |   A and C must both
                | C |   occur before C happens.
                |___|   
     

    Перед нами пример N-way forks и N-way joins,

    Параллелизм в программе

    Рассмотрим программу,состоящую из 4 выражений - A, B, C и D. Мы хотим ,чтобы сначала было выполнено A, затем параллельно B и C , и затем D. Conway использует следующую графическую нотацию:

                ___
                |   |
                | A |   The fork operation:
                |___|   A occurs before both
                  |     B and C in parallel.
           _______V_______
             |         |
            _V_       _V_
           |   |     |   |
           | B |     | C |
           |___|     |___|
             |         |
           __V_________V__
                  |
                 _V_
                |   |
                | D |
                |___|
     

    Мы будем использовать блоки begin end из Алгола. Все,что выполняется внутри этого блока,делается последовательно. Для параллельного выполнения используем блок cobegin и coend:

         A;
          cobegin
            B;
            C;
          coend
          D;
     

    Имеется ввиду,что это параллельное выполнение происходит на мультипроцессорной системе.

    То же самое можно записать с помощью другой нотации :

         A;
          B, C;
          D;
     

    Реализация параллелизма

    Имеется 2 процесса - родительский и дочерний. В родительском процессе выполняется ветка A-B-D, в дочернем- C.

         Parent process:
                                   Child process:
                A;
                fork(L);               L: C
                B;                        join(V);
                await-join(V);
                D;
     
    По схеме Conway,у обоих процессов имеется общая расшареная память.

    Conway ввел специальную общую переменную для синхронизации обоих процессов. По нонешним временам эта переменная похожа на семафор.

    Следующий UNIX-код решает поставленную задачу:

         A;
          if (fork() == 0) {
               B;               /* child */
               exit();          /* kill child, signal parent */
          } else {
               C;               /* parent */
               wait();          /* await signal from child */
          }
          D;
     
    Здесь A, C и D выполняются в родительском процессе, fork() создает дочерний процесс, в которм выполняется B. В юниксе каждый процесс имеет свое собственное адресное пространство. Операция "B" увидит все изменения,которые будут сделаны в "A", но все изменения,сделанные в "B",не будут видны в "C" и в "D".

    Exercise: Look up the fork(), exit() and wait() primitives in the Unix Programmer's Reference Manual. They are in section 2 of the manual (system calls); the Unix man command under the Unix shell searches this manual. Try man man for help on this command.

    Exercise: A; cobegin B; C; D; coend; E; Write this in C to run under Unix. There are two classes of solutions, differing in whether one parent has two children, or whether the parent has one child that in turn has a child (the grandchild of the parent). Read the manual carefully! One of these variants is harder to program than the other.

    Mutual Exclusion

    Consider the problem of counting the nonzero elements of a 1,000,000 element array. Here is a parallel solution, expressed informally using two processes:

         cobegin
             -- process A
             for i := 1 to 500,000 do
               if a[i] <> 0
                 then count := count + 1;
          
             -- process B
             for j := 500,001 to 1,000,000 do
               if a[j] <> 0
                 then count := count + 1;
          coend
     
    After processes A and B have completed, you'd expect that count would correctly reflect the desired quantity, but there is a significant chance that it won't, because the operation of incrementing count in each of these processes will most likely be accomplished by some sequence of operations such as the following:
         load register with count
          increment register
          store register to count
     
    Unfortunately, even on a uniprocessor, it is possible that both processes A and B will attempt to execute the this same instruction sequence at the same time, and one of the possible instruction sequences will be something like the following:
                process A           process B
              
            load rA with count
            preempt process A
                                 load rB with count
                                 increment rB
                                 store rB in count
                                 ...
                                 preempt process B
            increment rA
            store rA to count
     
    It is not hard to see that each time something like this occurs, one or more increment operations will be miscounted. On a uniprocessor using preemptive scheduling once per timeslice, once the CPU is preempted from process A and given to B as shown above, one full time-slice of computation by process B will be lost when process A regains control!

    To prevent this, it is essential that only one process at a time attempt to increment count. Therefore, we say that this sequence of instructions is a critical section, or that access by processors to count must be mutually exclusive. In general, whenever there is a shared variable, code that accesses that shared variable will potentially involve critical sections, and each of these must be guarded!

    Simple Mutual Exclusion on Uniprocessors

    On a uniprocessor, mutual exclusion may be provided by disabling interrupts on entry to a critical section and reenabling them on exit.

         disable interrupts   -- begin critical section
          load register with count
          increment register
          store register to count
          enable interrupts    -- end critical section
     
    Disabling interrupts prevents preemption, and so long as the code within the critical section includes no relinquish or signal operations, we have a guarantee that no other processes will run. This approach is very common in systems such as small microcontrollers that have no operating system and it is very common inside the kernel of an operating system, but there are two problems that limit its use:

    First, the instructions to enable and disable interrupts are usually privileged, which is to say, the operating system kernel and hardware cooperate to prevent users from using these instructions. If this were not done, a user could execute the disable interrupt instruction and then go into an infinite loop, shutting down the system. Therefore, this approach cannot be used by user programs on protected systems.

    Second, this approach is not selective. If one process is incrementing a particular shared variable, it only needs to exclude other processes that are intent on modifying the same variable. If there are many programs that operate on shared variables, and if each of them shuts down all scheduler activity when it enters a critical section, the net result can be serious performance problems.

    Exercise: The code given in the previous lecture for implementing a scheduler relied on exactly this method! Find the critical sections in that code.

    Simple Mutual Exclusion on Multiprocessors

    Many CPUs designed for use in multiprocessors, particularly in the 1960's and 1970's, included instructions such as exchange(r,m) or test-and-set(m). These are atomic, which means that, if one CPU is executing one of these operations, no other CPU may intervene. The exchange operation swaps the contents of a register with the contents of a location in memory, while the test-and-set operation sets a memory location to zero while testing its previous value and reporting the results in the condition codes.

    Given one or another of these operations, the variable count used in our running example may be protected with a new variable called a spin lock. As a convention, the spin lock protecting a variable, for example, count will be called count-lock. Given this, we can implement our critical section as:

          lock( count-lock )    -- begin critical section
           load register with count
           increment register
           store register to count
           unlock( count-lock )  -- end critical section
     
    The two states of a spin lock are locked and unlocked; any two values may be used to represent these states, however, we will use zero to represent locked and one to represent unlocked because of the way these values generalize to the use of semaphores. Initially, the lock should be one; lock sets it to zero, and unlock resets it to one. Lock and unlock can be implemented as follows:
          lock( v )
              register := 0;
              repeat
                 exchange( register, v );
              until register = 1;
     
           unlock( v )
              v := 1;
     
    Spin locks have the disadvantage that they involve busy waiting. Nonetheless, they are the preferred mutual exclusion mechanism for very short critical sections on shared-memory multiprocessors. Specifically, if the expected waiting time for entry to the critical section in the event of contention is shorter than the time it would take to do two context switches (switching the waiting process out of the CPU and back in again, under the control of the scheduler), this is the preferred method.

    On some uniprocessor operating systems, including much of the older code in UNIX, the poor design of the scheduler or compatability with older versions of the system forces the use of spin-locks. Typically, in high level low-priority code, a relinquish operation of some type is inserted in the busy waiting loop in order to let other ready processes use the CPU while one process waits for the lock.

    In modern computer architectures, those designed in the 1980's or later, both the test-and-set and exchange operations are uncommon. Instead, it is common to find what is known as a snooping mechanism in each CPU, along with a pair of special instructions, load-locked and store-conditional. The load-locked instruction loads from a memory location to a register and sets the snooping mechanism to monitor the bus and note any attempt to change the contents of that location. The store-conditional instruction stores from register to memory, but only if the snooping mechanism is currently monitoring the destination address and there has been no change to that address since the most recent load-locked on this CPU. The store-conditional instruction always reports it success or failure, for example, using a condition code.

    Exercise: Write code for the lock and unlock operations using test-and-set.

    Exercise: Write code for the lock and unlock operations using load-locked and store-conditional.

    Mutual Exclusion Without (much) Help from Hardware

    The problem of mutual exclusion on multiprocessors where there are no atomic exchange or test-and-set operations has been the subject of years of study, during which many ingeneous solutions to the critical section problem have been found. Dekker's solution is oldest of these, solving the problem for 2 processes with ID's 0 and 1, and making no assumption about special hardware support other than the fact that the operations of reading or writing a variable are atomic.

    Dekker represented each spin-lock by three variables:

          state[0..1] -- process i sets state[i] to enter section
           turn        -- indicates who's turn it is to try next
     

    Given this, and that all variables are initially zero, the code for the lock and unlock operations is as follows, assuming that each process passes it's process ID to the lock and unlock operations:

          other(pid)
               return( 1 - pid );
     
           lock(pid):
               state[pid] = 1;
               while state[other(pid)] = 1 do
                   state[pid] = 0;
                   while turn <> pid do nothing;
                   state[pid] = 1;
               end loop;
               return;
     
           unlock(pid):
               state[pid] = 0;
               turn = other(pid);
               return;
     

    When A is inside the critical section, state[A] is one and state[B] is zero. If both try to enter at the same time, so both state variables are set to one, they both enter the loop waiting for the other to reset their state to zero. One will immediately reassert their claim, while the other is blocked a waiting its turn. Furthermore, on exit from the critical section, each process always sets turn to give priority to the other process, preventing one or the other from hogging the critical section.

    This code is difficult to understand! Furthermore, it is difficult to generalize to more than two processes. One interesting generalization uses a binary tree of processes, where processes enter the tree at the leaves and use a different process ID at each internal node, for example, using 0 if they came to that node from the left branch and 1 if they came to that node from the right branch. There are many more solutions to this problem, some of which we will discuss as the semester continues.

    A General Mutual Exclusion Mechanism

    Semaphores can be used to provide mutual exclusion! The variable count used in our running example may be protected with a semaphore; we'll call this count-sem. Given this, we can implement our example critical section as:

         P( count-sem )    -- begin critical section
          load register with count
          increment register
          store register to count
          V( count-sem )    -- end critical section
     
    Initially, the semaphore should have a count of one; the P operation sets it to zero, and the V operation resets it to one.

    The parallel between this and the spin-lock solution should be obvious! In fact, spin-locks are considered to one implementation of a special case of semaphores called binary semaphores. Binary semaphores are those that never take on values other than zero and one.

    Semaphores implemented using the process scheduler are the preferred implementation of user-level and some system-level mutual exclusion on uniprocessors, and they are the preferred implementation on multiprocessors for those critical sections where the expected waiting time for entry to the section in the event of contention is greater than the time taken for two context switches.

    Exercise: Suggese a rule for system programmers on uniprocessors: When should the system programmer use semaphores to guard a critical section, and when should the system programmer simply disable interrupts?

    Межпроцессное взаимодействие

    Producers and Consumers

    Mutual exclusion is not enough! All it provides is the illusion that entire critical sections are atomic, but effective parallel programming rests on more than this.

    Consider a situation where one process produces data to be consumed by another:

          Producer --> Consumer
     

    The producer can easily fill a shared variable, then let the consumer empty the variable, and obviously, both the filling and emptying of the shared variable should appear atomic from the perspective of other processes. Mutual exclusion is sufficient for this, but it is not sufficient There must be some way for the producer to signal to the consumer that data is available!

    Signalling as an Abstraction

    One way to view this signalling problem is to think purely in terms of the sequence of computations. We want the producer to produce data, and only when the producer has produced do we want to allow the consumer to consume. If the producer produces, but the consumer is not yet ready, we do not want to make the producer wait, but if the consumer is ready and the producer has not yet produced, we must block the consumer. Using Petri-net notation, we can express these constraints as follows:

       Process A             Process B
            |                     |
       _____V_____                |
      |  Produce  |               |
      | some data |               |
      |___________|               |
            |          X          |
         ___V______________       |
            |          |          |
            |       ___V___       |
            |      |       |      |
            |      |   S   |      |
            |      |_______|      |
            |          |          |
            |       ___V__________V___
            |                     |
            |          Y     _____V_____
            |               |  Comsume  |
            |               |  the data |
            |               |___________|
            |                     |
     

    This is just a Petri-net fragment because it shows nothing of the producer or consumer control flow other than the actual produce and consume steps. If the two transitions in this net fragment are viewed as fork and join transitions, then we have not only a producer and consumer process, but also a short process from X to Y with only one place, S, and no internal transitions. This is not a productive view of the place S!

    For a better understanding of S, notice that, after each time the producer performs the produce step, a token is added to the place S, and before each time the consumer performs the consume, a token is removed from the place S. The count of tokens in S is therefore a count of the number of items that have been produced but not yet consumed. Therefore, we can model S in a program by a semaphore, and we can model the transition X as a signal operation on this sempaphore, while we model the transition Y as a wait operation on this semaphore.

    In fact, this model generalizes. All semaphores can be modeled as places in a petri net, with each signal operation modeled by a fork that deposits a token in that place, and each wait operation modeled by a join that consumes a token from that place. This understanding of synchronization as being theoretically equivalent to control flow is a relatively new insight!

    Bounded Buffers

    The simplest general implementation of a FIFO queue between a producer and a consumer is called a bounded buffer. A bounded buffer has a fixed maximum data capacity, but at any time, it may be empty, it may contain fewer items than its capacity, or it may be full. The items in the queue are stored in an array, the buffer, between the head and tail pointers, as follows:

       ________________________________
       |  |  |\\|\\|\\|\\|\\|\\|  |  |  |
       |__|__|\\|\\|\\|\\|\\|\\|__|__|__|
        Empty |     Data        | Empty
              |                 |
            Head               Tail
     

    In the illustration above, the head pointer refers to the first item in the queue, the item that will be returned by the next call to the dequeue operation, and the tail pointer refers to the free space that will be filled by the next call to enqueue. This arrangement assumes that the pointers are incremented after each use. A second possibility is to increment before use, in which case, the head pointer would reference the free location from which data was most recently dequeued, and the tail pointer would reference the data most recently enqueued.

    The operations on this structure are simple, if we ignore problems with preventing enqueueing data in a full queue or dequeueing data from an empty queue they are:

          enque(d)
               buffer[tail] = d
               tail = tail + 1 (modulo queue size)
     
           dequeue(d)
               d = buffer[head]
               head = head + 1 (modulo queue size)
     

    We can detect that the queue is empty or full fairly easily:

          empty := (head = tail)
           full := (head = tail)
     

    Unfortunately, these conditions are the same, and we can only distinguish them by noting the history. If, after enqueue, the head and tail pointers are the same, the queue is full. If, after dequeue, this happens, the queue is empty.

    In applications where busy waiting is appropriate, the full and empty conditions are sometimes distinguised by the simple expedient of wasting one slot in the buffer; in effect, we declare the queue to be full when there is acctually one free spot remaining:

          full := (head + 1 = tail) (modulo queue size)
     

    Exercise: Propose other solutions to detecting the full and empty conditions in a circular buffer, perhaps by counting the data elements in the buffer. Compare your solution with the one proposed above! Does it take more or less space for a bounded buffer of 256 characters. How about a bounded buffer of 256 floating point numbers? Does your alternative have a higher or lower run-time cost than the approach outlined above for detecting the queue full and empty conditions.

    Shared Queues

    When a queue implemented as a bounded buffer is shared between multiple processes, there are two different problems. One is assurance of mutual exclusion, so that, for instance, no two processes try to enqueue an item of data in the same queue slot, and the other is keeping track of the number of available data items and free spaces. All of these can be accomplished using semaphores:

          Binary semaphores used to provide mutual exclusion:
               headlock -- protects the head pointer
               taillock -- protects the tail pointer
     
           General semaphores used to count available resources:
               free -- counts number of free spaces in the buffer
               data -- counts number of data items in the buffer
     
           Initially,
               headlock.count = 0
               taillock.count = 0
               free.count = buffer capacity
               data.count = 0
     

    Given these semaphores, the operations on a shared bounded buffer queue can be implemented as follows:

          enque(d)
               P(free)     -- wait for free space to put data
               P(taillock) -- wait for exclusive use of tail
               buffer[tail] = d
               tail = tail + 1 (modulo queue size)
               V(taillock) -- release the tail
               V(data)     -- signal that data is available
               return
     
           dequeue(d)
               P(data)     -- wait for data to be available
               P(headlock) -- wait for exclusive use of head
               d = buffer[head]
               head = head + 1 (modulo queue size)
               V(headlock) -- release the head
               V(free)     -- signal free space available
               return
     
    This implementation of shared FIFO queues is universally applicable whenever queues are used to communicate between processes that run under the process manager. There are some variations worth noting:

    First, the mutual exclusion locks are only needed if there are multiple producers or consumers. When there is only one producer process, the producer is the only process that will inspect and manipulate the tail pointer, so no mutual exclusion semaphore is needed to protect the tail. The same holds for the head.

    Second, the critical sections in this code are very brief, so it is unlikely that any process will wait long for entry. As a result, in system-level code on a uniprocessor, these can be protected by disabling interrupts, and in a multiprocessor setting, spin-locks are appropriate.

    It is also worth noting that if queue items are very large, we do not need to read or write the item from the queue inside the critical section; all we need to do is sample and increment the head or tail pointer in the section, and then do the data copy outside, as follows:

          enque(d)
               local variable t
               P(free)     -- wait for free space to put data
               P(taillock) -- wait for exclusive use of tail
               t = tail
               tail = tail + 1 (modulo queue size)
               V(taillock) -- release the tail
     
               buffer[t] = d  -- copy data here!
     
               V(data)     -- signal that data is available
               return
     

    Variations on Queues

    If the buffer has a capacity of one, the head and tail pointers are superfluous, as are the locks protecting them. In this case, the producer and consumer processes would be expected to operate in lock-step, producing one data item, then consuming it before the next is produced.

    Exercise: Write the code for this limiting case for a bounded buffer queue. How many semaphores are required? What purposes do the remaining semaphores serve, and what range of values do they take on?

    Lock step operation of the producer and consumer is frequently innefficient because unless the computational requirements of the two processes are identical, either the producer will spend time waiting for the consumer, or vicea versa. Nonetheless, this one-item queue allows the actual computations of the producer and consumer to overlap, thus allowing parallelism that would not have been possible in a purely sequential setting.

    From elementary queueing theory, we can predict the following: If the average time taken to produce an item is the same as the average time taken to consume an item, and the times vary randomly, the performance of the system will be maximized with an infinite queue capacity and an infinite average number of items in the queue. Furthermore, as the size of the bounded buffer is decreased, the percentage of the time that the producer spends waiting for space or the consumer spends waiting for data will increase smoothly. Put briefly, so long as there is any available memory, performance can be increased by using this memory to increase the queue capacity.

    The bounded buffer with a capacity of two is a commonly used special case, particularly in 1960's vintage input/output subsystems. In that context, it is frequently described as "Double Buffered I/O".

    Exercise: Write the code for this common case for a bounded buffer queue. How many semaphores are required if you assume a single consumer and a single producer? What purposes do the remaining semaphores serve, and what range of values do they take on?

    When multiple FIFO queues are implemented using linked lists instead of arrays, it is common to have a single shared pool of free list items instead of a bounded supply of list items for each queue. For a given amount of memory, this can markedly improve performance, but it adds slightly to the complexity of the code. Specifically, instead of having a separate space semaphore to count the amount of free space in each queue, there is a single free space semaphore.

    Exercise: Write the code for linked list queues in the general case where there are multiple queues, where each queue may serve multiple producers and multiple consumers. Assume that there is only one type of object enqueued in any of the many queues -- call it a buffer, and assume that each buffer has a data field and however many links are required for the list structuring scheme you use. How many semaphores are required per queue, and how many additional semaphores are shared by all queues? Identify the purpose of each!

    Other problems

    Do not be misled into thinking that the mutual exclusion and producer-consumer problems are the only problems in parallel programming. They are the most important and most common problems, but there are many others:

    The readers-writers problem is a common problem in file-system design. A file may either be closed, open for read access by any number of users, or open for read-write access by exactly one user.

    Problem: Implement the synchronization required for the open-for-read, open-for-write and close operations under the constraints of the readers-writers problem, using semaphores and shared variables.

    Deadlock, the situation that arises when multiple processes are blocked, each awaiting some action by another blocked process, is a common failing in parallel programs. There are complex algorithms for both deadlock avoidance and deadlock detection. We will talk about these later.

    Управление памятью

    Key Issues

    The key questions for any resource management system are:

    • WHO requested the resource?
    • WHAT resource was requested?
    • WHERE should the resource be allocated?
    • HOW should the resource be allocated?
    • WHY should the resource be allocated?
    • WHEN should the resource be allocated?

    WHO requested the resource?

    This is not trivial! On most systems, resources are allocated to processes, but this is not universal!

    If the process is the object to which all allocation is attributed, then the process becomes a very heavy object, and changing processes becomes expensive. As a result, on systems where this is done, the system's notion of process is frequently augmented with lightweight processes, to which no resources other than the CPU are allocated. These are called threads.

    On the old Berkeley Timesharing System, in the late 1960's, resources were allocated to interactive users. Each interactive user had a single virtual address space containing up to 64 pages of memory, and the user could have any number of processes running in this address space, where at any time, each process could directly address 8 pages of this 64-page address space. Berkeley Timesharing System processes were very lightweight objects, with a total process description containined in only 8 words of memory! Using today's terminology, we would say that a Berkeley Timesharing System user was a process, while the processes in this system correspond to threads in modern terminology. In fact, the Berkeley system was the first to incorporate this modern idea of multithreaded processes.

    Aside: The word process originally always meant a locus of control, or some entity in a system that could execute a sequential program. Another good definition (though not usually used) is that each process runs on a virtual CPU created by the scheduler.

    This definition has been largely forgotten in the UNIX community, where it is common to use the word process to refer to the unit of activity to which allocated resources are bound, even if that unit contains many lightweight threads, each of which is technically a process in the older sense of the word. For now, we must live with the resulting confusion.

    WHAT resource was requested?

    Memory can be allocated by the byte, the word, the page, or the segment. In any particular context, the system typically only provides one unit of allocation, and it is up to the user or the applications library to manufacture the desired units of memory out of those allocated by the system.

    Bytes and Words are typically the smallest addressable units of memory. Allocating by the byte or word is quite rare, except in systems where the word is sufficiently large to accomodate two pointers. In that case, randomly allocated words are sufficient to build arbitrary binary-linked data structures, which in turn, can be used to mimic any other data structure. This is the foundation of the LISP worldview.

    For those unfamiliar with LISP, the world (containing both programs and data structures) is represented entirely with S-expressions. The simplest S-expressions are known as atoms; atoms include numbers (represented in 1 word) and textual atoms (represented using a special word that includes the property list for the atom; one property in this list is the printable name of the atom). Complex S-expressions may be formed by composing S-expressions into lists. The compose operator (CONS) takes 2 S-expressions as parameters and returns a new S-expression that joins these. The new S-expression is represented by a word holding 2 pointers, one to the left component (the CAR) and one to the right component (the CDR).

    Pages are fixed sized blocks of memory, with the size typically determined by some characteristic of the addressing hardware on the machine. Because the page size is determined by the hardware and not by the application, it is generally necessary to construct large logical objects out of multiple pages, or to divide single pages into multiple logical objects. Typically, the memory management unit hardware allows multiple pages to occupy logically contiguous addresses independent of their physical locations in memory, so that large objects can easily span many pages.

    Segments are variable sized blocks of memory, with the size typically determined by some characteristic of the application. Segments are typically supported by hardware, while otherwise identical blocks of memory that are purely software artifacts are usually referred to simply as Blocks. Each block or segment typically contains one logical object, but sometimes, the economics of allocation are such that the application sub-allocates a block or segment.

    WHERE should the memory be allocated?

    A naive user typically assumes that blocks of memory requested by an application are allocated directly in main memory, so that the application directly uses main memory addresses. This was true in first-generation systems, including both the mainframe systems of the 1950's and the first generation of microcomputer operating systems in the 1970's and 1980's, but by 1970, most mainframe vendors were moving away from this view, and by 1995, the operating systems on microcomputers were well on their way to abandoning this simple view.

    Most interesting operating systems only use physical memory addresses for internal data structures of the operating system itseelf. All user data structures are allocated in memory segments allocated to the users by the system. This allows the system, by means of the memory management unit, to protect itself from errant user programs, and it allows user programs to be isolated from each other, with the only permitted communication being through authorized channels.

    The question of where memory is allocated has another interpretation: Within any given address space from which memory may be be allocated, where should it be taken. The answer is trivial if all but the requested amount has already been allocated. Another trivial case is when each block of allocatable memory is identical and interchangable with all others, as is the case when the unit of allocation is the page or any other fixed size block, for example, the LISP S-expression node.

    A final aspect of this question arises in machines with non-uniform memory access times (so-called NUMA architectures). On such machines, it is important to try to allocate memory "near" the CPU that is the primary user of that memory, where distance is measured in terms of expected access time.

    HOW should the memory be allocated?

    Allocation for all time may be different from temporary allocation, and some systems allow preemption of memory, either with the cooperation of the user to which the memory was allocated, or without notice.

    Almost all preemptable resources are used for implementing virtual resources. Preemptive schedulers create one virtual CPU per process. Demand paged storage managers preempt page frames with every page fault in order to create virtual address spaces. Window managers preempt regions on the screen to support different overlapped virtual screens (windows).

    Non preemptable resources exist at the bottom level in each such virtual resource hierarchy. The lowest level backing store in a virtual memory environment cannot easily be preempted. The (possibly virtual) memory used to implement a window manager is similar, as is the memory used to hold the states of the ready and waiting processes under a process manager.

    Note that there are many different types of virtual resources in the above example, but all are implemented in terms of the same underlying non preemptable resource. This is a common pattern.

    WHY should the resource be allocated?

    Why not! It is up to the user to decide why they want a resource. Operating systems have no basis for asking this question, although it should be asked by program designers.

    WHEN should the resource be allocated?

    Can the user wait? Can the user give advance notice? Must the user give notice? If we attempt to allocate physical resources on demand, and if we block users who ask for resources that are currently unavailable, we run the risk that one process may be blocked waiting for a resource currently held by another process, while the other process is blocked waiting for a resource held by the first. This is one example of the more general problem of deadlock.

    With preemptable memory resources, such as page frames, demand allocation is common, but performance can frequently be improved by either forcing some users to wait before their demands are met, or by asking users to provide advanced notice of their expected demand. Sometimes the system can anticipate a user's allocation request and allocate resources before they are needed.

    Memory management algorithms

    Consider the simple case of allocating blocks of physical memory to system or user code on a system with no protection mechanisms. With small changes, the same mechanism may be used to allocate blocks from the virtual memory segments of a user program. The interface to such a memory manager is frequently specified by the programming language.

    The definition of Pascal, for example, includes the special procedure new(p) that sets the pointer p to point to a new block of memory sized to fit an object of the type referenced by p. Most implementations of Pascal include the nonstandard procedure dispose(p) that deallocates the block referenced by p and sets p to nil.

    The C standard library, similarly, includes the function malloc(s) that returns a pointer to a new block of s bytes of memory, and the function free(p) that deallocates the block referenced by p, if that block was previously allocated by malloc. C is a weakly typed language, so the programmer is responsible for making sure that the size, in bytes, passed to malloc matches the size of the object required.

    It is easy to see that the C and Pascal memory management interfaces are basically the same. Both allocate memory from a memory space called the heap. It is noteworthy that C++ was originally implemented by a preprocessor that converted the C++ program to an equivalent C program. When the C++ preprocessor encounters code requiring the instantiation of a new object, it generates C code calling malloc to create the structure representing the object.

    There are a huge number of algorithms for implementing the basic allocate and deallocate operations. Two representative examples will be given here, the binary buddy system and a boundary tag method.

    Note that memory management was once a central problem on operating systems, when user processes shared real memory with each other. In modern uniprocessors, it may seem far less critical because user code runs in virtual memory, and allocation and preemption of page frames is relatively trivial.

    The use of virtual memory may seem to push the burden of memory management into the user processes, where it shows up either as an application programming problem or as part of the language run-time system, for example, in the implementation of new and dispose or malloc and free in Pascal and C, or as the garbage collector underlying an implementation of LISP, Simula, Smalltalk or Java.

    Of course, systems must manage the backing store, but this is typically allocated in units of interchangable disk tracks or sectors. There are performance penalties for allocating disk space poorly, but the problem is one of performance, not absolute necessity. Simple allocation heuristics such as allocating that free sector closest to the other allocated sectors of the same file (or virtual address space) can give excellent results.

    In fact, much of the message of the above paragraphs is incorrect. When an application chooses the wrong allocation algorithm, it can paralyze the system. On some UNIX systems from the late 1980's, for example, the system needed to be periodically rebooted because the X-windows window manager, an applications program, consumed all of the available virtual memory. The problem was severe external fragmentation within the memory allocated to the window manager by the system.

    Problems faced in memory allocation:

    Fragmentation is the term used to describe how the free space in a pool from which memory is allocated is slowly broken up into many small pieces. There may be hundreds of words of free space, and yet a request to allocate a ten word block of memory may fail because there are no ten consecutive free words!

    Overhead is the term used to describe additional memory that must be allocated, above and beyond that requested by the users, in order to provide for management of the pool of memory. For example, it may require a few words of memory to describe the location of each allocated memory block.

    The consequence of fragmentation and overhead is that, although the total user demand for memory may be less than the number of words available, the system may be unable to meet that demand because the largest available block of free memory is smaller than the smallest user demand that has yet to be satisfied.

    Fragments of free space, and fragmentation in general, comes from two sources:

    Internal fragments are fragments of free space that are allocated to users even though the user does not request them. For example, on many modern processors, each allocated block of memory must be aligned on modern processors, byte addressing is quitee expensive, so each allocated block of memory must be aligned on a word boundary. If a user wants an odd number of bytes, this will require rounding up the user request to the next word boundary, allocating space to that user that the user does not want.

    External fragments are fragments of free space between allocated blocks. External fragmentation can be eliminated by a compacting garbage collector; one that can move blocks of memory around in the heap in order to consolidate the fragments, but these are uncommon outside the specialized area of run-time support for Java.

    A trivial memory management algorithm

    Allocate fixed size blocks with a size greater than the maximum possible user request.

    This leads to awful internal fragmentation unless all user requests are for the same sized block. Allocation and deallocation are very fast, requiring a single linked list, free-list, to keep track of free blocks.

    If there are a variety of block sizes being requested, this algorithm can be improved by using a separate free list for each popular block size. The problem with this is that it introduces a new fragmentation problem, there may be plenty of free space, but only in sizes that don't match the user's request. If blocks of one size can be combined to make larger blocks or broken up to make smaller blocks, this problem is solved. The result is a class of algorithms known as buddy systems.

    The Binary Buddy System

    In the simplest of the buddy systems, all block sizes are powers of two. Any large block may be split in half to make two smaller blocks, and any two blocks that were originally split from the same parent may be combined to reconstruct that parent.

    The central data structure used in all buddy systems is an array of free-lists:

          Array of
           free-list
           heads               ______       ______
            ________     ---->|_____o----->|______|
      128K |________|   |     |  64k |     |  64k |
      64K  |_______o----      |      |     |      |
      32K  |_______o------    |______|     |______|
      16K  |________|     |    ______       ______
      8K   |_______o----   -->|_____o----->|_____o---
      4K   |________|   |     |  32k |     |  32k |  |
      2K   |________|   |     |______|     |______|  |
      1K   |________|   |      ______       ______   |
      512  |________|   |     |______|<-----o_____|<-
      256  |________|   |     |  32k |     |  32k |
           (indexed     |     |______|     |______|
            by block    |      ______       ______
            size)        ---->|_____o----->|______|
                              |___8k_|     |___8k_|
     
     
    Note that the above picture shows the state of a buddy system storage allocator with an array of free lists that allows for blocks of size up to 128k bytes. The system currently has two blocks of size 64k bytes, 3 blocks of size 32k bytes, and 2 blocks of 8k bytes. When a block is in use, the block is entirely available to the user. When the block is free, the first few bytes of the block are used to hold a pointer to the next free block of the same size.

    The minimum block size is determined by the size of a pointer. If a user asks for less than this, there will be internal fragmentation because we cannot form blocks into free lists if we cannot store one pointer in each block.

    The number of table entries required is determined by the maximum and minimum block sizes. In the above example, the maximum was 128k (217), and the minimum on a 32 bit machine would typically be 4 (22); in this case, the table itself requires 17-2 or 15 entries. This is the storage overhead of the binary buddy system.

    Binary Buddy System Algorithms

    to allocate a block of size n:
       pick i such that 2(i-1) < n <= 2i
       if freelist[i] not empty
         return the first element of freelist[i] to the user
         and set freelist[i] to point to the next element
       else
         allocate a block of size 2(i+1) (this is a recursive)
         and split it in half.  Return half to the user
         and put the other half on freelist[i].
       end
     

    The allocation code outlined above is recursive! Furthermore, because we are constrained to allocate powers of two, if user requests are for uniformly distributed random block sizes, we expect that the typical request will only use 3/4 of each block, resulting in an average loss to internal fragmentation of 1/4 of the total memory capacity.

    to deallocate a block of size n:
       pick i such that 2(i-1) < n <= 2i
       compute b, the buddy of the deallocated block.
       if b is already in freelist[i]
         remove b from freelist[i].
         combine b and its buddy, and deallocate the
         resulting block of size 2(i+1) (this is recursive)
       else
         add the deallocated block to freelist[i].
     

    Again, this code is expressed in recursive form. It should be noted that this version of deallocate uses what is referred to as prompt merger of free blocks. That is, free blocks are merged with their buddies as soon as possible after deallocation. The result of prompt merger is that there will always be the best possible supply of large free blocks, but if most of the traffic is in one particular block size, there will be unnecessary computaiton involved in the repeated merger and splitting of blocks.

    Prompt merger is a good solution for real-time systems, where our concern is a bounded worst-case time per operation. If, instead, we are more worried about minimizing the total computation time, we should use what is called lazy merger. That is, we should avoid merging free blocks for as long as possible, letting the free lists for popular block sizes grow as long as possible and exhausting the free lists for unpopular block sizes.

    In lazy merger, free blocks are only merged when an allocation request comes in for a block size where the free lists for that size and all larger sizes are empty. At that point, the allocate routine begins performing mergers of smaller free blocks until it builds a block large enough to satisfy the request. This means that there may be a huge number of mergers in response to one allocation request.

    Exercise: For a total heap size of N words, what is the worst case number of splits and mergers for allocate and deallocate, and number of splits and mergers for allocate and deallocate; do this first for the prompt version, and then the lazy version.

    Finding the buddy of a block i of size s is easy in the binary buddy system:

      buddy(i) = i xor s
     
    Note that the above rule violates most people's ideas of strong type checking, since we compute the pointer to the buddy of a block by exclusive-oring the size of the block (an integer) with the pointer to the block!

    The Fibonacci Buddy System

    The Fibonacci buddy system is an alternative where block sizes are Fibonacci numbers. This reduces internal fragmentation by providing for a larger variety of free block sizes.

    The Fibonacci numbers are 1 1 2 3 5 8 13 21 ...

    Each Fibonacci number is the sum of its two predecessors. The problem with this system is that there is no trivial way to compute the address of the buddy of a block given the address of that block.

    Exercise: Implement the Fibonacci buddy system.

    Boundary Tag Storage Allocation

    Boundary tags are data structures on the boundary between blocks in the heap from which storage is allocated. The use of such tags allows blocks of arbitrary size to be used -- the tag describes, for each block, how big it is, its status, and its neighbors:

            HEAP________
                |________| tag
                |  used  |
                |        |
                |        |
                |________|
                |________| tag
                |  used  |
                |________|
                |________| tag
                |  free  |
                |        |
                |________|
                |________| tag
                |  used  |
                |________|
                |________| tag
                |  free  |
                |________|
                |________| tag
     
    The consequence of allocating arbitrary sized blocks on the heap is that the bookkeeping needed to split blocks, merge adjacent free blocks, find the right size block, and find the neighbors of a block all become more complex. The boundary tag data structure permits this, but we pay a price, the overhead of the tag itself.

    The decision to use a boundary tag method poses many choices. Among these are the lazy versus prompt merger decision that was present with the buddy systems. Another important choice is that between best fit and first fit allocation: When searching a collection of free blocks, should we take the first block we find that is big enough to satisfy a user's demand, or should we do a more careful search, looking for the best fit?

    Intution suggests that best-fit allocation should reduce fragmentation, but in fact, if an exact fit is unavailable and there is a wide variety of commonly requested block sizes, empirical evidence shows that first fit generally works better because it produces larger fragments that are more likely to prove useful at a later time. The problem with best-fit allocation is that is produces large numbers of very small fragments.

    It is very important to note that in most heap management environments, an exact fit is very likely to be available! Most applications repeatedly allocate and deallocate objects of only a few sizes, and a good storage manager should take advantage of this. On the other hand, it is important to remember that there are applications that have genuinely random sequences of allocations. One example of such a random sequence of requests comes when opening a succession of windows to view images, where each image was hand cropped (trimmed to size) to eliminate distracting or irrelevant content. Some web pages are full of such randomly sized images, and viewing such pages can lead to significant heap fragmentation.

    Much of the early research in the area of heap management was done by the Burroughs Corporation in the early 1960's, but research in this area continues. Some of the best recent papers in the area were published in a the Proceedings of the ACM SIGPLAN International Symposium on Memory Management, published as SIGPLAN Notices, Vol 34, No 3, March 1999. The most relevant paper in this proceedings is "The Memory Fragmentation Problem: Solved?" by Johnstone and Wilson, pages 26-48.

    Boundary Tags

            |_______|
             |_______| <--
             |       |    |-- Block boundary tags
             |       |    |   give
             | block |    |    1) Block status
             |       |    |        (used or free)
             |       |    |    2) Block size
             |_______|    |    3) Addresses of neighbors
             |_______| <--
             |       |        (2 & 3 are correlated!)
     
    With the buddy system, the heap manager needed no record of the blocks allocated to users, but the arbitrary sized blocks used in boundary tag methods require that the manager keep a record of each block in the heap, whether it is allocated or free. These records are easily maintained in the heap itself, in the form of a prefix or suffix on each block. From the user's point of view, these prefixes and suffixes are not part of the block -- they are part of the overhead of the storage manager.

    When the heap is viewed as a sequence of blocks, the suffix on one block is adjacent to the prefix on the next block, and every boundary between blocks, whether they are used blocks or free blocks, contains a prefix-suffix pair. This pair is the boundary tag from which the entire class of storage allocation methods takes its name.

    For each block, the boundary tag must contain the status of that block so that, at the appropriate time, it can be determined whether the block is free, available for allocation or merger with another free block, or in use, and therefore unavailable. Another piece of necessary information is the size of the block, so that the allocation algorithm can determine whether the block is big enough to satisfy some allocation request. Finally, when it is necessary to merge free blocks, it must be possible to find the neighbors of a free block in order to see whether they are free.

    The size of a block and the addresses of the neighbors are redundant, since if you know the address of the following block and the size of the boundary tag fields, you can compute the size of a block, and if you know the size of a block, you can compute the address of either end if you are given the address of the other end.

    Boundary Tag Alternatives

       First Alternative     Second Alternative
     
        |    |        | ^        |________|
        |    |        | |        |__size__|
         ->->|________| |        |        |
          |  |_status_| |        |  block |
          |  |__next_o--         |________|
          |  |__prev_o----       |__size__|
          |  |        |   |      |_status_|
          |  |        |   v      |        |
     
    There are two popular ways of storing the information needed in the boundary tags. In the left method shown above, the boundary tags are arranged into a doubly linked list. Each tag contains a pointer to the next tag and to the previous tag, and each tag contains the status of the block it immediately precedes. As shown in the illustrations, all pointers point to the first word of a block in the heap, so positive offsets from the pointer are indices into the block and negative offsets are indices into the boundary tag. This convention is convenient but not essential.

    The right method shown above records the size of each block as a prefix and suffix on that block, and includes the status in either the prefix or suffix or, on some systems, both (above, it is shown in the prefix).

    Boundary Tag Free Lists

           ^ |        |      In the worst case,
           | |  free  |      free blocks must
           | |________|      be arranged into
            --o_______|      a doubly-linked
         -----o_______|      free list.  This
        |  ^ |////////| tag  allows fast
        |  | |        |      allocation and eager
        |  | |  used  |      merger.
        |  | |________|
        |  | |////////| tag  For some other
        |  | |        |      choices of allocation
        |  | |  free  |      method, a simpler
        |  | |________|      free list structure
        |   --o_______|      will suffice.
         -> --o_______|
           | |////////| tag
           v |        |
     
     

    The free list is maintained in the free blocks themselves, not the bounary tags, so the minimum free block size depends on the number of pointers per list element, two if the list is doubly linked, typically one word each. The boundary tag itself requires two pointers or two block sizes (typically, the block size is the same size as a pointer), and one status bit (typically a whole word), so in order to allocate a two word block on the heap, it is typically necessary to find a total of five free words.

    The above argument implies that boundary tag methods cannot eliminate internal fragmentation! When a user requests a block of size X and the only block available is size X + E (E is the extra part), then the available block must be split into a block of size exactly X and a block of size E - T (where T is the size of a boundary tag). If E - T is smaller than needed to store the pointers needed to link it into the free-list, then the split cannot be done. For the example data structure shown above, E must be at least 5 words, so there will be occasional internal fragmentation involving fragments of 4 or fewer words.

    In the extreme case, no free list is needed. When a user wishes to do allocation, a linear search over the list of boundary tags will always manage to find a free block, but if the heap is largely full, this search will be unnecessarily slow as it examines all of the allocated blocks in the heap in the search for a free block of the right size.

    If free block merger is lazy, then the merger can be done by a linear scan of the entire heap, rebuilding the free list as free blocks are found and merged. In this case, the free list can be singly linked.

    If free block merger is done on an eager basis (at deallocate time) or on an eager basis (at allocate time) during the scan of the free list, where the free list is randomly linked with no attention to the addresses of blocks in the heap, then double linking is needed because when two adjacent free blocks are merged, one has to be removed from the free list and without a double link, it's successor in the list cannot be found.

    The so-called fastest fit storage allocation algorithm uses a simple heap manager with a singly linked free list as its foundation and lazy block mergers; free blocks of popular sizes are stored in secondary free lists, one for each popular block size. Allocations for commonly used sizes check the appropriate secondary free list first, before using the underlying heap manager. Deallocations of common block sizes simply place the blocks in the secondary list (uncommon sizes can be returned to the main free list).

    Only when an allocation request to the underlying heap manager cannot be satisfied are free blocks merged -- this is done by scanning the entire heap and rebuilding the main free list, leaving all the secondary free lists empty.

    The fastest fit algorithm gives what may be the best average-case performance, but its worst case performance is not good. For the worst case, first-fit is better! This is one of many examples of situations where the best algorithm for real-time use is not the same as the best algorithm for minimizing the total run-time!

    Exercise: Write code for a boundary tag storage manager, using prompt merger.

    Виртуальная память

    What if the aggregate memory requirements of the system plus applications excede the available main memory. If there is auxiliary memory available, for example, on disk, there are at least three ways of using this as a secondary memory or backing storage:

    • Overlays
    • Swapping
    • Paging

    Overlays

    Overlays are sections of an application's code or data that are stored on secondary memory when not in use. It is up to the application to read overlays into main memory when they are needed, and it is up to the application to write overlays back to secondary memory, if necessary, when they are unneeded.

    Typically, each overlay contains a procedure or function. The main program must load the overlay before calling the routine, and the routine in the overlay may call any routines that were loaded with the main program. In addition, an overlay may contain secondary routines called only from within that overlay, and it may load other overlays.

    In the most primitive overlay systems, memory to hold the overlays is allocated statically. In more advanced overlay systems, memory to hold overlays is allocated dynamically. In such a system, a call to a routine contained in an overlay might be coded as follows:

    	{ /* load overlay X and then call function x within it */
     		int X;       /* file descriptor */
     		int Xsiz;    /* size of file */
     		void * Xpt;  /* pointer to overlay */
     
     		/* first find out about the file */
     		X = open("X", ...);
     		Xsiz = lseek( X, 0, L_XTND );
     
     		/* second, make space for the contents */
     		Xpt = malloc( Xsiz  );
     
     		/* third, load the file in memory */
     		Xsiz = lseek( X, 0, L_SET );
     		read( X, Xpt, Xsiz);
     
     		/* then, call the function */
     		(*Xpt)(params)
     
     		/* finally, dispose of the memory it occupied */
     		free( Xpt )
     	}
     

    The above C code assumes that the overlay can be loaded into memory by a simple call to read(). This will be true if the file contains position independent code, but a more complicated loader is required if any relocation or linkage needs to be performed.

    A common form of overlay is the load-on-call procedure. This purpose of this is to hide the details of overlay management from the user of the overlay, packaging these details in a special procedure called an overlay stub. The only part of the procedure initially loaded in main memory is the load-on-call stub; a call to this loads the procedure into main memory and then transfers control to it. Thus, the act of calling the procedure causes it to be loaded.

    The word stub is widely used in discussions of program development, dynamic linkage and distributed software, so it is important to make it clear that this word is not just technical jargon, but also common English.

    A stub is the remainder of a branch or limb that has been cut or broken off of a tree or animal. In the case of software, the tree that is of interest is the abstract tree describing the history of procedure and function calls during the execution of a program. The root of this tree is the main program, and each routine called from the root corresponds to a branch. A routine that calls no other routines is a leaf of the tree, and if some branch of the tree is run on a different machine, or loaded by a different loader, or involves code that has not yet been written, the code at the base of that branch is referred to as a stub.

    Consider the procedure X. In a program calling X where the memory is not sufficient for all routines needed by that program, a call to X might be linked to a stub instead of being linked to X. The following stub is designed to mimic the external interface of X, and when it is called, it loads X into memory, calls it, returns the results of X, and then deallocates the memory used so that it may be used for other routines.
    	Xstub(params)
     		Xpt = malloc( sizeof(X) )
     		load( X, Xpt )
     		(*Xpt)(params)
     		free( Xpt )
     		return
     

    Here, sizeof(f) returns the size of the memory region needed by the object file f, and load(f,p) is a system routine that loads object file f into the memory region pointed to by p.

    More intelligent overlay management stubs can swap in a routine only if that routine was not already in memory. Consider the following stub for X that shares memory with routines Y and Z:

            Xstub(params)
                 if Xpt is null then
                     Xpt = malloc( sizeof(X) )
                     if Xpt is null then -- the allocate failed
                         if Ypt is not null then
                             free(Ypt)
                         else if Zpt is not null then
                             free(Zpt)
                         endif
                         Xpt = malloc( sizeof(X) )
                     endif
                     load( X, Xpt )
                 endif
                 (*Xpt)(params)
                 return
     

    Here, when X is first called, the stub will allocate space and load the code. On subsequent calls, the code will remain loaded so no new allocation is needed. If, however, space is needed for routines Y or Z, X may be deallocated. If X is needed ans space is unavailable, the stub for X will deallocate Y or Z to make space.

    The above scheme rests on the assumption that the stubs for X, Y and Z cooperate to share memory, and it rests on the assumption that the code for X can be deallocated safely when Y or Z need the space! This assumption can be met if no procedure called from X calls Y or Z, no procedure called from Y calls X or Z, and no procedure called from X cally X or Y.

    By the late 1960's, most computer manufacturers offered elaborate linkage editors that would analyze the procedure call graph of a program to determine what procedures could operate as peers in overlay management schemes such as the above. Today, such overlay managers are almost obsolete because of the widespread use of virtual memory.

    Overlays allow a single program to run in a main memory region smaller than that program. The next technique we will discus allows multiple processes to run in a meory that cannot hold all of them at once. Note, however, that it is almost always possible to break a single program into multiple communicating processes, where the purpose of the breakup is not to achieve parallelism, but merely to reduce the size of the address space needed by any single component of the original program. Therefore, we can also use the next idea as an alternative to overlays!

    Swapping

    Swapping involves the process scheduler. The term was originally based on the common English verb to swap meaning to exchange, because the basic idea involved exchanging a process (or process image) stored on disk for one stored in main memory. When swapping is used, memory assigned to ready processes is considered preemptable. If memory is needed, a ready process is selected to be swapped out. Swapping a process out of main memory involves copying the data segment(s) of that process to a swapping device, usually a disk. When a process is selected to be run, if that process is swapped out, the system must first find or make space in main memory, then swap that process in, before running it.

    This leads to a new process state swapped. Ready processes become swapped when their memory resources are moved to secondary memory, and swapped processes become ready when their memory resources are moved back to primary memory. Swapped processes cannot be run because the CPU is unable to execute instructions from secondary memory and unable to load or store operands of machine instructions in secondary memory!

             preempt
                or         swap
            relinquish     out
              ---->--     ---->-
            /         \ /        \
        RUNNING      READY     SWAPPED
            \         / \        /
              --<----     -<----
             schedule      swap
                            in
     

    Swapping systems are at their best if main memory can be divided into at least two partitions, each able to hold one process. In this case, the process in one partition can be running while processes in other partitions are being copied to or from disk. The classic IBM mainframe operating systems of the 1960's operated this way, as did early timesharing systems for the DEC PDP-8 and PDP-10 (a minicomputer and a mainframe, respectively). The first versions of UNIX, on the DEC PDP-9 and PDP-11 were also based on this idea.

    It is worth noting that, on these systems of the 1960's, swapping was frequently done to a drum memory, so early system descriptions frequently speak of the swapping drum.

    If memory on a uniprocessor was divided into N partitions, there could at most N-1 ready processes and one running process. If the total number of processes that could be run was greater than N, there would typically be were N-2 ready processes because at any given instant, the process in one of the partitions was in transit to or from disk.

    Paging

    Virtual memory based on paging and the related technique of segmenting involves hardware address translation mechanisms that allow individual pages of the main memory used by a program to be preemptively claimed the system for use by other programs.

    The subject of virtual memory lies on one of the most important interface between operating systems and computer architecture because it depends on the nature of the memory management unit hardware, and the memory management unit hardware exists primarily to support virtual memory. The memory management unit sits between the processor and main memory:

             _____               data               _____
             |     |  |<------------------------>|  |main |
             | CPU |--|          _____           |--| RAM |
             |_____|  |-------->|     |--------->|  |_____|
                        virtual | MMU | physical
                        address |_____| address
     

    Addresses issued by the CPU are considered to be virtual addresses. The memory management unit translates these to physical addresses or it complains (with a trap request) that the translation is impossible or that the proposed access is not allowed.

    Two primary models of address translation have been explored by the designers of memory management units: Paged memory management units operate in terms of fixed-size pages of memory, while segmented memory management units operate in terms of variable-size segments of memory. Mixed models are common in modern hardware; most of these are based on variable sized segments where each segment is composed of one or more fixed size pages.

    We will concentrate on paged memory, but there are many segmented machines. Notably, the 80x86 family of microprocessors support a segmented memory model that is most clearly expressed in the 80286. This was not effectively exploited by most operating systems developed for the x86 family.

    Pages are typically used without regard to the logical arrangement of memory -- that is, data structures are not typically arranged in any particular way with respect to page boundaries. Programmers on paged systems do not usually pay any attention to the presence of pages. The exception to this is typically the heap manager; better heap managers know of the presence of pages in the heap and enlarge the heap by requesting extra pages when it fills, and also sometimes relinquish unneeded pages in the middle of large free blocks.

    In a segmented system, each data structure is typically assigned to a particular segment. Thus, a segments might be dedicated to each procedure or to each activation record on the stack in a fine- grained segmented system. In a coarse-grained use of segmenting, one segment might be reserved for the stack, another for the code, and a third for the heap and global data of a program.

    On paged systems, users generally view the virtual address is viewed as a single integer that indexes a word (or byte) in a single linear virtual address space. In contrast, on segmented systems, the virtual address space may be a single linear space, or it may be divided into separate spaces. The 80x86 is an example of the latter, where the segment being addressed is determined by the context. By default, PUSH and POP reference the stack segment, most loads and stores reference the data segment, and instruction fetches reference the code segment, but this default behavior may be overridden by special instruction prefixes.

    Paged Address Translation

    Here is a diagram of the data flow through the memory management unit hardware used to translate a virtual address to a physical address in a simple paged virtual address space.

        Virtual Address from CPU
          ______________________
         |__________|___________|   Word in Page
              | Page      |___________
              | ___ _________         |
              |  | | |       |        |
         array|__| |  Page   |        |
         index   | |   Table |        |
                 V |         |        |
                -->| |       |        |
                   |_|_______|        |
                    |    |Frame       |
                  __|    |__          |
                 |          |         |
                 v     _____v_________v______
              Access  |__________|___________|
              Control
                      Physical Address to Memory
     
     

    The picture shown here shows the logical function of the memory management unit, but there are many possible physical organizations -- if the page table is large, it is not practical to store it entirely in a table within the MMU! However the page table is stored, the virtual address is divided into two fixed size fields, where the most significant field gives the page number in the address space, and the least significant gives the word number in that page (or byte-in-page, for byte addressable memory). The physical address is similary divided into a frame number field and a word -in-frame field. The translation function, done by a lookup in the page table, only deals with translating the page number to a frame number; the least significant bits of the address are not changed.

    The Contents of a Page Table Entry

    The format of one entry in the page table will typically be something similar to the following:

        _ _ _  _______ ______________________
       | _ _ _ |_______|______________________|
           |       |              |
           |       |            Frame -- where is the
           |       |                     page stored.
           |     Access
           |     Control ------- what actions are allowed,
           |                     by hardware on this page:
         Soft                         || invalid
         fields -- used by software   || read-only
                   to record other    || read-write
       (optional)  attributes of the  || execute-only
                   page.              || dirty
     
     
    Some hardware allows extra bits in each page-table entry where those bits have no hardware function; these bits are available for use by software for any purpose the software may have, so they are called the soft fields of the page table entry. If the software needs extra space beyond what the hardware may have provided, it is always possible to allocate this in a parallel array unknown to the MMU hardware but also indexed by the page number.

    Page Table Storage Options

    On systems with a small virtual address space, the page table can be stored entirely in hardware registers within the memory management unit.

    Consider the MODCOMP IV computer, a minicomputer first sold in around 1973. The MODCOMP Classic series of computers descended from the MODCOMP IV remained in production for many years, and in fact, www.modcomp.com still exists and still offers some support for their Classic line of processors. The MODCOMP IV had a 16-bit virtual address, 8 bits giving the page number and 8 bits giving the 16-bit word in page. The address map for this machine therefore occupied 256 internal registers in the MMU. This allowed very fast address translation, at the cost of slow context switching. Loading or storing one address map took as much as 256 memory cycles! The designers of this machine built the MMU with 1K internal registers, so that the MMU could hold 4 complete address maps. As a result, so long as the number of ready processes remained small, it was rarely necessary to load and store maps when context switching between processes. All that was needed was to change a small map-select register.

    If the page table is too large to store in special purpose registers inside the MMU, it can be stored in main memory. In this case, the MMU needs to contain a page-table base register plus a memory data register. Context switching between virtual address spaces, can be very fast, all that is needed is a change to the page-table base register. Unfortunately, this has its cost! For each address to be translated, the MMU must read a page table from main memory to the internal memory data register, and then use this plus information from the virtual address to compute a physical address. Thus, each memory cycle of the CPU takes two memory cycles, one to translate the address and one to do useful work.

    Despite this overhead, this solution is very common! The key to efficient implementation of this is to add a special cache memory called the translation lookaside buffer. This is a cache memory dedicated to memory address translation and integrated into the memory management unit. So long as the address being translated is in this cache, no extra memory cycles are needed and the MMU operates as quickly as a simple MMU holding the entire page table in internal registers. From a software perspective, context switching is simple, just change the page-table base register in the MMU, but this has a run-time cost because each time this register is changed, the entire contents of the cache are invalidated by hardware.

    The Ferranti Atlas computer, introduced in 1960, was the very first machine to support paged address translation, and the MMU scheme it used was remarkably close to a TLB based scheme, although the MMU on the Atlas was actually implemented in a distributed manner, distributed over all of the physical memory modules of the machine.

    Each entry in the TLB of a typical moder machine has the following form:

    Cache Entry
        _______________________________
       |____________|__________________|
     	| Page         | Page Table
     	| Number       | Entry
     	|              | 
        Subject to     Returned by
        Associative    Successful
        Search         Search
     
    In its simplest form, the associative search involved in searching the cache involves a simple parallel comparison of the page number in each cache entry with the page number issued by the CPU. This can be done very quickly with appropriate hardware. Cache hardware to support fully associative searches is expensive, but the alternatives are the subject of an architecture course and are will not be discussed here.

    If the cache size is the same as the number of page frames in main memory, a cache miss signals a page fault, and the software can be put in charge of replacing the cache entry as it replaces the page in the page frame. This is what was done on the Atlas. This approach results in the simplest possible cache hardware, since the hardware need not manage cache misses and replacement. Several modern RISC processors use variations on this simple model, but most MMU designs from the 1970's and early 1980's (including that of the Pentium) rest on hardware cache management.

    The Page-Fault Trap Service Routine

    The page-fault trap service routine gains control when the MMU detects an attempt to use a page-table entry that is invalid or an attempt to perform an operation that is not permitted by the access-control bits in the page-table entry.

    A trap is very similar to an interrupt. The traditional distinction is that interrupts are caused by asynchronous events, while traps are caused by the actions of the CPU. Other traps include such things as arithmetic exception traps (caused by the values of operands to arithmetic instructions), and bus-fault traps (caused by attempts to access physical addresses to which no physical memory is attached).

    On some machines, there is an additional convention -- When an interrupt is requested, the current instruction is allowed to complete before the PC is saved and the interrupt service routine is entered. When a trap occurs, the current instruction is aborted and the trap service routine is entered directly.

    For it to be possible to serve a page fault, the instruction must be aborted before it has any side effects, or (as was done on the DEC PDP-11) all side effects must be recorded so that they can be undone by the trap service routine prior to any attempt to restart the instruction.

    The exception to the above rule is on machines such as the Intel 80x86 family, where some instructions are designed to be interruptable -- typically, this applies to block copy instructions. These instructions, if interrupted, save their state in CPU registers in such a way that they can be restarted from the point of interruption instead of restarting from the beginning.

    The memory management unit hardware requests a page-fault trap when an untranslatable virtual address is encountered. The response to this trap is entirely under the control of the operating system software, specifically the page fault trap service routine.

    For all interrupt and trap service routines, the hardware and software cooperate to save the program counter and other key registers as control is transferred to the trap service routine. The details of how this is done vary considerably from machine to machine. On completion of the entry to the page-fault trap service routine, the saved PC is the address of the instruction that caused the trap.

    Before return, the trap service routine must adjust the page table so that re-executing the instruction that caused the trap will not cause the same page fault. On return, the state of the CPU is returned exactly as it was prior to executing the instruction that caused the trap. As a result, from the perspective of the program, it will be as if there was no page fault trap!

    The PDP-11, sold from 1970 into the 1980's, had a complex instruction set where some instructions could have several side effects before they computed a memory address that could cause a trap. The memory management unit on this machine included a special register that recorded these side effects (what registers had been incremented or decremented by how much); on entry to the page-fault service routine, the first computation required was to undo these side effects, restoring the CPU state to what it had been before the instruction that caused the trap.

    A single instruction can potentially cause many page faults. Consider a memory to memory move double word instruction on a machine like the PDP-11 or the Motorola M68000. The instruction itself will typically be more than one word long, so the instruction may span a page boundary. Furthermore, both the source and destination addresses may reference double words that span page boundaries. Thus, an attempt to execute this single instruction could cause as many as 6 page faults!

    Clearly, if there are fewer than 6 page frames on the machine, this instruction cannot be executed. In general, the instruction set of a virtual machine must be examined to determine the maximum number of pages a single instruction can execute, and then that number of page frames must be guaranteed.

    Each entry to the page fault trap service routine must bring in one of the pages needed to complete an instruction. It need not bring in all of the required pages, since it is frequently faster to simply retry the instruction again and again so that the CPU can compute the successive addresses and let the MMU check each to see if it causes a fault.

    Постраничная память

    Page Faults

    When a program references a page that the memory management unit is unable to handle, or when the reference requires access rights that are not allowed by the memory management unit, the memory management unit reports a page fault by forcing the CPU to enter a trap service routine, usually a part of the operating system, but there are systems such as Mach where page faults may be handled by application-level code.

    The instruction that was executing at the time of the trap is aborted! On entry to any trap service routine or interrupt service routine, a properly designed CPU will cooperate with the prologue (the introductory sequence of instructions) of the service routine to save the values of all of the registers of the interrupted program, including the PC. In the case of trap handlers, the saved values should be the values that were in effect immediately prior to the instruction that caused the trap.

    Note: On some machines, the values for some saved registers include changes made by the instruction that caused the trap. For example, the program counter may have been incremented as a side effect of the instruction fetch prior to the memory cycle that caused a page fault, or the stack pointer may have been decremented prior to the memory cycle that caused the fault. If such a machine is to be used to support demand-paged virtual memory, there must be a way to determine what registers have been changed so that the prologue on the page fault trap service routine can undo these changes!

    In theory, given the PC and registers prior to the instruction that caused the trap, the trap service routine can simulate the execution of the next instruction to determine all of the memory addresses it might use; for each address, it can check to see which of these might have caused the trap. This is simple enough that on some machines, the memory management units do not save the actual virtual address that was used to cause the trap; in this case, the address must be computed by the prologue to the trap service routine. The software is simplified, however, if the MMU saves the virtual address that caused the trap, thus saving the trap service routine from having to simulate an instruction cycle.

    Here, in outline form, is a generic page fault service routine:

    page fault service routine:
         -- prologue
         save the state prior to the instruction that caused the trap
     
         -- find necessary information about cause of fault
         va = virtual address that was referenced
         op = operation that was attempted (read or write)
     
         -- is this an access violation?
         if map[va.pagenumber].rights = read_only
         and if op = write
             handle as user access violation (bus trap!)
         else
     
             -- find a free page frame
             if there are free page frames
                 pa = free frame address
             else
     
                 -- page replacement policy
                 pa = address of page to replace
                 va' = virtual address of page in that frame
                 write( pa, diskaddress( va' ))
                 update virtual address map, etc to reflect change
             endif
     
             -- bring in the required page
             read( pa, diskaddress( va ))
             update virtual address map, etc to reflect change
     
             -- epilogue
             restore saved state
             return from page fault service routine
         endif
     
    This outline leaves us with two important choices; what page(s) should be copied into main memory from disk, and, assuming that main memory is full, what page(s) should be copied out from main memory to disk in order to make space.

    There is also the question of what disk address should be mapped to each virtual address. Some systems simply set aside a disk device or a partition of a disk device (a virtual disk) to hold the entire range of legal virtual addresses. On these systems, if we assume that the page size equals the size of a disk sector, we can use the page number directly as a sector number on the disk or disk partition. Another approach is to set aside a disk file in the file system for each virtual address space. This makes sense in systems that support multiple virtual address spaces; when this is done, we can use the virtual address of a page as the offset into the file for reading and writing the saved copy of each page.

    What page should be copied into main memory?

    When a page fault occurs because a program tried to access a specific virtual address that was not currently mapped into main memory, we must bring that page into memory. If this is the only way pages come into main memory, we say that our virtual memory system uses pure demand paging. That is, pages are only brought into memory on demand.

    If, on the other hand, our system attempts to guess the identity of pages that will be needed in the near future and brings these into main memory along with the pages that were demanded, we say it uses anticipatory paging. That is, the system tries to anticipate pages that will be needed soon.

    A typical anticipation scheme is based on the observation that most instructions are executed sequentially, and many operand references are located in sequential memory locations. Therefore, we can estimate the likelihood that a memory reference to page X will be followed in the near term by a reference to page X+1. If the likelihood is under 1/2, it is obvious we should not bring in page X+1; if the likelihood is sufficiently greater than half, there is a good chance that we will gain by an anticipatory transfer.

    If we study program execution patterns to find the expected size of a function, call this S, we can use this as an estimate of the likelihood that executing location X will predict the execution of locations up to X+S in the near future. More generally, we can study the sequences of operand addresses issued by running programs to get an estimate of the likelihood of references to X+S within the next T time units, for all S and all T, and we can use this as a basis for a sound anticipatory paging policy.

    Of course, not all programs behave typically. Experience on systems that support anticapitory paging shows that it does not always pay off. Furthermore, it is sometimes possible to predict, as you write a program, that the program will behave poorly under an anticipatory paging system, for example, when traversing linked lists that were constructed at random. One way of dealing with this is to allow applications programs to turn off anticipation when the programmer believes it will behave poorly, and turn it back on again.

    What page in main memory should be replaced?

    What page in main memory should be replaced? We refer to this as the question of page replacement policy. In the above outline of a page-fault service routine, this issue was glossed over this by asking for the address of page to replace. There are a number page replacement policies that have widely accepted names:

    Random page replacement:
    Random replacement requires no intelligence and works moderately well. When a page frame is needed, the random algorithm simply selects any page frame at random, dumps the page from that frame onto disk, and then uses it. If the memory reference pattern of the application program is very irregular, for example, as occurs during some list processing applications, it may even be the best solution.

    FIFO page replacement
    First-in first-out replacement is an obvious policy; here, the page that has been in main memory the longest is the one replaced. For most applications, this turns out to be as bad as random replacement in practice. The fact that a page has been in memory for a long time does not generally imply that it will not be needed soon!

    LIFO page replacement
    Last-in first-out replacement is included in this list only for the sake of completeness. This policy is among the worst possible policies! The page most recently brought into main memory is frequently among those that will be used very soon, and sending it back to disk is almost always a very bad idea.

    LRU page replacement
    The least-recently-used replacement policy is usually the best choice, if appropriate hardware is available to support it. This policy selects for replacement the page in main memory that was least-recently used. That is, the page that has not been used for the longest time. If a page has not been recently used, it is reasonably likely it will not be needed anytime soon. Unfortunately, LRU replacement requires that the MMU perform some fairly complex record keeping, for example, by keeping a constant record of when each page was most recently accessed.

    Clock page replacement
    Clock replacement is almost as good as LRU replacement. This scheme divides the set of pages in main memory into those that have recently been used, and those that are less recently used, and selects for replacement one of the less recently used pages using an algorithm to be discussed later. Where the pure LRU replacement scheme needs several bits per page frame to select the page for replacement, for example, enough bits to store a timestamp, the clock algorithm needs only one bit to distinguish between those pages that have been recently used and those that have not. Therefore, the hardware support for clock replacement is relatively simple, and in fact, special hardware can be entirely dispensed with.

    Belady's optimal page replacement policy
    Belady, working at IBM in the late 1960's, developed an optimal page replacement policy. This policy is entirely impractical in practice, but because the number of page faults required by this policy can be worked out retrospectively, it is possible to determine how close the practical policies come to optimality. The optimal policy is simple: Replace that page that will not be needed for the longest time in the future!

    LRU Replacement

    Required Hardware

    The best of the practical page replacement policies is to replace the least recently used page, that is, the page which has not been referenced by the CPU for the longest time in the past. This requires special hardware to track every memory access. One way to do this would be to have the memory management unit manage a table of page frames, where each page frame contains a timestamp that is updated with each reference to the page in that frame.

    This nearly doubles the complexity of the memory management unit, but if we move the timestamp to the page table entries, recognizing that this field will only have use for those pages that are currently in frames in main memory, we can simplify the hardware. The resulting page table structure might be as follows:

          A Page Table Entry
            _________________________________
           |__________|______|_______________|
           |   Time   |Rights      Frame     |
           |  Stamp   |                      |
           |          |                      |
           |   added  |   original fields    |
     
     

    Why Does LRU Replacement Work Well?

    Demand paged virtual memory works, that is, it offers access times comparable to main memory, at a price comparable to secondary memory because programs exhibit what is called a working set. The working set of a program is the set of pages it needs in order to make progress at some instant in time. If enough page frames are available in main memory to store the entire working set of a program, that program will execute almost as quickly as it would if it was executing out of main memory with no paging.

    In practice, most programs exhibit a well defined working set that changes only slowly with time. That is, they exhibit strong locality of reference. The locality property involves two components: In programs that exhibit strong temporal locality, the fact that a memory location has recently been referenced strongly suggests that it will soon be referenced again. In programs that exhibit strong spatial locality, the fact that some memory location has recently been referenced strongly suggests that nearby locations will soon be needed. Squential execution of programs and sequential processing of arrays lead to spatial locality, while iterative control structures lead temporal locality for both code and data.

    The LRU policy directly addresses the issue temporal locality, and the organization of memory into pages that contain more than one consecutive memory location addresses spatial locality. Programs with good locality have well defined working sets, while programs with poor locality have poorly defined working sets.

    Belady's optimal page replacement algorithm is optimal because it examines the future to make page replacement decisions. The LRU policy, in contrast, uses the predictions it can make by assuming that the program has good locality, and usually, these are good predictions. LRU is usually considered to be optimal among practical algorithms, although there are contexts where some anticipatory paging can lead to marginal improvements.

    MULTICS on the GE 600 used true LRU replacement, back in the late 1960's, but few systems since that time have used true LRU. The MULTICS implementation used a special high-speed core memory to implement its cache of all currently available page frames. The Following sequential algorithm was used by the address translation unit, implemented in fast hardware:

    I = 0
        Fetch entry(I) into temp
        while I < cache_size
        and temp.page /= virtual_address.page do
            I = I + 1;
            exchange temp with entry(I)
        endloop
        if temp.page = virtual_address.page
            Store temp into entry(0)
        else
            Report a page fault for virtual_address.page
            Report that temp.frame is the least recently used page frame
            Copy the map entry from main memory into cache entry(0).
     

    This keeps the entries in the cache ordered by recency of use. This ordering guarantees that the cache will be fast for references to recently used frames (those that are likely to be frequently used) while it is slow for frames that weren't recently used (those that are unlikely to be frequently used).

    The MULTICS scheme is an elegant historical curiosity, of only limited practical value today because it is predicated on core memory technology.

    MULTICS was the first design for a large-scale timesharing system, developed with federal funds at Project MAC at MIT starting in the mid 1960's. The original development involved a consortium involving MIT, GE and Bell Labs, but Bell Labs dropped out of the project, and Honeywell bought GE's computer business. In the 1970's, MULTICS became one of two standard operating systems offered on Honeywell mainframes.

    MULTICS was originally conceived of as a computer utility. The idea was that such a utility would be operated in much the same way as the telephone and power utilities, but instead of selling power or audio communication services, the computer utility would sell CPU time and disk storage to its customers. Today, this isn't a radical idea -- there are many computer utilities, ranging from national services like AOL to smaller local providers. The concept was a radical one in the mid 1960's, and this idea drove a number of technical developments.

    Clock page replacement

    The clock algorithm allows a good approximation of LRU replacement, although if there are too few available page frames, it degenerates to FIFO replacement. Clock replacement requires that there be one bit per page frame to record whether that page has been recently referenced. The mark bit is set by the memory management unit hardware whenever the page in some frame is referenced, and it is reset by the page-fault service routine, in the process of deciding what page frame to replace.

    To select a page frame to replace, the page fault service routine software sweeps through the page frames, resetting the referenced bit on each page frame and stopping when it finds a page frame with this bit already reset. The page in this frame is the one that is replaced (copied out to disk to make an empty frame).

    The clock algorithm is usually pictured with a clock face, where each minute on the clock face corresponds to a page frame. The clock hand remains fixed except during the execution of the page fault service routine. When a page fault occurs, the clock hand begins its sweep. The hand stops as soon as it sees an unmarked page frame. The clock hand removes the marks from each page frame it sweeps over.

    Data Structures and Code for the Clock Algorithm

    The Clock Algorithm requires a frame table, that is, a table indexed by physical memory page frame number and containing, for each page frame, the virtual address or page number of the page in that frame, along with at least one status bit, the mark bit. The mark bit can be moved to the page table, at the expense of a level of indirection in each test of the mark bit by software in the replacement algorithm. This is typically done in order to simplify the MMU by having it deal only with the page table.

                              Frame Table
                             ________________
                            |_X__|___________|
                            |____|___________|
       Clock Hand -         |____|___________|
                    \       |____|___________|
                      ----->|____|___________|
                         |  |_X__|___________|
          Sweep Direction|  |_X__|___________|
                         |  |____|___________|
                         V  |_X__|___________|
                            |mark| what page |
                            |bit | is in this|
                            |    | frame?    |
     
     
     	find page to replace:
                loop while frame_table[clock_hand].mark is set
                   reset frame_table[clock_hand].mark
                   clock_hand = (clock_hand + 1) mod table_size
                endloop
                -- now, frame_table[clock_hand].what_page
                -- is the page number to be written back to disk,
                -- and clock_hand is the frame number to use with
                -- the page being brought in from disk.
     

    Clock replacement has been widely used on many computer systems, including the IBM System 370 and its descendants in the IBM mainframe family, later versions of the Multics system, and many others.

    Modified Clock Replacement

    It turns out that the clock algorithm can be modified to work without any special hardware support. The first large-scale use of this modification was in the DEC VAX family of computers introduced in the late 1970's, but many modern RISC systems also take advantage of this to simplify their memory management units.

    As the clock hand sweeps by a page frame, all page table entries referencing that frame are marked as invalid. In fact, this page is still in main memory, but by marking it as invalid, we force the the memory management unit to force a page-fault trap the next time the application references that page. We use this combination, page-in-main-memory but marked-as-invalid in the page table, to be equivalent to mark-bit-reset in the original clock algorithm, while page-in-main-memory and marked-as-valid in the page table is used to mean mark-bit-set.

    When the CPU attempts to reference a page marked as invalid in the page table, there is a page fault. If the page is actually in memory, this is called a soft page fault. The page table entry is simply changed from invalid to valid when this occurs, so it is a very simple fault to handle.

    Of course, we must add at least one new bit to each page table entry to distinguish between hard and soft page faults. This bit is never tested or set by the hardware, but is only used by the page-fault service routine. Therefore, it is one of the soft bits in the page table entry.

    One of the most interesting models for representing the soft bits in the page table uses a complete copy of the access-rights field in the soft bits. The soft rights field gives the real access rights the user has to the page, while the hardware access-rights field gives the rights that the memory management unit will enforce, requesting a page-fault interrupt if they are violated.

           _____________________________
           |______|______|_______________|
           | Soft |Rights      Frame     |
           |Rights|                      |
           |      |                      |
           | soft |   hardware fields    |
           | bits |                      |
     

    The state of the page, from the point of view of the modified clock algorithm, is represented by the relationship between the hardware rights field and the soft rights field:

    Soft Rights = Rights = accessable

    Marked. This page is not a candidate for clock replacement; it has been recently used. Legal references to the page by the applicaton program will take place at full speed. Illegal references will cause page faults that should be interpreted as acccess rights violations.

    Soft Rights = accessible; Rights = inaccessable

    Not Marked. A candidate for replacement. When there is a page fault involving this page, it is a soft page fault and the only action is to change the hardware rights field to match the soft rights field, changing the state to marked.

    Soft Rights = Rights = inaccessable

    Paged out! Faults involving this page are "hard" -- page replacement and disk I/O are needed to resolve this fault.

    Soft Rights = inaccessible; Rights = accessable

    Strange! This combination should never occur.

    Dirty Pages

    A dirty page is a page that has been modified since it was copied from disk to main memory. Dirty pages must be copied back to disk when they are replaced so that the frames that hold them may be used to hold other pages. Clean pages, those that are not dirty, need not be copied back to disk if original copy of the page on disk is still available. This can cut the time needed to service some page faults in half.

    Tracking clean and dirty pages requires one bit per page frame; typically, we assume that the hardware sets this bit to mark the page as dirty whenever there is a write operation on that page, and we let the software reset this bit whenever it writes the contents of a page frame to disk. The latter operation is called cleaning the page frame.

    This can be combined with the clock algorithm as follows: As the clock hand sweeps by dirty page frames they are scheduled for cleaning; the actual write operations to disk take place in parallel with computation and are typically done in whatever order is convenient for the disk interface hardware and software. The clock hand only stops its sweep when it encounters a page frame that is both unreferenced and clean.

    This scheme (modified clock replacement) is typical of modern virtual memory operating systems. This scheme requires minimal hardware support and yet delivers performance comparable to LRU replacement.

    If the computer does not include hardware to record the fact that a page is dirty, this may be done by software by marking clean pages as read-only. When a write operation is made to a read-only page, the resulting soft page fault marks the page as read-write, which is interpreted to mean that it is a dirty page.

    Some implementations of such schemes are enhanced with anticipatory paging -- for example, when there is a page fault for page i, they might bring in pages i, i+1 and i+2. This assumes that a program that references page i is highly likely to use the following pages. Since this assumption is not always a good one, such systems usually provide a way to turn off the anticipation logic in the page replacement algorithm. Anticipatory paging works very well for programs that use numerous large arrays and it provides few benefits for programs dominated by complex linked lists jumbled into the heap.

    Exercise: Outline a procedure to be called from the page-fault service routine that will search the set of page frames and select a page to replace. Do this for the simple clock algorithm, the clock algorithm with clean-dirty page management, and each of the above with memory management unit hardware that does not mark used and dirty pages. Before you do this, you may want to solve the following:

    Subsidiary Exercise: What are the states of a page-frame under the modified (software-only) clock algorithm with clean-dirty replacement.

    Subsidiary Exercise: The clock algorithm cyclicaly sweeps through the table of all page frames. This is not the same as the address map of the virtual address space! Describe the relationship between these two data structures!

    A digression into architecture

    Memory management units are fairly simple to design only if the entire page table is held in the MMU and if the MMU does not support the mark and dirty bits needed by the clock algorithm. The latter fact explains why, starting with the DEC VAX architecture, many modern systems, mostly associated with RISC architectures, have omitted hardware support for the clock and dirty bits, forcing the use of software solutions to these problems.

    If the entire page table cannot fit into the MMU, it is common for the MMU to use a cache memory, so that the most recently used page table entries are in the MMU while the entire page table is in main memory. This is quite effective, except that cache design is itself a fairly messy business. As a result, some MMU designs in use today take this one step further; the MMU cache, called the translation lookaside buffer, is simplified so that, if the desired page table entry is not in the TLB, the MMU issues a page fault! Thus, the page-fault service routine is responsible for both MMU TLB management and page replacement in main memory.

    This introduces a new kind of soft page fault, where the desired page is in main memory but the desired page table entry is not in the TLB, so the page fault service routine merely loads the page table entry into the MMU and returns, with no actual paging activity required.

    You can find an example MMU specification illustrating this extremity in the fictional Hawk architecture, described in

    http://www.cs.uiowa.edu/~jones/arch/hawk/overview.html#mmu

    Page Tables and Frame Tables

    Note that many implementations of paged virtual memory rely on two distinct data structures, the page table and the frame table.

    The page table describes the state of each page in the virtual address space. There is one entry in the page table for every addressable page in the virtual address space. This table frequently grows very large for large address spaces.

    The page table is indexed by the virtual address of the page, and each page-table entry has a field that describes where that page is currently stored -- for pages currently not in main memory, this would typically be the disk address where the page is stored; for pages in main memory, this would be the page frame holding the page.

    The frame table is indexed by the physical address of the frame, and the entries describe the state of each page frame in main memory. For those frames currently in use, it describes what pages are is them. There is one entry for each page frame in the physical address space. At any instant, the frame table can typically be derived from the page table, but this derivation is computationally expensive.

    The minimum frame table is just an inverted index, holding the index of the page table entry holding the page currently stored in each page frame. On some systems, more information is stored in the frame table; such information as whether the frame has been recently used and whether the contents of the frame have been changed by the CPU are all as useful if stored in the frame table as they are if stored in the page table.

                    Page table                Frame table
     Page Number     __________    Frame       ___________
          |         |_|________|   number     |_|_________|
          |         |_|________|      |       |_|_________|
          |         |_|________|       ------>|_|_________|
           -------->|_|________|              |_|_________|
                    |_|________|              |_|_________|
                    |_|________|               |     |
                    |_|________|      <--------       ------->
                     |     |            mark        page number
          <----------       -------->
         rights               frame number
     

    Address translation problems

    Recall that the easiest to understand virtual to physical address translation hardware holds the entire page table in the memory management unit:

                      Word in page
        Virt addr  |----------------------->| Phys Addr
        ---------->|  Page   ______  Frame  |--------->
                   |---     |      |   ---->|
                       |    | Page |  |
                       |    |Table |  |
                        --->|      |  |
             address        |______|  |
             invalid         |  |     |
           <-----------------    -----
     
    For large page tables, this is not implementable. For example, on the Intel 80x86 (for x > 2), the virtual address is 32 bits, of which 12 bits specify byte-in-page. Thus, the page table must contain a million entries. The entries themselves turn out to be 32 bits each, so this approach to implementing the memory management unit would require that the memory management unit contain 4 megabytes of special purpose memory for the page table, and a context switch from one virtual address space to another would require loading and storing this entire special memory area!

    The original Atlas virtual memory system used an alternative approach to translation. The address translation hardware held a copy of the frame table. When a virtual address was presented to the hardware, it compared the desired page number with the page number stored in each frame table entry. If there was no match, a page fault was reported; if there was a match, the index of the frame table entry holding the match was the frame number. This comparison was done using fast parallel hardware, in what we would now call an associative memory, so the time taken for the search did not depend on the number of page frames in the system.

    The Atlas scheme only works if the number of frames is itself relatively small. On a modern machine with many megabytes of memory, the number of page frames may itself grow into the tens of thousands, so neither the entire frame table nor the entire page table can be held in the memory management unit registers.

    The usual solution to this problem is to hold both the page table and the frame table in main memory, so only those entries that reference recently used pages reside in registers within the memory management unit. In this case, we say that the memory management unit contains a virtual address translation cache or an address translation lookaside buffer. From the viewpoint of system software, we then view the address translation process as if it was done through an address translation table in main memory.

    The problem of large page tables

    A typical modern machine, whether an Intel Pentium, or an IBM RS6000 has a virtual address space of 4 gigabytes, but the main memory is only measured in megabytes. The typical virtual address map needed to describe such an address space contains over 1 million map entries, and requires on the order of 4 megabytes to store. Users with 4 megabytes of main memory (today, an admittedly small size) would be quite unhappy if the system required the entire memory just to store the address map!

    The solution to this problem is to break up the virtual address map and store most of it in secondary memory. In effect, the map itself is stored in virtual memory, and only those pages of the address map that contain map entries for recently used pages occupy page frames in main memory at any given time. Each page of the virtual address map is called a segment of the map, so the data structure describing the virtual address space of the program can be pictured as follows:

                                _ _ _ _
                  Segment Table |_|_|_|_|
                  _______________| | | |________
       Page      |              ___| X          |
      Tables  _ _|_ _       _ _|_ _          _ _|_ _
       for   |_|_|_|_|     |_|_|_|_|        |_|_|_|_|
     Segments_| | | |_     _| | | |_        _| | | |_ 
            |  |   |  |   |  |   |  |      |  |   |  |
     Pages |_||_| |_||_| |_||_| |_||_|    |_||_| |_||_|
     

    The page table for a segment may itself be moved to disk when all pages in that segment have been moved to disk, and once this is done, a reference to any page in that segment requires that the page table for that segment be brought back into main memory first, prior to bringing in the referenced page.

    When such a system is used, virtual addresses are typically broken up as follows:

     _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
     |_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|
     |                   |                   |                       |
            segment         page in segment        byte in page
     
    Thee segment field specifies what segment of the address map contains the desired map entry, while the page field specifies what entry in that segment of the map should be used. The sizes of the fields shown above is typical of many 32-bit machines. With 12-bits for the byte-in-page field, pages are 4K bytes or 1K words. With 10 bits in the segment and page-in-segment fields, both the root segment table and the page table for each segment have 1K entries, and if each page-table entry occupies 1 word, each segment of the page table occupies exactly one page-frame of main memory.

    The term segment is used here deliberately because, on many machines that use paged-segmented virtual memory, the designers intended that each segment of the page table correspond to a single logical segment of the address space. For example, a program might have a stack segment, a code segment and a heap segment, each composed of from 1 to 210 pages, while other segments might be set aside for any data structures too large to fit in the heap, secondary parts of the program or other purposes. Today, many systems divide the virtual address space into segments that are much larger than the segments of the page table established by the hardware.

    Assuming that the virtual address translation hardware uses a cache (translation lookaside buffer), there are now four different outcomes possible for each address translation operation:

    Hit: The desired page table entry was in the hardware address translation cache. Address translation takes place as expected.

    Miss: The desired page table entry is not in the hardware address translation cache. The address translation unit performs one memory cycle to fetch the page table entry for the desired page. Address translation takes place as expected by the software, but it is slowed by the additional hardware activity.

    Segment Fault: A miss occurs, but the desired page table entry cannot be fetched because the desired segment of the page table is not in main memory. The address translation unit therefore requests a fault, aborting the current instruction and allowing the operating system segment fault trap service routine to bring in the desired segment of page table.

    Page Fault: A miss occurs, the desired page table entry is available, but the page is marked as invalid. The address translation unit requests a fault, aborting the current instruction and allowing the page fault trap service routine to bring in the desired page.

    The Software View

    With a paged segmented virtual address space, the software view of address translation is summarized by the following picture:

                     ___  ____
                  Seg  |  |    | Segment
     Virt Addr  |----->|  |    | Table
     ---------->|---   |  |____|       ____
                |-  | --->|____|----->|    | Page
                  | |     |____|   |  |    | Table
                  | | Page         |  |    |      
                  |  ------------->|  |____|       ____
                  |               --->|____|----->|    | Page
                  |                   |____|   |  |    |
                  | Byte in Page               |  |    |
                   --------------------------->|  |____|
                                 physical addr--->|____|
                                                  |____|
     

    To translate one virtual address to a physical address, first a word must be fetched from the segment table; this holds the address of the page table for that segment. Next, a word must be fetched from the page table for that segment; this holds the frame number of the desired page, from which the physical address may be computed.

    A cache is always included in the address translation hardware so that these extra memory accesses are only required when there is a cache miss. In fact, if the cache entries may be directly manipulated by software, the cache itself can be entirely managed by software, in response to page faults, and the memory management unit need never actually initiate a memory cycle to fetch page table entries into the unit. In this case, all cache misses initiate page faults, but some faults, called soft page faults, can be serviced by merely bringing the indicated page-table-entry into the memory management unit with no need for I/O activity.

    Address Translation and Protection

    Virtual address translation hardware does more than just translate addresses. It can also serve a central role in protection. A protection domain describes, for one logical resource user, for example, a process or a person, what resources that user should be able to access and how. Each logical resource user typically operates in a different domain.

    When there is one big virtual address space, shared by all users, a change of protection domains can be accomplished by changing the markings on the page table entries. Those table entries describing pages that the new user should not access can be marked inaccessible; those entries describing pages the user should not be able to modify can be marked read-only, and so on.

    Editing address maps can be expensive, particularly if they are large, so it is more common to assign a separate virtual address space to each domain. In this case, all that needs to be done when the domain changes is to change the pointer to the currently active page table. The page table then becomes the primary record of what pages are in each domain and what access each domain allows to each page.

    With one virtual address space per domain, it is common to find that most domains are fairly small; if a full page table occupies a few megabytes, this can lead to a significant waste of resources, so many virtual addressing systems allow some mechanism to eliminate storage of irrelevant parts of the page table. For example, entire segments of the page table may be marked as invalid in the segment table, so that no space must be allocated to the corresponding page tables, or there may be a limit register that describes size of the page table, or segments of the page table may be variably sized.

    With one virtual address space per domain, sharing of pages between domains is accomplished by replicating the page table entries for the shared pages between the page tables for each domain that shares those pages. When the page table is segmented, shared segments are sometimes used instead of shared pages.

    Locality and the Working Set

    In all cases, virtual memory rests on the fundamental empirical observation that most programs exhibit a property called locality of reference. A program exhibits strong temporal locality if the set of addresses most recently referenced is a strong predictor of the set of addresses that will be referenced soon. The fact that most programs are iterative leads naturally to temporal locality.

    A program exhibits strong spatial locality if most memory references are to locations near other referenced locations. The fact that instructions are usually executed sequentially and the fact that sequential search is common both naturally lead to spatial locality. It is important to remember that locality is an empirical property, and that there are a few programs that exhibit very bad locality.

    As a consequence of the locality of programs, we can talk about the working set of a process or group of processes. This is the set of pages that process or group of processes needs in order to make progress. When a process comes into execution in a demand paged environment, there will typically be a sequence of page faults as its working set is brought into memory. Once the working set is in memory, that process will typically run steadily, with very few page faults; these page faults typically reflect changes in the working set as the program enters different phases of its computation.

    When the number of page frames on a computer is smaller than hold the number of pages in the working set of the set of ready and running processes on a system, performance is typically very bad. A large fraction of all memory accesses on such a system cause page faults, and with each page fault, some page that will be needed in the near future is forced out to disk, guaranteeing that there will be yet another page fault very soon. A system with this behavior is said to be thrashing.

    The application or mix of applications determines the working set size. For any particular application, a graph of page-faults per second versus number of page frames made available to that application will tend to look like the following:

           |
            |______    Working set size
            |      --_ |
      page  |         -|
     faults |          _
      per   |Thrashing   CPU bound
     second |    or    |-
            |I/O bound | _ 
            |          |  -__
            |          W     -------
          --+------------------------
            |  page frames available
     
    In this graph, the number of page frames in the working set is empirically determined by the location of the transition from normal CPU bound operation to the thrashing state.

    Some application exhibit a sharp thrashing transition -- these are programs with high locality and a well defined working set. Others exhibit a gentle transition to thrashing -- these are programs with poor locality and an ill defined working set.

    There are two cures for thrashing: First, we can buy more memory in order to enlarge the set of available page frames. Second, if there are multiple processes, we can shrink the working set by stopping some of these processes. If running both processes A and B in parallel causes the system to thrash, so long as A and B are independent, we are better off finishing A before we start B (or visa versa). Many Unix systems include a process called the swapper that monitors the system's performance and stops low priority processes when it detects thrashing, allowing the page replacement algorithm to reclaim the page frames belonging to those processes and use them to complete the higher priority processes.

    Input/Output

    Review

    This lecture focuses on a review of three different aspects of input output:

    • User level I/O requests
    • Virtual devices -- the abstractions provided by the system
    • I/O drivers -- the implementation of those abstractions
    The term virtual device is not standard, but it is realistic! The device interface presented to applications running under an operating system is usually quite different from the actual device interface, at least as different as the virtual memory abstraction is from the actual physical memory resources, and therefore, it seems quite reasonable to use the adjective virtual to distinguish the view of the device supported by the system from the actual resources provided by the device.

    The Physical Device Interfaces

    The physical device interface of a real system typically consists of a number of device registers. For example, on a typical IBM PC compatable, parallel port zero is presented to the software by the hardware as follows:

                    __ __ __ __ __ __ __ __ 
       0378 data    |D7|D6|D5|D4|D3|D2|D1|D0|
                    |__|__|__|__|__|__|__|__|
                     __ __ __ __ __ __ __ __ 
       0379 status  |BS|AK|PE|OL|ER|x |x |x |
                    |__|__|__|__|__|__|__|__|
     	BS = 1 when printer not busy
     	AK = 1 when data transfer in progress
     	PE = 1 when printer is out of paper
     	OL = 1 when printer is on line
     	ER = 1 when there is no error
                     __ __ __ __ __ __ __ __ 
       0379 control |x |x |x |IE|SL|IN|LF|ST|
                    |__|__|__|__|__|__|__|__|
             IE = 1 to enable printer interrupt
             SL = 1 to select printer
             IN = 1 for normal operation (zero causes printer reset)
             LF = 1 to force printer to advance paper
             ST = 1 to transfer data into printer
     

    The protocol for printing one character using this interface was established by Centronics Corp for their line printers in the early 1970's, significantly before the IBM PC came to market. To print one character, the software must load the character in the parallel port data register and then send out a positive pulse on the ST (strobe) line. The printer acknowledges the strobe pulse with a positive pulse on the AK (acknowledge) line and a negative pulse on the BS (busy) line. The contents of the data register must remain valid until the end of the AK pulse, and no new transfer is legal until the end of the BS pulse. Usually, these two pulses will be coincident.

    If interrupts are enabled (IE = 1), the printer interface will request an interrupt when AK (acknowledge) goes from 1 to zero. The particular interrupt requested may depend on the settings of jumpers or option switches on the parallel port interface, but it is normally IRQ7, the lowest priority hardware interrupt supported on the PC family of machines.

    The SL, IN, LF and ST lines from the control register correspond directly to corresponding output pins in the parallel port connector, and all of the status lines correspond to input pins in the parallel port connector. Some of these lines are inverted, so the voltage levels corresponding to 1 and 0 in the above documentation aren't always as expected. Aside from the IE (interrupt enable) bit and the interrupt logic attached to AK (acknowledge), the parallel port hardware makes no assumptions about the use made of the control, status and data lines. The meanings of those lines are established by the hardware connected to the printer port.

    The PC parallel printer port interface is typical of the interfaces to low performance devices where each byte of input or output is transferred via explicit execution of instructions executed by the CPU. High performance devices frequently use direct-memory access interfaces. The simplest direct memory access interface must have at least the following device registers:

                    _________________________________
        DMA address |                                 |
                    |_________________________________|
                     _________________________________
        DMA count   |                                 |
                    |_________________________________|
                     ________ __ __ ________
            control |        |IE|  |  COM   |
                    |________|__|__|________|
             IE     interrupt enable on DMA transfer complete
             COM    command field
                     ________ __ ___________ 
            status  |        |DN|           |
                    |________|__|___________|
     	DN     indication of transfer complete
     

    The direct-memory access processor idles until the program running on the CPU loads the control register with a command requesting an input or output transfer. On receipt of this command, the DMA processor repeatedly transfers data between memory and the device, in a direction indicated by the command. Each transfer addresses the memory location pointed to by the DMA address register, and after each transfer, the DMA address register is incremented by the size of the transfer and the DMA count register is decremented, typically also by the size. Some direct memory access devices transfer one byte at a time, while others transfer larger blocks, up to the word size of memory. When the DMA count register reaces zero, the DMA processor sets the done bit in the status register; if the interrupt enable bit is set, this also results in an interrupt request to the CPU.

    The interface described above would be sufficient to support direct memory access transfers to simple devices such as the parallel port, but some devices require other device control registers. Consider, for example, disk drive. Typical disks allow random access to any sector of any track of any cylinder of the disk, and we can speak of the specification of the sector, track and cylinder as the address, on the disk itself, of the data we are referencing. Before a transfer to or from disk, therefore, we must somehow tell the disk drive what disk address to use. More generally, for random-access devices, we speak of the device address and there must be a device address register somewhere in the interface for each such device:

                    _________________________________
     device address |                                 |
                    |_________________________________|
     

    Note that the device address register may have multiple fields, for example, one for the sector number, one for the track number and one for the cylinder number on disk. When these cannot all be packed into one register, the device may have separate registers for these fields.

    Prior to any transfer to or from a random access device, the program running on the CPU must load the device address register appropriately. Floppy disks are low performance devices, so the data transfer to and from a floppy disk is frequently done without benefit of direct memory access mechanisms, but direct memory access is required when high-speed hard drives are used. Therefore, the software controlling a typical disk interface must generally loasd the device address register, the DMA address and count registers and finally the device command register in order to initiate the reading or writing of one sector of data.

    The Stream Abstraction

    The stream is a typical (but by no means universal) input/output abstraction. Typical stream operations are get to read a character from a stream and put to write a character to a stream. In the C standard library, for example, the fgetc(s) function returns the next character read from the stream s, and the fputc(c,s) function writes the character c to the steam s.

    Whether these are packaged in procedural or object oriented form, these functions are usually hidden under layers of middleware that allow reading and writing higher-level data types such as character strings or integers. We should consider the conversion routines between these higher level types and the character level stream to be attributes of those data types, so we will not discuss such conversions here.

    Streams are abstractions in the sense that they don't necessarily map directly to the actual input/output operations on any particular physical device. Some devices are inherently stream oriented, for example, the simple parallel port discussed above, as well as asynchronous serial communications lines and keyboards, while others are not, for example, disks, high performance network interfaces and graphics display devices.

    The UNIX stream abstraction is somewhat more complex than the simple put and get operations suggested above. In UNIX, the operations are:

    	len = read( f, buf, buflen )
     	len = write( f, buf, buflen )
     

    In both of these f names the operating-system level stream to be read or written, and data is transferred to or from the buffer buf. This buffer is of size buflen; on output, this many characters will be written unless there is a problem; on input, this many characters will be read unless some other condition terminates input sooner. Both of these return len, the number of characters actually read or written.

    Note that read(f,&ch,1) is therefore equivalent, in a simpleminded way, to ch=fgetc(s). There are two differences worth noting: First, the parameter f to read is a file descriptor, the integer index of an open file in the open file table for the process, while the parameter s to fgetc is a pointer to a stream object (confusingly enough, this is of type FILE*). Second, each call to read involves a relatively expensive context switch across the user/kernel protection boundary, while the call to fgetc is a simple function call. If the time taken by a function call is too much, the standard C include files also provide getc, a macro that expands as inline code to extract a character from the stream. Typical C stream implementations include a stream buffer with a capacity of around 4K bytes, and only when this is empty does fgetc call read to refill the buffer.

    Also note that the there are undefined and device dependent characteristics associated with kernel-level Unix files: For example, in reading a stream attached to the keyboard, it is possible to configure the system to force read to terminate when the user types some key (typically return), it is possible to configure it so read terminates after a timeout if the user types nothing.

    It is important to emphasize that both the Unix file and the C stream abstractions are conceptually objects with methods to operate on them. The fact that the UNIX system call interface and the C standard library interfaces are procedural can hide this, but logically, the operations on an open UNIX file should be thought of as references to methods of a file object. Thus, for example, it is constructive to think of read(f,buf,len) as entirely equivalent to f.read(buf,len).

    Block Oriented Abstractions

    typical abstration used for input/output supports block-oriented random access to input output devices. In this case, typical primitives are:

    	Read( device, memory_buffer, device_address )
     	Write( device, memory_buffer, device_address )
     
    Here, the buffer is a block of consecutive memory locations in the caller's address space, and the device address names a block of consecutive locations in the random-access I/O abstraction, perhaps a relative address in a disk drive or disk file. The block size in this scheme may be fixed system wide, fixed by the particular device, or passed as an explicit parameter.

    This abstraction maps naturally to disk devices in the case where the block size is one sector and the device_address is the sequential sector number on the device. It maps poorly to almost all other real devices, but disks are so important that it is frequently used.

    The file abstraction of the Unix kernel is a generally successful attempt to generalize streams to include both sequential and random-access block I/O. The operations are:

    	len = read( f, buffer, length )
     	len = write( f, buffer, length )
       	lseek( f, offset, base )
     
    The lseek() operation is used to position the read-write pointer in the indicated open file. The offset may be relative to the start of the stream, to the end of the stream, or to the current position in the stream, depending on the value given for base. Obviously, seeking in some streams is meaningless (consider seeking relative to the end of the stream of characters being read from a communications line), and seek operations are not always supported (seeking in the stream from the keyboard would require storing the entire history of what was typed, and this is not worth doing), but on those file types where seeking is useful and can be efficiently implemented, for example, on disks and tapes, this interface works quite well.

    With a buffer size of one byte and no seek operations, this interface exactly emulates a pure stream. With a buffer size of one disk sector, with a seek offset that is a multiple of the sector size prior to each read or write, this interface exactly emulates a conventional random-access model. If the sector size is not matched to the buffer size or if seeks are made to device addresses that are not sector aligned, intermediate buffers and possibly extra input-output transfers must be used to reblock the data between the user and the device.

    Of course, if the underlying device is incapable of a seek operation, for example, if it is a printer or a keyboard, the seek operation in the user interface is meaningless.

    Class Hierarchy for Streams?

    One of the very first uses of object oriented programming can be found in Christopher Strachey's stream implementation in his OS-6 project in the late 1960's. Strachey, the primary developer of the CPL programming language, naturally chose to write this system in BCPL (Basic CPL, the ancestor of C), but he wrote his code in clearly object oriented form, and today, we would say that he used a class hierarchy something like the following:

                    Write Only Stream
                          write(c)
                             |
                     Read/Write Stream
                           read(c)
                          write(c)
                             |
                    Random-Access Stream
                           read(c)
                          write(c)
                           seek(c)
     

    In fact, the whole notion of having the class hierarchy have any relationship to the implementation hierarchy fails for I/O, because, for some devices, disks for example, random block access is the primitive, while stream access is a higher level abstraction. For other devices, such as asynchronous serial ports, the situation is reversed; stream access is primitive, and blocks must be introduced as a higher level abstraction. Thus, one of the original inspirations for object oriented programming remains a good test of the methodology!

    Implementation of Streams

    A device driver is the component of an operating system that supports input or output operations to some device. Conceptually, this bridges the gap between the abstractions presented to the user and the concrete interface of the hardware itself, but in systems such as Unix, the front-end interface to user code, for example, the read(), write() and lseek() kernel functions of Unix, include standard code that applies to all devices and is therefore above the level of the input/output drivers.

    Stream input/output provides an excellent example of the implementation of an input/output driver. The driver is usually subdivided into two halves, one half deals with the hardware interface and includes the interrupt service routines for the device, and the other half deals with the user interface. It is quite common to include a FIFO queue between the two in order to allow computation and input/output activity to overlap. The following diagram illustrates this overall structure for a parallel port printer handler under an operating system for the IBM PC:

                ||                                ||
          write(f,buf,len)  ---> FIFO  ---> Interrupt Handler
           when f refers         queue             ||
           to the parallel                         ||
                port                               ||
                 ||                                ||
           user  ||         device driver          ||  hardware
                 ||   (in the operating system)    ||
     
    Note that the user interface suggested here is that of Unix, but this is only for the sake of example. What is important is that this user interface routine enqueues the successive characters passed by the user in the queue, waiting for space when the queue is full, but otherwise returning to the user as soon as possible. The interrupt handler, in turn, dequeues one character from the queue for each interrupt and transfers that character to the hardware.

    For input stream devices, for example, the input driver for a unidirectional input-only communication line interface, we can draw a similar picture:

                ||                              ||
          read(f,buf,len) <--- FIFO  <--- Interrupt Handler
           when f refers                         ||
           to this input                         ||
                port                             ||
                 ||                              ||
           user  ||        device driver         ||  hardware
                 ||  (in the operating system)   ||
     
    The only difference between this example and the output example above is that the queue is now fed by the device and emptied by the user. When the user calls read, characters are dequeued from the queue for return to the user. With each dequeue, the user may be forced to wait until data arrives, but if data has arrived before the user calls read, the user does not wait.

    On first generation operating systems, whether on mainframes of the 1950's or on microcomputers of the 1970's, programs were limited to synchronous I/O, which is to say, neither interrupts nor queues were used in the I/O drivers, and user programs were therefore blocked until each I/O operation was completed. With no overlap of user computation with I/O, any input that arrived while the user program was busy was simply lost unless the hardware contained holding buffers.

    The IBM PC BIOS and most of the old MS/DOS operating systems were the last widespread survivors of this first generation of operating systems. Modern operating systems for PC compatables generally only use the BIOS routines to load the operating system and get descriptions of the devices for which drivers should be loaded. Once the system is running, the drivers in the system typically work directly with the hardware device interfaces.

    Echoing of Keyboard Input

    Keyboard input is somewhat more complex than other input devices because users have come to expect everything they type to be echoed on some kind of display. Teletype terminals, based on hardware dating from the 1920's, always offered two operating modes, full duplex and half duplex. In full duplex mode, everything typed on the keyboard was simply transmitted over the asynchronous line, and the printer printed everything that was received. In effect, the Teletype acted as two independent devices, one input only, and one output only. In full duplex mode, it was up to the remote system to echo keypresses so that everything typed would appear on the printer. In half duplex mode, the printer was connected directly to the keyboard; everyting typed was therefore printed locally as it was transmitted to the remote system. Half duplex operation simplified the design of modems and it reduced the complexity of remote software, but it was less flexible.

    By 1970, most interactive computer use was done using full duplex Teletypes as terminals, but there were still several open options for echoing what the user typed. The responsibility for echoing could be left to the application, or it could be handled by the system, and if handled by the system, it could be done in several ways. These are illustrated below:

               ||                              ||
               Read <---  FIFO <---  Interrupt Handler
     	   ||   |   ___________/    |      ||
                ||  A|  /     B        FIFO C   ||
     	   ||   | |                 |      ||
     	   ||   V V                 V      ||
               Write ---> FIFO  ---> Interrupt Handler
                ||                              ||
           user ||        device driver         ||  hardware
     

    In path A above, the read routine enqueues character to be echoed as it dequeues it from the input queue. To the interactive user, this is indistinguishable from echoing doen by the application itself. As a result, input typed while the application is in the midst of generating output will not be mixed with that output but will instead wait in the input FIFO queue until it is read by the application.

    In path B above, the input interrupt routine handles echoing. As a result, they are echoed soon after they are typed, independent of when they are read by the application. Because they are enqueued in the same FIFO as other output, echoing may not be immediate when other output is in that queue. Most Unix interactive I/O drivers use this model.

    In path C above, an additional (short) queue is introduced. The output interrupt driver must check both queues, with the new queue of characters to be echoed taking priority over the normal output queue. This leads to the fastest possible echoing of input, but in addition to allowing input to be mixed arbitrarily with output, it also makes the output driver more complex.

    On Unix, application connected to an interactive terminal may select the echoing mode using some of the less common system calls, stty() and ioctl(). These calls allow nonstandard and device dependent options to be set, and the set of options they allow for use with the terminal I/O driver is huge, including the ability to turn on and of echoing.

    Block I/O Implementation

    As with stream I/O, there are two interfaces that concern us, and the system provides a mapping between these, as well as providing for buffering to improve performance.

                ||                                   ||
                Write             Queue               ||
                 ||  or      --->  of   --->   Interrupt Handler
                Read            Requests              ||
                 ||                                   ||
           user  ||           device driver           ||  hardware
                 ||     (in the operating system)     ||
     

    Read and Write operate on the same queue, placing requests for block I/O in it, instead of data. Each time the device issues an interrupt, it signals that it is ready to begin a new transfer, at which point, driver removes an entry from the queue and gives it to the device, starting the next transfer.

    Each entry in the queue contains at least the following fields:

    • The device address -- what disk sector, or the equivalent for other devices.
    • The buffer address -- where in main memory is the buffer.
    • The transfer direction -- read from disk or write to disk.

    In a Unix-like system, the user interface assumes that the user wishes to wait for the I/O operation to complete. Therefore the read or write routine can only return to the user when the I/O transfer is actually done. This happens after the I/O request is dequeued by the interrupt service routine, when the interrupt comes in indicating the completion of the last transfer from the buffer. Therefore, the queue entry must contain the synchronization information needed to return control to the user.

    Exercise: Consider the alternatives for solving this problem. You could put a pointer to the user process itself in the queue entry, so that the interrupt handler puts the process back in the ready list when it finishes the operation. Alternatively, you could put a flag in the queue entry; with this scheme, the user can test the flag to see if the I/O is done. Finally, you could put a semaphore in the queue entry; when the I/O is complete, the interrupt service routine signals this semaphore, and when the user wants to see if it is complete, the user can test this semaphore. Whis of these schemes is easiest to use, which is easiest to implement, and which maintains the best isolation between the scheduler and the I/O code.

    The simple structure suggested above assumes that each buffer provided by the user holds exactly one block of data to be transferred to or from disk. Most devices have a fixed hardware-defined block size, so this requires that the user code be written in terms of this block size. The Unix model allows random access to the arbitrary byte in the stream, and it allows user buffers to be of any size. Some Unix device drivers can do I/O directly to or from the user buffer if the user happens to request exactly one disk block, but all Unix drivers allow non-aligned I/O transfers, where the user buffer contains more or less than one block and begins at any byte position. This is handled by an additional software layer between the interface described here and the user-level interface to the kernel.

    Exercise: Design an appropriate intermediate layer allowing seeking to any byte in a file and reading or writing any number of bytes starting at that byte, assuming that all file I/O must be done in 512 byte blocks that are aligned on 512 byte boundaries. Is your solution efficient?

    Input/Output Latency

    For most input/output devices, we must recognize that some of the time taken to transfer data to or from the device is spent waiting. For example, with disk devices, we must move the heads to the appropriate track and then wait for the disk to rotate until the desired data sector is under the heads. During the time the heads are moving and while we wait for the right sector, no input/output is taking place.

    In general, we use the term latency to describe the time between the initiation of an operation and the actual start of the operation. Latency is a general English word, so for example, we can speak of a latent infection during the time between the actual contact with some germ and the first symptoms of that infection.

    For disks, CD-ROM devices and similar rotating memory devices, we use the term rotational latency to refer to the delay caused by the rotation of the storage medium. For example, if a disk rotates at 100 revolutions per second, we might have to wait up to 1/100 of a second before a particular bit of data on that disk is rotated under the heads. It is worth noting that we use the term rotational latency to refer not only to delays caused by physical rotation, but also to delays caused by logical rotation. Thus, we can speak of the rotational latencyh of a shift-register memory, where nothing physical rotates, but the data circulates around a shift register as if the data was on a rotating belt.

    We also speak of head positioning latency on devices where there is only one head that may need to be moved to the appropriate storage track before data on that track can be read or written. Disks have been built with one head permanently mounted above each track, but most disks have only one head per recording surface, and this head must be moved to the appropriate track prior to any use of that track.

    Rotational and head positioning latency are both examples of mechanical latency (unless we are speaking of the rotational latency of a shift-register memory device). Some devices and device interfaces also have latency characteristics that arise from electronic or logical requirements. We may speak, for example, of the head selection latency on a multi-head disk-drive; although this is usually negligable. Some network protocols lead to a delay before a machine is allowed to send a message; we can describe these as network protocol latency.

    Disk Input/Output

    Disks

    Rotating magnetic memory devices date back to the early 1950's. The first such devices were drums, with one head fixed over every track. The drum on ILLIAC I, for example, had 64 tracks, with an 8 by 8 array of heads. (This drum is still owned by the U of Illinois computer science department).

    In the late 1950's and early 1960's, workers at IBM figured out how to make a record/playback head "fly" over the flat surface of a spinning disk. The result was a new technology, the moving head disk, where one head was positioned over each recording surface on the disk spindle and moved in or out to the appropriate track.

    For large-capacity, high performance disks, whether in 1965 or 2000 a number of disk platters are rigidly attached to each disk spindle. The number of platters on one spindle has varied greatly over the years. IBM's original disk drive had over 20 platters, while it was common to see disks with 12 platters on a common spindle in the early 1970's. Usually, both sides of each platter are used for data storage; today's larger capacity disks typically have between 4 or 8 recording surfaces.

    When there are multiple recording surfaces, there is typically only one head per surface, with all heads sharing a common head-positioning mechanism, as illustrated below:

         spindle
            ||  v------||    __________
       =====[]=====    ||___|          |
            ||  ^------||   |   head   |
            ||  v------||___|positioner|
       =====[]=====    ||   |__________|
          __||__^------||
         |      |
         | motor|
         |______|
     

    It's fun to note that the tiny disks used in today's laptop computers frequently have the motor literally inside the hub of the disk, while the disk drives of the 1960's had motors like those found in washing machines, with a belt-drive from motor to disk spindle.

    Moving head disks did not displace older technologies immediately. Drums continued to be made until the mid 1960's, and fixed head disks, disks with one head fixed over every track, persisted into the 1970's. Drums and fixed-head disks were particularly popular as paging and swapping devices because their lower latency led to reduced page-fault service times. At the other extreme, moving head disks have been made with more than one head per recording surface, and they have been made with more than one head positioner per disk drive. Such eccentric designs are uncommon and will not be discussed here!

    The first production moving-head disks with one head per surface date from the very early 1960's. The IBM 1401 computer, indroduced in 1961, came with the IBM 1505 disk drive (the combination was sold under the name RAMAC -- The Random Access Method of Accounting Control, named after its initial business application). This disk drive had 25 platters, 200 tracks per surface, 5 sectors per track and 200 6-bit bytes per sector. 5 and 10 platter disks were common in the late 1960's and early 1970's, with platters about 14 inches in diameter. Some early disks used hydraulic head positioning mechanisms, but the dominant head positioning technology for high performance disks from the late 1960's to the present has been the voice-coil servo drive.

    Reynold B. Johnson, 1906-1998, led the IBM lab in San Jose that developed the RAMAC; he also developed one of the first mark-sense readers and the 1/2-inch video-tape. (Professor Lindquist of the University of Iowa developed another early mark-sense reader to grade his college entrance exam, the ACT.)

    While the speed and capacity of rotating storage media have increased immensely since the early days of computing, and sizes have fallen from large free-standing cabinets to palm-sized enclosures, the basic physics has remained unchanged, and thus, the basic issues with which the software must deal are the same now as they were in the early 1950's.

    Disk Latency

    The rotational speed of all high performance disks is fixed; typical disks of the early 1970's ran at 20 revolutions per second, while modern high performance disks may rotate much faster. The stored kinetic energy in the rotating disk is sufficient that it may take a few seconds to bring it up to speed on power-up, and any speed change after that is prohibitively expensive. This is as true of todays miniature disk drives with flea-sized motors as it was in the days when the disk assembly weighed 20 pounds and was spun by a washing-machine motor.

    Some floppy disk drives, notably the old Apple Macintosh 800K drives, used a variable speed scheme to pack a bit more data onto a disk than they could pack onto a fixed speed disk. This worked only because the floppy disk itself is very lightweight and rotates relatively slowly; most of the motor power is needed to overcome friction and it takes only a small amount of power to accelerate or decelerate the disk. It is easier to use a variable speed clock to solve this problem. While variable speed clocks easy to build at low data rates typical of floppy disks, variable speed clocking for high performance hard drives is difficult and rare. The problem is, the highest performance drives move data to and from the disk at speeds close to the limit of the digital logic itself.

    Magnetic disk recording surfaces are divided into numerous circular tracks; typically, each track is subdivided into sectors, and the basic operation on the disk is to read or write one sector of data. On a disk or drum with one head fixed over each track, the time delay between the issue of a read or write and the completion of that operation can be divided into two components, the rotational latency time and the transfer time.

    The rotational latency time is the time it takes for the rotation of the disk to bring the start of the desired sector into position under the head. This may vary from one full revolution, if the start of the desired sector had just passed under the head at the time the transfer was requested, to nothing, if the start of the desired sector is just about to pass under the head at the time the transfer is requested.

    The transfer time is the time it takes for the rotation of the disk to rotate one sector past the head.

    With the advent of moving head disk technology in the early 1960's, a new component was added to the time taken for a single read or write operation, the head positioning latency time or seek time. (The latter term derives from the common use of the verb `to seek' to describe the act of moving the heads to the desired track.) This is the time it takes for the disk head positioning mechanism to move the heads to the desired track; it may vary from nothing, in the case where the head was already on the desired track, to an appreciable fraction of a second.

    Typical head positioning hardware smoothly accelerates the head from the starting track to the end track, and a bit of simple physics leads to the result that such hardware will give track-to-track head movement times that are quadratic. That is, if a move of one track takes one time unit, a move of four tracks will take two time units and a move of nine tracks will take three time units.

    It is worth noting that the standard CD format, being designed for audio recording, uses a single spiral track. Like a magnetic disk, this track is divided into sectors, and random access to the recorded data, whether to play a specific track of an audio CD or to read a random sector of a CD-ROM on a computer, typically involves two steps, first moving the head blindly to the estimated radial distance of the desired sector, and then reading the first sector header encountered in order to see if the move was far enough. As a result, CD-ROM drives have head motion latency and rotational latency not too different from magnetic conventional disks.

    Note on Data Formats

    A typical sector on a disk is organized as follows:

     __________________________________________________________
     |________|______________________________________|__________|
       prolog                  data                     epilog
     

    Typically, the prolog includes synchronizing marks that allow the clock used for reading the sector to be synchronized with the data and that allow the hardware to recognize the start of a disk sector. Following this, the prolog typically includes the full disk address of the sector.

    The disk address is included in the prolog for many reasons. First, it allows verification that the head positioning hardware and disk surface selection logic is working correctly, or at least, that it is working the same way it was when the disk was formatted. Second, on some disks, the head positioning servo actually uses the track or cylinder numbers read off the disk to move the heads in or out to the right track, and finally, the addressing logic of many disk systems relies on the sector number recorded at the start of each sector to determine whether the sector is the one the user wishes to access.

    The end of the prologue is typically a checksum on the prologue. Once the prologue is recorded by the disk formatter, it is usually never written again. When a write operation is requested, the hardware begins by reading and verifying the prologue. Only then does it begin to write the data.

    The epilogue typically consists of a checksum. All read and write operations must process the entire sector, either writing a full sector and then writing the checksum, or reading the full sector and then verifying the checksum. If the disk controller allows partial sector reads or writes, this is an illusion. A partial sector write almost always overwrites the full sector, appending arbitrary data to the user's data in order to achieve this. A partial sector read only reports a fraction of the sector's data to the user, but the entire sector must be read by the controller in order to verify the checksum.

    Formatting a disk

    When a "raw" disk is mounted on a disk drive, it must be formatted. Typically, this involves writing predetermined data in each sector of each track of the disk. The important part of the formatting operation from this perspective is not the data recorded but rather that the formatter puts valid prologs and epilogs on all of the sectors are written.

    Usually, the disk formatting software also verifies that each sector is readable and writable, gathers the good sectors into a free-space data structure, and creates a file system of some type on the disk. The latter subject will be discussed in more detail later, and in fact, close examination of most formatting programs shows that these two jobs are largely independent.

    When the disk interface is configured for disk formatting, the hardware generally allows the software to write out arbitrary data to fill entire sectors, including the prolog and epilog. This gives the software the option of laying down sectors in arbitrary order on each track of the disk. If they are put down in consecutive ascending order, it may be impossible to read or write consecutive sectors at full speed. This is because, by the time the interrupt marking the end of one read or write has been processed, it may be too late to start work on the next sector without waiting a full revolution.

    To avoid this problem, many systems arrange the sectors on a disk in interleaved order, so that logically consecutive sectors are separated by one or more other sectors. The following example illustrates the interleaving of 16 sectors, using consecutive letters of the alphabet to label logically consecutive sectors.

         A B C D E F G H I J K L M N O P  -- no interleaving
          A I B J C K D L E M F N G O H P  -- 2-way interleaving, almost
          A G L B H M C I N D J O E K P F  -- 3-way interleaving
          A E I M B F J N C G K O D H L P  -- 4-way interleaving, almost
          A N K H E B O L I F C P M J G D  -- 5-way interleaving
     

    The 2-way and 4-way interleaving patterns shown above are imperfect because 2 and 4 evenly divide 16. The 3 and 5-way interleaving patterns are perfect. Interleaving can be done in hardware, or it can be done in software.

    When the physical addresses recorded in the sector prologs on the disk are not interleaved, the software may interleave them. In general, if there are n sectors, numbered consecutively from 0 to n-1 on each track, and if we want m-way interleaving, then so long as n and m are relatively prime (they have no common factors) the software can generate interleaved sector numbers using the formula:

    interleave(i) = (m * i) mod n
     

    The software interface

    To read or write one sector of a disk, the software must:

    1. Specify the address of a buffer in memory.

    2. Specify one of the several disk drives that may be attached to a single controller.

    3. Specify how to position the heads on the drive in question. This is usually called the cylinder portion of the disk address.

    4. Specify which recording head or recording surface is to be used on the selected drive.

    5. Specify which sector of the track selected by the cylinder and surface is to be used.

    6. Specify the data transfer direction.

    In simple disk controllers, each of the above might appear as a separate register in the controller, writable by software. Typically, the actual transfer begins only when the final command is issued, the transfer direction. Typically, it is the act of loading a value in the final register that actually starts the controller in operation. Thus, the loading of a memory address or a field of the disk address has no effect until the software actually specifies a read or write operation, at which point, the values previously loaded are used to initiate a seek on the appropriate drive, await the right sector of the selected track, and then transfer the data.

    When a data transfer is completed, a status register typically displays this fact, along with an indication of any errors detected. In addition, the done condition usually results in an interrupt request.

    Added Complexity

    Interface busses such as the SCSI bus separate the responsibility for each operation into two halves, as illustrated below:

               __________            __________   ____
     CPU ----->|   DMA    |          |   DISK   |-|disk|
               |controller|<========>|controller|-|disk|
     Memory <=>|__________|          |__________|-|____|
     

    This design adds significant complexity, and it is noteworthy that such hardware has been common since the 1960s, where the the IBM 360 channel architecture serves as the architypical example. There, the DMA controller was called a channel processor. Todays examples include the SCSI, USB and Firewire controllers; as with the IBM 360, most of these allow multiple disks (and their controllers) to be attached to one such bus.

    Of course, everything is smaller today than in the 1960's. Today's disk controllers are made of a few chips mounted on a single printed circuit board attached to each disk drive, and today's channel processors are usually as small and in many cases are built into the motherboard of the host computer. In the 1960's, each of these occupied a free-standing cabinet full of electronics.

    Then and now, commands from the CPU to the channel controller, perhaps a SCSI or Firewire controller, direct it to transfer data to or from some memory buffer and direct it to forward certain additional commands to the disk controller. All of the usual I/O operations are possible with such systems, but most such systems also allow many nonsensical operations such as directing the channel controller to read data from the disk to memory, while directing the disk controller to write data from the memory to disk.

    Another complicating factor is introduced by the fact that it is frequently useful to start one disk seeking while another reads or writes data. This is particularly important in large servers. To support this, many disk controllers allow a seek operation to be initiated on some disk without initiating either a read or write. Software that actually uses this feature is common on the high performance operating systems and is only slowly becoming common on personal computers as these migrate from single-user operating systems to scaled-back versions of general purpose operating systems.

    Virtual memory adds another layer of complexity. Generally, the user's I/O buffer is not forced to lie entirely within one page of the user's address space, but instead, it may span several pages. While these may be consecutive in the user's virtual address space, they might be anywhere at all in the physical address space. DMA controllers that understand the structure of page tables have been built, but for most simple DMA controllers, the I/O software must first gather data from the fragments of the user's buffer into a physically contiguous buffer, and then do a DMA write from that buffer to disk, and similarly, after a DMA read from disk, the system must scatter the data from the system's buffer to the fragments of the user buffer.

    Many high end DMA controllers support scatter read and gather write operations. For these controllers, there is not a single base/count register pair describing the user's buffer, but rather, a list of base/count pairs describing the fragments of the user's buffer. Usually, this list is in main memory, and the DMA controller has a register that points to the next pair to process when the current pair is completed. Only when the list is exhausted does the transfer stop. In fact, this is an old feature! The IBM 360 channel controllers supported an even more complex variation on this through what were called channel programs.

    Some disk controllers allow multiple sectors to be read in sequence, so if the buffer is larger than one sector, it can be entirely read in one disk transfer, so long as the desired sectors are consecutively numbered on disk. Thus, depending on the controller, it is sometimes possible to read or write as much as an entire track or even a cylinder to or from the disk in one transfer.

    Finally, at the other extreme of performance, some modest and low performance disk controllers have been built without the full generality of DMA. Instead, the disk controllers are built with an internal buffer able to hold one sector or one track, and all disk I/O is to or from this buffer. Prior to writing the disk, software running on the CPU must copy the data from the application's memory area to the disk buffer, and after a read, it is up to software to copy from the disk buffer to the user's buffer. The disk buffer in such controllers may or may not be addressable as a normal memory area; in the lowest performance disks, data is transferred between the CPU and this buffer one byte at a time.

    Disk Request Queues

    Typically, assuming a simple DMA disk interface, without the added features discussed above, a modern operating system will maintain a queue of disk I/O requests. Each request is a record of a disk input or output operation that has been requested by a user but not yet carried out by the disk interface. A typical disk I/O queue will contain records with a format such as the following:

         Disk Request Record:
              Buffer Address in main memory.
              Disk Address:
                 cylinder.
                 surface.
                 sector.
              transfer direction
              done semaphore
     

    A user process, or rather, the file system or page fault handler acting on behalf of that user, might pose a disk request as follows:

        low-level user disk I/O request routine
              enqueue( request, disk-queue )
              enable( disk_interrupt )
              P( request.semaphore )  -- wait for I/O complete
              return
     

    The interrupt service routine might operate as follows:

        disk interrupt service routine
              disable( disk_interrupt )
              if request not null then V( request.semaphore )
                 -- the above lets a waiting user continue
              if not empty( disk-queue ) then
                 request := dequeue( disk-queue )
                 buffer-address-register := request.buffer-address
                 disk-address-registers := request.disk-address
                 disk-command := request.transfer-direction
                 enable( disk_interrupt )
                 return
              else
                 request := null
                 return
     

    Disk interrupts signal the completion of the previously requested operation, so the first thing the interrupt service routine does is signal the semaphore on which the process requesting the previous service may have blocked. The having done this, the interrupt service routine continues by starting a new disk transfer, if one is pending in the disk request queue.

    Note that, if the queue is empty, the disk routine returns, leaving the interrupt disabled but doing nothing to make the controller drop its interrupt request. When a user enqueues a new request, the user code enables the interrupt, causing an immediate interrupt and allowing the controller to immediately start the newly requested transfer.

    Exercise: The above pseudocode for the disk interrupt service routine and for enqueueing disk requests assumes that disk I/O is being requested for typical user code. Imagine the case where some disk requests are enqueued by the page-fault service routine. In that case, it may be awkward for the service routine to block on a semaphore (some models of interrupt service prohibit service routines from blocking on anything), so it may be better to have the disk interrupt service routine finish the processing of the page fault. Propose an augmented data structure for the disk request queue and pseudocode for the page-fault handler and disk interrupt service routine to do this.

    Improving Disk Performance, Disk Caches

    One way to improve the performance of a disk system is to use the queue of pending disk requests as a cache. New read requests are only posted to the queue if no write is pending to the same sector, and new write requests cause previously posted write requests addressed to the same sector to be deleted. The following version of the user's code to post a new disk I/O request to the disk request queue describes this:

        low-level user disk I/O request routine
            if request.disk-address matches prior.disk-address
               where prior is a request already in the disk-queue
               if request.direction = in and prior.direction = out
                  copy prior.buffer into request.buffer
                  return
               if request.direction = out and prior.direction = out
                  delete prior from disk-queue
                  return
            enqueue( request, disk-queue )
            enable( disk_interrupt )
            P( request.semaphore )
            return
     

    The result of this logic is that there will never be more than one write queued for any disk sector, and if a write is queued for a sector, there will be no reads queued for that sector.

    Exercise: Multiple reads from the same sctor can also be combined. Propose additions to the disk request queue data structure and to the pseudocode given here that would do this.

    The use of a disk cache poses risks, whether the cache is based on the request queue, as outlined above, or whether it is based on specifically reserved buffer space in main memory. In either case, there is the possibility that some write may never be recorded to disk! If some software keeps writing some sector, each such write will cancel the write already in the queue, and the result is that no write will ever be done. So long as there are no system failures, this is not a problem, but if there is a power failure, this could cause the loss of very important data. Therefore, it is common for systems that support disk cache schemes to include, in the kernel, a specific operation to force all writes to a disk to be forced to be completed. In the UNIX system, the sync() kernel call does this.

    Improving Disk Performance, Disk Schedulers

    Disk scheduler software can markedly improve the performance of disk I/O for any disk that serves multiple independent sources of I/O requests. This is typical of disks in large timesharing systems and of disks in network file servers. The key is to note that, if there are multiple requests in the disk I/O queue, the time taken to process those requests depends on the order in which they are processed. This is because the latency time depends on the order in which they are accessed.

    For example, consider a request queue containing 2 requests for operations on track 1 and two requests for operations on track 400. Clearly, if these are done in the order (1, 400, 1, 400) the result will be quite a bit slower than if they are done in the order (1, 1, 400, 400). The former requires three long seeks, while the latter requires only one long seek.

    Thus, if disk requests are scheduled in the appropriate order, disk performance can be markedly improved. It is generally impossible to point to a separate procedure in the operating system and say "this is the disk scheduler". Instead, the disk scheduler is broken between the two ends of the disk request queue, and the queue data structure is itself the structure used to organize the scheduler.

    Typically, the queue for each disk is broken into separate pieces for each cylinder, and the enqueue routine on the user side of the disk request queue is made responsible for enqueueing the request in the appropriate cylinder's queue. The scheduling policy is then the responsibility of the disk interrupt service routine.

    Typical scheduling policies are:

    1. FIFO, the base case, is inherently fair to all users, but it does nothing to improve throughput. (A fun exercise is to quantify fairness and then prove that the FIFO policy is optimal).

    2. Shortest seek time first. After processing all requests for the current cylinder, the next cylinder to be processed is the one with pending requests that is closest to the current cylinder. This policy produces good throughput (the number of I/O requests processed per second) but it is unfair, it will cause some users to wait indefinitely. (The optimality of this method depends on how the seek time varies as a function of the number of cylinders moved. For a fun exercise, figure out the class of functions for which this is optimal.)

    3. The bidirectional elevator algorithm. After processing all requests for the current cylinder, the next cylinder to be processed is the one closest to the current cylinder in a fixed direction, for example, towards the center of the disk. Only if no work is pending in any cylinder in that direction does the preferred direction reverse so that future seeks will be in the opposite direction. This technique offers a good throughput, but it discriminates against disk requests near either extreme of the range of valid cylinder numbers.

    4. The unidirectional elevator algorithm. After processing all requests for the current cylinder, the next cylinder to be processed is the one closest to the current cylinder in a fixed direction, for example, towards the center of the disk. Only if no work is pending in any cylinder in that direction does a seek in the opposite direction take place. In that case, the most distant available cylinder with pending work is selected, and the preferred direction remains unchanged. This technique offers equal service to all cylinders while only slightly reducing the throughput of the bidirectional elevator.

    The elevator algorithms are based on the model used by elevators in tall buildings. The elevator does not visit floors in FIFO order, that is, going from floor to floor in the order those floors were requested by users pushing buttons requesting visits to various levels of the building. Instead, the elevator sweeps systematically up the building, visiting those floors for which it has pending requests and stopping when it reaches the highest requested floor. At that point, it begins a downward sweep, stopping at the lowest requested floor. This is exactly the bidirectional elevator algorithm, with floor numbers substituted for cylinder numbers.

                     Queues for each
                       disk cylinder
     
                         ---------   |
                           00000 |   |
                         ---------   |
                             000 |   |
         Source          ---------   |
           of    ---->      0000 |   |
        requests         --------- /---\  Cylinder
                                     00|  currently
                         --------- \---/  being
                               0 |        served
                         ---------
     

    The conflict between throughput and fairness illustrated by the differences between the unidirectional and bidirectional elevator algorithms is typical of a number of scheduling issues in operating systems. Attempts to optimize against a single goal are fairly straightforward, but there is no solution which is simultaneously optimal against all criteria!

    The following pseudocode implements the elevator algorithm; in this code, the disk request queue has been replaced with an array of queues, one per cylinder:

        low-level user disk I/O request routine
            enqueue( request, disk-queue[ request.cylinder ] )
            enable( disk_interrupt )
            P( request.semaphore )
     
    The interrupt service routine consumes requests from the queues as follows:
        disk interrupt service routine
            disable( disk_interrupt )
            if request not null then V( request.semaphore )
               -- the above lets a waiting user continue
            if empty( current-queue ) then
               -- time to move the elevator to a new floor
               i := current-cylinder
               repeat
                  i := (i + 1) mod cylinders
               until i = current_cylinder or not empty( disk-queue[i] ) 
               current-queue := disk-queue[i]
               disk-queue[i] := empty
               current-cylinder := i
            end if
            if empty( current-queue )
               request := dequeue( current-queue )
               buffer-address-register := request.buffer-address
               disk-address-registers := request.disk-address
               disk-command := request.transfer-direction
               enable( disk_interrupt )
     	  return
            else
               request := null;
     	  return
     

    Note that the disk interrupt service routine uses a separate queue, current-queue to hold the requests for the cylinder it is currently servicing. This prevents late arrivals at that cylinder from being served until the next visit to that cylinder; if this was not done, and if some process continuously maintained a supply of pending disk addresses for one cylinder, no other cylinder would ever be served.

    Another detail to note is that the search for the next cylinder to process need not be a sequential search, as outlined above. If the disk queue array is maintained as a circular sparse array (a linked list structure), so that empty queues are not represented at all, the next queue will always be adjacent to the current one and the computation shown as a search above can be recoded as a simple pointer assignment. (for a pleasant exercise, write the necessary code.)

    Device Independence

    The desire for device independent input/output was one of the original motives behind the development of polymorphic object-oriented programming methodologies. Typically, we aske that each device support some set of input/output operations, for example:

         Read( ch )                  Arbitrary software
          Write( ch )        <----->  that performs actions
          Read( buf, add )            that look like I/O to
          Write( buf, add )           the user, and may even
     			         involve I/O
          Used for disk files,
          windows, network ports,
          and similar abstractions.
     
    The advantages of using a standard software interface for every device, whether disk or terminal or printer or network port have been known since the development of the FORTRAN programming language in the late 1950's. To achieve this, it is necessary to have a user-level device interface that is sufficiently flexible to operate with both random access and sequential devices, supporting both stream and block modes of operation on any device. The low level Unix interface (read, write, seek) is one good example of such an interface.

    This interface may be considered a virtual device interface, with the input output software of the operating system serving the function of virtual to physical input output operation translation.

    The virtual device may also be considered as an object, in the sense used in object-oriented programming. In fact, the first clear description of how objects in this sense can be implemented came from the field of operating systems, in a paper by Christopher Strachey in his OS 6 system from the mid 1960's. In this system, he used the following implementation:

           An open file descriptor
     		________
      F ----------->|     o--|--------> Read( F, ch )
     	       |     o--|--------> Write( F, ch )
     	       |________|    
      Data specific |        |     Pointers to code
      to this file  |        |     to do device dependant
      and device    |        |     part of the I/O operation
     	       |________|
     
     
       Read( F, ch ) <------ the device independant read
       {                     routine called by the user.
          F^.Read( F, ch )
       }                     It simply calls the device
     			specific routine, passing
     			the pointer to the data
     			needed to do the operation.
     

    This was originally implemented in BCPL (the Basic Computer Programming Language), a predecessor of C based on CPL. Strachey invented CPL, which stood either for Christopher's Programming Language, the Combined Programming Language, or the Cambridge Programming Language, depending on who you talk to.

    In systems with no protection, a file variable can be as simple as a pointer to the descriptor for the open file. This descriptor is typically created by the Open system service. If there is protection, the descriptor must be protected from the user program, so F, the file varialbe, is more complex than just a pointer. On Unix, for example, F is an integer index into a system maintained array of pointers to descriptors (the array is called the file table of the current process).

    In object oriented terms, each of Strachey's open files is an object, an instance of the class file. The class file is amenable to multiple implementations, each requiring different code; such a type is called a polymorphic type, and one requirement of the operating system context is that multiple implementations of the abstraction coexist, so that, for example, a user can have one open file refer to a file on disk, while another represents the keyboard.

    In object oriented programming languages, the same approach described here is commonly used to implement polymorphic objects. The data structure representing the object contains pointers to the action routines that implement the abstract operations on the object.

    File Systems

    An open disk file is an example of a virtual device just as much as a disk or a printer interface is, when presented through a device independent interface. Thus, we can extend our polymorphic class hierarchy for open files to include not only open files which refer directly to physical devices such as printer ports, serial communications lines, and disk drives, but also open files that refer to files on the file system, windows under the window manager, and other virtual I/O devices.

    Therefore, the underlying data structure for an open disk file under a modern file system might resemble that shown in the following diagram:

                    ________
              F --->|     o--|--------> File System 
                    |     o--|--------> Code
                    |________|    
                    |        |              ________
                    |     o--|------------>|     o--|------> Disk I/O
                    |        |             |     o--|------> Code
                    |________|             |________|
                  Descriptor of            |        |
                  open disk file           |        | Data needed to
                                           |        | access the disk
                                           |________|
                                          Descriptor of
                                          the disk itself.
     

            Code to read and write an open disk file
     
             * May convert block size to that of disk.
             * Translates address of block in file
               to disk address on the real disk.
     
             The latter is Virtual Address Translation
     

    The descriptor of the open disk file will typically contain buffers for block size conversions, and it will contain the data structure needed to convert addresses from file-relative form to absolute disk addresses.

    The descriptor of the disk device itself will typically contain the information needed for disk scheduling, the queue needed to communicate with the disk interrupt service routines, and other information specific to the actual disk being used.

    The read and write routines for the open disk file can be written in such a way that they call the standard, device independant read and write routines of the disk itself -- in effect, the read and write routines that implement the open disk file can be written as if they were user-level code.

    Real time Clocks

    Timer hardware

    Before discussing file systems, it's useful to spend some time on a simpler class of devices: timers. In a sense, a timer is the simplest of all I/O devices, because it performs no I/O. There are many kinds of timers; the simplest is the periodic timer. All it does is request a periodic interrupt.

    Most modern operating systems, Unix, Windows and Mac/OS included, use periodic timers. In the case of early Unix the timer fired 60 times per second, a frequency derived from the power supply. Modern Unix systems almost universally have the system clock fire 100 times per second. In the case of the DOS and Windows environments, the interval has classically been 18.2 ticks per second. Most modern systems use hardware that is able to provide far more general timer service, but commercially available operating systems (excluding those sold as real-time systems) frequently always ignore the generality of the hardware in favor of using it to provide a simple periodic interrupt.

    As an example of a very simple hardware interface, consider a timer that fires every millisecond. This might present the following interface to the software:

    expired -- once a millisecond, the timer sets this single bit interface register. When this bit is set, the timer also requests an interrupt. The timer interrupt service software must reset this bit before returning from or re-enabling interrupts in order to turn off the interrupt request.

    This simple timer will be used for the remainder of this lecture, but note that far more complex timers are common. For example, interval timers contain a register that is decremented at a constant rate, for example, once per microsecond. The timer requests an interrupt when the register reaches zero, and the interrupt service routine may load a new count in the register to request the next interrupt.

    Timer Software

    It is rare to find application software that requires a periodic interrupt at the same rate as the hardware period of the timer. When the application requires a periodic interrupt at longer intervals, however, this can by synthesized from the basic periodic interrupt of the hardware clock as follows:

        periodic-interrupt-service-routine:
            reset( expired )
            countdown := countdown - 1;
            if countdown = 0 call user-service
            return
     
    Here, the procedure user-service is called at the end of an interval defined by the periodic decrementing of the variable countdown; user-service should reset countdown each time it is called. If user-service always sets countdown to the same value, the user routine will be called periodically; if user-service sets different values each time it is calle, it will be aperiodic. This creates, in software, a programmable interval timer, but most modern real-time-clock hardware directly supports this function, with a countdown register decrementing at several megahertz and the rule that the hardware will request an interrupt when the register reaches zero.

    Application programs rarely want just one interval timer. What they typically want is a service such as the following, callable from within any user process:

    delay( t ) -- when a process calls this operating system service, that process will be delayed t time units (the particular time unit depends on the system; we will use milliseconds for the examples presented here).

    Exercise: How can you implement the example delay() service in terms of the Unix kernel's timer primitives -- see the man page for setitimer() for documentation.

    On a system with only one process and an interval timer, this service would be trivial to implement. On a system with more than one process and a periodic timer, this service is far more interesting. In such a context, many processes may be awaiting different times, and there may easily be more than one process waiting for the same time instant.

    Thus, there must be a list of waiting processes, with a record, for each process, of the time it is awaiting. When a timer expires, it must somehow search this list and wake up each process waiting for the current instant.

    The need to block processes and then wake them up suggests that each timer record will have a semaphore on which the process blocks. This leads to the following data structure:

        Timer record:
            link fields, as needed for list management
            delay -- the requested time delay
            sem -- semaphore used to block process
     
         Timer list:
            a linked list of timer records
     

    The delay service seen by an application program would then be as follows"

        delay( t )
            get a timer record r
            initialize r.sem to a count of zero )
            r.delay := t
            append( r, timer-list )
            P( r.sem )
     

    A single timer record can be permanently attached to each process. The interrupt routine needed to support this service is as follows:

        periodic-interrupt-service-routine:
            reset( expired )
            for each timer record r in timer-list
               r.delay := r.delay - 1
               if r.delay <= 0 then
                  remove( r, timer-list )
                  V( r.sem )
               endif
            endloop
            return
     

    Code written along the above lines is actually used on some systems, but it begins to behave very badly on systems with large numbers of waiting processes. The problem is that interrupt service routines should never perform open ended computations such as scanning of linked lists. This work should be moved to an application process if at all possible.

    Nonetheless, the above code implements what can properly be called a virtual interval timer available to each applications process. This notion of virtual timers is an important example of virtual device interfaces; its primary importance is that timer, as a class of device, have the simplest of semantics.

    Exercise: Explore the documentation for the UNIX timer services and write a critical essay on the deficiencies of these services. There are many!

    Improving Timer Performance

    The key to avoiding complex interrupt processing in the implementation of timer services is to rely on a sorted list of timer records, each holding the time being awaited. This requires the following data structure:

        Timer record:
            link fields, as needed for list management
            time -- the requested time at which the timer should fire
            sem -- semaphore used to block process
     
         Timer queue:
            a priority queue of timer records, sorted by time
     
         Clock:
            the current time
     
    There are many priority queue implementations, only the simplest, based on a linear list, is suggested by the figure below:
     _____________     ______     ______     ______
     | Timer Queue |-->| Link |-->| Link |-->| Link |--X
     |_____________|   |------|   |------|   |------|
                       | time |   | time |   | time |
                     / |------|   |------|   |------|
      Sort by order /  | sem  |   | sem  |   | sem  |
      of time fields!  |______|   |______|   |______|
     

    If the system is to run for long periods, the variables used to hold clock and the time fields of timer records must have sufficient precision that overflows are not likely during the life of the system. A year's worth of milliseconds will overflow a 32 bit counter, so realistic systems should 48 or 64 bit counters.

    Some systems maintain all time values in, for example, milliseconds since some arbitrary dawn of time. Others maintain time values in more expensive forms, with fields divided out for year, day, hour, minute, second and millisecond. There is little reason to do the latter, however, since these divisions are usually only needed for the output of dates, and internal time and interval arithmetic is simplified by holding time values in a simple form.

    Exercise: The classic Unix representation of the date uses a signed 32 bit count of the seconds since Midnight January 1, 1970, GMT. When will the UNIX date overflow? (It is interesting to note that MacOS and UNIX were both designed without Y2K errors in their kernels.)

    Assuming that the time fields are absolute times, in milliseconds, and that the timer interrupt service routine maintains a global variable called clock, measuring milliseconds from the beginning of time, the delay service can be coded as follows:

        delay( t )
            get a timer record r
            initialize r.sem to a count of zero
            r.time := clock + t
            enqueue( r, timer-queue )
            P( r.sem )
     
    The interrupt routine needed to support this service is as follows:
        periodic-interrupt-service-routine:
            reset( expired )
            clock := clock + 1   -- count time
            while head( timer-queue ).time <= clock
               r := dequeue( timer-queue )
               V( r.sem )
            end loop
            return
     

    This code rests on a priority queue. The simplest such data structure is a sorted linked list, where the enqueue service sorts new items into the list in time order, the head of the list is always available, and the dequeue service simply removes the head element from the list.

    This simple implementation moves most of the work to the enqueue service, where it belongs, but if the list is long, the enqueue service may take some time. There are faster algorithms that use binary trees or heap-ordered trees to achieve O(log n) operation times on the queue. You can get more information on this subject from "An Empirical Comparison of Priority Queue and Event Set Implementations" in Communications of the ACM, 29, April 1986, pages 300-311. Code for interesting priority queue algoritnms is available to WWW users.

    Adapting the above code to a programmable interval timer poses some problems. Primary among these is the problem of maintaining the clock. Every time the interval timer expires, the clock can be updated to reflect the interval that has just passed, but for inquiries about the clock value between updates, the running count (countdown, in the example given earlier) must be inspected to get the current value.

    Date Overflow Considerations

    The above code relies on a global variable "clock" that can be incremented once per timer interrupt. This is a source of significant problems in real systems. For example, consider a timer that increments once per millisecond. This will increment by 1000 per second (so about 10 bits of timer precision are devoted to counting seconds. Counting 60 seconds per minute takes roughly another 6 bits of precision, and counting 60 minutes per hour takes another 6 bits, leaving 10 bits out of a typical 32 bit word available for counting hours. 1000 hours is just 40 days of 25 hours each, so we can quickly estimate that our 32 bit counter will overflow in just 6 weeks! If we replace rough estimates such as those used above with accurate arithmetic, we get a clock lifetime of 49.71027 days.

    Using the above algorithm, we would expect our system to run correctly for a month, and then we would expect a brief burst of irregular behavior as the clock rolled over, after which it would deliver correct performance for another month. This is hardly acceptable!

    Therefore, we seek a better solution! One approach is to simply use a higher precision counter. For example, we could replace our 32 bit counter with a 64 bit counter; with this, our system would last for millennia instead of mere months. We pay a price, however, in slow arithmetic, unless we're running on a machine with a 64 bit ALU. Actually, the situation is not that bad. On most machines where a register to register add takes one instruction, a double-word register to register add can be done in two or three instructions.

    Another approach is to note that, while we may well want our system to run for longer than one month, most of the timed delays requested by users are likely to be on the order of seconds. We can easily rewrite the timer data structures to support this by replacing the time field in each timer record with a relative delay field:

        Timer record:
            link fields, as needed for list management
            delay -- the delay between the previous timer and this
            sem -- semaphore used to block process
     
         Timer queue:
            a sequential queue of timer records, sorted by expiration time
     
         Clock:
            the current time
     
     _____________     ______     ______     ______
     | Timer Queue |-->| Link |-->| Link |-->| Link |--X
     |_____________|   |------|   |------|   |------|
                       | delay|   | delay|   | delay|
      Relative delay / |------|   |------|   |------|
      from one record  | sem  |   | sem  |   | sem  |
      to the next.     |______|   |______|   |______|
     
    This data structure adds only marginally to the complexity of the timer services, while it allows very fast arithmetic to be used for typical short delays.

    Summary: The above notes suggest a number of ways to implement a large number of virtual timers for use by user processes, when given only a single hardware timer. There are many more ways of doing this! This lesson applies to any of a number of virtual resources created by operating systems, so, for example, a file system may creates many virtual disks from one physical disk in many different ways, and a window manager may create many virtual display screens within the confines of one physical display screen in many different ways.

    Real Time Considerations

    The semantics of timers is not always immediately obvious. When a process says delay s seconds this does not mean to wait exactly s seconds and then continue. Rather, it means to wait at least s seconds. If there are other ready processes at the end of the wait, we generally make no attempt to predict which process will be running at the end of the wait; we merely insist that the waiting process become ready when the timer expires.

    This is consistent with the usual weak assumptions made about parallel processes -- that they advance unpredictably, so that we can make no assumptions about how CPU resources are divided between them. Usually, proof of correctness under these weak assumptions is easier than proof under more complex assumptions, so we use these weak assumptions even when stronger assumptions are reasonable.

    Note that the presence of a timed-delay or equivalent service is one of the defining characteristics of a real-time operating system! Real-time applications, demand stronger assumptions. Typically, the defining characteristic of a real-time system is the presence of deadlines by which the software must complete some actions.

    When parallel programming is applied to real-time systems, we must ask if there is a feasible schedule. That is, we ask not only if there is enough CPU time overall to complete the entire computation, but we also ask if each task can accomplish the necessary computation before each deadline.

    If there is no feasible schedule, the system cannot succede, but if there is a feasible schedule, we must also ask whether the scheduling algorithm is able to find the feasible schedule among the infinite set of possible schedules. Two classes of real-time scheduling algorithms have been widely studied.

    Rate monotonic priority scheduling will always find feasible schedules if there is approximately 1/3 idle CPU capacity. In this scheme, processes are scheduled with priorities related to the intervals between deadlines. Processes dealing with deadlines that have a short interval are given high priorities, while those dealing with deadlines that have a long interval are given relatively lower priorities.

    Usually, rate monotonic scheduling is discussed under the assumption that intervals are periodic, in which case, the priority of a process is fixed proportionally to the rate at which it must meet deadlines. In fact, there is no reason to limit this scheme to periodic deadlines, but introducing aperiodic deadlines does force the system to use dynamically varying priorities.

    Deadline based scheduling uses the time of the next deadline to be met by a process as its priority. Thus, if a process must finish some job by time t, the priority of that job is t. Distant future deadlines have low priority, while imminent deadlines have high priority. Deadline basee scheduling is somewhat uncommon in practice, but it can find a feasible schedule even when there is no excess CPU capacity.

    Файловая система

    File System Layers

    A well designed file system (as opposed, for example, to the file systems used on most commercially successful operating systems) will typically be built in many layers, for example, with a structure such as the following:

    1. The device driver
      • Interrupt service routine
      • Disk request scheduler
      • Low level disk-request queue interface
    2. Low-level file system
    3. Buffered streams
    4. Named files

    The low-level file system offers a device interface quite similar to that offered by the device driver, and possibly identical to it. The difference is that the device driver enqueues requests for I/O to a real device, while the low-level file system accepts requests for I/O to virtual devices that are each implemented by one or more real devices.

    Higher level layers of the file system create the illusion of buffered stream I/O, and independence from the actual block sizes of the underlying device, and create a system of directories and file names for the files implemented by the low-level file system. Here, we will focus on how a decent low level file system can be built.

    A standard low-level interface

    If you assume a standard device interface, shown pictorially as:

            _______
            | READ  |  -- device.read( buffer, device-address )
            |_______|
            | WRITE |  -- device.write( buffer, device-address )
            |_______|
     

    It is clear that this can be supported by the low-level interface to the device drivers of a large variety of random-access storage devices, ranging from hard and floppy disks to software emulation of such devices using fast RAM for data storage. For the purpose of this section, we will assume one fixed buffer size, equal to the sector size of the device, and we will assume that device addresses are lineraized over a range such as 0 to Max-Address instead of being constructed from components such as sector, cylinder and surface numbers.

    To go beyond these simplifying assumptions, we could add a service to our standard low-level random-access device interface:

            _______
            | READ  |  -- device.read( buffer, device-address )
            |_______|
            | WRITE |  -- device.write( buffer, device-address )
            |_______|
            | INFO  |  -- device.info( infoblock )
            |_______|
     

    The new info service for each device would be expected to return information about that device's disk geometry answering questions such as: What is Max-Address for this device? Is the sector size fixed, and if so, to what value? Is the number of sectors per track fixed and if so, to what number, and is this divice divided into multiple surfaces and cylinders, and if so, how many of each? The answers to these questions may vary from device to device, creating serious problems for higher level software.

    Higher level I/O primitives, such as the standard random-access stream primitives of Unix or the even higer level stream primitives of C can easily be implemented on top of this layer, if needed, and these primitives can hide most of the complexity exposed by the addition of the info service suggested above.

    In addition to constructing physical device drivers that use this interface, we can also construct virtual devices that use it. For example, consider the following picture:

                                 ___
             _______ ____________|  _______
            | READ  |            | | READ  |  
            |_______|    Disk    | |_______|
            | WRITE |   Cache    | | WRITE |
            |_______|____________| |_______|
                                 |___
     

    This shows a disk cache implementation that assumes the availability of an underlying device supporting this interface, where the cache itself supports exactly the same interface as the underlying device. In object oriented terms, both the device interface itself and the disk cache are subclasses of the same abstract polymorphic class, and all instances of the cache subclass are implemented in terms of some instance of the same abstract class. Mathematical type theorists say this is an illegal use of the type theory on which object oriented programming rests, but it is a commonplace one!

    If interposed between any physical device and a user of that device, this disk cache can improve the apparent speed of the disk, from the user's viewpoint, by caching recently used sectors in some area of main memory, for example, using an LRU replacement algorithm. The disk cache might, for example, maintain a set of buffers in memory that hold copies of the most recently accessed disk sectors; on read, if the desired sector is buffered, no actual disk I/O takes place. On write, a copy of the data written can be held in the cache for later reading, and cache buffers can be assigned to sectors on an LRU basis, with all of his done by the software.

    Our focus here is on file systems. In effect, our lowest level model of an open file is a virtual device that also supports this standard device interface, although perhaps with a noticably smaller value of Max Address. Thus, the following picture applies:

                                 ___
             _______ ____________|  _______ _____
            | READ  |            | | READ  |  
            |_______|   Opened   | |_______| Physical
            | WRITE |    File    | | WRITE |  Disk
            |_______|____________| |_______|_____
                                 |___
     

    The implementation of an open file must map read and write requests from the user into read and write requests on the underlying file. Of course, the file system need not be implemented directly on a physical device. The following structure is quite possible:

                            ___                    ___
        _______ ____________|  _______ ____________|  _______ _____
       | READ  |            | | READ  |            | | READ  |
       |_______|   Opened   | |_______|    Disk    | |_______| Physical
       | WRITE |    File    | | WRITE |   Cache    | | WRITE |  Disk
       |_______|____________| |_______|____________| |_______|_____
                            |___                   |___
     

    A file system consists of two essentially separate parts, one that creates opened files in response to open requests, and one that implements opened files. We will focus on the former first.

    Opened File Semantics

    Just like the underlying disk or cached disk on which it is implemented, an opened file supports our two basic operations, read and write. The fact that it is an open disk file does nothing to change the data that flows to or from disk with these operations. What it does do is change the disk addresses used. Thus, the top-level code of the read and write operations on an open file can be expressed as:

    open-file representation and access methods:
     
        device -- identity of device used to implement this file
     
        description -- some kind of description of this file
           serving to distinguish it from other files on the same
           underlying device.
        
        read( buffer, file-address )
           device.read( buffer, disk_address( file_address, description ))
     
        write( buffer, file-address )
           device.write( buffer, disk_address( file_address, description ))
     

    A user of a bare disk or a disk cache uses real disk addresses. A user of an opened file uses addresses relative to that file. In effect, these can be considered to be virtual disk addresses, and all of the address mapping mechanisms applicable to virtual address tranalation are equally applicable to translating file addresses to disk addresses.

    Additive linear mapping is the simplest way to translate virtual to physical addresses. This is illustrated below:

    disk_address( file_address, description ) =
         if file_address < description.size
            then return file_address + description.base
            else error
     

    In this case, the description of an open file consists of just two fields, a base and a limit, where both are sector numbers.

    In old minicomputer and early microcomputer operating systems, file systems were frequently constructed this way. A directory entry consisted of a file name, a base address and a file size. If this is the only form of file mapping on a system, the primary weakness shows up when there are large numbers of small files. As the file system evolves, the disk space grows progressively more fragmented, until sufficiently large free blocks cannot be found to allow new files to be created. At this point, the file system must be compacted, sliding all existing files together and consolodating the free space into one large free block.

    On modern systems, such simple additive linear mapping is commonly called partitioning, and it is quite common for large disks to be divided into many smaller virtual disks, each called a partition, and each supporting an independent file system.

    Partitioning is done for a number of reasons:

    to allow partial backups
    It is common to divide a disk into a user partition and a system partition because the system is only changed occasionally, and therefore only needs to be backed up occasionally.

    to support an old file system
    This is an unfortunate reason for partitioning a disk! As disk drives get larger, they have frequently exceeded the maximum size anticipated by the designers of older file systems. Many of todays operating systems were originally designed in an era when disks with capacities over a few hundred megabytes were unavailable, and the designs frequently failed to anticipate the availability of larger drives. Small virtual disk drives can easily support the old file system with minimal changes!

    to control resource contention
    If each subcommunity of users allocates their files in independent partitions, the system management can control the impact of misuse of disk space. Well designed multi-user file systems generally have far more powerful tools for solving this problem, but partitioning remains in common use for this purpose.

    to allow coexisting file systems

    When one hardware system must support multiple operating systems, it is common to partition at least one of the disk drives, reserving a partition for each of the operating systems. Thus, we find PC systems with Linux partitions and DOS partitions. The coexisting operating systems must cooperate to support this by supporting compatable disk partitioning schemes; if one file system partitions the disk by cylinders while the other partitions the disk by recording surface, coexistance will be difficult!

    Table lookup mapping is a common method - this is exactly the method used with virtual memory address maps, except that it is done entirely in file system software instead of involving a memory management unit:

    disk_address( file_address, description ) = description[ file_address ]
     

    Here, the description of each file is an array, indexed by file address, giving the corresponding disk addresses. Each file has its own mapping table; the difficulty with this is that most files are quite small, while the table size is determined by the largest legal file. In such a case, most table entries would contain error indicators. A more reasonable implementation would involve storing only the non-error entries in the table:

    disk_address( file_address, description ) =
         if file_address < description.size
            then return description.table[ file_address ]
            else error
     

    Here, the description of an open file is a record containing the size of the mapping array, along with that array. This allows a dynamically allocated array of size just large enough for the file.

    For large files, it is not sensible to keep the entire mapping table in main memory! The obvious place to put it is on disk, and one way to do this is to put the mapping table in another file. This is recursive, but it isn't infinite regress if there is a minimum threshold size used for this method. In that case, the translation code might look something like:

    disk_address( file_address, description ) =
         if file_address >= description.size
            then error
            else if description.size = 1 then
               then return description.address
     	  else
            else
               sector = file_address div slots_per_sector
               slot = file_address mod slots_per_sector
               description.file.read( buffer, sector )
               return buffer[ slot ]
     

    Here, an open file description has 3 fields, the size of the file, the disk address of the sector holding that file, if it is only one sector, or the file holding the table of sectors of that file, if it is larger than one sector. Each address translation for a large file involves reading a sector of the description file to get the required address.

    Note that this approach is only efficient if a disk cache sits between the file system and the disk -- if not, an extra disk access would be required for every sector of a large file read or written. Because it is highly likely that consecutive sectors will be accessed or that one sector will be accessed multiple times, perhaps with a read and then a write, the locality principle operates here, and the use of an appropriate disk cache will eliminate most of the extra I/O operations.

    In fact, the scheme outlined above can also be viewed as storing disk files as tree structures:

                          _ _ _
     root (1 sector)      |_|_|/|
                         __| |__
                       _|_ _   _|_ _
     description file |_|_|_| |_|/|/|
                     __| | |__ |___
                   _|_  _|_  _|_  _|_
     data sectors |___||___||___||___|
     

    In the above, both root file and each sector of the intermediate level description file have been assumed hold three disk addresses each. It is far more common to have upwards of 128 disk addresses per sector of the description file (or files), and instead of defining tiny files as those of just 1 sector, the base level data structure can contain the addresses of up to 8 or 10 sectors.

    This kind of tree structure does not impose any physical organization on the sectors of a file, and this is both an asset and a liability. It is an asset because it means that a file system organized using this kind of indexing is not subject to serious external fragmentation problems. It is a liability because it means that sequential access to files on such a system may be subject to serious latency problems imposed by poor organization of the file sectors.

    A good file system using this type of organization will typically maintain free-space data structures in such a way that space for files can be allocated in the same cylinder as the space used to index the sectors of those files, and so that consecutive sectors of a file are stored in the same cylinder whenever possible.

    Tree structured file systems are quite common, but they are rarely implemented in the straitforward way suggested above. Instead, they are usually implemented using fairly awkward special purpose code.

    Unix I-nodes

    The widely used Unix file system is an example. There, each open file is represented in memory by a data structure called an I-node. The I usually stands for either the words information or index. The classic version of the I-node data structure contains:

    • The access rights.

    • Owner, date of creation, etc.

    • The disk addresses of the first 8 sectors. This allows fast access to the first few sectors of a file.

    • The disk address of the sector containing pointers to the next 128 sectors of the file.

    • The disk address of the sector containing pointers to the sectors containing pointers to the next 128x128 sectors of the file.

    Given that the classic Unix sector size is 512 bytes, this allowed accessing files up to about 8 megabytes on classic Unix systems. This was sufficient for machines with 16 bit words, and it was sufficient for the disk technology available in the early 1970's, but by the late 1970's, it was obviously too small.

    Modern Unix systems overcome the deficiency of the data structure outlined above by adding one more disk address to each I-node that supports a three-level tree, the so-called large-model file system. This extends naturally to what came to be called the huge model, with a 4-level tree.

    The complexity of the Unix I-node data structure is a result of the memory limited context of early Unix systems, combined with a desire to make the first few sectors of even the largest files easy to access. Fast access to the first sector of a file is particularly important under Unix and related systems because, when a file is executed, the "magic number" included in the first bytes of the first sector determines whether the file is object code or is intended for execution by some interpreter.

    Fundamental Problems

    The two fundamental problems that a well engineered file system must address are as follows:

    Most files are small. Small shell scripts, bits of E-mail, single procedures and other bits and pieces make up the majority of the files on most systems. Many of these are significantly smaller than a single sector. Therefore, a well engineered file system should be able to store a large number of tiny files.

    Most accesses to files made by running programs are to large files. Image files, sound files, gigantic databases, and similar applications are important! Therefore, a well engineered file system should support efficient access to very large files.

    File Names and Directory Managers

    The primary job of a directory manager is to maintain a mapping between textual file names, as presented to the open function, and the information needed to build an open file description record.

    The directory itself is the data structure on disk used by the directory manager in performing this mapping.

    The simplest approach to directory management was used on many early computer systems, including most early floppy-disk based systems. On such systems, a fixed part of each disk was reserved for the directory, which took the form of a flat table, for example:

    Directory = array [1..64] of record
                   name: packed array [0..7] of char;
                   start-sector: disk-address;
                   sectors: integer;
                 end record;
     

    Here, as in many crude file systems, a linear additive mapping scheme is used to describe the disk space occupied by a file. It is trivial to extend such a directory to include such features as the date of file creation, the length of the file, in bytes, and similar details.

    Hierarchically structured file names

    In general, users don't like flat name spaces, and they quickly start building hierarchic spaces using simple conventions such as the now almost universal

    filename.extension
     

    format. In fact, the name of this file,

    /csf/jones/.public-html/opsys/notes/13.html
     

    could as easily be treated as a single long character string in a flat name space as it can be treated as a description of a path through a tree. Flat directory structures become unwieldy, though, so most modern systems support some variant on the tree-structured directory.

    Tree-structured directories appear to have originated with proposals put forth by workers at project Mac for what eventually became the Multics system. These ideas were published in paper by Dennis and Van Horn, in Communications of the ACM, March 1966.

    The key idea is to store directory data structures such as that outlined for a simple flat directory system in files. One bit attached to each file determines whether that file contains data or a directory. Whether it is a data file or a directory, all other details of storage allocation for the file are typically the same. Only the interpretation and the set of allowed operations on the file's contents changes.

    The Unix file system

    The Unix file system is a variant on the theme of hierarchically structured directories. Underlying the user-level directory manager is a primitive flat file system using integers as file names. The integers are simply indices into an array of I-nodes, the I-table, allocated out near the center of the (linearized) range of disk addresses:

     ___________________________________________________
     |super|                |         |                  |
     |block|                | I-table |                  |
     |_____|________________|_________|__________________|
     

    The superblock at the start of the disk contains the disk address of the I-table and its size, along with free-space data structures. In order to provide a degree of crash protection, the superblock is actually stored redundantly. The I-table is in the middle of the disk in order to minimize seek times in moving the heads from I-table to data and back.

    Each I-node has one bit indicating whether the associated file is a directory or a data file. Other than that, all other attributes such as how storage is allocated for files are the same, whether it's a directory or a data file.

    Under the Unix system, each directory is a sequence of ordered pairs <file-name, I-number>, where file-name is a variable-length string and I-number is an integer. Each directory contains two special entires, one called "." (dot), which contains the I-number of that directory, and one called ".." (dot dot), containing the I-number of the parent directory. Other than these two special entries, the directories of a Unix system are strictly tree structured.

    Unix does allow data files to be referenced from multiple directories. Thus, the same file may appear under different names in two or more directories.

    The redundancy of the Unix directory structure (self-links and back links) allows a program called fsck (file system check) to reconstruct the directory structure in the face of many possible failures. On most Unix systems, fsck is routinely run as part of the startup script. The redundancy of the Unix directory structure was not designed with support for fsck in mind; other systems, notably the Xerox Distributed File System (XDFS), have provided far greater redundancy, so that the entire directory system can be reconstructed from the data files themselves.

    Under XDFS, the prefix on each sector of each file contains the sector number of that file relative to the file it is in, along with the file number of the file (analogous to a Unix I-node number). As a result, the entire data structure describing mappings from [sector-number,file-number] pairs to actual files on disk can be reconstructed by scanning the headers on all the sectors on disk. In Unix terms, the I-table and all subsidiary data structures can be reconstructed from the representations of the files themselves.

    Under XDFS, sector zero of each file contains the file number of the directory that references that file, as well as the name of the file and any related information for the file. Thus, by finding sector zero of each file on the system (by scanning the entire disk) a strictly tree-structured directory for the file system can be reconstructed. The net result is that if some sectors on a disk are damaged, the only possible data loss is to the data in those sectors. No localized damage will ever destroy the entire file system. Sadly, few modern file systems have incorporated this degree of redundancy in order to allow a similar degree of damage survival, so today, it is common to lose entire files or directories because of very localized disk failures.

    Under Unix, the prohibition on directory structures other than a tree with back links (..) and self-links (.) is imposed in order to avoid problems with reclamation of storage when links are deleted. Each I-node contains a count of the number of incoming links; this is incremented when a link to that I-node is added, and it is decremented when a link to that I-node is removed. If the count reaches zero, the file is deleted.

    The problem with this scheme is that it can't detect it when a directory becomes inaccessible because the circular and self-links prevent the count for the directory from going to zero. As a result, Unix uses one command to remove links to data files (rm) and another command to delete directories from the directory tree (rmdir). The Cambridge File System (see Wilkes and Needham, The Cambridge Cap Computer and its Operating System) eliminates this problem by using dynamic garbage collection to reclaim unreachable files or directories. Again, this solution, while well proven, has not been adapted by any commercial file systems in widespread use.

    Performance Problems

    A successful file system must support large numbers of small files while supporting fast access to large files. Large numbers of small files requires allocation of files in relatively small units such as the sector, and this encourages use of translation tables, N-ary trees, or other similar data structures to hold the data needed to translate between addresses within a file and physical disk addresses.

    To support fast access to large files, contiguous allocation of disk space is essential. Many file systems do this, preferentially selecting among available free blocks so that, when a file is enlarged, new sectors are accessable with minimum latency. This requires that, when a file is enlarged, free sectors in the same cylinder are used, if they are available, and if not, free sectors in adjacent cylinders are used.

    Protection

    Memory Protection

    Memory protection is a key element in protecting the resources of a single-CPU or shared memory machine. It is essential that users not be able to arbitrarily manipulate the variables or code of the system; if I/O is memory-mapped, memory protection can even be used to control access to I/O devices.

    If access to the memory management unit is protected so that only the system can change the memory management registers, then the memory management registers can be considered to be a protected part of the user's process state.

    This trivially allows each user to be given access to a disjoint region of the systems virtual memory, and it is noteworthy that this trivial policy can be supported with almost any MMU hardware. Here's an illustration, using a paged system:

            Trivial policy
                              USER 1           USER 2
                               ____             ____
             MMU mapping      |  o-|-----      |  o-|-----
             registers        |  o-|---  |     |  o-|---  |
                              |  o-|-  | |     |  o-|-  | |
                              |____| | | |     |____| | | |
                                     | | |            | | |
                           _______   | | |  _______   | | |
             Pages on     |       |<-  | | |       |<-  | |
             disk or in   |_______|    | | |_______|    | |
             main memory     _______   | |    _______   | |
                            |       |<-  |   |       |<-  |
                            |_______|    |   |_______|    |
                               _______   |      _______   |
                              |       |<-      |       |<-
                              |_______|        |_______|
     

    To switch between the two users shown here, the system must not only change their CPU registers (Accumulator, Program counter, etc), but it must change the contents of the MMU registers.

    If the user address space doesn't contain any system code, then the MMU must have a second set of mapping registers that apply in system state, or alternately, the MMU must be turned off in system state. The latter is simple, on machines that allow it, but it eliminates the possibility of having parts of the system run in virtual memory. The former (with two sets of mapping registers) is easy -- but it may make it difficult for system code to address arguments to system calls, since they're in the wrong address space. The alternative of allowing system and user code to both address the same memory requires that there be some mechanism that changes the addressability of system pages when system state is entered -- for example, the MMU could declare that access to system pages is legal only when the program is in system state.

    Sharing Memory Between Users

    A paged memory management unit can do much more than provide each user with a distinct address space. For example, consider the following:

                             USER 1            USER 2
                               ____              ____
             MMU mapping      |  o-|-----       |  o-|-----
             registers        |  o-|---  |      |  o-|---  |
                              |  o-|-  | |      |  o-|-  | |
                              |____| | | |      |____| | | |
                                     | | |             | | |
                           _______   | | |   _______   | | |
             Pages on     |       |<-  |  ->|Shared!|<-  | |
             disk or in   |_______|    |    |_______|    | |
             main memory     _______   |       _______   | |
                            |       |<-       |       |<-  |
                            |_______|         |_______|    |
                                                 _______   |
                                                |       |<-
                                                |_______|
     

    How do users arrange to share pages (or segments)? What if shared pages are not at the same virtual address? This simple picture does not answer these questions!

    In most Unix systems, read-only code segments are always shared between all processes that are executing the same program. Shared read-write segments are allowed under System V Unix, using a scheme borrowed from the 1966 vintage Berkeley Timesharing System. This required the following two system calls:

    	Create shared segment ( pages, name, security )
     
     	Insert shared segment ( pages, name, security, address )
     

    When a shared segment is created, it is given a name. Unfortunately, this name is in a flat name-space (it is unstructured) and there are no tools provided to prevent users from selecting names that conflict.

    Sharable segments have a size, an identifying name, and some security attributes (on the Berkeley system, a password; on Unix, a short access control list with the same format as that used for files).

    When a process needs to address a shared segment, the user specifies the segment, the number of pages desired (which may be less than the size), and the virtual address where the segment should appear in the user's address space.

    If two different processes insert the same segment at different virtual addresses, then pointers that refer to part of the segment from within the address space of one process will not refer to the same place from within the virtual address of the other process.

    The Unix System V shmat() kernel call attaches a shared segment using this model. Under System V, shared segments have 32-bit names, usually interpreted as integers, selected from a global name space. As a result, programmers must generally allocate names dynamically because of the possibility that some other program has already created a segment with some particular identifier.

    Alternative Approaches to Sharing Memory - share with children

    On the now discontinued Encore Multimax computer under the UMAX operating system, a variant of Unix, there was only one system service used for memory sharing:

        Share( range of addresses )
           mark the pages holding the indicated virtual addresses as shared.
     

    Marking a page as shared had no immediate effect. Instead, the effect was delayed until the process executed the fork() system call. At that point, all non-shared pages were copied (at least, that was the user's impression of what happened), while pages marked as shared remained accessable by both the parent and child.

    This scheme allowed pages to be shared between processes only if the common ancestor of those processes established the sharing. This was sufficient to allow multiple processes to cooperate in solving some problem on behalf of a single user, but it didn't allow sharing between users.

    This scheme guarantees that a shared page will occur in the same place in all address spaces that share it.

    On demand paged Unix systems, the trivial implementation of Fork is expensive -- copying all non-shared pages of a process requires that all pages be pulled into memory and copied, a job that may require many disk accesses in a row. To avoid this, most virtual memory Unix systems use what is called lazy fork or copy-on-write semantics. After a fork, all pages are physically shared but marked as read-only. When a user tries to write on a such a page, the page-fault trap service routine replicates the page on behalf of the process that tried to change it, making a private read-write copy for that process. This is yet another type of soft page fault. The result of this design is that the expense of the fork operation is distributed over some number of page faults, and the only pages that get copied are those that are actually modified.

    Alternative Approaches to Sharing Memory -- share pages of files

    The Multics system had only one operation that supported sharing:

         OPEN( disk file name, virtual address )
     	Map the pages of the indicated file into the process's
     	address space at the indicated address.  Once a disk
             file is open, the entire file becomes addressable as
             if it had been read into a giant buffer in memory.
     

    This relies on the file system to determine what users are allowed to open which files.

    The Multics system was built on the GE 600, a 36 bit machine, by a consortium of MIT, Bell Labs and GE. Honeywell bought out GE's computer division and continued selling Multics systems through the 1970's. There may still be a few Multics systems running, and other manufacturers, notably Prime, have copied most of the good ideas from Multics.

    The name Unix is a pun on Multics. Unix was written at Bell Labs after Bell dropped out of the Multics consortium.

    Multics used a paged-segmented model of virtual memory, and each segment could address one file. Mapping a file into the virtual address space of a process was the only way to read or write that file. (Sequential I/O to terminals, magnetic tape and other non-random-access devices was done through different mechanisms.)

    If two users opened the same file at different virtual addresses (different segment numbers), then major problems would arise with data structures containing pointers in the segment they shared.

    Shared memory under Windows NT is based on the Multics model, in the sense that all memory sharing under Windows NT is done by mapping shared files into the virtual memory address spaces of the processes that wish to share memory. The BSD Unix mmap() system call supports this model, and this has since been incorporated into System V and other more modern standards for Unix.

    Protection Theory

    From a theoretical point of view, we can talk about any kind of object and any kind of user. It is worth asking if users are themselves a kind of object, but what matters is that users act on objects using operations, where the type of each object may determine the set of applicable operations.

    Here, we are not concerned with the nature of objects, other than the fact that objects may be operated on by users using operations selected from a finite set. If we adopt the terminology of object-oriented programming, the set of operations applicable to an object is the set of methods defined by the object's class. Files fit this model, with the operations read and write. Directories fit this model, with the operations make-directory-entry, lookup-directory-entry and delete-directory-entry. Memory pages or segments also fit this model, with the operations load and store.

    We are also not concerned with the nature of users. From the the point of view of the theory, users may be humans, programs, processes or anything else that may evoke operations on objects.

    For example purposes, it is useful to think of files as the objects and humans as the users, but it is equally useful to think of pages or segments as the objects and processes as the users.

    Static Protection Models

    The access matrix is a universal tool for expressing models of the state of a protection system at some instant in time. It shows, for that instant, all users, all objects, and the set of operations each user is allowed to apply to each object in the system. Pictorially, access matrices are generally presented as follows:

                            Objects
     
     	            | Cake | Tea   |
     	      ------+------+-------|
     	       Fred | Eat  | Drink |
     	Users ------+------+-------|
     	       Lucy | Bake | Drink |
     	            |      | Brew  |
     	      ---------------------   Allowed
                                          Operations
     

    The above example illustrates the (rather steriotypical) interaction of two users, Fred and Lucy, centered around two objects, Cake and Tea. The operations Eat and Bake apply to cake; Fred can Eat Cake, but not Bake it, and Lucy can Bake Cake, but not Eat it. The operations Drink and Brew apply to Tea. From the point of view of the access matrix, neither the names of users, the names of objects, nor the names of operations have any meaning other than the explicit relations between users, objects and operations detailed in the matrix.

    Access matrices may be used to control access to any resource, by any kind of user. Their use is not limited to computers; for example, they have been used in the Russian Gulag to control the access of prisoner/laborers to rooms in a prison/factory.

    The access matrix describes allowed actions, but it does not describe an enforcement mechanism. In the case of the Russian prison, the rules of the matrix were enforced by guards with guns.

    On a computer, one could store the access matrix inside the operating system, and have the system check the matrix whenever a user tried to access a resource. The problem with this is that the matrix is usually quite sparse -- that is, on most systems, most matrix entries will say "no access".

    It is important to note that the access matrix is a static description of who can do what operations on which resources. The model must be extended to describe how access rights are changed, and these extensions frequently differ markedly depending on how the access matrix is implemented.

    Access Control Lists

    There are two practical ways to implement the access matrix, access control lists and capabilty lists. Each of these allows efficient storage of the information in the access matrix.

    The access control list implementation of the access matrix given above is:

            Access Control Lists
                       --------       --------
             Objects  |  Cake  |     |  Tea   |
                       --------       --------
                          |              |
                       --------       --------
             Access   | Fred   |     | Fred   |
              Control |    Eat |     |  Drink |
               Lists  |--------|     |--------|
                      | Lucy   |     | Lucy   |
                      |   Bake |     |  Drink |
                       --------      |  Brew  |
                                      --------
     

    An access control list or ACL is what results when the access matrix is stored colum-wise. An access control list is a column of the matrix is stored with a protected object. Each entry in an access control list specifies a user and the operations that user may perform on that object.

    One problem with access control lists is that they may be quite large. It is easy to imagine the access control lists for some widely used files occupying more storage than the data in the file. The usual solution to this is to introduce named sets of users -- for example, OTHERS to specify the rights given to all users not explicitly mentioned in the list.

    Example: the Multics file system

    Under the classic Multics system, which was also used by the PRIMOS operating system, each file has an ACL, with access rights such as the following:

    • Read
    • Write
    • Traverse -- for directories
    • Execute -- for data
    • EditACL

    The read and write rights should be obvious.

    The traverse right is used in the context of hierarchical directory structures. A user who has no right to read or write a directory can mention that directory in the path name of a file as long as the user has traverse rights to that directory -- the user is just passing through, but may not open the directory to list its contents.

    The right to edit an access control list illustrates one approach to allowing the contents of the access control matrix to be modified. In a sense, all users with the right to edit the access control list of an object can be thought of as co-owners of the object. If one user (the legitimate owner of an object) grants the right to edit that object's ACL to another user, it is like granting the other user a power of attourny over the object.

    Example: the Unix file system

    Under Unix, each file has an ACL, but the ACL is strictly limited in structure. Each Unix ACL has 3 fixed entries:

    • An entry for the owner of the file
    • An entry for others in the same group as the the owner
    • An entry for the general public

    The rights associated with each entry are:

    • Read
    • Write
    • Execute/Traverse

    The Multics project originated as a joint project between MIT, GE and Bell Labs in the mid 1960's. In the late 1960's, Bell pulled out of the Multics project, largely because the technical staff of the computing group had concluded that Multics was getting too complicated. Unix was born from this frustration, and the name Unix is a pun on Multics -- suggesting a scaled down system designed to meet a scaled down set of goals.

    The simple to implement ACL scheme of Unix is an example of this. Unlike the Multics ACL, it is not general, but it meets the needs of most users of small computer systems. As a result, every Unix ACL fits exactly in 9 bits of the I-node of the file it describes!

    Multics was intended to meet the needs of a general information utility, and as such, it had to deal with the problems of large numbers of users, including users from competing organizations. The generality of the Multics access control list is well justified by this environment, and the success of PRIMOS in meeting the needs of compaines like Compuserve or The Source (modern examples of computer utilities) amply illustrates the appropriateness of the Multics design.

    When Unix systems are extended beyond single departments, the short ACL of Unix becomes a problem. Distributed versions of Unix such as the Appolo Domain operating system and it's successors have been forced to add general ACL's back onto the Unix file model in order to meet the demands of the large user base that such a system can serve.

    Capability Lists

    As mentioned above, capability lists are the alternative to access control lists. The capability list implementation of the sample access matrix given above is as follows:

             ------   --------------------
             | Fred |-| Tea     | Cake     |
              ------  |   Drink |      Eat |
                       --------------------
     
              ------   --------------------
             | Lucy |-| Tea     | Cake     |
              ------  |   Drink |     Bake |
                      |   Brew  |          |
                       --------------------
              Users    C-lists
     

    A Capability List is the dual (in the topological sense) of an access control list. Instead of storing one column of the access matrix with a protected object, a row of the access matrix is stored with the user.

    A capability list entry, known as a capability, consists of the name of an object and the rights one user has to that object.

    This does not change the rights a user has to a resource, and it does not change the set of policies that can be implemented, but it does change the economics of certain operations.

    With a capability-list based protection mechanism, it is clear that users should not be granted the right arbitrarily edit their own capability lists! Instead, it is common to consider the capability list itself to be an object to which rights can be granted. Thus, Fred in the above example might have the right to edit Lucy's capability list. Typically, users might have the right to move capabilities from any list they can read to any list they can write, but users would not be able to modify a capability, except by removing access rights from it.

    Capability Based Addressing

    The most important application of capability lists in modern computer systems is to addressing. With capability-based addressing, each address is formed from two parts, as follows:

             ADDRESS
              __________________
             |_______|__________|
                 |         |
                 |       Index into an addressed object
                 |
               Index into a capability list
               used to select an object
     

    Capability based addressing, as such, was invented for the Chicago Magic Number computer, a paper machine that was described by people at the University of Chicago in the late 1960's. Capability lists were invented by people involved with the Multics project, although Multics was not itself thought of as being based on this idea.

    Capability based addressing solves some interesting problems. For example, on Unix, users can frequently name objects that they may not access, and users may find themself in the odd situation where they own a file (and pay for the storage it occupies) yet they can't address it because they have no rights to the directory where that file is listed.

    Capability based addressing solves this problem by associating the right to access an object with the ability to name it. User's names of objects are always relative to some capability list, and any global absolute name for an object is hidden inside the capabilty and is never visible to the user.

    Examples

    The most common use of capability based addressing is in paged or segmented virtual memory address translation. Each entry in a page table clearly indicates both the object to which that entry grants access and the access rights granted, and the page table associated with each user is the page table for that user.

          VIRTUAL ADDRESS
           __________________
          |_______|__________|
              |         |
              |       Word in page
              |
            Index into the process's C-list for pages
            the objects referenced by capabilities are pages
     

    Note that it is difficult to imagine an efficient implementation of paged virtual address translation using one access control list per page. This would require either an associative memory or a search of some kind with each memory reference. In contract, a simple table lookup suffices to check the access rights in a conventional (capability list based) virtual address translation system.

    Another common example of the use of a capability list is in accessing open files. Even when access control lists are used to control the right to open a file, once the file is open, it is common to simply enter the file description into a list of open files attached to a process. Files in this list are referenced not by name, but by their position in the list, and the entries in this list indicate not only what file is involved but how it was opened -- that is, what access rights apply. Clearly, the list of open files is a capability list.

         Read( f, buffer, address )
                |            |
                |          address of a block in the file
                |
             index into the processes open file C-list
             the objects referenced by capabilities are files
     

    These two examples illustrate a problem with the way capabilities are used in modern systems. Instead of a single capability list per user, where that list describes all the resources the user is allowed to access, we tend to give users a collection of lists, one list for each type of resource. This is awkward.

    Some systems have been built that solve this problem, notably the IBM AS/400 (and System 38). Another commercial system in this class is the Plessy System 250. Experimental systems that have solved this problem include the Cambridge CAP system and the Carnegie-Mellon Hydra and C.mmp systems. These will be discussed later.

    It is worth noting that, on Multics, when a file is opened, the sectors of that file become visible as a segment in the virtual address space of the process that requested the open operation. Thus, for Multics, there is no distinction between the capability list describing the segments a process may access and the capability list describing the files the process may access.

    Domains and Domain Switching

    The domain of a process is the set of all objects that process may manipulate. In a simple capability based system,

    	  Domain = C-list
     

    In the abstract, domain switching involves a change of the current working C-list. Domain switching is very important! It is very common, and systems must control it.

    For example, when a trap occurs, the system typically changes from a user domain to the system kernel's domain, and on return from trap, the system switches back to the user's domain.

    The entire reason for introducing two states in order to build a simple protected system is to create two domains, a system domain, that includes all objects, and a user domain, that includes only some objects.

    One of the most complex domain switching mechanisms was introduced on the Multics system. On this system, each process has a level between 0 and 16. In addition, each segment has a level, and when a process attempts a memory access, the hardware checks the process level against the level of the accessed segment.

    A Multics memory access is legal if the process has level higher than (or the same as) the segment, and illegal if the process level is below the segment level.

    The Multics hierarchy of domains is a generalization of the two domain model of the primitive systems discussed previously. Multics allowed the operating system itself to be subdivided, so that only the process manager operated at the highest level, level 0. The virtual memory manager was at level 1, accounting was at a more restricted level, things like line-printer spoolers were even more restricted, leaving a variety of levels were available to users.

    A user might run undebugged code at level 15, so that bugs in the code wouldn't damage other parts of the user's program.

    To further complicate the Multics model, some pages can be marked as gateways. A procedure call to address zero in a gateway page is the only legal memory reference from a process running at a numerically higher level.

    This causes the current level of the process to be pushed on the stack along with the return address. The called procedure then runs at the level of the segment containing it, and the level is reset to the original level when the procedure returns.

    A Multics gateway is the generalization of a trap in a two-level security system. In addition to allowing system calls to run at the system level, it allows users to divide their code into more secure and less secure segments.

    For example, consider a system where Joe, one of the users, owns a private database (say Chem Abstracts), and wishes to sell access to that database on a per-use basis. This means that Joe can't allow direct access to the database, he only wants other users to access it through his access code (which produces an appropriate billing record every time someone calls it).

    Joe would simply arrange for his access code to run at a low numbered level (also called an inner ring), and he would require that users wishing to run his code do so from a higher numbered level (or outer ring), crossing a gateway as they call his code.

    The above kind of consideration was a natural outcome of thinking about a computer utility -- the motivation behind Multics.

    Note, however, that the strictly hierarchic scheme of Multics does not solve some problems. For example, it fails to solve the problems of two mutually suspicious users. Either one or the other must be given greater access rights if code from both users must share an address space.

    Deadlock

    Introduction

    Deadlock occurs when two or more processes are blocked, where each process is blocked awaiting some action by another blocked process.

    A process might be blocked awaiting a message from another process, or it might be blocked awaiting a resource currently held by another process. If all the processes in a deadlocked group of processes are blocked awaiting messages, the deadlock is described as a communication deadlock; if all processes in a deadlocked group are awaiting resources, it is a resource deadlock. Mixed cases where some processes in the group are awaiting resources and others are awaiting messages are quite possible.

    In either case, whether resources or messages (or both) are involved, the key element in any deadlock is that there is a cyclic wait. In some older texts, the possibility of message passing is completely ignored, and deadlocks are defined entirely in terms of resource contention.

    Resource Deadlocks

    Deadlocks resulting from resource allocation have been widely studied since the late 1960's. Three distinct issues have been raised in this context:

    • Deadlock Detection
    • Deadlock Resolution
    • Deadlock Avoidance

    Once a deadlock is detected, it must be somehow resolved. Deadlock resolution is frequently destructive -- only fault tolerant applications can survive it. Therefore, if it is possible, deadlock avoidance is better.

    Deadlock Detection

    Since a deadlock involves is a cycle of processes waiting for something, the standard algorithms for deadlock detection convert the problem into graph theoretic terms and then use various cycle finding algorithms on the resulting graph.

    The following notation is traditionally used in graph theoretic formulations of the deadlock detection problem:

       Allocated        Process that
        resource         has the resource
           _                _
          |_|------------->(_)
     
        Waiting          Resource the
        process          process waits for
           _                _
          (_)------------->|_|
     
        Waiting          Process that
        process          should send message
           _                _
          (_)------------->(_)
     

    The graph has vertices for each process (round) and resource (square). The graph has edges representing the waits for relation. A process can wait for a resource, a resource can wait to be released by a process, and (in message passing systems) a process can wait for a message from some other process. The latter is a modern extension to this old notation.

    Here is an example "wait for graph" describing 4 processes, A, B, C and D, and 4 resources, Y, Y, Z and W. A is waiting for X, but B has both X and Z. C has W and is waiting for Y, and D has Y and is waiting for W. Does a deadlock exist?

                Processes     Resources
                     _            _
                   A(_)--------->|_|X
                              __/
                             /
                     _      /     _
                   B(_)<===    ->|_|Y
                            \/   /
                            /\  /  
                     _ ____/  \/  _
                   C(_)<-     /\_|_|Z
                          \  /
                           \/
                     _     /\____ _
                   D(_)<--    -->|_|W
                      \_____/
     

    There is a deadlock involving processes C and D and resources W and Y.

    Detecting a cycle in a graph involves traversing the cycle! Thus, the cost of deadlock detection is related to the number of vertices in the cycle, which could be as large as the number of vertices in the graph. As a result, deadlock detection algorithms scale poorly growing in expense as the system grows in size.

    As long as a the only wait operations involve processes waiting on resources, the deadlock graph is formally a bipartite graph, that is, a graph where there are two sets of vertices (round and square, here), and where all edges either connect a vertex in one set with a vertex in the other. If processes may wait for messages, the graph is not bipartite, but this has no effect on the relevant algorithms.

    A Review of Elementary Graph Algorithms

    Basic graph algorithms are a standard part of undergraduate courses on data structures!

    A graph consists of a set of vertices (processes and resources, in our example), and a set of edges (the arrows connecting processes and resources in our example). In general, edges need not have direction associated with them, but in our example application, we are only interested in directed graphs, where edges have direction.

    Many of the basic questions we will need to resolve about graphs rest on the same basic meta-algorithm for graph traversal:

        Initially, all vertices and edges are unmarked
         R is a some vertex, to be used as the root
         Mark R
         S = { R }
         Repeat
             Remove a vertex V from S
             For each V' directly reachable from V,
                 If V' is not marked
                     mark V'
                     put V' in S
                     mark the edge <V,V'>
                 endif
             endfor
         Until S is empty
     

    At any point during the execution of this meta-algorithm, the set S contains the vertices "on the frontier" of the graph being explored. Marked vertices that are no-longer in S have been fully explored, while unmarked vertices not yet in S remain to be visited.

    The nature of the markings put on vertices and edges leads to variations in the results obtained from this meta algorithm. The order in which vertices are removed from S is the other primary variable. If S is managed on a LIFO basis, this algorithm performs a depth-first traversal of the graph, following the longest possible path from the root before returning to visit other parts of the graph. If S is managed on a FIFO basis, this algorithm performs a breadth-first traversal of the graph, visiting all vertices reachable from vertices already in S before visiting others beyond them. It is also possible to manage S as a priority queue, using, for example, the value of the marking on each vertex as a priority.

    The set of marked edges produced by this algorithm is always a tree that spans every vertex in that part of the graph reachable from the root vertex R. That is, it is a spanning tree of the reachable part of the graph. Not all parts of all graphs are reachable from any vertex, so this is not always a spanning tree of the whole graph. Note that, depending on the traversal scheme used (that is, depending on the management of the set S) many different spanning trees may be produced by the same meta-algorithm (unless, of course, the graph is itself tree structured, in which case there is never more than one possible spanning tree, and that is the graph itself).

    NOTE: Other variants on the above metaalgorithm will show up repeatedly in the remainder of this course! The subject of spanning trees will also show up again and again!

    Given a directed graph rooted at vertex R, it is not hard to modify the depth-first recursive tree traversal algorithm to traverse the graph and search for cycles. The modified algorithm only marks those edges and vertices on the path from the root to the leaf currently being examined -- thus, as it follows a path through the tree, it marks edges and vertices, and as it backtracks in its traversal, it unmarks edges and vertices. If this modified algorithm ever encounters a marked vertex as it walks deeper into the graph (as opposed to encountering such a vertex on backtracking), it has found a cycle reachable from the root R.

    Deadlock Graphs

    It is worth noting that the details of the system being studied determines the nature of the graph algorithm needed for deadlock detection! In a resource model, where processes wait for resources, if no process may wait for more than one resource at a time, the number of outgoing edges from any vertex will never excede one, so the deadlock graph has a trivial structure -- it consists of a number of chains or loops. This special case is called the single resource model.

       Running
        process
           _
          (_)                   Running
                                process
        Waiting                that owns 
        process    Resource     resource
           _           _           _
          (_)-------->|_|-------->(_)
     
                                Waiting                 Running
                                process                 process
        Waiting                that owns    Another     everyone
        process    Resource     resource    resource   waits for
           _           _           _           _           _
          (_)-------->|_|-------->(_)-------->|_|-------->(_)
     

    In this simple case, deadlock detection doesn't involve a general graph algorithm. The constraint that no process waits for more than one resource at a time translates to the constraint that no process vertex in the graph has more than one outgoing edge. Similarly, the constraint that no resource is owned by more than one process translates to the constraint that no resource vertex has more than one outgoing edge. As a result, we can detect deadlocks by following the single chain of outgoing edges from a blocked process, marking vertices as we go; if the chain has an open end, the process is not deadlocked; if we encounter a marked vertex, there is a deadlock. In any case, we always finish by unmarking all the vertices we marked.

    Note that the unconstrained resource model allows a process to wait for a number of resources, for example, using the operation block P until all resources in set S can be claimed. If the set S contains resources X and Y, we can replace this general wait operation by block P until X can be claimed followed by block P until Y can be claimed. This allows many more general deadlock problems to be translated into single-resource deadlock problems where deadlocks can be detected using this trivial chain-following deadlock detection algorithm.

    Unfortunately, in some systems of processes, the straightforward translation from the more general the primitive to the single-resource primitive will sometimes convert a system that is deadlock free into a system that is prone to deadlock. The reason for this is that the block P until all resources in set S can be claimed primitive claims none of the resources in S until it can claim all of them, while the sequential claim one at a time approach will have some states in which a process holds a claim on some subset of the resources in S. If two processes are competing for the same set of resources and they claim them in different orders, there will be a deadlock under the single resource model that would not have been present under the more general model.

    The most common application of the single resource model in real systems is in database systems, where resources correspond to database records that must be claimed before some update can be performed. Some database systems use exactly this trivial chain following algorithm to detect deadlocks.

    Deadlock Resolution

    Once a deadlock is detected, an important question must be answered: what should be done about it? The classic solution for resource models is to break the cycle of the deadlock by aborting one of the processes and freeing its resources. This is only applicable to systems using a resource model, and of course, it is only applicable in systems where we can afford to abort processes!

    What process should be aborted? Ideally, the process to be aborted should be the one where that act will cause the least damage, that is, the least critical process. Few systems provide information sufficient to determine how critical a process is, however. The common alternative of aborting the process with the lowest priority approximates this, since high priority processes are frequently more critical than low priority processes, but this isn't necessarily so. Priority depends on real-time constraints! The audio entertainment process must meet tight real time constraints but it isn't very critical compared to monitoring the level of fuel in the fuel tank, something that is only done occasionally but is very important.

    An alternative is to delete the process that closed the cycle, the last process to block in the group of processes involved in the deadlock. Unfortunately, just because a process was the one that completed the cycle does not mean that it is the least critical or the most appropriate victim.

    If a process can be coded to anticipate the possibility of allocation failures, for example, by having it deallocate resources it has already claimed and begin again, we can have the resource claiming operation return a failure code if that operation would have closed the cycle in the resource graph, leading to deadlock. This approach is taken in several database managers.

    The classic solution of aborting a process is clearly not oriented towards communication deadlocks! Aborting a process that is waiting for a message does nothing to eliminate the deadlock, since the only way to resolve the deadlock may be for that same process to send a message. Aborting a process is also very undesirable in critical applications, where all processes are expected to run forever or until the power fails.

    Deadlock Avoidance

    If deadlock recovery is difficult, we are far better off if we can construct our applications in such a way that we can prove that they are deadlock free. This may be done statically, or it may be done, in some more complex settings, by dynamic algorithms.

    There is a trivial solution to the problem of deadlock avoidance in a pure resource allocation context: All you need to do is prohibit processes from incrementally allocating resources.

    This trivial solution has been widely used, particularly back in the days of batch operating systems taking input from a stream of punched card. These systems typically required each process to declare all resources it might want as a field of the job card -- the head card on the card deck. This is clumsy, and it requires that the process hold all resources it might need for the duration of its run even if it only needs some of them briefly.

    Claiming Resources in a Strict Order Prevents Deadlock

    A less trivial solution to the deadlock avoidance problem is to require that all resources be acquired in a strictly predetermined order. For example, any process that requires both the modem and the laser printer must acquire the modem first and the printer second. As long as this rule is enforced, it will be possible to guarantee that the system is deadlock free with regard to these two resources because it will be impossible for one process to claim the modem and then request the printer while the other claims the printer and then requests the modem.

    It is always possible to impose a total ordering over all resources. Any ordering will suffice, so long as all processes adhere to it. For example, resources may be ordered in alphabetical order by resource class, and then in numerical order, within each resource class, by resource address. Many applications can be rewritten to comply with such an ordering constraint, but many others cannot.

    Atomic transactions

    Introduction

    The usual assumption surrounding such assignments as

              A := 0; A := -1
     

    is that assignment is atomic, that is, it either happens or it doesn't. If the above two statements are executed in parallel, or if one statement is interrupted in order to complete the other, a programmer will typically expect only two possible outcomes. These are expressed by the following postcondition:

              (A = 0) or (A = -1)
     

    What if the variable A is 64 bytes long on a machine where all loads and stores are in terms of 32 bit quantities? Or, on a small microcontroller, what if the variable A is 16 bits long, on a machine where all load and store operations operated on 8 bit quantities. Suddenly, there are a number of other possible outcomes! Each of the following outcomes is now possible on the 8/16 bit microcontroller:

              (A = 0) or (A = -1)
               or (A = 255)
               or (A = -256)
     

    If the low half of the word A is A0 and the high half is A1, the following sequence of assignments will give the value 255:

              Process 1  Process 2
               A0 := 0;
     		     A0 := 255;
     		     A1 := 255;
               A1 := 0;
     

    The usual solution to this is to use some kind of mutual exclusion mechanism, so that the assignments to A0 and A1 are done in a critical section, but this cannot prevent failures. Consider the following:

              Process 1  Process 2
               A0 := 0;
               A1 := 0;
     		     A0 := 255;
     		   < A1 := 255; > -- fails
     

    Such a failure, affecting one process but not the other, is easily imagined on a multiprocessor, but even on a uniprocessor, if two different users are running code that shares a memory segment, and then one user hits control-C to kill a process, this kind of consequence is possible.

    The consequence of this partial failure is that the illegal value 255 still appears in the shared variable, so that, even if the mutual exclusion mechanism can recover from the failure of one process inside the critical section, and grant entry to another, the shared data is corrupted.

    This problem shows up whenever the value of a shared variable can only be updated one piece at a time. For example, if A is a logical record on disk that consists of multiple physical records. The problem may occur not only on parallel computers, but also on purely sequential machines when there is asynchronous context switching.

    Analogous problems can occur in the absence of context switching, if a sequence of assignments is interrupted by a failure. Consider:

              A0 := 0;
               A1 := 0;
                 ...      -- some time later
               A0 := 255;
             < A1 := 255; > -- fails
     

    Here, if the data being updated is on disk or in "persistant RAM" such as RAM with battery backup or old-fashioned core memory, the failure can leave an illegal value in memory, a value that was only partially updated.

    Pointer Assignment

    One way to assure that a shared variable is updated on an all or none basis is to perform the updates "off line". This requires a way to atomically move the updated copy into public view. Consider the following scheme:

    The publically visible copy of a composite shared variable is pointed to by a pointer; a process wishing to update the composite variable updates a copy, and then atomically updates the pointer. We assume that updates to pointers are atomic -- that is, that pointers occupy exactly one hardware word, where a word is defined as the unit of atomic update in memory.

         P a pointer to the shared value
     
          Inspect(V):
             V := P^;  -- return the value pointed to by P
     
          Update(V):
     	new( temp );
             temp^ := V;
             P := temp; -- make P point to the desired value
     

    The above code uses a newly allocated cell from the heap for each update to the atomicly updated variable. This generality is unnecessary. If N processes compete for access to the shared variable, it must be possible for each process to independently update its own private copy of the variable; thus, the atomically updatable pointer must have at least N possible values, corresponding to the private variables of each process.

    Therefore, if the machine allows atomic updates of n bit quantities in memory, this scheme allows up to 2n processes to compete.

    In fact, we do not need a completely general heap! If each process has two cells able to hold a single value each, and if each process uses these cells in turn, updating the least recently updated cell each time an update is required, and finishing the update with a pointer assignment, this scheme will work. Therefore, pointer assignment schemes can support atomic update on a 4-bit machine with as many as 8 processes, or on an 8-bit machine with as many as 128 processes.

    Stable Storage

    How do you make an assignments to a multi-byte object look atomic without using pointer assignment and without using memory proportional to the number of processes? This is easy if there is no problem with failure, but it is harder when failures are possible!

    Leslie Lamport developed algorithms for updating a shared variable in the face of failures. This assumes that a mutual exclusion algorithm guarantees that only one process tries to update the variable at a time, and in this context, it guarantees that the result of the update will be correct, in the face of failure.

    The basic operations offered by Lamport's stable storage algorithms are:

        inspect( V ) returns value   (or V.inspect()        )
         update( V, value )           (   V.update( value )  )
     

    Lamport's stable storage rests on a new and redundant representation for the stored value and it rests on two procedures (or protocols) for inspecting and updating the stored value.

    Conceptually, it is reasonable to use a client-server world view and imagine the inspect and update procedures as being the services offered by the server to clients. If the server fails, we can easily start a new server, as long as the variable itself is stored in such a way that it survives the failure.

    A stable variable V is represented as a record where every field is duplicated:

                Copy1  -- the value stored
                 Time1  -- the time it was stored
                 Sum1   -- a checksum over the value and time
     
                 Copy2
                 Time2
                 Sum2
     
    There are two copies of the value, and for each copy, there is a record of the time at which the value was last updated and a record of the checksum computed as of the last update.

    The fault tolerance of Lamport's scheme improves if the two (or more) copies of the <value, time, checksum> tuple are stored in such a way that failures only destroy one copy at a time. Thus, they should be in different memory modules, or on different disk drives. If the system is physically distributed, they should be geographically separated and connected to different computer systems.

    The update and inspect operations must be performed inside critical sections, and if failure is to be survived, these critical sections must use some algorithm that can detect failure of a process inside a critical section and release any associated mutual exclusion semaphores.

            Procedure Update( V, value )
             begin
                update time
     
                V.Copy1 := value
                V.Time1 := time
                V.Sum1  := Checksum( value, time )
     
                -- wait for the assignment to really finish
     
                V.Copy2 := value
                V.Time2 := time
                V.Sum2  := Checksum( value, time )
     
                -- wait for the assignments to really finish
             end
     

    The utility of this code relies on keeping two (or more) copies of the value, where no failure will corrupt more than one copy at a time. The wait for one update to finish before starting another update is very important, in this regard.

    The assignment statements shown above will not, typically, execute instantaneously. On a cache machine, for example, there may be a delay before the values are written back to main memory. If disks are involved, there may be a considerable time between the issuing of a write request and the actual write to disk. If disk cache software is involved, it is even possible that the write to disk will never happen.

    This illustrates something that was pointed out earlier, in the context of the discussion of disk caches. In general, fancy cacheing algorithms can improve performance, but they can get in the way when fault tolerance is the goal! We must have a way to force cached values out into the real memory before we can rely on this stable storage algorithm. In systems derived from UNIX, the kernel primitive that does this is fsync().

            Procedure Inspect( V )
             begin
                -- optionally start by reading all fields from disk
     
                if V.Sum1 = checksum( V.Copy1, V.Time1 )
                   if V.Sum2 = checksum( V.Copy2, V.Time2 )
                      if V.Time1 > V.Time2
                         value = V.Copy1
                      else
                         value = V.Copy2
                      endif
                   else
                      value = V.Copy1
                   endif
                else
                   if V.Sum2 = checksum( V.Copy2, V.Time2 )
                      value = V.Copy2
                   else
                      -- failure --
                   endif
                endif
                return value
             end
     

    This code is fairly simple -- there are four cases to consider:

    1) There were no errors In this case, the checksums on both copies will be valid, both will be the same, and they will have the same timestamp. In this case, either copy may be returned.

    2) There was a failure such that update managed to write one updated copy of the data, but it didn't get to start writing the other updated copy. In this case, the checksums on both copies will be valid, but their timestamps will differ. Return the copy with the most recent timestamp.

    3) There was a failure during one of the updates, or there was a failure that corrupted one of the stored values between updates. In this case, the checksum on the copy in question will be invalid. The other copy should still be OK and can be returned.

    If the failure occurs during the write of the second copy, it will be as if the write was successful because the first copy will have already been written and will be available for use. If the failure occurs during the write of the first copy, it will be as if the write never occurred.

    4) A failure or failures destroy all copies. There is nothing that can be done about this, but it takes multiple failures to cause this kind of damage, and the probability of this multiple failure can be made arbitrarily low by storing more copies in more widely distributed locations.

    Note that the stable storage update procedure may be made to be even more reliable if it does a write-read-verify cycle on each copy it writes out, that is, it writes each to memory and then reads it back and verifies correct storage before continuing.

    Also, note that no checksum algorithm can detect all possible errors in the stored data. Checksums can be made arbitrarily error resistant, but but they cannot be made perfect. For any checksum algorithm, there is some combination of errors that will defeat it!

    Transactions

    Atomic assignment and inspection are not enough to guarantee that a shared variable is used correctly through possible fault sequences! For example, consider the problem of transferring money from one checking account to another:

          Jones pays Smith some Amount:
              Jones := Jones - Amount
              Smith := Smith + Amount
     

    To avoid accidentally creating or destroying money, either both updates must be made before an error occurs, or neither must be made! In the general case, there may be three machines here, one belonging to a bank where Smith has an account, one belonging to a bank where Jones has an account, and a third machine that mediates the transaction.

    The term transaction comes from the financial world, as is suggested by the above example. Transactions can be quite complex: For example, if I deposit a check in a bank, the transaction involves debting the account of the check writer, crediting my account, and crediting the sum representing the current day's checks received.

    Each transaction typically involves three sets of variables:

      V       -- the set of variables
       I in V  -- the subset of V inspected
       U in V  -- the subset of V updated
     

    In the following, we will make the slightly stronger assumption:

      U in I  -- updated variables are all inspected
     

    Typically, we assume that every variable is associated with a lock, a binary mutual exclusion semaphore; this allows processes to claim exclusive use of that variable. If a variable is to be updated but does not need inspection, we lock it as if we needed to inspect it first.

    In one sense, atomic transactions are trivial to implement. The entire database needed for the transaction could be stored as a single variable using Lamport's stable storage. A single mutual exclusion semaphore suffices to guard access to this variable.

    This trivial solution is unacceptable! Most interesting databases cannot be claimed totally for each update. Instead, the process performing an update must claim only part of the structure, locking only those variables needed while allowing other processes to operate on other parts of the structure at the same time.

    Consider, for example, the Bank of America. This is a large bank with branches all over California. It would be impossible to operate such a bank if each transaction required giving the teller in charge of that transaction exclusive use of the bank's entire data base for the duration of the transaction.

    The sets I (inspected variables) and U (updated variables) are not known in advance. Typically, some subset of I must be inspected in order to determine the other variables in I and U. This leads to the following two-phase model of transactions.

         Phase 1
             Claim locks and inspect variables
             until all needed variables are locked.
     

    All variables must be locked prior to inspection. If multiple variables are locked, deadlock is possible. Deadlocks can be resolved by aborting and restarting transactions because they always occur in phase 1, before the transaction has modified any shared variables.

    Deadlocks can be avoided by claiming locks in a standard order, but this is not always practical, in the general case. For example, if each database record contains pointers to other records that may need to be locked and inspected, you must claim the record containing the pointer in order to follow the pointer, and any cycles in the pointer structure are potential sources of deadlock.

         Phase 2
             Update variables and release locks!
     

    The 2 phase model can be improved by using more interesting locks. Instead of 2-state locks -- where the two states are free and in-use, for example, 3-state locks can be used, with the following states:

                free
                 read-only
                 in-use (read-write)
     

    If a process only needs to read a variable, it may claim the variable by setting the lock to read-only. This is legal if the previous state of the lock was free or read-only. If a process may need to write a variable, it must claim the lock as in-use, a claim that can only be made if the lock was free. Thus, a variable may have multiple readers. Locks that support this are said to solve the readers-writers problem.

    The 2 phase model guarantees that things will work, but it is overly restrictive. There are specific problems where a greater degree of concurrent access can be allowed by violation of the two-phase model.

    For example, consider a shared lexicographic binary tree. The two-phase model requires that a user of any node in this tree must lock all nodes on the path from the root to the node in question before releasing any of the locks.

    In fact, the data structure will work just as well if the user only holds a few locks at a time. Specifically, the user first claims the lock on the root, and then claims the lock on the child of the root that is on the path to the target node in the tree. As soon as it is determined that the root is not the direct parent of the target, the root can be released and the process can be continued recursively. In the end, the process holds a lock on the target node and on the parent of the target node.

    Note: The above definitions of the two-phase model of transactions do not guarantee atomicity. They only describe the mutual exclusion model that atomic transaction systems provide for the user. The implementation of atomicity in this model will be the subject of the next lecture!
    Оставьте свой комментарий !

    Ваше имя:
    Комментарий:
    Оба поля являются обязательными

     Автор  Комментарий к данной статье