Annotation of 43BSDReno/share/doc/ps2/04.implement/implement, revision 1.1.1.1

1.1       root        1: .\"    @(#)implement   6.2 (Berkeley) 6/17/86
                      2: .\"
                      3: .EH 'PS2:4-%''UNIX Implementation'
                      4: .OH 'UNIX Implementation''PS2:4-%'
                      5: .de P1
                      6: .DS
                      7: ..
                      8: .de P2
                      9: .DE
                     10: ..
                     11: .de UL
                     12: .lg 0
                     13: .if n .ul
                     14: \%\&\\$3\f3\\$1\fR\&\\$2
                     15: .lg
                     16: ..
                     17: .de UC
                     18: \&\\$3\s-1\\$1\\s0\&\\$2
                     19: ..
                     20: .de IT
                     21: .lg 0
                     22: .if n .ul
                     23: \%\&\\$3\f2\\$1\fR\&\\$2
                     24: .lg
                     25: ..
                     26: .de SP
                     27: .sp \\$1
                     28: ..
                     29: .hw device
                     30: .TL
                     31: UNIX Implementation
                     32: .AU "MH 2C-523" 2394
                     33: K. Thompson
                     34: .AI
                     35: .MH
                     36: .AB
                     37: This paper describes in high-level terms the
                     38: implementation of the resident
                     39: .UX
                     40: kernel.
                     41: This discussion is broken into three parts.
                     42: The first part describes
                     43: how the
                     44: .UX
                     45: system views processes, users, and programs.
                     46: The second part describes the I/O system.
                     47: The last part describes the
                     48: .UX
                     49: file system.
                     50: .AE
                     51: .NH
                     52: INTRODUCTION
                     53: .PP
                     54: The
                     55: .UX
                     56: kernel consists of about 10,000
                     57: lines of C code and about 1,000 lines of assembly code.
                     58: The assembly code can be further broken down into
                     59: 200 lines included for
                     60: the sake of efficiency
                     61: (they could have been written in C)
                     62: and 800 lines to perform hardware
                     63: functions not possible in C.
                     64: .PP
                     65: This code represents 5 to 10 percent of what has
                     66: been lumped into the broad expression
                     67: ``the
                     68: .UX
                     69: operating system.''
                     70: The kernel is the only
                     71: .UX
                     72: code that
                     73: cannot be substituted by a user to his
                     74: own liking.
                     75: For this reason,
                     76: the kernel should make as few real
                     77: decisions as possible.
                     78: This does not mean to allow the user
                     79: a million options to do the same thing.
                     80: Rather, it means to allow only one way to
                     81: do one thing,
                     82: but have that way be the least-common divisor
                     83: of all the options that might have been provided.
                     84: .PP
                     85: What is or is not implemented in the kernel
                     86: represents both a great responsibility and a great power.
                     87: It is a soap-box platform on
                     88: ``the way things should be done.''
                     89: Even so, if
                     90: ``the way'' is too radical,
                     91: no one will follow it.
                     92: Every important decision was weighed
                     93: carefully.
                     94: Throughout,
                     95: simplicity has been substituted for efficiency.
                     96: Complex algorithms are used only if
                     97: their complexity can be localized.
                     98: .NH
                     99: PROCESS CONTROL
                    100: .PP
                    101: In the
                    102: .UX
                    103: system,
                    104: a user executes programs in an
                    105: environment called a user process.
                    106: When a system function is required,
                    107: the user process calls the system
                    108: as a subroutine.
                    109: At some point in this call,
                    110: there is a distinct switch of environments.
                    111: After this,
                    112: the process is said to be a system process.
                    113: In the normal definition of processes,
                    114: the user and system processes are different
                    115: phases of the same process
                    116: (they never execute simultaneously).
                    117: For protection,
                    118: each system process has its own stack.
                    119: .PP
                    120: The user process may execute
                    121: from a read-only text segment,
                    122: which is shared by all processes
                    123: executing the same code.
                    124: There is no
                    125: .IT functional
                    126: benefit
                    127: from shared-text segments.
                    128: An
                    129: .IT efficiency
                    130: benefit comes from the fact
                    131: that there is no need to swap read-only
                    132: segments out because the original
                    133: copy on secondary memory is still current.
                    134: This is a great benefit to interactive
                    135: programs that tend to be swapped while
                    136: waiting for terminal input.
                    137: Furthermore,
                    138: if two processes are
                    139: executing
                    140: simultaneously
                    141: from the same copy of a read-only segment,
                    142: only one copy needs to reside in
                    143: primary memory.
                    144: This is a secondary effect,
                    145: because
                    146: simultaneous execution of a program
                    147: is not common.
                    148: It is ironic that this effect,
                    149: which reduces the use of primary memory,
                    150: only comes into play when there is
                    151: an overabundance of primary memory,
                    152: that is,
                    153: when there is enough memory
                    154: to keep waiting processes loaded.
                    155: .PP
                    156: All current read-only text segments in the
                    157: system are maintained from the
                    158: .IT "text table" .
                    159: A text table entry holds the location of the
                    160: text segment on secondary memory.
                    161: If the segment is loaded,
                    162: that table also holds the primary memory location
                    163: and the count of the number of processes
                    164: sharing this entry.
                    165: When this count is reduced to zero,
                    166: the entry is freed along with any
                    167: primary and secondary memory holding the segment.
                    168: When a process first executes a shared-text segment,
                    169: a text table entry is allocated and the
                    170: segment is loaded onto secondary memory.
                    171: If a second process executes a text segment
                    172: that is already allocated,
                    173: the entry reference count is simply incremented.
                    174: .PP
                    175: A user process has some strictly private
                    176: read-write data
                    177: contained in its
                    178: data segment.
                    179: As far as possible,
                    180: the system does not use the user's
                    181: data segment to hold system data.
                    182: In particular,
                    183: there are no I/O buffers in the
                    184: user address space.
                    185: .PP
                    186: The user data segment has two growing boundaries.
                    187: One, increased automatically by the system
                    188: as a result of memory faults,
                    189: is used for a stack.
                    190: The second boundary is only grown (or shrunk) by
                    191: explicit requests.
                    192: The contents of newly allocated primary memory
                    193: is initialized to zero.
                    194: .PP
                    195: Also associated and swapped with
                    196: a process is a small fixed-size
                    197: system data segment.
                    198: This segment contains all
                    199: the data about the process
                    200: that the system needs only when the
                    201: process is active.
                    202: Examples of the kind of data contained
                    203: in the system data segment are:
                    204: saved central processor registers,
                    205: open file descriptors,
                    206: accounting information,
                    207: scratch data area,
                    208: and the stack for the system phase
                    209: of the process.
                    210: The system data segment is not
                    211: addressable from the user process
                    212: and is therefore protected.
                    213: .PP
                    214: Last,
                    215: there is a process table with
                    216: one entry per process.
                    217: This entry contains all the data
                    218: needed by the system when the process
                    219: is
                    220: .IT not
                    221: active.
                    222: Examples are
                    223: the process's name,
                    224: the location of the other segments,
                    225: and scheduling information.
                    226: The process table entry is allocated
                    227: when the process is created, and freed
                    228: when the process terminates.
                    229: This process entry is always directly
                    230: addressable by the kernel.
                    231: .PP
                    232: Figure 1 shows the relationships
                    233: between the various process control
                    234: data.
                    235: In a sense,
                    236: the process table is the
                    237: definition of all processes,
                    238: because
                    239: all the data associated with a process
                    240: may be accessed
                    241: starting from the process table entry.
                    242: .KF
                    243: .in .375i
                    244: .so fig1.pic
                    245: .in -.375i
                    246: .sp 2v
                    247: .ce
                    248: Fig. 1\(emProcess control data structure.
                    249: .KE
                    250: .NH 2
                    251: Process creation and program execution
                    252: .PP
                    253: Processes are created by the system primitive
                    254: .UL fork .
                    255: The newly created process (child) is a copy of the original process (parent).
                    256: There is no detectable sharing of primary memory between the two processes.
                    257: (Of course,
                    258: if the parent process was executing from a read-only
                    259: text segment,
                    260: the child will share the text segment.)
                    261: Copies of all writable data segments
                    262: are made for the child process.
                    263: Files that were open before the
                    264: .UL fork
                    265: are
                    266: truly shared after the
                    267: .UL fork .
                    268: The processes are informed as to their part in the
                    269: relationship to
                    270: allow them to select their own
                    271: (usually non-identical)
                    272: destiny.
                    273: The parent may
                    274: .UL wait
                    275: for the termination of
                    276: any of its children.
                    277: .PP
                    278: A process may
                    279: .UL exec
                    280: a file.
                    281: This consists of exchanging the current text and data
                    282: segments of the process for new text and data
                    283: segments specified in the file.
                    284: The old segments are lost.
                    285: Doing an
                    286: .UL exec
                    287: does
                    288: .IT not
                    289: change processes;
                    290: the process that did the
                    291: .UL exec
                    292: persists,
                    293: but
                    294: after the
                    295: .UL exec
                    296: it is executing a different program.
                    297: Files that were open
                    298: before the
                    299: .UL exec
                    300: remain open after the
                    301: .UL exec .
                    302: .PP
                    303: If a program,
                    304: say the first pass of a compiler,
                    305: wishes to overlay itself with another program,
                    306: say the second pass,
                    307: then it simply
                    308: .UL exec s
                    309: the second program.
                    310: This is analogous
                    311: to a ``goto.''
                    312: If a program wishes to regain control
                    313: after
                    314: .UL exec ing
                    315: a second program,
                    316: it should
                    317: .UL fork
                    318: a child process,
                    319: have the child
                    320: .UL exec
                    321: the second program, and
                    322: have the parent
                    323: .UL wait
                    324: for the child.
                    325: This is analogous to a ``call.''
                    326: Breaking up the call into a binding followed by
                    327: a transfer is similar to the subroutine linkage in
                    328: SL-5.
                    329: .[
                    330: griswold hanson sl5 overview
                    331: .]
                    332: .NH 2
                    333: Swapping
                    334: .PP
                    335: The major data associated with a process
                    336: (the user data segment,
                    337: the system data segment, and
                    338: the text segment)
                    339: are swapped to and from secondary
                    340: memory, as needed.
                    341: The user data segment and the system data segment
                    342: are kept in contiguous primary memory to reduce
                    343: swapping latency.
                    344: (When low-latency devices, such as bubbles,
                    345: .UC CCD s,
                    346: or scatter/gather devices,
                    347: are used,
                    348: this decision will have to be reconsidered.)
                    349: Allocation of both primary
                    350: and secondary memory is performed
                    351: by the same simple first-fit algorithm.
                    352: When a process grows,
                    353: a new piece of primary memory is allocated.
                    354: The contents of the old memory is copied to the new memory.
                    355: The old memory is freed
                    356: and the tables are updated.
                    357: If there is not enough primary memory,
                    358: secondary memory is allocated instead.
                    359: The process is swapped out onto the
                    360: secondary memory,
                    361: ready to be swapped in with
                    362: its new size.
                    363: .PP
                    364: One separate process in the kernel,
                    365: the swapping process,
                    366: simply swaps the other
                    367: processes in and out of primary memory.
                    368: It examines the
                    369: process table looking for a process
                    370: that is swapped out and is
                    371: ready to run.
                    372: It allocates primary memory for that
                    373: process and
                    374: reads its segments into
                    375: primary memory, where that process competes for the
                    376: central processor with other loaded processes.
                    377: If no primary memory is available,
                    378: the swapping process makes memory available
                    379: by examining the process table for processes
                    380: that can be swapped out.
                    381: It selects a process to swap out,
                    382: writes it to secondary memory,
                    383: frees the primary memory,
                    384: and then goes back to look for a process
                    385: to swap in.
                    386: .PP
                    387: Thus there are two specific algorithms
                    388: to the swapping process.
                    389: Which of the possibly many processes that
                    390: are swapped out is to be swapped in?
                    391: This is decided by secondary storage residence
                    392: time.
                    393: The one with the longest time out is swapped in first.
                    394: There is a slight penalty for larger processes.
                    395: Which of the possibly many processes that
                    396: are loaded is to be swapped out?
                    397: Processes that are waiting for slow events
                    398: (i.e., not currently running or waiting for
                    399: disk I/O)
                    400: are picked first,
                    401: by age in primary memory,
                    402: again with size penalties.
                    403: The other processes are examined
                    404: by the same age algorithm,
                    405: but are not taken out unless they are
                    406: at least of some age.
                    407: This adds
                    408: hysteresis to the swapping and
                    409: prevents total thrashing.
                    410: .PP
                    411: These swapping algorithms are the
                    412: most suspect in the system.
                    413: With limited primary memory,
                    414: these algorithms cause total swapping.
                    415: This is not bad in itself, because
                    416: the swapping does not impact the
                    417: execution of the resident processes.
                    418: However, if the swapping device must
                    419: also be used for file storage,
                    420: the swapping traffic severely
                    421: impacts the file system traffic.
                    422: It is exactly these small systems
                    423: that tend to double usage of limited disk
                    424: resources.
                    425: .NH 2
                    426: Synchronization and scheduling
                    427: .PP
                    428: Process synchronization is accomplished by having processes
                    429: wait for events.
                    430: Events are represented by arbitrary integers.
                    431: By convention,
                    432: events are chosen to be addresses of
                    433: tables associated with those events.
                    434: For example, a process that is waiting for
                    435: any of its children to terminate will wait
                    436: for an event that is the address of
                    437: its own process table entry.
                    438: When a process terminates,
                    439: it signals the event represented by
                    440: its parent's process table entry.
                    441: Signaling an event on which no process
                    442: is waiting has no effect.
                    443: Similarly,
                    444: signaling an event on which many processes
                    445: are waiting will wake all of them up.
                    446: This differs considerably from
                    447: Dijkstra's P and V
                    448: synchronization operations,
                    449: .[
                    450: dijkstra sequential processes 1968
                    451: .]
                    452: in that
                    453: no memory is associated with events.
                    454: Thus there need be no allocation of events
                    455: prior to their use.
                    456: Events exist simply by being used.
                    457: .PP
                    458: On the negative side,
                    459: because there is no memory associated with events,
                    460: no notion of ``how much''
                    461: can be signaled via the event mechanism.
                    462: For example,
                    463: processes that want memory might
                    464: wait on an event associated with
                    465: memory allocation.
                    466: When any amount of memory becomes available,
                    467: the event would be signaled.
                    468: All the competing processes would then wake
                    469: up to fight over the new memory.
                    470: (In reality,
                    471: the swapping process is the only process
                    472: that waits for primary memory to become available.)
                    473: .PP
                    474: If an event occurs
                    475: between the time a process decides
                    476: to wait for that event and the
                    477: time that process enters the wait state,
                    478: then
                    479: the process will wait on an event that has
                    480: already happened (and may never happen again).
                    481: This race condition happens because there is no memory associated with
                    482: the event to indicate that the event has occurred;
                    483: the only action of an event is to change a set of processes
                    484: from wait state to run state.
                    485: This problem is relieved largely
                    486: by the fact that process switching can
                    487: only occur in the kernel by explicit calls
                    488: to the event-wait mechanism.
                    489: If the event in question is signaled by another
                    490: process,
                    491: then there is no problem.
                    492: But if the event is signaled by a hardware
                    493: interrupt,
                    494: then special care must be taken.
                    495: These synchronization races pose the biggest
                    496: problem when
                    497: .UX
                    498: is adapted to multiple-processor configurations.
                    499: .[
                    500: hawley meyer multiprocessing unix
                    501: .]
                    502: .PP
                    503: The event-wait code in the kernel
                    504: is like a co-routine linkage.
                    505: At any time,
                    506: all but one of the processes has called event-wait.
                    507: The remaining process is the one currently executing.
                    508: When it calls event-wait,
                    509: a process whose event has been signaled
                    510: is selected and that process
                    511: returns from its call to event-wait.
                    512: .PP
                    513: Which of the runable processes is to run next?
                    514: Associated with each process is a priority.
                    515: The priority of a system process is assigned by the code
                    516: issuing the wait on an event.
                    517: This is roughly equivalent to the response
                    518: that one would expect on such an event.
                    519: Disk events have high priority,
                    520: teletype events are low,
                    521: and time-of-day events are very low.
                    522: (From observation,
                    523: the difference in system process priorities
                    524: has little or no performance impact.)
                    525: All user-process priorities are lower than the
                    526: lowest system priority.
                    527: User-process priorities are assigned
                    528: by an algorithm based on the
                    529: recent ratio of the amount of compute time to real time consumed
                    530: by the process.
                    531: A process that has used a lot of
                    532: compute time in the last real-time
                    533: unit is assigned a low user priority.
                    534: Because interactive processes are characterized
                    535: by low ratios of compute to real time,
                    536: interactive response is maintained without any
                    537: special arrangements.
                    538: .PP
                    539: The scheduling algorithm simply picks
                    540: the process with the highest priority,
                    541: thus
                    542: picking all system processes first and
                    543: user processes second.
                    544: The compute-to-real-time ratio is updated
                    545: every second.
                    546: Thus,
                    547: all other things being equal,
                    548: looping user processes will be
                    549: scheduled round-robin with a
                    550: 1-second quantum.
                    551: A high-priority process waking up will
                    552: preempt a running, low-priority process.
                    553: The scheduling algorithm has a very desirable
                    554: negative feedback character.
                    555: If a process uses its high priority
                    556: to hog the computer,
                    557: its priority will drop.
                    558: At the same time, if a low-priority
                    559: process is ignored for a long time,
                    560: its priority will rise.
                    561: .NH
                    562: I/O SYSTEM
                    563: .PP
                    564: The I/O system
                    565: is broken into two completely separate systems:
                    566: the block I/O system and the character I/O system.
                    567: In retrospect,
                    568: the names should have been ``structured I/O''
                    569: and ``unstructured I/O,'' respectively;
                    570: while the term ``block I/O'' has some meaning,
                    571: ``character I/O'' is a complete misnomer.
                    572: .PP
                    573: Devices are characterized by a major device number,
                    574: a minor device number, and
                    575: a class (block or character).
                    576: For each class,
                    577: there is an array of entry points into the device drivers.
                    578: The major device number is used to index the array
                    579: when calling the code for a particular device driver.
                    580: The minor device number is passed to the
                    581: device driver as an argument.
                    582: The minor number has no significance other
                    583: than that attributed to it by the driver.
                    584: Usually,
                    585: the driver uses the minor number to access
                    586: one of several identical physical devices.
                    587: .PP
                    588: The use of the array of entry points
                    589: (configuration table)
                    590: as the only connection between the
                    591: system code and the device drivers is
                    592: very important.
                    593: Early versions of the system had a much
                    594: less formal connection with the drivers,
                    595: so that it was extremely hard to handcraft
                    596: differently configured systems.
                    597: Now it is possible to create new
                    598: device drivers in an average of a few hours.
                    599: The configuration table in most cases
                    600: is created automatically by a program
                    601: that reads the system's parts list.
                    602: .NH 2
                    603: Block I/O system
                    604: .PP
                    605: The model block I/O device consists
                    606: of randomly addressed, secondary
                    607: memory blocks of 512 bytes each.
                    608: The blocks are uniformly addressed
                    609: 0, 1, .\|.\|. up to the size of the device.
                    610: The block device driver has the job of
                    611: emulating this model on a
                    612: physical device.
                    613: .PP
                    614: The block I/O devices are accessed
                    615: through a layer of buffering software.
                    616: The system maintains a list of buffers
                    617: (typically between 10 and 70)
                    618: each assigned a device name and
                    619: a device address.
                    620: This buffer pool constitutes a data cache
                    621: for the block devices.
                    622: On a read request,
                    623: the cache is searched for the desired block.
                    624: If the block is found,
                    625: the data are made available to the
                    626: requester without any physical I/O.
                    627: If the block is not in the cache,
                    628: the least recently used block in the cache is renamed,
                    629: the correct device driver is called to
                    630: fill up the renamed buffer, and then the
                    631: data are made available.
                    632: Write requests are handled in an analogous manner.
                    633: The correct buffer is found
                    634: and relabeled if necessary.
                    635: The write is performed simply by marking
                    636: the buffer as ``dirty.''
                    637: The physical I/O is then deferred until
                    638: the buffer is renamed.
                    639: .PP
                    640: The benefits in reduction of physical I/O
                    641: of this scheme are substantial,
                    642: especially considering the file system implementation.
                    643: There are,
                    644: however,
                    645: some drawbacks.
                    646: The asynchronous nature of the
                    647: algorithm makes error reporting
                    648: and meaningful user error handling
                    649: almost impossible.
                    650: The cavalier approach to I/O error
                    651: handling in the
                    652: .UX
                    653: system is partly due to the asynchronous
                    654: nature of the block I/O system.
                    655: A second problem is in the delayed writes.
                    656: If the system stops unexpectedly,
                    657: it is almost certain that there is a
                    658: lot of logically complete,
                    659: but physically incomplete,
                    660: I/O in the buffers.
                    661: There is a system primitive to
                    662: flush all outstanding I/O activity
                    663: from the buffers.
                    664: Periodic use of this primitive helps,
                    665: but does not solve, the problem.
                    666: Finally,
                    667: the associativity in the buffers
                    668: can alter the physical I/O sequence
                    669: from that of the logical I/O sequence.
                    670: This means that there are times
                    671: when data structures on disk are inconsistent,
                    672: even though the software is careful
                    673: to perform I/O in the correct order.
                    674: On non-random devices,
                    675: notably magnetic tape,
                    676: the inversions of writes can be disastrous.
                    677: The problem with magnetic tapes is ``cured'' by
                    678: allowing only one outstanding write request
                    679: per drive.
                    680: .NH 2
                    681: Character I/O system
                    682: .PP
                    683: The character I/O system consists of all
                    684: devices that do not fall into the block I/O model.
                    685: This includes the ``classical'' character devices
                    686: such as communications lines, paper tape, and
                    687: line printers.
                    688: It also includes magnetic tape and disks when
                    689: they are not used in a stereotyped way,
                    690: for example, 80-byte physical records on tape
                    691: and track-at-a-time disk copies.
                    692: In short,
                    693: the character I/O interface
                    694: means ``everything other than block.''
                    695: I/O requests from the user are sent to the
                    696: device driver essentially unaltered.
                    697: The implementation of these requests is, of course,
                    698: up to the device driver.
                    699: There are guidelines and conventions
                    700: to help the implementation of
                    701: certain types of device drivers.
                    702: .NH 3
                    703: Disk drivers
                    704: .PP
                    705: Disk drivers are implemented
                    706: with a queue of transaction records.
                    707: Each record holds a read/write flag,
                    708: a primary memory address,
                    709: a secondary memory address, and
                    710: a transfer byte count.
                    711: Swapping is accomplished by passing
                    712: such a record to the swapping device driver.
                    713: The block I/O interface is implemented by
                    714: passing such records with requests to
                    715: fill and empty system buffers.
                    716: The character I/O interface to the disk
                    717: drivers create a transaction record that
                    718: points directly into the user area.
                    719: The routine that creates this record also insures
                    720: that the user is not swapped during this
                    721: I/O transaction.
                    722: Thus by implementing the general disk driver,
                    723: it is possible to use the disk
                    724: as a block device,
                    725: a character device, and a swap device.
                    726: The only really disk-specific code in normal
                    727: disk drivers is the pre-sort of transactions to
                    728: minimize latency for a particular device, and
                    729: the actual issuing of the I/O request.
                    730: .NH 3
                    731: Character lists
                    732: .PP
                    733: Real character-oriented devices may
                    734: be implemented using the common
                    735: code to handle character lists.
                    736: A character list is a queue of characters.
                    737: One routine puts a character on a queue.
                    738: Another gets a character from a queue.
                    739: It is also possible to ask how many
                    740: characters are currently on a queue.
                    741: Storage for all queues in the system comes
                    742: from a single common pool.
                    743: Putting a character on a queue will allocate
                    744: space from the common pool and link the
                    745: character onto the data structure defining the queue.
                    746: Getting a character from a queue returns
                    747: the corresponding space to the pool.
                    748: .PP
                    749: A typical character-output device
                    750: (paper tape punch, for example)
                    751: is implemented by passing characters
                    752: from the user onto a character queue until
                    753: some maximum number of characters is on the queue.
                    754: The I/O is prodded to start as
                    755: soon as there is anything on the queue
                    756: and, once started,
                    757: it is sustained by hardware completion interrupts.
                    758: Each time there is a completion interrupt,
                    759: the driver gets the next character from the queue
                    760: and sends it to the hardware.
                    761: The number of characters on the queue is checked and,
                    762: as the count falls through some intermediate level,
                    763: an event (the queue address) is signaled.
                    764: The process that is passing characters from
                    765: the user to the queue can be waiting on the event, and
                    766: refill the queue to its maximum
                    767: when the event occurs.
                    768: .PP
                    769: A typical character input device
                    770: (for example, a paper tape reader)
                    771: is handled in a very similar manner.
                    772: .PP
                    773: Another class of character devices is the terminals.
                    774: A terminal is represented by three
                    775: character queues.
                    776: There are two input queues (raw and canonical)
                    777: and an output queue.
                    778: Characters going to the output of a terminal
                    779: are handled by common code exactly as described
                    780: above.
                    781: The main difference is that there is also code
                    782: to interpret the output stream as
                    783: .UC  ASCII
                    784: characters and to perform some translations,
                    785: e.g., escapes for deficient terminals.
                    786: Another common aspect of terminals is code
                    787: to insert real-time delay after certain control characters.
                    788: .PP
                    789: Input on terminals is a little different.
                    790: Characters are collected from the terminal and
                    791: placed on a raw input queue.
                    792: Some device-dependent code conversion and
                    793: escape interpretation is handled here.
                    794: When a line is complete in the raw queue,
                    795: an event is signaled.
                    796: The code catching this signal then copies a
                    797: line from the raw queue to a canonical queue
                    798: performing the character erase and line kill editing.
                    799: User read requests on terminals can be
                    800: directed at either the raw or canonical queues.
                    801: .NH 3
                    802: Other character devices
                    803: .PP
                    804: Finally,
                    805: there are devices that fit no general category.
                    806: These devices are set up as character I/O drivers.
                    807: An example is a driver that reads and writes
                    808: unmapped primary memory as an I/O device.
                    809: Some devices are too
                    810: fast to be treated a character at time,
                    811: but do not fit the disk I/O mold.
                    812: Examples are fast communications lines and
                    813: fast line printers.
                    814: These devices either have their own buffers
                    815: or ``borrow'' block I/O buffers for a while and
                    816: then give them back.
                    817: .NH
                    818: THE FILE SYSTEM
                    819: .PP
                    820: In the
                    821: .UX
                    822: system,
                    823: a file is a (one-dimensional) array of bytes.
                    824: No other structure of files is implied by the
                    825: system.
                    826: Files are attached anywhere
                    827: (and possibly multiply)
                    828: onto a hierarchy of directories.
                    829: Directories are simply files that
                    830: users cannot write.
                    831: For a further discussion
                    832: of the external view of files and directories,
                    833: see Ref.\0
                    834: .[
                    835: ritchie thompson unix bstj 1978
                    836: %Q This issue
                    837: .].
                    838: .PP
                    839: The
                    840: .UX
                    841: file system is a disk data structure
                    842: accessed completely through
                    843: the block I/O system.
                    844: As stated before,
                    845: the canonical view of a ``disk'' is
                    846: a randomly addressable array of
                    847: 512-byte blocks.
                    848: A file system breaks the disk into
                    849: four self-identifying regions.
                    850: The first block (address 0)
                    851: is unused by the file system.
                    852: It is left aside for booting procedures.
                    853: The second block (address 1)
                    854: contains the so-called ``super-block.''
                    855: This block,
                    856: among other things,
                    857: contains the size of the disk and
                    858: the boundaries of the other regions.
                    859: Next comes the i-list,
                    860: a list of file definitions.
                    861: Each file definition is
                    862: a 64-byte structure, called an i-node.
                    863: The offset of a particular i-node
                    864: within the i-list is called its i-number.
                    865: The combination of device name
                    866: (major and minor numbers) and i-number
                    867: serves to uniquely name a particular file.
                    868: After the i-list,
                    869: and to the end of the disk,
                    870: come free storage blocks that
                    871: are available for the contents of files.
                    872: .PP
                    873: The free space on a disk is maintained
                    874: by a linked list of available disk blocks.
                    875: Every block in this chain contains a disk address
                    876: of the next block in the chain.
                    877: The remaining space contains the address of up to
                    878: 50 disk blocks that are also free.
                    879: Thus with one I/O operation,
                    880: the system obtains 50 free blocks and a
                    881: pointer where to find more.
                    882: The disk allocation algorithms are
                    883: very straightforward.
                    884: Since all allocation is in fixed-size
                    885: blocks and there is strict accounting of
                    886: space,
                    887: there is no need to compact or garbage collect.
                    888: However,
                    889: as disk space becomes dispersed,
                    890: latency gradually increases.
                    891: Some installations choose to occasionally compact
                    892: disk space to reduce latency.
                    893: .PP
                    894: An i-node contains 13 disk addresses.
                    895: The first 10 of these addresses point directly at
                    896: the first 10 blocks of a file.
                    897: If a file is larger than 10 blocks (5,120 bytes),
                    898: then the eleventh address points at a block
                    899: that contains the addresses of the next 128 blocks of the file.
                    900: If the file is still larger than this
                    901: (70,656 bytes),
                    902: then the twelfth block points at up to 128 blocks,
                    903: each pointing to 128 blocks of the file.
                    904: Files yet larger
                    905: (8,459,264 bytes)
                    906: use the thirteenth address for a ``triple indirect'' address.
                    907: The algorithm ends here with the maximum file size
                    908: of 1,082,201,087 bytes.
                    909: .PP
                    910: A logical directory hierarchy is added
                    911: to this flat physical structure simply
                    912: by adding a new type of file, the directory.
                    913: A directory is accessed exactly as an ordinary file.
                    914: It contains 16-byte entries consisting of
                    915: a 14-byte name and an i-number.
                    916: The root of the hierarchy is at a known i-number
                    917: (\fIviz.,\fR 2).
                    918: The file system structure allows an arbitrary, directed graph
                    919: of directories with regular files linked in
                    920: at arbitrary places in this graph.
                    921: In fact,
                    922: very early
                    923: .UX
                    924: systems used such a structure.
                    925: Administration of such a structure became so
                    926: chaotic that later systems were restricted
                    927: to a directory tree.
                    928: Even now,
                    929: with regular files linked multiply
                    930: into arbitrary places in the tree,
                    931: accounting for space has become a problem.
                    932: It may become necessary to restrict the entire
                    933: structure to a tree,
                    934: and allow a new form of linking that
                    935: is subservient to the tree structure.
                    936: .PP
                    937: The file system allows
                    938: easy creation,
                    939: easy removal,
                    940: easy random accessing,
                    941: and very easy space allocation.
                    942: With most physical addresses confined
                    943: to a small contiguous section of disk,
                    944: it is also easy to dump, restore, and
                    945: check the consistency of the file system.
                    946: Large files suffer from indirect addressing,
                    947: but the cache prevents most of the implied physical I/O
                    948: without adding much execution.
                    949: The space overhead properties of this scheme are quite good.
                    950: For example,
                    951: on one particular file system,
                    952: there are 25,000 files containing 130M bytes of data-file content.
                    953: The overhead (i-node, indirect blocks, and last block breakage)
                    954: is about 11.5M bytes.
                    955: The directory structure to support these files
                    956: has about 1,500 directories containing 0.6M bytes of directory content
                    957: and about 0.5M bytes of overhead in accessing the directories.
                    958: Added up any way,
                    959: this comes out to less than a 10 percent overhead for actual
                    960: stored data.
                    961: Most systems have this much overhead in
                    962: padded trailing blanks alone.
                    963: .NH 2
                    964: File system implementation
                    965: .PP
                    966: Because the i-node defines a file,
                    967: the implementation of the file system centers
                    968: around access to the i-node.
                    969: The system maintains a table of all active
                    970: i-nodes.
                    971: As a new file is accessed,
                    972: the system locates the corresponding i-node,
                    973: allocates an i-node table entry, and reads
                    974: the i-node into primary memory.
                    975: As in the buffer cache,
                    976: the table entry is considered to be the current
                    977: version of the i-node.
                    978: Modifications to the i-node are made to
                    979: the table entry.
                    980: When the last access to the i-node goes
                    981: away,
                    982: the table entry is copied back to the
                    983: secondary store i-list and the table entry is freed.
                    984: .KF
                    985: .in .25i
                    986: .so fig2.pic
                    987: .in -.25i
                    988: .sp 2v
                    989: .ce
                    990: Fig. 2\(emFile system data structure.
                    991: .sp
                    992: .KE
                    993: .PP
                    994: All I/O operations on files are carried out
                    995: with the aid of the corresponding i-node table entry.
                    996: The accessing of a file is a straightforward
                    997: implementation of the algorithms mentioned previously.
                    998: The user is not aware of i-nodes and i-numbers.
                    999: References to the file system are made in terms of
                   1000: path names of the directory tree.
                   1001: Converting a path name into an i-node table entry
                   1002: is also straightforward.
                   1003: Starting at some known i-node
                   1004: (the root or the current directory of some process),
                   1005: the next component of the path name is
                   1006: searched by reading the directory.
                   1007: This gives an i-number and an implied device
                   1008: (that of the directory).
                   1009: Thus the next i-node table entry can be accessed.
                   1010: If that was the last component of the path name,
                   1011: then this i-node is the result.
                   1012: If not,
                   1013: this i-node is the directory needed to look up
                   1014: the next component of the path name, and the
                   1015: algorithm is repeated.
                   1016: .PP
                   1017: The user process accesses the file system with
                   1018: certain primitives.
                   1019: The most common of these are
                   1020: .UL open ,
                   1021: .UL create ,
                   1022: .UL read ,
                   1023: .UL write ,
                   1024: .UL seek ,
                   1025: and
                   1026: .UL close .
                   1027: The data structures maintained are shown in Fig. 2.
                   1028: In the system data segment associated with a user,
                   1029: there is room for some (usually between 10 and 50) open files.
                   1030: This open file table consists of pointers that can be used to access
                   1031: corresponding i-node table entries.
                   1032: Associated with each of these open files is
                   1033: a current I/O pointer.
                   1034: This is a byte offset of
                   1035: the next read/write operation on the file.
                   1036: The system treats each read/write request
                   1037: as random with an implied seek to the
                   1038: I/O pointer.
                   1039: The user usually thinks of the file as
                   1040: sequential with the I/O pointer
                   1041: automatically counting the number of bytes
                   1042: that have been read/written from the file.
                   1043: The user may,
                   1044: of course,
                   1045: perform random I/O by setting the I/O pointer
                   1046: before reads/writes.
                   1047: .PP
                   1048: With file sharing,
                   1049: it is necessary to allow related
                   1050: processes to share a common I/O pointer
                   1051: and yet have separate I/O pointers
                   1052: for independent processes
                   1053: that access the same file.
                   1054: With these two conditions,
                   1055: the I/O pointer cannot reside
                   1056: in the i-node table nor can
                   1057: it reside in the list of
                   1058: open files for the process.
                   1059: A new table
                   1060: (the open file table)
                   1061: was invented for the sole purpose
                   1062: of holding the I/O pointer.
                   1063: Processes that share the same open
                   1064: file
                   1065: (the result of
                   1066: .UL fork s)
                   1067: share a common open file table entry.
                   1068: A separate open of the same file will
                   1069: only share the i-node table entry,
                   1070: but will have distinct open file table entries.
                   1071: .PP
                   1072: The main file system primitives are implemented as follows.
                   1073: .UL \&open
                   1074: converts a file system path name into an i-node
                   1075: table entry.
                   1076: A pointer to the i-node table entry is placed in a
                   1077: newly created open file table entry.
                   1078: A pointer to the file table entry is placed in the
                   1079: system data segment for the process.
                   1080: .UL \&create
                   1081: first creates a new i-node entry,
                   1082: writes the i-number into a directory, and
                   1083: then builds the same structure as for an
                   1084: .UL open .
                   1085: .UL \&read
                   1086: and
                   1087: .UL write
                   1088: just access the i-node entry as described above.
                   1089: .UL \&seek
                   1090: simply manipulates the I/O pointer.
                   1091: No physical seeking is done.
                   1092: .UL \&close
                   1093: just frees the structures built by
                   1094: .UL open
                   1095: and
                   1096: .UL create .
                   1097: Reference counts are kept on the open file table entries and
                   1098: the i-node table entries to free these structures after
                   1099: the last reference goes away.
                   1100: .UL \&unlink
                   1101: simply decrements the count of the
                   1102: number of directories pointing at the given i-node.
                   1103: When the last reference to an i-node table entry
                   1104: goes away,
                   1105: if the i-node has no directories pointing to it,
                   1106: then the file is removed and the i-node is freed.
                   1107: This delayed removal of files prevents
                   1108: problems arising from removing active files.
                   1109: A file may be removed while still open.
                   1110: The resulting unnamed file vanishes
                   1111: when the file is closed.
                   1112: This is a method of obtaining temporary files.
                   1113: .PP
                   1114: There is a type of unnamed
                   1115: .UC  FIFO
                   1116: file called a
                   1117: .UL pipe.
                   1118: Implementation of
                   1119: .UL pipe s
                   1120: consists of implied
                   1121: .UL seek s
                   1122: before each
                   1123: .UL read
                   1124: or
                   1125: .UL write
                   1126: in order to implement
                   1127: first-in-first-out.
                   1128: There are also checks and synchronization
                   1129: to prevent the
                   1130: writer from grossly outproducing the
                   1131: reader and to prevent the reader from
                   1132: overtaking the writer.
                   1133: .NH 2
                   1134: Mounted file systems
                   1135: .PP
                   1136: The file system of a
                   1137: .UX
                   1138: system
                   1139: starts with some designated block device
                   1140: formatted as described above to contain
                   1141: a hierarchy.
                   1142: The root of this structure is the root of
                   1143: the
                   1144: .UX
                   1145: file system.
                   1146: A second formatted block device may be
                   1147: mounted
                   1148: at any leaf of
                   1149: the current hierarchy.
                   1150: This logically extends the current hierarchy.
                   1151: The implementation of
                   1152: mounting
                   1153: is trivial.
                   1154: A mount table is maintained containing
                   1155: pairs of designated leaf i-nodes and
                   1156: block devices.
                   1157: When converting a path name into an i-node,
                   1158: a check is made to see if the new i-node is a
                   1159: designated leaf.
                   1160: If it is,
                   1161: the i-node of the root
                   1162: of the block device replaces it.
                   1163: .PP
                   1164: Allocation of space for a file is taken
                   1165: from the free pool on the device on which the
                   1166: file lives.
                   1167: Thus a file system consisting of many
                   1168: mounted devices does not have a common pool of
                   1169: free secondary storage space.
                   1170: This separation of space on different
                   1171: devices is necessary to allow easy
                   1172: unmounting
                   1173: of a device.
                   1174: .NH 2
                   1175: Other system functions
                   1176: .PP
                   1177: There are some other things that the system
                   1178: does for the user\-a
                   1179: little accounting,
                   1180: a little tracing/debugging,
                   1181: and a little access protection.
                   1182: Most of these things are not very
                   1183: well developed
                   1184: because our use of the system in computing science research
                   1185: does not need them.
                   1186: There are some features that are missed in some
                   1187: applications, for example, better inter-process communication.
                   1188: .PP
                   1189: The
                   1190: .UX
                   1191: kernel is an I/O multiplexer more than
                   1192: a complete operating system.
                   1193: This is as it should be.
                   1194: Because of this outlook,
                   1195: many features are
                   1196: found in most
                   1197: other operating systems that are missing from the
                   1198: .UX
                   1199: kernel.
                   1200: For example,
                   1201: the
                   1202: .UX
                   1203: kernel does not support
                   1204: file access methods,
                   1205: file disposition,
                   1206: file formats,
                   1207: file maximum size,
                   1208: spooling,
                   1209: command language,
                   1210: logical records,
                   1211: physical records,
                   1212: assignment of logical file names,
                   1213: logical file names,
                   1214: more than one character set,
                   1215: an operator's console,
                   1216: an operator,
                   1217: log-in,
                   1218: or log-out.
                   1219: Many of these things are symptoms rather than features.
                   1220: Many of these things are implemented
                   1221: in user software
                   1222: using the kernel as a tool.
                   1223: A good example of this is the command language.
                   1224: .[
                   1225: bourne shell 1978 bstj
                   1226: %Q This issue
                   1227: .]
                   1228: Each user may have his own command language.
                   1229: Maintenance of such code is as easy as
                   1230: maintaining user code.
                   1231: The idea of implementing ``system'' code with general
                   1232: user primitives
                   1233: comes directly from
                   1234: .UC  MULTICS .
                   1235: .[
                   1236: organick multics 1972
                   1237: .]
                   1238: .LP
                   1239: .[
                   1240: $LIST$
                   1241: .]

unix.superglobalmegacorp.com

This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.