43BSDReno/share/doc/ps2/04.implement/implement - annotate

Return to implement CVS log
Up to [CSRG BSD Unix] / 43BSDReno / share / doc / ps2 / 04.implement
Annotation of 43BSDReno/share/doc/ps2/04.implement/implement, revision 1.1

1.1     ! root        1: .\"    @(#)implement   6.2 (Berkeley) 6/17/86
        !             2: .\"
        !             3: .EH 'PS2:4-%''UNIX Implementation'
        !             4: .OH 'UNIX Implementation''PS2:4-%'
        !             5: .de P1
        !             6: .DS
        !             7: ..
        !             8: .de P2
        !             9: .DE
        !            10: ..
        !            11: .de UL
        !            12: .lg 0
        !            13: .if n .ul
        !            14: \%\&\\$3\f3\\$1\fR\&\\$2
        !            15: .lg
        !            16: ..
        !            17: .de UC
        !            18: \&\\$3\s-1\\$1\\s0\&\\$2
        !            19: ..
        !            20: .de IT
        !            21: .lg 0
        !            22: .if n .ul
        !            23: \%\&\\$3\f2\\$1\fR\&\\$2
        !            24: .lg
        !            25: ..
        !            26: .de SP
        !            27: .sp \\$1
        !            28: ..
        !            29: .hw device
        !            30: .TL
        !            31: UNIX Implementation
        !            32: .AU "MH 2C-523" 2394
        !            33: K. Thompson
        !            34: .AI
        !            35: .MH
        !            36: .AB
        !            37: This paper describes in high-level terms the
        !            38: implementation of the resident
        !            39: .UX
        !            40: kernel.
        !            41: This discussion is broken into three parts.
        !            42: The first part describes
        !            43: how the
        !            44: .UX
        !            45: system views processes, users, and programs.
        !            46: The second part describes the I/O system.
        !            47: The last part describes the
        !            48: .UX
        !            49: file system.
        !            50: .AE
        !            51: .NH
        !            52: INTRODUCTION
        !            53: .PP
        !            54: The
        !            55: .UX
        !            56: kernel consists of about 10,000
        !            57: lines of C code and about 1,000 lines of assembly code.
        !            58: The assembly code can be further broken down into
        !            59: 200 lines included for
        !            60: the sake of efficiency
        !            61: (they could have been written in C)
        !            62: and 800 lines to perform hardware
        !            63: functions not possible in C.
        !            64: .PP
        !            65: This code represents 5 to 10 percent of what has
        !            66: been lumped into the broad expression
        !            67: ``the
        !            68: .UX
        !            69: operating system.''
        !            70: The kernel is the only
        !            71: .UX
        !            72: code that
        !            73: cannot be substituted by a user to his
        !            74: own liking.
        !            75: For this reason,
        !            76: the kernel should make as few real
        !            77: decisions as possible.
        !            78: This does not mean to allow the user
        !            79: a million options to do the same thing.
        !            80: Rather, it means to allow only one way to
        !            81: do one thing,
        !            82: but have that way be the least-common divisor
        !            83: of all the options that might have been provided.
        !            84: .PP
        !            85: What is or is not implemented in the kernel
        !            86: represents both a great responsibility and a great power.
        !            87: It is a soap-box platform on
        !            88: ``the way things should be done.''
        !            89: Even so, if
        !            90: ``the way'' is too radical,
        !            91: no one will follow it.
        !            92: Every important decision was weighed
        !            93: carefully.
        !            94: Throughout,
        !            95: simplicity has been substituted for efficiency.
        !            96: Complex algorithms are used only if
        !            97: their complexity can be localized.
        !            98: .NH
        !            99: PROCESS CONTROL
        !           100: .PP
        !           101: In the
        !           102: .UX
        !           103: system,
        !           104: a user executes programs in an
        !           105: environment called a user process.
        !           106: When a system function is required,
        !           107: the user process calls the system
        !           108: as a subroutine.
        !           109: At some point in this call,
        !           110: there is a distinct switch of environments.
        !           111: After this,
        !           112: the process is said to be a system process.
        !           113: In the normal definition of processes,
        !           114: the user and system processes are different
        !           115: phases of the same process
        !           116: (they never execute simultaneously).
        !           117: For protection,
        !           118: each system process has its own stack.
        !           119: .PP
        !           120: The user process may execute
        !           121: from a read-only text segment,
        !           122: which is shared by all processes
        !           123: executing the same code.
        !           124: There is no
        !           125: .IT functional
        !           126: benefit
        !           127: from shared-text segments.
        !           128: An
        !           129: .IT efficiency
        !           130: benefit comes from the fact
        !           131: that there is no need to swap read-only
        !           132: segments out because the original
        !           133: copy on secondary memory is still current.
        !           134: This is a great benefit to interactive
        !           135: programs that tend to be swapped while
        !           136: waiting for terminal input.
        !           137: Furthermore,
        !           138: if two processes are
        !           139: executing
        !           140: simultaneously
        !           141: from the same copy of a read-only segment,
        !           142: only one copy needs to reside in
        !           143: primary memory.
        !           144: This is a secondary effect,
        !           145: because
        !           146: simultaneous execution of a program
        !           147: is not common.
        !           148: It is ironic that this effect,
        !           149: which reduces the use of primary memory,
        !           150: only comes into play when there is
        !           151: an overabundance of primary memory,
        !           152: that is,
        !           153: when there is enough memory
        !           154: to keep waiting processes loaded.
        !           155: .PP
        !           156: All current read-only text segments in the
        !           157: system are maintained from the
        !           158: .IT "text table" .
        !           159: A text table entry holds the location of the
        !           160: text segment on secondary memory.
        !           161: If the segment is loaded,
        !           162: that table also holds the primary memory location
        !           163: and the count of the number of processes
        !           164: sharing this entry.
        !           165: When this count is reduced to zero,
        !           166: the entry is freed along with any
        !           167: primary and secondary memory holding the segment.
        !           168: When a process first executes a shared-text segment,
        !           169: a text table entry is allocated and the
        !           170: segment is loaded onto secondary memory.
        !           171: If a second process executes a text segment
        !           172: that is already allocated,
        !           173: the entry reference count is simply incremented.
        !           174: .PP
        !           175: A user process has some strictly private
        !           176: read-write data
        !           177: contained in its
        !           178: data segment.
        !           179: As far as possible,
        !           180: the system does not use the user's
        !           181: data segment to hold system data.
        !           182: In particular,
        !           183: there are no I/O buffers in the
        !           184: user address space.
        !           185: .PP
        !           186: The user data segment has two growing boundaries.
        !           187: One, increased automatically by the system
        !           188: as a result of memory faults,
        !           189: is used for a stack.
        !           190: The second boundary is only grown (or shrunk) by
        !           191: explicit requests.
        !           192: The contents of newly allocated primary memory
        !           193: is initialized to zero.
        !           194: .PP
        !           195: Also associated and swapped with
        !           196: a process is a small fixed-size
        !           197: system data segment.
        !           198: This segment contains all
        !           199: the data about the process
        !           200: that the system needs only when the
        !           201: process is active.
        !           202: Examples of the kind of data contained
        !           203: in the system data segment are:
        !           204: saved central processor registers,
        !           205: open file descriptors,
        !           206: accounting information,
        !           207: scratch data area,
        !           208: and the stack for the system phase
        !           209: of the process.
        !           210: The system data segment is not
        !           211: addressable from the user process
        !           212: and is therefore protected.
        !           213: .PP
        !           214: Last,
        !           215: there is a process table with
        !           216: one entry per process.
        !           217: This entry contains all the data
        !           218: needed by the system when the process
        !           219: is
        !           220: .IT not
        !           221: active.
        !           222: Examples are
        !           223: the process's name,
        !           224: the location of the other segments,
        !           225: and scheduling information.
        !           226: The process table entry is allocated
        !           227: when the process is created, and freed
        !           228: when the process terminates.
        !           229: This process entry is always directly
        !           230: addressable by the kernel.
        !           231: .PP
        !           232: Figure 1 shows the relationships
        !           233: between the various process control
        !           234: data.
        !           235: In a sense,
        !           236: the process table is the
        !           237: definition of all processes,
        !           238: because
        !           239: all the data associated with a process
        !           240: may be accessed
        !           241: starting from the process table entry.
        !           242: .KF
        !           243: .in .375i
        !           244: .so fig1.pic
        !           245: .in -.375i
        !           246: .sp 2v
        !           247: .ce
        !           248: Fig. 1\(emProcess control data structure.
        !           249: .KE
        !           250: .NH 2
        !           251: Process creation and program execution
        !           252: .PP
        !           253: Processes are created by the system primitive
        !           254: .UL fork .
        !           255: The newly created process (child) is a copy of the original process (parent).
        !           256: There is no detectable sharing of primary memory between the two processes.
        !           257: (Of course,
        !           258: if the parent process was executing from a read-only
        !           259: text segment,
        !           260: the child will share the text segment.)
        !           261: Copies of all writable data segments
        !           262: are made for the child process.
        !           263: Files that were open before the
        !           264: .UL fork
        !           265: are
        !           266: truly shared after the
        !           267: .UL fork .
        !           268: The processes are informed as to their part in the
        !           269: relationship to
        !           270: allow them to select their own
        !           271: (usually non-identical)
        !           272: destiny.
        !           273: The parent may
        !           274: .UL wait
        !           275: for the termination of
        !           276: any of its children.
        !           277: .PP
        !           278: A process may
        !           279: .UL exec
        !           280: a file.
        !           281: This consists of exchanging the current text and data
        !           282: segments of the process for new text and data
        !           283: segments specified in the file.
        !           284: The old segments are lost.
        !           285: Doing an
        !           286: .UL exec
        !           287: does
        !           288: .IT not
        !           289: change processes;
        !           290: the process that did the
        !           291: .UL exec
        !           292: persists,
        !           293: but
        !           294: after the
        !           295: .UL exec
        !           296: it is executing a different program.
        !           297: Files that were open
        !           298: before the
        !           299: .UL exec
        !           300: remain open after the
        !           301: .UL exec .
        !           302: .PP
        !           303: If a program,
        !           304: say the first pass of a compiler,
        !           305: wishes to overlay itself with another program,
        !           306: say the second pass,
        !           307: then it simply
        !           308: .UL exec s
        !           309: the second program.
        !           310: This is analogous
        !           311: to a ``goto.''
        !           312: If a program wishes to regain control
        !           313: after
        !           314: .UL exec ing
        !           315: a second program,
        !           316: it should
        !           317: .UL fork
        !           318: a child process,
        !           319: have the child
        !           320: .UL exec
        !           321: the second program, and
        !           322: have the parent
        !           323: .UL wait
        !           324: for the child.
        !           325: This is analogous to a ``call.''
        !           326: Breaking up the call into a binding followed by
        !           327: a transfer is similar to the subroutine linkage in
        !           328: SL-5.
        !           329: .[
        !           330: griswold hanson sl5 overview
        !           331: .]
        !           332: .NH 2
        !           333: Swapping
        !           334: .PP
        !           335: The major data associated with a process
        !           336: (the user data segment,
        !           337: the system data segment, and
        !           338: the text segment)
        !           339: are swapped to and from secondary
        !           340: memory, as needed.
        !           341: The user data segment and the system data segment
        !           342: are kept in contiguous primary memory to reduce
        !           343: swapping latency.
        !           344: (When low-latency devices, such as bubbles,
        !           345: .UC CCD s,
        !           346: or scatter/gather devices,
        !           347: are used,
        !           348: this decision will have to be reconsidered.)
        !           349: Allocation of both primary
        !           350: and secondary memory is performed
        !           351: by the same simple first-fit algorithm.
        !           352: When a process grows,
        !           353: a new piece of primary memory is allocated.
        !           354: The contents of the old memory is copied to the new memory.
        !           355: The old memory is freed
        !           356: and the tables are updated.
        !           357: If there is not enough primary memory,
        !           358: secondary memory is allocated instead.
        !           359: The process is swapped out onto the
        !           360: secondary memory,
        !           361: ready to be swapped in with
        !           362: its new size.
        !           363: .PP
        !           364: One separate process in the kernel,
        !           365: the swapping process,
        !           366: simply swaps the other
        !           367: processes in and out of primary memory.
        !           368: It examines the
        !           369: process table looking for a process
        !           370: that is swapped out and is
        !           371: ready to run.
        !           372: It allocates primary memory for that
        !           373: process and
        !           374: reads its segments into
        !           375: primary memory, where that process competes for the
        !           376: central processor with other loaded processes.
        !           377: If no primary memory is available,
        !           378: the swapping process makes memory available
        !           379: by examining the process table for processes
        !           380: that can be swapped out.
        !           381: It selects a process to swap out,
        !           382: writes it to secondary memory,
        !           383: frees the primary memory,
        !           384: and then goes back to look for a process
        !           385: to swap in.
        !           386: .PP
        !           387: Thus there are two specific algorithms
        !           388: to the swapping process.
        !           389: Which of the possibly many processes that
        !           390: are swapped out is to be swapped in?
        !           391: This is decided by secondary storage residence
        !           392: time.
        !           393: The one with the longest time out is swapped in first.
        !           394: There is a slight penalty for larger processes.
        !           395: Which of the possibly many processes that
        !           396: are loaded is to be swapped out?
        !           397: Processes that are waiting for slow events
        !           398: (i.e., not currently running or waiting for
        !           399: disk I/O)
        !           400: are picked first,
        !           401: by age in primary memory,
        !           402: again with size penalties.
        !           403: The other processes are examined
        !           404: by the same age algorithm,
        !           405: but are not taken out unless they are
        !           406: at least of some age.
        !           407: This adds
        !           408: hysteresis to the swapping and
        !           409: prevents total thrashing.
        !           410: .PP
        !           411: These swapping algorithms are the
        !           412: most suspect in the system.
        !           413: With limited primary memory,
        !           414: these algorithms cause total swapping.
        !           415: This is not bad in itself, because
        !           416: the swapping does not impact the
        !           417: execution of the resident processes.
        !           418: However, if the swapping device must
        !           419: also be used for file storage,
        !           420: the swapping traffic severely
        !           421: impacts the file system traffic.
        !           422: It is exactly these small systems
        !           423: that tend to double usage of limited disk
        !           424: resources.
        !           425: .NH 2
        !           426: Synchronization and scheduling
        !           427: .PP
        !           428: Process synchronization is accomplished by having processes
        !           429: wait for events.
        !           430: Events are represented by arbitrary integers.
        !           431: By convention,
        !           432: events are chosen to be addresses of
        !           433: tables associated with those events.
        !           434: For example, a process that is waiting for
        !           435: any of its children to terminate will wait
        !           436: for an event that is the address of
        !           437: its own process table entry.
        !           438: When a process terminates,
        !           439: it signals the event represented by
        !           440: its parent's process table entry.
        !           441: Signaling an event on which no process
        !           442: is waiting has no effect.
        !           443: Similarly,
        !           444: signaling an event on which many processes
        !           445: are waiting will wake all of them up.
        !           446: This differs considerably from
        !           447: Dijkstra's P and V
        !           448: synchronization operations,
        !           449: .[
        !           450: dijkstra sequential processes 1968
        !           451: .]
        !           452: in that
        !           453: no memory is associated with events.
        !           454: Thus there need be no allocation of events
        !           455: prior to their use.
        !           456: Events exist simply by being used.
        !           457: .PP
        !           458: On the negative side,
        !           459: because there is no memory associated with events,
        !           460: no notion of ``how much''
        !           461: can be signaled via the event mechanism.
        !           462: For example,
        !           463: processes that want memory might
        !           464: wait on an event associated with
        !           465: memory allocation.
        !           466: When any amount of memory becomes available,
        !           467: the event would be signaled.
        !           468: All the competing processes would then wake
        !           469: up to fight over the new memory.
        !           470: (In reality,
        !           471: the swapping process is the only process
        !           472: that waits for primary memory to become available.)
        !           473: .PP
        !           474: If an event occurs
        !           475: between the time a process decides
        !           476: to wait for that event and the
        !           477: time that process enters the wait state,
        !           478: then
        !           479: the process will wait on an event that has
        !           480: already happened (and may never happen again).
        !           481: This race condition happens because there is no memory associated with
        !           482: the event to indicate that the event has occurred;
        !           483: the only action of an event is to change a set of processes
        !           484: from wait state to run state.
        !           485: This problem is relieved largely
        !           486: by the fact that process switching can
        !           487: only occur in the kernel by explicit calls
        !           488: to the event-wait mechanism.
        !           489: If the event in question is signaled by another
        !           490: process,
        !           491: then there is no problem.
        !           492: But if the event is signaled by a hardware
        !           493: interrupt,
        !           494: then special care must be taken.
        !           495: These synchronization races pose the biggest
        !           496: problem when
        !           497: .UX
        !           498: is adapted to multiple-processor configurations.
        !           499: .[
        !           500: hawley meyer multiprocessing unix
        !           501: .]
        !           502: .PP
        !           503: The event-wait code in the kernel
        !           504: is like a co-routine linkage.
        !           505: At any time,
        !           506: all but one of the processes has called event-wait.
        !           507: The remaining process is the one currently executing.
        !           508: When it calls event-wait,
        !           509: a process whose event has been signaled
        !           510: is selected and that process
        !           511: returns from its call to event-wait.
        !           512: .PP
        !           513: Which of the runable processes is to run next?
        !           514: Associated with each process is a priority.
        !           515: The priority of a system process is assigned by the code
        !           516: issuing the wait on an event.
        !           517: This is roughly equivalent to the response
        !           518: that one would expect on such an event.
        !           519: Disk events have high priority,
        !           520: teletype events are low,
        !           521: and time-of-day events are very low.
        !           522: (From observation,
        !           523: the difference in system process priorities
        !           524: has little or no performance impact.)
        !           525: All user-process priorities are lower than the
        !           526: lowest system priority.
        !           527: User-process priorities are assigned
        !           528: by an algorithm based on the
        !           529: recent ratio of the amount of compute time to real time consumed
        !           530: by the process.
        !           531: A process that has used a lot of
        !           532: compute time in the last real-time
        !           533: unit is assigned a low user priority.
        !           534: Because interactive processes are characterized
        !           535: by low ratios of compute to real time,
        !           536: interactive response is maintained without any
        !           537: special arrangements.
        !           538: .PP
        !           539: The scheduling algorithm simply picks
        !           540: the process with the highest priority,
        !           541: thus
        !           542: picking all system processes first and
        !           543: user processes second.
        !           544: The compute-to-real-time ratio is updated
        !           545: every second.
        !           546: Thus,
        !           547: all other things being equal,
        !           548: looping user processes will be
        !           549: scheduled round-robin with a
        !           550: 1-second quantum.
        !           551: A high-priority process waking up will
        !           552: preempt a running, low-priority process.
        !           553: The scheduling algorithm has a very desirable
        !           554: negative feedback character.
        !           555: If a process uses its high priority
        !           556: to hog the computer,
        !           557: its priority will drop.
        !           558: At the same time, if a low-priority
        !           559: process is ignored for a long time,
        !           560: its priority will rise.
        !           561: .NH
        !           562: I/O SYSTEM
        !           563: .PP
        !           564: The I/O system
        !           565: is broken into two completely separate systems:
        !           566: the block I/O system and the character I/O system.
        !           567: In retrospect,
        !           568: the names should have been ``structured I/O''
        !           569: and ``unstructured I/O,'' respectively;
        !           570: while the term ``block I/O'' has some meaning,
        !           571: ``character I/O'' is a complete misnomer.
        !           572: .PP
        !           573: Devices are characterized by a major device number,
        !           574: a minor device number, and
        !           575: a class (block or character).
        !           576: For each class,
        !           577: there is an array of entry points into the device drivers.
        !           578: The major device number is used to index the array
        !           579: when calling the code for a particular device driver.
        !           580: The minor device number is passed to the
        !           581: device driver as an argument.
        !           582: The minor number has no significance other
        !           583: than that attributed to it by the driver.
        !           584: Usually,
        !           585: the driver uses the minor number to access
        !           586: one of several identical physical devices.
        !           587: .PP
        !           588: The use of the array of entry points
        !           589: (configuration table)
        !           590: as the only connection between the
        !           591: system code and the device drivers is
        !           592: very important.
        !           593: Early versions of the system had a much
        !           594: less formal connection with the drivers,
        !           595: so that it was extremely hard to handcraft
        !           596: differently configured systems.
        !           597: Now it is possible to create new
        !           598: device drivers in an average of a few hours.
        !           599: The configuration table in most cases
        !           600: is created automatically by a program
        !           601: that reads the system's parts list.
        !           602: .NH 2
        !           603: Block I/O system
        !           604: .PP
        !           605: The model block I/O device consists
        !           606: of randomly addressed, secondary
        !           607: memory blocks of 512 bytes each.
        !           608: The blocks are uniformly addressed
        !           609: 0, 1, .\|.\|. up to the size of the device.
        !           610: The block device driver has the job of
        !           611: emulating this model on a
        !           612: physical device.
        !           613: .PP
        !           614: The block I/O devices are accessed
        !           615: through a layer of buffering software.
        !           616: The system maintains a list of buffers
        !           617: (typically between 10 and 70)
        !           618: each assigned a device name and
        !           619: a device address.
        !           620: This buffer pool constitutes a data cache
        !           621: for the block devices.
        !           622: On a read request,
        !           623: the cache is searched for the desired block.
        !           624: If the block is found,
        !           625: the data are made available to the
        !           626: requester without any physical I/O.
        !           627: If the block is not in the cache,
        !           628: the least recently used block in the cache is renamed,
        !           629: the correct device driver is called to
        !           630: fill up the renamed buffer, and then the
        !           631: data are made available.
        !           632: Write requests are handled in an analogous manner.
        !           633: The correct buffer is found
        !           634: and relabeled if necessary.
        !           635: The write is performed simply by marking
        !           636: the buffer as ``dirty.''
        !           637: The physical I/O is then deferred until
        !           638: the buffer is renamed.
        !           639: .PP
        !           640: The benefits in reduction of physical I/O
        !           641: of this scheme are substantial,
        !           642: especially considering the file system implementation.
        !           643: There are,
        !           644: however,
        !           645: some drawbacks.
        !           646: The asynchronous nature of the
        !           647: algorithm makes error reporting
        !           648: and meaningful user error handling
        !           649: almost impossible.
        !           650: The cavalier approach to I/O error
        !           651: handling in the
        !           652: .UX
        !           653: system is partly due to the asynchronous
        !           654: nature of the block I/O system.
        !           655: A second problem is in the delayed writes.
        !           656: If the system stops unexpectedly,
        !           657: it is almost certain that there is a
        !           658: lot of logically complete,
        !           659: but physically incomplete,
        !           660: I/O in the buffers.
        !           661: There is a system primitive to
        !           662: flush all outstanding I/O activity
        !           663: from the buffers.
        !           664: Periodic use of this primitive helps,
        !           665: but does not solve, the problem.
        !           666: Finally,
        !           667: the associativity in the buffers
        !           668: can alter the physical I/O sequence
        !           669: from that of the logical I/O sequence.
        !           670: This means that there are times
        !           671: when data structures on disk are inconsistent,
        !           672: even though the software is careful
        !           673: to perform I/O in the correct order.
        !           674: On non-random devices,
        !           675: notably magnetic tape,
        !           676: the inversions of writes can be disastrous.
        !           677: The problem with magnetic tapes is ``cured'' by
        !           678: allowing only one outstanding write request
        !           679: per drive.
        !           680: .NH 2
        !           681: Character I/O system
        !           682: .PP
        !           683: The character I/O system consists of all
        !           684: devices that do not fall into the block I/O model.
        !           685: This includes the ``classical'' character devices
        !           686: such as communications lines, paper tape, and
        !           687: line printers.
        !           688: It also includes magnetic tape and disks when
        !           689: they are not used in a stereotyped way,
        !           690: for example, 80-byte physical records on tape
        !           691: and track-at-a-time disk copies.
        !           692: In short,
        !           693: the character I/O interface
        !           694: means ``everything other than block.''
        !           695: I/O requests from the user are sent to the
        !           696: device driver essentially unaltered.
        !           697: The implementation of these requests is, of course,
        !           698: up to the device driver.
        !           699: There are guidelines and conventions
        !           700: to help the implementation of
        !           701: certain types of device drivers.
        !           702: .NH 3
        !           703: Disk drivers
        !           704: .PP
        !           705: Disk drivers are implemented
        !           706: with a queue of transaction records.
        !           707: Each record holds a read/write flag,
        !           708: a primary memory address,
        !           709: a secondary memory address, and
        !           710: a transfer byte count.
        !           711: Swapping is accomplished by passing
        !           712: such a record to the swapping device driver.
        !           713: The block I/O interface is implemented by
        !           714: passing such records with requests to
        !           715: fill and empty system buffers.
        !           716: The character I/O interface to the disk
        !           717: drivers create a transaction record that
        !           718: points directly into the user area.
        !           719: The routine that creates this record also insures
        !           720: that the user is not swapped during this
        !           721: I/O transaction.
        !           722: Thus by implementing the general disk driver,
        !           723: it is possible to use the disk
        !           724: as a block device,
        !           725: a character device, and a swap device.
        !           726: The only really disk-specific code in normal
        !           727: disk drivers is the pre-sort of transactions to
        !           728: minimize latency for a particular device, and
        !           729: the actual issuing of the I/O request.
        !           730: .NH 3
        !           731: Character lists
        !           732: .PP
        !           733: Real character-oriented devices may
        !           734: be implemented using the common
        !           735: code to handle character lists.
        !           736: A character list is a queue of characters.
        !           737: One routine puts a character on a queue.
        !           738: Another gets a character from a queue.
        !           739: It is also possible to ask how many
        !           740: characters are currently on a queue.
        !           741: Storage for all queues in the system comes
        !           742: from a single common pool.
        !           743: Putting a character on a queue will allocate
        !           744: space from the common pool and link the
        !           745: character onto the data structure defining the queue.
        !           746: Getting a character from a queue returns
        !           747: the corresponding space to the pool.
        !           748: .PP
        !           749: A typical character-output device
        !           750: (paper tape punch, for example)
        !           751: is implemented by passing characters
        !           752: from the user onto a character queue until
        !           753: some maximum number of characters is on the queue.
        !           754: The I/O is prodded to start as
        !           755: soon as there is anything on the queue
        !           756: and, once started,
        !           757: it is sustained by hardware completion interrupts.
        !           758: Each time there is a completion interrupt,
        !           759: the driver gets the next character from the queue
        !           760: and sends it to the hardware.
        !           761: The number of characters on the queue is checked and,
        !           762: as the count falls through some intermediate level,
        !           763: an event (the queue address) is signaled.
        !           764: The process that is passing characters from
        !           765: the user to the queue can be waiting on the event, and
        !           766: refill the queue to its maximum
        !           767: when the event occurs.
        !           768: .PP
        !           769: A typical character input device
        !           770: (for example, a paper tape reader)
        !           771: is handled in a very similar manner.
        !           772: .PP
        !           773: Another class of character devices is the terminals.
        !           774: A terminal is represented by three
        !           775: character queues.
        !           776: There are two input queues (raw and canonical)
        !           777: and an output queue.
        !           778: Characters going to the output of a terminal
        !           779: are handled by common code exactly as described
        !           780: above.
        !           781: The main difference is that there is also code
        !           782: to interpret the output stream as
        !           783: .UC  ASCII
        !           784: characters and to perform some translations,
        !           785: e.g., escapes for deficient terminals.
        !           786: Another common aspect of terminals is code
        !           787: to insert real-time delay after certain control characters.
        !           788: .PP
        !           789: Input on terminals is a little different.
        !           790: Characters are collected from the terminal and
        !           791: placed on a raw input queue.
        !           792: Some device-dependent code conversion and
        !           793: escape interpretation is handled here.
        !           794: When a line is complete in the raw queue,
        !           795: an event is signaled.
        !           796: The code catching this signal then copies a
        !           797: line from the raw queue to a canonical queue
        !           798: performing the character erase and line kill editing.
        !           799: User read requests on terminals can be
        !           800: directed at either the raw or canonical queues.
        !           801: .NH 3
        !           802: Other character devices
        !           803: .PP
        !           804: Finally,
        !           805: there are devices that fit no general category.
        !           806: These devices are set up as character I/O drivers.
        !           807: An example is a driver that reads and writes
        !           808: unmapped primary memory as an I/O device.
        !           809: Some devices are too
        !           810: fast to be treated a character at time,
        !           811: but do not fit the disk I/O mold.
        !           812: Examples are fast communications lines and
        !           813: fast line printers.
        !           814: These devices either have their own buffers
        !           815: or ``borrow'' block I/O buffers for a while and
        !           816: then give them back.
        !           817: .NH
        !           818: THE FILE SYSTEM
        !           819: .PP
        !           820: In the
        !           821: .UX
        !           822: system,
        !           823: a file is a (one-dimensional) array of bytes.
        !           824: No other structure of files is implied by the
        !           825: system.
        !           826: Files are attached anywhere
        !           827: (and possibly multiply)
        !           828: onto a hierarchy of directories.
        !           829: Directories are simply files that
        !           830: users cannot write.
        !           831: For a further discussion
        !           832: of the external view of files and directories,
        !           833: see Ref.\0
        !           834: .[
        !           835: ritchie thompson unix bstj 1978
        !           836: %Q This issue
        !           837: .].
        !           838: .PP
        !           839: The
        !           840: .UX
        !           841: file system is a disk data structure
        !           842: accessed completely through
        !           843: the block I/O system.
        !           844: As stated before,
        !           845: the canonical view of a ``disk'' is
        !           846: a randomly addressable array of
        !           847: 512-byte blocks.
        !           848: A file system breaks the disk into
        !           849: four self-identifying regions.
        !           850: The first block (address 0)
        !           851: is unused by the file system.
        !           852: It is left aside for booting procedures.
        !           853: The second block (address 1)
        !           854: contains the so-called ``super-block.''
        !           855: This block,
        !           856: among other things,
        !           857: contains the size of the disk and
        !           858: the boundaries of the other regions.
        !           859: Next comes the i-list,
        !           860: a list of file definitions.
        !           861: Each file definition is
        !           862: a 64-byte structure, called an i-node.
        !           863: The offset of a particular i-node
        !           864: within the i-list is called its i-number.
        !           865: The combination of device name
        !           866: (major and minor numbers) and i-number
        !           867: serves to uniquely name a particular file.
        !           868: After the i-list,
        !           869: and to the end of the disk,
        !           870: come free storage blocks that
        !           871: are available for the contents of files.
        !           872: .PP
        !           873: The free space on a disk is maintained
        !           874: by a linked list of available disk blocks.
        !           875: Every block in this chain contains a disk address
        !           876: of the next block in the chain.
        !           877: The remaining space contains the address of up to
        !           878: 50 disk blocks that are also free.
        !           879: Thus with one I/O operation,
        !           880: the system obtains 50 free blocks and a
        !           881: pointer where to find more.
        !           882: The disk allocation algorithms are
        !           883: very straightforward.
        !           884: Since all allocation is in fixed-size
        !           885: blocks and there is strict accounting of
        !           886: space,
        !           887: there is no need to compact or garbage collect.
        !           888: However,
        !           889: as disk space becomes dispersed,
        !           890: latency gradually increases.
        !           891: Some installations choose to occasionally compact
        !           892: disk space to reduce latency.
        !           893: .PP
        !           894: An i-node contains 13 disk addresses.
        !           895: The first 10 of these addresses point directly at
        !           896: the first 10 blocks of a file.
        !           897: If a file is larger than 10 blocks (5,120 bytes),
        !           898: then the eleventh address points at a block
        !           899: that contains the addresses of the next 128 blocks of the file.
        !           900: If the file is still larger than this
        !           901: (70,656 bytes),
        !           902: then the twelfth block points at up to 128 blocks,
        !           903: each pointing to 128 blocks of the file.
        !           904: Files yet larger
        !           905: (8,459,264 bytes)
        !           906: use the thirteenth address for a ``triple indirect'' address.
        !           907: The algorithm ends here with the maximum file size
        !           908: of 1,082,201,087 bytes.
        !           909: .PP
        !           910: A logical directory hierarchy is added
        !           911: to this flat physical structure simply
        !           912: by adding a new type of file, the directory.
        !           913: A directory is accessed exactly as an ordinary file.
        !           914: It contains 16-byte entries consisting of
        !           915: a 14-byte name and an i-number.
        !           916: The root of the hierarchy is at a known i-number
        !           917: (\fIviz.,\fR 2).
        !           918: The file system structure allows an arbitrary, directed graph
        !           919: of directories with regular files linked in
        !           920: at arbitrary places in this graph.
        !           921: In fact,
        !           922: very early
        !           923: .UX
        !           924: systems used such a structure.
        !           925: Administration of such a structure became so
        !           926: chaotic that later systems were restricted
        !           927: to a directory tree.
        !           928: Even now,
        !           929: with regular files linked multiply
        !           930: into arbitrary places in the tree,
        !           931: accounting for space has become a problem.
        !           932: It may become necessary to restrict the entire
        !           933: structure to a tree,
        !           934: and allow a new form of linking that
        !           935: is subservient to the tree structure.
        !           936: .PP
        !           937: The file system allows
        !           938: easy creation,
        !           939: easy removal,
        !           940: easy random accessing,
        !           941: and very easy space allocation.
        !           942: With most physical addresses confined
        !           943: to a small contiguous section of disk,
        !           944: it is also easy to dump, restore, and
        !           945: check the consistency of the file system.
        !           946: Large files suffer from indirect addressing,
        !           947: but the cache prevents most of the implied physical I/O
        !           948: without adding much execution.
        !           949: The space overhead properties of this scheme are quite good.
        !           950: For example,
        !           951: on one particular file system,
        !           952: there are 25,000 files containing 130M bytes of data-file content.
        !           953: The overhead (i-node, indirect blocks, and last block breakage)
        !           954: is about 11.5M bytes.
        !           955: The directory structure to support these files
        !           956: has about 1,500 directories containing 0.6M bytes of directory content
        !           957: and about 0.5M bytes of overhead in accessing the directories.
        !           958: Added up any way,
        !           959: this comes out to less than a 10 percent overhead for actual
        !           960: stored data.
        !           961: Most systems have this much overhead in
        !           962: padded trailing blanks alone.
        !           963: .NH 2
        !           964: File system implementation
        !           965: .PP
        !           966: Because the i-node defines a file,
        !           967: the implementation of the file system centers
        !           968: around access to the i-node.
        !           969: The system maintains a table of all active
        !           970: i-nodes.
        !           971: As a new file is accessed,
        !           972: the system locates the corresponding i-node,
        !           973: allocates an i-node table entry, and reads
        !           974: the i-node into primary memory.
        !           975: As in the buffer cache,
        !           976: the table entry is considered to be the current
        !           977: version of the i-node.
        !           978: Modifications to the i-node are made to
        !           979: the table entry.
        !           980: When the last access to the i-node goes
        !           981: away,
        !           982: the table entry is copied back to the
        !           983: secondary store i-list and the table entry is freed.
        !           984: .KF
        !           985: .in .25i
        !           986: .so fig2.pic
        !           987: .in -.25i
        !           988: .sp 2v
        !           989: .ce
        !           990: Fig. 2\(emFile system data structure.
        !           991: .sp
        !           992: .KE
        !           993: .PP
        !           994: All I/O operations on files are carried out
        !           995: with the aid of the corresponding i-node table entry.
        !           996: The accessing of a file is a straightforward
        !           997: implementation of the algorithms mentioned previously.
        !           998: The user is not aware of i-nodes and i-numbers.
        !           999: References to the file system are made in terms of
        !          1000: path names of the directory tree.
        !          1001: Converting a path name into an i-node table entry
        !          1002: is also straightforward.
        !          1003: Starting at some known i-node
        !          1004: (the root or the current directory of some process),
        !          1005: the next component of the path name is
        !          1006: searched by reading the directory.
        !          1007: This gives an i-number and an implied device
        !          1008: (that of the directory).
        !          1009: Thus the next i-node table entry can be accessed.
        !          1010: If that was the last component of the path name,
        !          1011: then this i-node is the result.
        !          1012: If not,
        !          1013: this i-node is the directory needed to look up
        !          1014: the next component of the path name, and the
        !          1015: algorithm is repeated.
        !          1016: .PP
        !          1017: The user process accesses the file system with
        !          1018: certain primitives.
        !          1019: The most common of these are
        !          1020: .UL open ,
        !          1021: .UL create ,
        !          1022: .UL read ,
        !          1023: .UL write ,
        !          1024: .UL seek ,
        !          1025: and
        !          1026: .UL close .
        !          1027: The data structures maintained are shown in Fig. 2.
        !          1028: In the system data segment associated with a user,
        !          1029: there is room for some (usually between 10 and 50) open files.
        !          1030: This open file table consists of pointers that can be used to access
        !          1031: corresponding i-node table entries.
        !          1032: Associated with each of these open files is
        !          1033: a current I/O pointer.
        !          1034: This is a byte offset of
        !          1035: the next read/write operation on the file.
        !          1036: The system treats each read/write request
        !          1037: as random with an implied seek to the
        !          1038: I/O pointer.
        !          1039: The user usually thinks of the file as
        !          1040: sequential with the I/O pointer
        !          1041: automatically counting the number of bytes
        !          1042: that have been read/written from the file.
        !          1043: The user may,
        !          1044: of course,
        !          1045: perform random I/O by setting the I/O pointer
        !          1046: before reads/writes.
        !          1047: .PP
        !          1048: With file sharing,
        !          1049: it is necessary to allow related
        !          1050: processes to share a common I/O pointer
        !          1051: and yet have separate I/O pointers
        !          1052: for independent processes
        !          1053: that access the same file.
        !          1054: With these two conditions,
        !          1055: the I/O pointer cannot reside
        !          1056: in the i-node table nor can
        !          1057: it reside in the list of
        !          1058: open files for the process.
        !          1059: A new table
        !          1060: (the open file table)
        !          1061: was invented for the sole purpose
        !          1062: of holding the I/O pointer.
        !          1063: Processes that share the same open
        !          1064: file
        !          1065: (the result of
        !          1066: .UL fork s)
        !          1067: share a common open file table entry.
        !          1068: A separate open of the same file will
        !          1069: only share the i-node table entry,
        !          1070: but will have distinct open file table entries.
        !          1071: .PP
        !          1072: The main file system primitives are implemented as follows.
        !          1073: .UL \&open
        !          1074: converts a file system path name into an i-node
        !          1075: table entry.
        !          1076: A pointer to the i-node table entry is placed in a
        !          1077: newly created open file table entry.
        !          1078: A pointer to the file table entry is placed in the
        !          1079: system data segment for the process.
        !          1080: .UL \&create
        !          1081: first creates a new i-node entry,
        !          1082: writes the i-number into a directory, and
        !          1083: then builds the same structure as for an
        !          1084: .UL open .
        !          1085: .UL \&read
        !          1086: and
        !          1087: .UL write
        !          1088: just access the i-node entry as described above.
        !          1089: .UL \&seek
        !          1090: simply manipulates the I/O pointer.
        !          1091: No physical seeking is done.
        !          1092: .UL \&close
        !          1093: just frees the structures built by
        !          1094: .UL open
        !          1095: and
        !          1096: .UL create .
        !          1097: Reference counts are kept on the open file table entries and
        !          1098: the i-node table entries to free these structures after
        !          1099: the last reference goes away.
        !          1100: .UL \&unlink
        !          1101: simply decrements the count of the
        !          1102: number of directories pointing at the given i-node.
        !          1103: When the last reference to an i-node table entry
        !          1104: goes away,
        !          1105: if the i-node has no directories pointing to it,
        !          1106: then the file is removed and the i-node is freed.
        !          1107: This delayed removal of files prevents
        !          1108: problems arising from removing active files.
        !          1109: A file may be removed while still open.
        !          1110: The resulting unnamed file vanishes
        !          1111: when the file is closed.
        !          1112: This is a method of obtaining temporary files.
        !          1113: .PP
        !          1114: There is a type of unnamed
        !          1115: .UC  FIFO
        !          1116: file called a
        !          1117: .UL pipe.
        !          1118: Implementation of
        !          1119: .UL pipe s
        !          1120: consists of implied
        !          1121: .UL seek s
        !          1122: before each
        !          1123: .UL read
        !          1124: or
        !          1125: .UL write
        !          1126: in order to implement
        !          1127: first-in-first-out.
        !          1128: There are also checks and synchronization
        !          1129: to prevent the
        !          1130: writer from grossly outproducing the
        !          1131: reader and to prevent the reader from
        !          1132: overtaking the writer.
        !          1133: .NH 2
        !          1134: Mounted file systems
        !          1135: .PP
        !          1136: The file system of a
        !          1137: .UX
        !          1138: system
        !          1139: starts with some designated block device
        !          1140: formatted as described above to contain
        !          1141: a hierarchy.
        !          1142: The root of this structure is the root of
        !          1143: the
        !          1144: .UX
        !          1145: file system.
        !          1146: A second formatted block device may be
        !          1147: mounted
        !          1148: at any leaf of
        !          1149: the current hierarchy.
        !          1150: This logically extends the current hierarchy.
        !          1151: The implementation of
        !          1152: mounting
        !          1153: is trivial.
        !          1154: A mount table is maintained containing
        !          1155: pairs of designated leaf i-nodes and
        !          1156: block devices.
        !          1157: When converting a path name into an i-node,
        !          1158: a check is made to see if the new i-node is a
        !          1159: designated leaf.
        !          1160: If it is,
        !          1161: the i-node of the root
        !          1162: of the block device replaces it.
        !          1163: .PP
        !          1164: Allocation of space for a file is taken
        !          1165: from the free pool on the device on which the
        !          1166: file lives.
        !          1167: Thus a file system consisting of many
        !          1168: mounted devices does not have a common pool of
        !          1169: free secondary storage space.
        !          1170: This separation of space on different
        !          1171: devices is necessary to allow easy
        !          1172: unmounting
        !          1173: of a device.
        !          1174: .NH 2
        !          1175: Other system functions
        !          1176: .PP
        !          1177: There are some other things that the system
        !          1178: does for the user\-a
        !          1179: little accounting,
        !          1180: a little tracing/debugging,
        !          1181: and a little access protection.
        !          1182: Most of these things are not very
        !          1183: well developed
        !          1184: because our use of the system in computing science research
        !          1185: does not need them.
        !          1186: There are some features that are missed in some
        !          1187: applications, for example, better inter-process communication.
        !          1188: .PP
        !          1189: The
        !          1190: .UX
        !          1191: kernel is an I/O multiplexer more than
        !          1192: a complete operating system.
        !          1193: This is as it should be.
        !          1194: Because of this outlook,
        !          1195: many features are
        !          1196: found in most
        !          1197: other operating systems that are missing from the
        !          1198: .UX
        !          1199: kernel.
        !          1200: For example,
        !          1201: the
        !          1202: .UX
        !          1203: kernel does not support
        !          1204: file access methods,
        !          1205: file disposition,
        !          1206: file formats,
        !          1207: file maximum size,
        !          1208: spooling,
        !          1209: command language,
        !          1210: logical records,
        !          1211: physical records,
        !          1212: assignment of logical file names,
        !          1213: logical file names,
        !          1214: more than one character set,
        !          1215: an operator's console,
        !          1216: an operator,
        !          1217: log-in,
        !          1218: or log-out.
        !          1219: Many of these things are symptoms rather than features.
        !          1220: Many of these things are implemented
        !          1221: in user software
        !          1222: using the kernel as a tool.
        !          1223: A good example of this is the command language.
        !          1224: .[
        !          1225: bourne shell 1978 bstj
        !          1226: %Q This issue
        !          1227: .]
        !          1228: Each user may have his own command language.
        !          1229: Maintenance of such code is as easy as
        !          1230: maintaining user code.
        !          1231: The idea of implementing ``system'' code with general
        !          1232: user primitives
        !          1233: comes directly from
        !          1234: .UC  MULTICS .
        !          1235: .[
        !          1236: organick multics 1972
        !          1237: .]
        !          1238: .LP
        !          1239: .[
        !          1240: $LIST$
        !          1241: .]
unix.superglobalmegacorp.com
This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.