Annotation of 43BSDReno/share/doc/ps2/05.iosys/iosys, revision 1.1.1.1

1.1       root        1: .\"    @(#)iosys       6.1 (Berkeley) 4/29/86
                      2: .\"
                      3: .EH 'PS2:5-%''The UNIX I/O System'
                      4: .OH 'The UNIX I/O System''PS2:5-%'
                      5: .TL
                      6: The UNIX I/O System
                      7: .AU
                      8: Dennis M. Ritchie
                      9: .AI
                     10: .MH
                     11: .PP
                     12: This paper gives an overview of the workings of the UNIX\(dg
                     13: .FS
                     14: \(dgUNIX is a Trademark of Bell Laboratories.
                     15: .FE
                     16: I/O system.
                     17: It was written with an eye toward providing
                     18: guidance to writers of device driver routines,
                     19: and is oriented more toward describing the environment
                     20: and nature of device drivers than the implementation
                     21: of that part of the file system which deals with
                     22: ordinary files.
                     23: .PP
                     24: It is assumed that the reader has a good knowledge
                     25: of the overall structure of the file system as discussed
                     26: in the paper ``The UNIX Time-sharing System.''
                     27: A more detailed discussion
                     28: appears in
                     29: ``UNIX Implementation;''
                     30: the current document restates parts of that one,
                     31: but is still more detailed.
                     32: It is most useful in
                     33: conjunction with a copy of the system code,
                     34: since it is basically an exegesis of that code.
                     35: .SH
                     36: Device Classes
                     37: .PP
                     38: There are two classes of device:
                     39: .I block
                     40: and
                     41: .I character.
                     42: The block interface is suitable for devices
                     43: like disks, tapes, and DECtape
                     44: which work, or can work, with addressible 512-byte blocks.
                     45: Ordinary magnetic tape just barely fits in this category,
                     46: since by use of forward
                     47: and
                     48: backward spacing any block can be read, even though
                     49: blocks can be written only at the end of the tape.
                     50: Block devices can at least potentially contain a mounted
                     51: file system.
                     52: The interface to block devices is very highly structured;
                     53: the drivers for these devices share a great many routines
                     54: as well as a pool of buffers.
                     55: .PP
                     56: Character-type devices have a much
                     57: more straightforward interface, although
                     58: more work must be done by the driver itself.
                     59: .PP
                     60: Devices of both types are named by a
                     61: .I major
                     62: and a
                     63: .I minor
                     64: device number.
                     65: These numbers are generally stored as an integer
                     66: with the minor device number
                     67: in the low-order 8 bits and the major device number
                     68: in the next-higher 8 bits;
                     69: macros
                     70: .I major
                     71: and
                     72: .I minor
                     73: are available to access these numbers.
                     74: The major device number selects which driver will deal with
                     75: the device; the minor device number is not used
                     76: by the rest of the system but is passed to the
                     77: driver at appropriate times.
                     78: Typically the minor number
                     79: selects a subdevice attached to
                     80: a given controller, or one of
                     81: several similar hardware interfaces.
                     82: .PP
                     83: The major device numbers for block and character devices
                     84: are used as indices in separate tables;
                     85: they both start at 0 and therefore overlap.
                     86: .SH
                     87: Overview of I/O
                     88: .PP
                     89: The purpose of
                     90: the
                     91: .I open
                     92: and
                     93: .I creat
                     94: system calls is to set up entries in three separate
                     95: system tables.
                     96: The first of these is the
                     97: .I u_ofile
                     98: table,
                     99: which is stored in the system's per-process
                    100: data area
                    101: .I u.
                    102: This table is indexed by
                    103: the file descriptor returned by the
                    104: .I open
                    105: or
                    106: .I creat,
                    107: and is accessed during
                    108: a
                    109: .I read,
                    110: .I write,
                    111: or other operation on the open file.
                    112: An entry contains only
                    113: a pointer to the corresponding
                    114: entry of the
                    115: .I file
                    116: table,
                    117: which is a per-system data base.
                    118: There is one entry in the
                    119: .I file
                    120: table for each
                    121: instance of
                    122: .I open
                    123: or
                    124: .I creat.
                    125: This table is per-system because the same instance
                    126: of an open file must be shared among the several processes
                    127: which can result from
                    128: .I forks
                    129: after the file is opened.
                    130: A
                    131: .I file
                    132: table entry contains
                    133: flags which indicate whether the file
                    134: was open for reading or writing or is a pipe, and
                    135: a count which is used to decide when all processes
                    136: using the entry have terminated or closed the file
                    137: (so the entry can be abandoned).
                    138: There is also a 32-bit file offset
                    139: which is used to indicate where in the file the next read
                    140: or write will take place.
                    141: Finally, there is a pointer to the
                    142: entry for the file in the
                    143: .I inode
                    144: table,
                    145: which contains a copy of the file's i-node.
                    146: .PP
                    147: Certain open files can be designated ``multiplexed''
                    148: files, and several other flags apply to such
                    149: channels.
                    150: In such a case, instead of an offset,
                    151: there is a pointer to an associated multiplex channel table.
                    152: Multiplex channels will not be discussed here.
                    153: .PP
                    154: An entry in the
                    155: .I file
                    156: table corresponds precisely to an instance of
                    157: .I open
                    158: or
                    159: .I creat;
                    160: if the same file is opened several times,
                    161: it will have several
                    162: entries in this table.
                    163: However,
                    164: there is at most one entry
                    165: in the
                    166: .I inode
                    167: table for a given file.
                    168: Also, a file may enter the
                    169: .I inode
                    170: table not only because it is open,
                    171: but also because it is the current directory
                    172: of some process or because it
                    173: is a special file containing a currently-mounted
                    174: file system.
                    175: .PP
                    176: An entry in the
                    177: .I inode
                    178: table differs somewhat from the
                    179: corresponding i-node as stored on the disk;
                    180: the modified and accessed times are not stored,
                    181: and the entry is augmented
                    182: by a flag word containing information about the entry,
                    183: a count used to determine when it may be
                    184: allowed to disappear,
                    185: and the device and i-number
                    186: whence the entry came.
                    187: Also, the several block numbers that give addressing
                    188: information for the file are expanded from
                    189: the 3-byte, compressed format used on the disk to full
                    190: .I long
                    191: quantities.
                    192: .PP
                    193: During the processing of an
                    194: .I open
                    195: or
                    196: .I creat
                    197: call for a special file,
                    198: the system always calls the device's
                    199: .I open
                    200: routine to allow for any special processing
                    201: required (rewinding a tape, turning on
                    202: the data-terminal-ready lead of a modem, etc.).
                    203: However,
                    204: the
                    205: .I close
                    206: routine is called only when the last
                    207: process closes a file,
                    208: that is, when the i-node table entry
                    209: is being deallocated.
                    210: Thus it is not feasible
                    211: for a device to maintain, or depend on,
                    212: a count of its users, although it is quite
                    213: possible to
                    214: implement an exclusive-use device which cannot
                    215: be reopened until it has been closed.
                    216: .PP
                    217: When a
                    218: .I read
                    219: or
                    220: .I write
                    221: takes place,
                    222: the user's arguments
                    223: and the
                    224: .I file
                    225: table entry are used to set up the
                    226: variables
                    227: .I u.u_base,
                    228: .I u.u_count,
                    229: and
                    230: .I u.u_offset
                    231: which respectively contain the (user) address
                    232: of the I/O target area, the byte-count for the transfer,
                    233: and the current location in the file.
                    234: If the file referred to is
                    235: a character-type special file, the appropriate read
                    236: or write routine is called; it is responsible
                    237: for transferring data and updating the
                    238: count and current location appropriately
                    239: as discussed below.
                    240: Otherwise, the current location is used to calculate
                    241: a logical block number in the file.
                    242: If the file is an ordinary file the logical block
                    243: number must be mapped (possibly using indirect blocks)
                    244: to a physical block number; a block-type
                    245: special file need not be mapped.
                    246: This mapping is performed by the
                    247: .I bmap
                    248: routine.
                    249: In any event, the resulting physical block number
                    250: is used, as discussed below, to
                    251: read or write the appropriate device.
                    252: .SH
                    253: Character Device Drivers
                    254: .PP
                    255: The
                    256: .I cdevsw
                    257: table specifies the interface routines present for
                    258: character devices.
                    259: Each device provides five routines:
                    260: open, close, read, write, and special-function
                    261: (to implement the
                    262: .I ioctl
                    263: system call).
                    264: Any of these may be missing.
                    265: If a call on the routine
                    266: should be ignored,
                    267: (e.g.
                    268: .I open
                    269: on non-exclusive devices that require no setup)
                    270: the
                    271: .I cdevsw
                    272: entry can be given as
                    273: .I nulldev;
                    274: if it should be considered an error,
                    275: (e.g.
                    276: .I write
                    277: on read-only devices)
                    278: .I nodev
                    279: is used.
                    280: For terminals,
                    281: the
                    282: .I cdevsw
                    283: structure also contains a pointer to the
                    284: .I tty
                    285: structure associated with the terminal.
                    286: .PP
                    287: The
                    288: .I open
                    289: routine is called each time the file
                    290: is opened with the full device number as argument.
                    291: The second argument is a flag which is
                    292: non-zero only if the device is to be written upon.
                    293: .PP
                    294: The
                    295: .I close
                    296: routine is called only when the file
                    297: is closed for the last time,
                    298: that is when the very last process in
                    299: which the file is open closes it.
                    300: This means it is not possible for the driver to
                    301: maintain its own count of its users.
                    302: The first argument is the device number;
                    303: the second is a flag which is non-zero
                    304: if the file was open for writing in the process which
                    305: performs the final
                    306: .I close.
                    307: .PP
                    308: When
                    309: .I write
                    310: is called, it is supplied the device
                    311: as argument.
                    312: The per-user variable
                    313: .I u.u_count
                    314: has been set to
                    315: the number of characters indicated by the user;
                    316: for character devices, this number may be 0
                    317: initially.
                    318: .I u.u_base
                    319: is the address supplied by the user from which to start
                    320: taking characters.
                    321: The system may call the
                    322: routine internally, so the
                    323: flag
                    324: .I u.u_segflg
                    325: is supplied that indicates,
                    326: if
                    327: .I on,
                    328: that
                    329: .I u.u_base
                    330: refers to the system address space instead of
                    331: the user's.
                    332: .PP
                    333: The
                    334: .I write
                    335: routine
                    336: should copy up to
                    337: .I u.u_count
                    338: characters from the user's buffer to the device,
                    339: decrementing
                    340: .I u.u_count
                    341: for each character passed.
                    342: For most drivers, which work one character at a time,
                    343: the routine
                    344: .I "cpass( )"
                    345: is used to pick up characters
                    346: from the user's buffer.
                    347: Successive calls on it return
                    348: the characters to be written until
                    349: .I u.u_count
                    350: goes to 0 or an error occurs,
                    351: when it returns \(mi1.
                    352: .I Cpass
                    353: takes care of interrogating
                    354: .I u.u_segflg
                    355: and updating
                    356: .I u.u_count.
                    357: .PP
                    358: Write routines which want to transfer
                    359: a probably large number of characters into an internal
                    360: buffer may also use the routine
                    361: .I "iomove(buffer, offset, count, flag)"
                    362: which is faster when many characters must be moved.
                    363: .I Iomove
                    364: transfers up to
                    365: .I count
                    366: characters into the
                    367: .I buffer
                    368: starting
                    369: .I offset
                    370: bytes from the start of the buffer;
                    371: .I flag
                    372: should be
                    373: .I B_WRITE
                    374: (which is 0) in the write case.
                    375: Caution:
                    376: the caller is responsible for making sure
                    377: the count is not too large and is non-zero.
                    378: As an efficiency note,
                    379: .I iomove
                    380: is much slower if any of
                    381: .I "buffer+offset, count"
                    382: or
                    383: .I u.u_base
                    384: is odd.
                    385: .PP
                    386: The device's
                    387: .I read
                    388: routine is called under conditions similar to
                    389: .I write,
                    390: except that
                    391: .I u.u_count
                    392: is guaranteed to be non-zero.
                    393: To return characters to the user, the routine
                    394: .I "passc(c)"
                    395: is available; it takes care of housekeeping
                    396: like
                    397: .I cpass
                    398: and returns \(mi1 as the last character
                    399: specified by
                    400: .I u.u_count
                    401: is returned to the user;
                    402: before that time, 0 is returned.
                    403: .I Iomove
                    404: is also usable as with
                    405: .I write;
                    406: the flag should be
                    407: .I B_READ
                    408: but the same cautions apply.
                    409: .PP
                    410: The ``special-functions'' routine
                    411: is invoked by the
                    412: .I stty
                    413: and
                    414: .I gtty
                    415: system calls as follows:
                    416: .I "(*p) (dev, v)"
                    417: where
                    418: .I p
                    419: is a pointer to the device's routine,
                    420: .I dev
                    421: is the device number,
                    422: and
                    423: .I v
                    424: is a vector.
                    425: In the
                    426: .I gtty
                    427: case,
                    428: the device is supposed to place up to 3 words of status information
                    429: into the vector; this will be returned to the caller.
                    430: In the
                    431: .I stty
                    432: case,
                    433: .I v
                    434: is 0;
                    435: the device should take up to 3 words of
                    436: control information from
                    437: the array
                    438: .I "u.u_arg[0...2]."
                    439: .PP
                    440: Finally, each device should have appropriate interrupt-time
                    441: routines.
                    442: When an interrupt occurs, it is turned into a C-compatible call
                    443: on the devices's interrupt routine.
                    444: The interrupt-catching mechanism makes
                    445: the low-order four bits of the ``new PS'' word in the
                    446: trap vector for the interrupt available
                    447: to the interrupt handler.
                    448: This is conventionally used by drivers
                    449: which deal with multiple similar devices
                    450: to encode the minor device number.
                    451: After the interrupt has been processed,
                    452: a return from the interrupt handler will
                    453: return from the interrupt itself.
                    454: .PP
                    455: A number of subroutines are available which are useful
                    456: to character device drivers.
                    457: Most of these handlers, for example, need a place
                    458: to buffer characters in the internal interface
                    459: between their ``top half'' (read/write)
                    460: and ``bottom half'' (interrupt) routines.
                    461: For relatively low data-rate devices, the best mechanism
                    462: is the character queue maintained by the
                    463: routines
                    464: .I getc
                    465: and
                    466: .I putc.
                    467: A queue header has the structure
                    468: .DS
                    469: struct {
                    470:        int     c_cc;   /* character count */
                    471:        char    *c_cf;  /* first character */
                    472:        char    *c_cl;  /* last character */
                    473: } queue;
                    474: .DE
                    475: A character is placed on the end of a queue by
                    476: .I "putc(c, &queue)"
                    477: where
                    478: .I c
                    479: is the character and
                    480: .I queue
                    481: is the queue header.
                    482: The routine returns \(mi1 if there is no space
                    483: to put the character, 0 otherwise.
                    484: The first character on the queue may be retrieved
                    485: by
                    486: .I "getc(&queue)"
                    487: which returns either the (non-negative) character
                    488: or \(mi1 if the queue is empty.
                    489: .PP
                    490: Notice that the space for characters in queues is
                    491: shared among all devices in the system
                    492: and in the standard system there are only some 600
                    493: character slots available.
                    494: Thus device handlers,
                    495: especially write routines, must take
                    496: care to avoid gobbling up excessive numbers of characters.
                    497: .PP
                    498: The other major help available
                    499: to device handlers is the sleep-wakeup mechanism.
                    500: The call
                    501: .I "sleep(event, priority)"
                    502: causes the process to wait (allowing other processes to run)
                    503: until the
                    504: .I event
                    505: occurs;
                    506: at that time, the process is marked ready-to-run
                    507: and the call will return when there is no
                    508: process with higher
                    509: .I priority.
                    510: .PP
                    511: The call
                    512: .I "wakeup(event)"
                    513: indicates that the
                    514: .I event
                    515: has happened, that is, causes processes sleeping
                    516: on the event to be awakened.
                    517: The
                    518: .I event
                    519: is an arbitrary quantity agreed upon
                    520: by the sleeper and the waker-up.
                    521: By convention, it is the address of some data area used
                    522: by the driver, which guarantees that events
                    523: are unique.
                    524: .PP
                    525: Processes sleeping on an event should not assume
                    526: that the event has really happened;
                    527: they should check that the conditions which
                    528: caused them to sleep no longer hold.
                    529: .PP
                    530: Priorities can range from 0 to 127;
                    531: a higher numerical value indicates a less-favored
                    532: scheduling situation.
                    533: A distinction is made between processes sleeping
                    534: at priority less than the parameter
                    535: .I PZERO
                    536: and those at numerically larger priorities.
                    537: The former cannot
                    538: be interrupted by signals, although it
                    539: is conceivable that it may be swapped out.
                    540: Thus it is a bad idea to sleep with
                    541: priority less than PZERO on an event which might never occur.
                    542: On the other hand, calls to
                    543: .I sleep
                    544: with larger priority
                    545: may never return if the process is terminated by
                    546: some signal in the meantime.
                    547: Incidentally, it is a gross error to call
                    548: .I sleep
                    549: in a routine called at interrupt time, since the process
                    550: which is running is almost certainly not the
                    551: process which should go to sleep.
                    552: Likewise, none of the variables in the user area
                    553: ``\fIu\fB.\fR''
                    554: should be touched, let alone changed, by an interrupt routine.
                    555: .PP
                    556: If a device driver
                    557: wishes to wait for some event for which it is inconvenient
                    558: or impossible to supply a
                    559: .I wakeup,
                    560: (for example, a device going on-line, which does not
                    561: generally cause an interrupt),
                    562: the call
                    563: .I "sleep(&lbolt, priority)
                    564: may be given.
                    565: .I Lbolt
                    566: is an external cell whose address is awakened once every 4 seconds
                    567: by the clock interrupt routine.
                    568: .PP
                    569: The routines
                    570: .I "spl4( ), spl5( ), spl6( ), spl7( )"
                    571: are available to
                    572: set the processor priority level as indicated to avoid
                    573: inconvenient interrupts from the device.
                    574: .PP
                    575: If a device needs to know about real-time intervals,
                    576: then
                    577: .I "timeout(func, arg, interval)
                    578: will be useful.
                    579: This routine arranges that after
                    580: .I interval
                    581: sixtieths of a second, the
                    582: .I func
                    583: will be called with
                    584: .I arg
                    585: as argument, in the style
                    586: .I "(*func)(arg).
                    587: Timeouts are used, for example,
                    588: to provide real-time delays after function characters
                    589: like new-line and tab in typewriter output,
                    590: and to terminate an attempt to
                    591: read the 201 Dataphone
                    592: .I dp
                    593: if there is no response within a specified number
                    594: of seconds.
                    595: Notice that the number of sixtieths of a second is limited to 32767,
                    596: since it must appear to be positive,
                    597: and that only a bounded number of timeouts
                    598: can be going on at once.
                    599: Also, the specified
                    600: .I func
                    601: is called at clock-interrupt time, so it should
                    602: conform to the requirements of interrupt routines
                    603: in general.
                    604: .SH
                    605: The Block-device Interface
                    606: .PP
                    607: Handling of block devices is mediated by a collection
                    608: of routines that manage a set of buffers containing
                    609: the images of blocks of data on the various devices.
                    610: The most important purpose of these routines is to assure
                    611: that several processes that access the same block of the same
                    612: device in multiprogrammed fashion maintain a consistent
                    613: view of the data in the block.
                    614: A secondary but still important purpose is to increase
                    615: the efficiency of the system by
                    616: keeping in-core copies of blocks that are being
                    617: accessed frequently.
                    618: The main data base for this mechanism is the
                    619: table of buffers
                    620: .I buf.
                    621: Each buffer header contains a pair of pointers
                    622: .I "(b_forw, b_back)"
                    623: which maintain a doubly-linked list
                    624: of the buffers associated with a particular
                    625: block device, and a
                    626: pair of pointers
                    627: .I "(av_forw, av_back)"
                    628: which generally maintain a doubly-linked list of blocks
                    629: which are ``free,'' that is,
                    630: eligible to be reallocated for another transaction.
                    631: Buffers that have I/O in progress
                    632: or are busy for other purposes do not appear in this list.
                    633: The buffer header
                    634: also contains the device and block number to which the
                    635: buffer refers, and a pointer to the actual storage associated with
                    636: the buffer.
                    637: There is a word count
                    638: which is the negative of the number of words
                    639: to be transferred to or from the buffer;
                    640: there is also an error byte and a residual word
                    641: count used to communicate information
                    642: from an I/O routine to its caller.
                    643: Finally, there is a flag word
                    644: with bits indicating the status of the buffer.
                    645: These flags will be discussed below.
                    646: .PP
                    647: Seven routines constitute
                    648: the most important part of the interface with the
                    649: rest of the system.
                    650: Given a device and block number,
                    651: both
                    652: .I bread
                    653: and
                    654: .I getblk
                    655: return a pointer to a buffer header for the block;
                    656: the difference is that
                    657: .I bread
                    658: is guaranteed to return a buffer actually containing the
                    659: current data for the block,
                    660: while
                    661: .I getblk
                    662: returns a buffer which contains the data in the
                    663: block only if it is already in core (whether it is
                    664: or not is indicated by the
                    665: .I B_DONE
                    666: bit; see below).
                    667: In either case the buffer, and the corresponding
                    668: device block, is made ``busy,''
                    669: so that other processes referring to it
                    670: are obliged to wait until it becomes free.
                    671: .I Getblk
                    672: is used, for example,
                    673: when a block is about to be totally rewritten,
                    674: so that its previous contents are
                    675: not useful;
                    676: still, no other process can be allowed to refer to the block
                    677: until the new data is placed into it.
                    678: .PP
                    679: The
                    680: .I breada
                    681: routine is used to implement read-ahead.
                    682: it is logically similar to
                    683: .I bread,
                    684: but takes as an additional argument the number of
                    685: a block (on the same device) to be read asynchronously
                    686: after the specifically requested block is available.
                    687: .PP
                    688: Given a pointer to a buffer,
                    689: the
                    690: .I brelse
                    691: routine
                    692: makes the buffer again available to other processes.
                    693: It is called, for example, after
                    694: data has been extracted following a
                    695: .I bread.
                    696: There are three subtly-different write routines,
                    697: all of which take a buffer pointer as argument,
                    698: and all of which logically release the buffer for
                    699: use by others and place it on the free list.
                    700: .I Bwrite
                    701: puts the
                    702: buffer on the appropriate device queue,
                    703: waits for the write to be done,
                    704: and sets the user's error flag if required.
                    705: .I Bawrite
                    706: places the buffer on the device's queue, but does not wait
                    707: for completion, so that errors cannot be reflected directly to
                    708: the user.
                    709: .I Bdwrite
                    710: does not start any I/O operation at all,
                    711: but merely marks
                    712: the buffer so that if it happens
                    713: to be grabbed from the free list to contain
                    714: data from some other block, the data in it will
                    715: first be written
                    716: out.
                    717: .PP
                    718: .I Bwrite
                    719: is used when one wants to be sure that
                    720: I/O takes place correctly, and that
                    721: errors are reflected to the proper user;
                    722: it is used, for example, when updating i-nodes.
                    723: .I Bawrite
                    724: is useful when more overlap is desired
                    725: (because no wait is required for I/O to finish)
                    726: but when it is reasonably certain that the
                    727: write is really required.
                    728: .I Bdwrite
                    729: is used when there is doubt that the write is
                    730: needed at the moment.
                    731: For example,
                    732: .I bdwrite
                    733: is called when the last byte of a
                    734: .I write
                    735: system call falls short of the end of a
                    736: block, on the assumption that
                    737: another
                    738: .I write
                    739: will be given soon which will re-use the same block.
                    740: On the other hand,
                    741: as the end of a block is passed,
                    742: .I bawrite
                    743: is called, since probably the block will
                    744: not be accessed again soon and one might as
                    745: well start the writing process as soon as possible.
                    746: .PP
                    747: In any event, notice that the routines
                    748: .I "getblk"
                    749: and
                    750: .I bread
                    751: dedicate the given block exclusively to the
                    752: use of the caller, and make others wait,
                    753: while one of
                    754: .I "brelse, bwrite, bawrite,"
                    755: or
                    756: .I bdwrite
                    757: must eventually be called to free the block for use by others.
                    758: .PP
                    759: As mentioned, each buffer header contains a flag
                    760: word which indicates the status of the buffer.
                    761: Since they provide
                    762: one important channel for information between the drivers and the
                    763: block I/O system, it is important to understand these flags.
                    764: The following names are manifest constants which
                    765: select the associated flag bits.
                    766: .IP B_READ 10
                    767: This bit is set when the buffer is handed to the device strategy routine
                    768: (see below) to indicate a read operation.
                    769: The symbol
                    770: .I B_WRITE
                    771: is defined as 0 and does not define a flag; it is provided
                    772: as a mnemonic convenience to callers of routines like
                    773: .I swap
                    774: which have a separate argument
                    775: which indicates read or write.
                    776: .IP B_DONE 10
                    777: This bit is set
                    778: to 0 when a block is handed to the the device strategy
                    779: routine and is turned on when the operation completes,
                    780: whether normally as the result of an error.
                    781: It is also used as part of the return argument of
                    782: .I getblk
                    783: to indicate if 1 that the returned
                    784: buffer actually contains the data in the requested block.
                    785: .IP B_ERROR 10
                    786: This bit may be set to 1 when
                    787: .I B_DONE
                    788: is set to indicate that an I/O or other error occurred.
                    789: If it is set the
                    790: .I b_error
                    791: byte of the buffer header may contain an error code
                    792: if it is non-zero.
                    793: If
                    794: .I b_error
                    795: is 0 the nature of the error is not specified.
                    796: Actually no driver at present sets
                    797: .I b_error;
                    798: the latter is provided for a future improvement
                    799: whereby a more detailed error-reporting
                    800: scheme may be implemented.
                    801: .IP B_BUSY 10
                    802: This bit indicates that the buffer header is not on
                    803: the free list, i.e. is
                    804: dedicated to someone's exclusive use.
                    805: The buffer still remains attached to the list of
                    806: blocks associated with its device, however.
                    807: When
                    808: .I getblk
                    809: (or
                    810: .I bread,
                    811: which calls it) searches the buffer list
                    812: for a given device and finds the requested
                    813: block with this bit on, it sleeps until the bit
                    814: clears.
                    815: .IP B_PHYS 10
                    816: This bit is set for raw I/O transactions that
                    817: need to allocate the Unibus map on an 11/70.
                    818: .IP B_MAP 10
                    819: This bit is set on buffers that have the Unibus map allocated,
                    820: so that the
                    821: .I iodone
                    822: routine knows to deallocate the map.
                    823: .IP B_WANTED 10
                    824: This flag is used in conjunction with the
                    825: .I B_BUSY
                    826: bit.
                    827: Before sleeping as described
                    828: just above,
                    829: .I getblk
                    830: sets this flag.
                    831: Conversely, when the block is freed and the busy bit
                    832: goes down (in
                    833: .I brelse)
                    834: a
                    835: .I wakeup
                    836: is given for the block header whenever
                    837: .I B_WANTED
                    838: is on.
                    839: This strategem avoids the overhead
                    840: of having to call
                    841: .I wakeup
                    842: every time a buffer is freed on the chance that someone
                    843: might want it.
                    844: .IP B_AGE
                    845: This bit may be set on buffers just before releasing them; if it
                    846: is on,
                    847: the buffer is placed at the head of the free list, rather than at the
                    848: tail.
                    849: It is a performance heuristic
                    850: used when the caller judges that the same block will not soon be used again.
                    851: .IP B_ASYNC 10
                    852: This bit is set by
                    853: .I bawrite
                    854: to indicate to the appropriate device driver
                    855: that the buffer should be released when the
                    856: write has been finished, usually at interrupt time.
                    857: The difference between
                    858: .I bwrite
                    859: and
                    860: .I bawrite
                    861: is that the former starts I/O, waits until it is done, and
                    862: frees the buffer.
                    863: The latter merely sets this bit and starts I/O.
                    864: The bit indicates that
                    865: .I relse
                    866: should be called for the buffer on completion.
                    867: .IP B_DELWRI 10
                    868: This bit is set by
                    869: .I bdwrite
                    870: before releasing the buffer.
                    871: When
                    872: .I getblk,
                    873: while searching for a free block,
                    874: discovers the bit is 1 in a buffer it would otherwise grab,
                    875: it causes the block to be written out before reusing it.
                    876: .SH
                    877: Block Device Drivers
                    878: .PP
                    879: The
                    880: .I bdevsw
                    881: table contains the names of the interface routines
                    882: and that of a table for each block device.
                    883: .PP
                    884: Just as for character devices, block device drivers may supply
                    885: an
                    886: .I open
                    887: and a
                    888: .I close
                    889: routine
                    890: called respectively on each open and on the final close
                    891: of the device.
                    892: Instead of separate read and write routines,
                    893: each block device driver has a
                    894: .I strategy
                    895: routine which is called with a pointer to a buffer
                    896: header as argument.
                    897: As discussed, the buffer header contains
                    898: a read/write flag, the core address,
                    899: the block number, a (negative) word count,
                    900: and the major and minor device number.
                    901: The role of the strategy routine
                    902: is to carry out the operation as requested by the
                    903: information in the buffer header.
                    904: When the transaction is complete the
                    905: .I B_DONE
                    906: (and possibly the
                    907: .I B_ERROR)
                    908: bits should be set.
                    909: Then if the
                    910: .I B_ASYNC
                    911: bit is set,
                    912: .I brelse
                    913: should be called;
                    914: otherwise,
                    915: .I wakeup.
                    916: In cases where the device
                    917: is capable, under error-free operation,
                    918: of transferring fewer words than requested,
                    919: the device's word-count register should be placed
                    920: in the residual count slot of
                    921: the buffer header;
                    922: otherwise, the residual count should be set to 0.
                    923: This particular mechanism is really for the benefit
                    924: of the magtape driver;
                    925: when reading this device
                    926: records shorter than requested are quite normal,
                    927: and the user should be told the actual length of the record.
                    928: .PP
                    929: Although the most usual argument
                    930: to the strategy routines
                    931: is a genuine buffer header allocated as discussed above,
                    932: all that is actually required
                    933: is that the argument be a pointer to a place containing the
                    934: appropriate information.
                    935: For example the
                    936: .I swap
                    937: routine, which manages movement
                    938: of core images to and from the swapping device,
                    939: uses the strategy routine
                    940: for this device.
                    941: Care has to be taken that
                    942: no extraneous bits get turned on in the
                    943: flag word.
                    944: .PP
                    945: The device's table specified by
                    946: .I bdevsw
                    947: has a
                    948: byte to contain an active flag and an error count,
                    949: a pair of links which constitute the
                    950: head of the chain of buffers for the device
                    951: .I "(b_forw, b_back),"
                    952: and a first and last pointer for a device queue.
                    953: Of these things, all are used solely by the device driver
                    954: itself
                    955: except for the buffer-chain pointers.
                    956: Typically the flag encodes the state of the
                    957: device, and is used at a minimum to
                    958: indicate that the device is currently engaged in
                    959: transferring information and no new command should be issued.
                    960: The error count is useful for counting retries
                    961: when errors occur.
                    962: The device queue is used to remember stacked requests;
                    963: in the simplest case it may be maintained as a first-in
                    964: first-out list.
                    965: Since buffers which have been handed over to
                    966: the strategy routines are never
                    967: on the list of free buffers,
                    968: the pointers in the buffer which maintain the free list
                    969: .I "(av_forw, av_back)"
                    970: are also used to contain the pointers
                    971: which maintain the device queues.
                    972: .PP
                    973: A couple of routines
                    974: are provided which are useful to block device drivers.
                    975: .I "iodone(bp)"
                    976: arranges that the buffer to which
                    977: .I bp
                    978: points be released or awakened,
                    979: as appropriate,
                    980: when the
                    981: strategy module has finished with the buffer,
                    982: either normally or after an error.
                    983: (In the latter case the
                    984: .I B_ERROR
                    985: bit has presumably been set.)
                    986: .PP
                    987: The routine
                    988: .I "geterror(bp)"
                    989: can be used to examine the error bit in a buffer header
                    990: and arrange that any error indication found therein is
                    991: reflected to the user.
                    992: It may be called only in the non-interrupt
                    993: part of a driver when I/O has completed
                    994: .I (B_DONE
                    995: has been set).
                    996: .SH
                    997: Raw Block-device I/O
                    998: .PP
                    999: A scheme has been set up whereby block device drivers may
                   1000: provide the ability to transfer information
                   1001: directly between the user's core image and the device
                   1002: without the use of buffers and in blocks as large as
                   1003: the caller requests.
                   1004: The method involves setting up a character-type special file
                   1005: corresponding to the raw device
                   1006: and providing
                   1007: .I read
                   1008: and
                   1009: .I write
                   1010: routines which set up what is usually a private,
                   1011: non-shared buffer header with the appropriate information
                   1012: and call the device's strategy routine.
                   1013: If desired, separate
                   1014: .I open
                   1015: and
                   1016: .I close
                   1017: routines may be provided but this is usually unnecessary.
                   1018: A special-function routine might come in handy, especially for
                   1019: magtape.
                   1020: .PP
                   1021: A great deal of work has to be done to generate the
                   1022: ``appropriate information''
                   1023: to put in the argument buffer for
                   1024: the strategy module;
                   1025: the worst part is to map relocated user addresses to physical addresses.
                   1026: Most of this work is done by
                   1027: .I "physio(strat, bp, dev, rw)
                   1028: whose arguments are the name of the
                   1029: strategy routine
                   1030: .I strat,
                   1031: the buffer pointer
                   1032: .I bp,
                   1033: the device number
                   1034: .I dev,
                   1035: and a read-write flag
                   1036: .I rw
                   1037: whose value is either
                   1038: .I B_READ
                   1039: or
                   1040: .I B_WRITE.
                   1041: .I Physio
                   1042: makes sure that the user's base address and count are
                   1043: even (because most devices work in words)
                   1044: and that the core area affected is contiguous
                   1045: in physical space;
                   1046: it delays until the buffer is not busy, and makes it
                   1047: busy while the operation is in progress;
                   1048: and it sets up user error return information.

unix.superglobalmegacorp.com

This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.