|
|
1.1 root 1: .\" @(#)iosys 6.1 (Berkeley) 4/29/86
2: .\"
3: .EH 'PS2:5-%''The UNIX I/O System'
4: .OH 'The UNIX I/O System''PS2:5-%'
5: .TL
6: The UNIX I/O System
7: .AU
8: Dennis M. Ritchie
9: .AI
10: .MH
11: .PP
12: This paper gives an overview of the workings of the UNIX\(dg
13: .FS
14: \(dgUNIX is a Trademark of Bell Laboratories.
15: .FE
16: I/O system.
17: It was written with an eye toward providing
18: guidance to writers of device driver routines,
19: and is oriented more toward describing the environment
20: and nature of device drivers than the implementation
21: of that part of the file system which deals with
22: ordinary files.
23: .PP
24: It is assumed that the reader has a good knowledge
25: of the overall structure of the file system as discussed
26: in the paper ``The UNIX Time-sharing System.''
27: A more detailed discussion
28: appears in
29: ``UNIX Implementation;''
30: the current document restates parts of that one,
31: but is still more detailed.
32: It is most useful in
33: conjunction with a copy of the system code,
34: since it is basically an exegesis of that code.
35: .SH
36: Device Classes
37: .PP
38: There are two classes of device:
39: .I block
40: and
41: .I character.
42: The block interface is suitable for devices
43: like disks, tapes, and DECtape
44: which work, or can work, with addressible 512-byte blocks.
45: Ordinary magnetic tape just barely fits in this category,
46: since by use of forward
47: and
48: backward spacing any block can be read, even though
49: blocks can be written only at the end of the tape.
50: Block devices can at least potentially contain a mounted
51: file system.
52: The interface to block devices is very highly structured;
53: the drivers for these devices share a great many routines
54: as well as a pool of buffers.
55: .PP
56: Character-type devices have a much
57: more straightforward interface, although
58: more work must be done by the driver itself.
59: .PP
60: Devices of both types are named by a
61: .I major
62: and a
63: .I minor
64: device number.
65: These numbers are generally stored as an integer
66: with the minor device number
67: in the low-order 8 bits and the major device number
68: in the next-higher 8 bits;
69: macros
70: .I major
71: and
72: .I minor
73: are available to access these numbers.
74: The major device number selects which driver will deal with
75: the device; the minor device number is not used
76: by the rest of the system but is passed to the
77: driver at appropriate times.
78: Typically the minor number
79: selects a subdevice attached to
80: a given controller, or one of
81: several similar hardware interfaces.
82: .PP
83: The major device numbers for block and character devices
84: are used as indices in separate tables;
85: they both start at 0 and therefore overlap.
86: .SH
87: Overview of I/O
88: .PP
89: The purpose of
90: the
91: .I open
92: and
93: .I creat
94: system calls is to set up entries in three separate
95: system tables.
96: The first of these is the
97: .I u_ofile
98: table,
99: which is stored in the system's per-process
100: data area
101: .I u.
102: This table is indexed by
103: the file descriptor returned by the
104: .I open
105: or
106: .I creat,
107: and is accessed during
108: a
109: .I read,
110: .I write,
111: or other operation on the open file.
112: An entry contains only
113: a pointer to the corresponding
114: entry of the
115: .I file
116: table,
117: which is a per-system data base.
118: There is one entry in the
119: .I file
120: table for each
121: instance of
122: .I open
123: or
124: .I creat.
125: This table is per-system because the same instance
126: of an open file must be shared among the several processes
127: which can result from
128: .I forks
129: after the file is opened.
130: A
131: .I file
132: table entry contains
133: flags which indicate whether the file
134: was open for reading or writing or is a pipe, and
135: a count which is used to decide when all processes
136: using the entry have terminated or closed the file
137: (so the entry can be abandoned).
138: There is also a 32-bit file offset
139: which is used to indicate where in the file the next read
140: or write will take place.
141: Finally, there is a pointer to the
142: entry for the file in the
143: .I inode
144: table,
145: which contains a copy of the file's i-node.
146: .PP
147: Certain open files can be designated ``multiplexed''
148: files, and several other flags apply to such
149: channels.
150: In such a case, instead of an offset,
151: there is a pointer to an associated multiplex channel table.
152: Multiplex channels will not be discussed here.
153: .PP
154: An entry in the
155: .I file
156: table corresponds precisely to an instance of
157: .I open
158: or
159: .I creat;
160: if the same file is opened several times,
161: it will have several
162: entries in this table.
163: However,
164: there is at most one entry
165: in the
166: .I inode
167: table for a given file.
168: Also, a file may enter the
169: .I inode
170: table not only because it is open,
171: but also because it is the current directory
172: of some process or because it
173: is a special file containing a currently-mounted
174: file system.
175: .PP
176: An entry in the
177: .I inode
178: table differs somewhat from the
179: corresponding i-node as stored on the disk;
180: the modified and accessed times are not stored,
181: and the entry is augmented
182: by a flag word containing information about the entry,
183: a count used to determine when it may be
184: allowed to disappear,
185: and the device and i-number
186: whence the entry came.
187: Also, the several block numbers that give addressing
188: information for the file are expanded from
189: the 3-byte, compressed format used on the disk to full
190: .I long
191: quantities.
192: .PP
193: During the processing of an
194: .I open
195: or
196: .I creat
197: call for a special file,
198: the system always calls the device's
199: .I open
200: routine to allow for any special processing
201: required (rewinding a tape, turning on
202: the data-terminal-ready lead of a modem, etc.).
203: However,
204: the
205: .I close
206: routine is called only when the last
207: process closes a file,
208: that is, when the i-node table entry
209: is being deallocated.
210: Thus it is not feasible
211: for a device to maintain, or depend on,
212: a count of its users, although it is quite
213: possible to
214: implement an exclusive-use device which cannot
215: be reopened until it has been closed.
216: .PP
217: When a
218: .I read
219: or
220: .I write
221: takes place,
222: the user's arguments
223: and the
224: .I file
225: table entry are used to set up the
226: variables
227: .I u.u_base,
228: .I u.u_count,
229: and
230: .I u.u_offset
231: which respectively contain the (user) address
232: of the I/O target area, the byte-count for the transfer,
233: and the current location in the file.
234: If the file referred to is
235: a character-type special file, the appropriate read
236: or write routine is called; it is responsible
237: for transferring data and updating the
238: count and current location appropriately
239: as discussed below.
240: Otherwise, the current location is used to calculate
241: a logical block number in the file.
242: If the file is an ordinary file the logical block
243: number must be mapped (possibly using indirect blocks)
244: to a physical block number; a block-type
245: special file need not be mapped.
246: This mapping is performed by the
247: .I bmap
248: routine.
249: In any event, the resulting physical block number
250: is used, as discussed below, to
251: read or write the appropriate device.
252: .SH
253: Character Device Drivers
254: .PP
255: The
256: .I cdevsw
257: table specifies the interface routines present for
258: character devices.
259: Each device provides five routines:
260: open, close, read, write, and special-function
261: (to implement the
262: .I ioctl
263: system call).
264: Any of these may be missing.
265: If a call on the routine
266: should be ignored,
267: (e.g.
268: .I open
269: on non-exclusive devices that require no setup)
270: the
271: .I cdevsw
272: entry can be given as
273: .I nulldev;
274: if it should be considered an error,
275: (e.g.
276: .I write
277: on read-only devices)
278: .I nodev
279: is used.
280: For terminals,
281: the
282: .I cdevsw
283: structure also contains a pointer to the
284: .I tty
285: structure associated with the terminal.
286: .PP
287: The
288: .I open
289: routine is called each time the file
290: is opened with the full device number as argument.
291: The second argument is a flag which is
292: non-zero only if the device is to be written upon.
293: .PP
294: The
295: .I close
296: routine is called only when the file
297: is closed for the last time,
298: that is when the very last process in
299: which the file is open closes it.
300: This means it is not possible for the driver to
301: maintain its own count of its users.
302: The first argument is the device number;
303: the second is a flag which is non-zero
304: if the file was open for writing in the process which
305: performs the final
306: .I close.
307: .PP
308: When
309: .I write
310: is called, it is supplied the device
311: as argument.
312: The per-user variable
313: .I u.u_count
314: has been set to
315: the number of characters indicated by the user;
316: for character devices, this number may be 0
317: initially.
318: .I u.u_base
319: is the address supplied by the user from which to start
320: taking characters.
321: The system may call the
322: routine internally, so the
323: flag
324: .I u.u_segflg
325: is supplied that indicates,
326: if
327: .I on,
328: that
329: .I u.u_base
330: refers to the system address space instead of
331: the user's.
332: .PP
333: The
334: .I write
335: routine
336: should copy up to
337: .I u.u_count
338: characters from the user's buffer to the device,
339: decrementing
340: .I u.u_count
341: for each character passed.
342: For most drivers, which work one character at a time,
343: the routine
344: .I "cpass( )"
345: is used to pick up characters
346: from the user's buffer.
347: Successive calls on it return
348: the characters to be written until
349: .I u.u_count
350: goes to 0 or an error occurs,
351: when it returns \(mi1.
352: .I Cpass
353: takes care of interrogating
354: .I u.u_segflg
355: and updating
356: .I u.u_count.
357: .PP
358: Write routines which want to transfer
359: a probably large number of characters into an internal
360: buffer may also use the routine
361: .I "iomove(buffer, offset, count, flag)"
362: which is faster when many characters must be moved.
363: .I Iomove
364: transfers up to
365: .I count
366: characters into the
367: .I buffer
368: starting
369: .I offset
370: bytes from the start of the buffer;
371: .I flag
372: should be
373: .I B_WRITE
374: (which is 0) in the write case.
375: Caution:
376: the caller is responsible for making sure
377: the count is not too large and is non-zero.
378: As an efficiency note,
379: .I iomove
380: is much slower if any of
381: .I "buffer+offset, count"
382: or
383: .I u.u_base
384: is odd.
385: .PP
386: The device's
387: .I read
388: routine is called under conditions similar to
389: .I write,
390: except that
391: .I u.u_count
392: is guaranteed to be non-zero.
393: To return characters to the user, the routine
394: .I "passc(c)"
395: is available; it takes care of housekeeping
396: like
397: .I cpass
398: and returns \(mi1 as the last character
399: specified by
400: .I u.u_count
401: is returned to the user;
402: before that time, 0 is returned.
403: .I Iomove
404: is also usable as with
405: .I write;
406: the flag should be
407: .I B_READ
408: but the same cautions apply.
409: .PP
410: The ``special-functions'' routine
411: is invoked by the
412: .I stty
413: and
414: .I gtty
415: system calls as follows:
416: .I "(*p) (dev, v)"
417: where
418: .I p
419: is a pointer to the device's routine,
420: .I dev
421: is the device number,
422: and
423: .I v
424: is a vector.
425: In the
426: .I gtty
427: case,
428: the device is supposed to place up to 3 words of status information
429: into the vector; this will be returned to the caller.
430: In the
431: .I stty
432: case,
433: .I v
434: is 0;
435: the device should take up to 3 words of
436: control information from
437: the array
438: .I "u.u_arg[0...2]."
439: .PP
440: Finally, each device should have appropriate interrupt-time
441: routines.
442: When an interrupt occurs, it is turned into a C-compatible call
443: on the devices's interrupt routine.
444: The interrupt-catching mechanism makes
445: the low-order four bits of the ``new PS'' word in the
446: trap vector for the interrupt available
447: to the interrupt handler.
448: This is conventionally used by drivers
449: which deal with multiple similar devices
450: to encode the minor device number.
451: After the interrupt has been processed,
452: a return from the interrupt handler will
453: return from the interrupt itself.
454: .PP
455: A number of subroutines are available which are useful
456: to character device drivers.
457: Most of these handlers, for example, need a place
458: to buffer characters in the internal interface
459: between their ``top half'' (read/write)
460: and ``bottom half'' (interrupt) routines.
461: For relatively low data-rate devices, the best mechanism
462: is the character queue maintained by the
463: routines
464: .I getc
465: and
466: .I putc.
467: A queue header has the structure
468: .DS
469: struct {
470: int c_cc; /* character count */
471: char *c_cf; /* first character */
472: char *c_cl; /* last character */
473: } queue;
474: .DE
475: A character is placed on the end of a queue by
476: .I "putc(c, &queue)"
477: where
478: .I c
479: is the character and
480: .I queue
481: is the queue header.
482: The routine returns \(mi1 if there is no space
483: to put the character, 0 otherwise.
484: The first character on the queue may be retrieved
485: by
486: .I "getc(&queue)"
487: which returns either the (non-negative) character
488: or \(mi1 if the queue is empty.
489: .PP
490: Notice that the space for characters in queues is
491: shared among all devices in the system
492: and in the standard system there are only some 600
493: character slots available.
494: Thus device handlers,
495: especially write routines, must take
496: care to avoid gobbling up excessive numbers of characters.
497: .PP
498: The other major help available
499: to device handlers is the sleep-wakeup mechanism.
500: The call
501: .I "sleep(event, priority)"
502: causes the process to wait (allowing other processes to run)
503: until the
504: .I event
505: occurs;
506: at that time, the process is marked ready-to-run
507: and the call will return when there is no
508: process with higher
509: .I priority.
510: .PP
511: The call
512: .I "wakeup(event)"
513: indicates that the
514: .I event
515: has happened, that is, causes processes sleeping
516: on the event to be awakened.
517: The
518: .I event
519: is an arbitrary quantity agreed upon
520: by the sleeper and the waker-up.
521: By convention, it is the address of some data area used
522: by the driver, which guarantees that events
523: are unique.
524: .PP
525: Processes sleeping on an event should not assume
526: that the event has really happened;
527: they should check that the conditions which
528: caused them to sleep no longer hold.
529: .PP
530: Priorities can range from 0 to 127;
531: a higher numerical value indicates a less-favored
532: scheduling situation.
533: A distinction is made between processes sleeping
534: at priority less than the parameter
535: .I PZERO
536: and those at numerically larger priorities.
537: The former cannot
538: be interrupted by signals, although it
539: is conceivable that it may be swapped out.
540: Thus it is a bad idea to sleep with
541: priority less than PZERO on an event which might never occur.
542: On the other hand, calls to
543: .I sleep
544: with larger priority
545: may never return if the process is terminated by
546: some signal in the meantime.
547: Incidentally, it is a gross error to call
548: .I sleep
549: in a routine called at interrupt time, since the process
550: which is running is almost certainly not the
551: process which should go to sleep.
552: Likewise, none of the variables in the user area
553: ``\fIu\fB.\fR''
554: should be touched, let alone changed, by an interrupt routine.
555: .PP
556: If a device driver
557: wishes to wait for some event for which it is inconvenient
558: or impossible to supply a
559: .I wakeup,
560: (for example, a device going on-line, which does not
561: generally cause an interrupt),
562: the call
563: .I "sleep(&lbolt, priority)
564: may be given.
565: .I Lbolt
566: is an external cell whose address is awakened once every 4 seconds
567: by the clock interrupt routine.
568: .PP
569: The routines
570: .I "spl4( ), spl5( ), spl6( ), spl7( )"
571: are available to
572: set the processor priority level as indicated to avoid
573: inconvenient interrupts from the device.
574: .PP
575: If a device needs to know about real-time intervals,
576: then
577: .I "timeout(func, arg, interval)
578: will be useful.
579: This routine arranges that after
580: .I interval
581: sixtieths of a second, the
582: .I func
583: will be called with
584: .I arg
585: as argument, in the style
586: .I "(*func)(arg).
587: Timeouts are used, for example,
588: to provide real-time delays after function characters
589: like new-line and tab in typewriter output,
590: and to terminate an attempt to
591: read the 201 Dataphone
592: .I dp
593: if there is no response within a specified number
594: of seconds.
595: Notice that the number of sixtieths of a second is limited to 32767,
596: since it must appear to be positive,
597: and that only a bounded number of timeouts
598: can be going on at once.
599: Also, the specified
600: .I func
601: is called at clock-interrupt time, so it should
602: conform to the requirements of interrupt routines
603: in general.
604: .SH
605: The Block-device Interface
606: .PP
607: Handling of block devices is mediated by a collection
608: of routines that manage a set of buffers containing
609: the images of blocks of data on the various devices.
610: The most important purpose of these routines is to assure
611: that several processes that access the same block of the same
612: device in multiprogrammed fashion maintain a consistent
613: view of the data in the block.
614: A secondary but still important purpose is to increase
615: the efficiency of the system by
616: keeping in-core copies of blocks that are being
617: accessed frequently.
618: The main data base for this mechanism is the
619: table of buffers
620: .I buf.
621: Each buffer header contains a pair of pointers
622: .I "(b_forw, b_back)"
623: which maintain a doubly-linked list
624: of the buffers associated with a particular
625: block device, and a
626: pair of pointers
627: .I "(av_forw, av_back)"
628: which generally maintain a doubly-linked list of blocks
629: which are ``free,'' that is,
630: eligible to be reallocated for another transaction.
631: Buffers that have I/O in progress
632: or are busy for other purposes do not appear in this list.
633: The buffer header
634: also contains the device and block number to which the
635: buffer refers, and a pointer to the actual storage associated with
636: the buffer.
637: There is a word count
638: which is the negative of the number of words
639: to be transferred to or from the buffer;
640: there is also an error byte and a residual word
641: count used to communicate information
642: from an I/O routine to its caller.
643: Finally, there is a flag word
644: with bits indicating the status of the buffer.
645: These flags will be discussed below.
646: .PP
647: Seven routines constitute
648: the most important part of the interface with the
649: rest of the system.
650: Given a device and block number,
651: both
652: .I bread
653: and
654: .I getblk
655: return a pointer to a buffer header for the block;
656: the difference is that
657: .I bread
658: is guaranteed to return a buffer actually containing the
659: current data for the block,
660: while
661: .I getblk
662: returns a buffer which contains the data in the
663: block only if it is already in core (whether it is
664: or not is indicated by the
665: .I B_DONE
666: bit; see below).
667: In either case the buffer, and the corresponding
668: device block, is made ``busy,''
669: so that other processes referring to it
670: are obliged to wait until it becomes free.
671: .I Getblk
672: is used, for example,
673: when a block is about to be totally rewritten,
674: so that its previous contents are
675: not useful;
676: still, no other process can be allowed to refer to the block
677: until the new data is placed into it.
678: .PP
679: The
680: .I breada
681: routine is used to implement read-ahead.
682: it is logically similar to
683: .I bread,
684: but takes as an additional argument the number of
685: a block (on the same device) to be read asynchronously
686: after the specifically requested block is available.
687: .PP
688: Given a pointer to a buffer,
689: the
690: .I brelse
691: routine
692: makes the buffer again available to other processes.
693: It is called, for example, after
694: data has been extracted following a
695: .I bread.
696: There are three subtly-different write routines,
697: all of which take a buffer pointer as argument,
698: and all of which logically release the buffer for
699: use by others and place it on the free list.
700: .I Bwrite
701: puts the
702: buffer on the appropriate device queue,
703: waits for the write to be done,
704: and sets the user's error flag if required.
705: .I Bawrite
706: places the buffer on the device's queue, but does not wait
707: for completion, so that errors cannot be reflected directly to
708: the user.
709: .I Bdwrite
710: does not start any I/O operation at all,
711: but merely marks
712: the buffer so that if it happens
713: to be grabbed from the free list to contain
714: data from some other block, the data in it will
715: first be written
716: out.
717: .PP
718: .I Bwrite
719: is used when one wants to be sure that
720: I/O takes place correctly, and that
721: errors are reflected to the proper user;
722: it is used, for example, when updating i-nodes.
723: .I Bawrite
724: is useful when more overlap is desired
725: (because no wait is required for I/O to finish)
726: but when it is reasonably certain that the
727: write is really required.
728: .I Bdwrite
729: is used when there is doubt that the write is
730: needed at the moment.
731: For example,
732: .I bdwrite
733: is called when the last byte of a
734: .I write
735: system call falls short of the end of a
736: block, on the assumption that
737: another
738: .I write
739: will be given soon which will re-use the same block.
740: On the other hand,
741: as the end of a block is passed,
742: .I bawrite
743: is called, since probably the block will
744: not be accessed again soon and one might as
745: well start the writing process as soon as possible.
746: .PP
747: In any event, notice that the routines
748: .I "getblk"
749: and
750: .I bread
751: dedicate the given block exclusively to the
752: use of the caller, and make others wait,
753: while one of
754: .I "brelse, bwrite, bawrite,"
755: or
756: .I bdwrite
757: must eventually be called to free the block for use by others.
758: .PP
759: As mentioned, each buffer header contains a flag
760: word which indicates the status of the buffer.
761: Since they provide
762: one important channel for information between the drivers and the
763: block I/O system, it is important to understand these flags.
764: The following names are manifest constants which
765: select the associated flag bits.
766: .IP B_READ 10
767: This bit is set when the buffer is handed to the device strategy routine
768: (see below) to indicate a read operation.
769: The symbol
770: .I B_WRITE
771: is defined as 0 and does not define a flag; it is provided
772: as a mnemonic convenience to callers of routines like
773: .I swap
774: which have a separate argument
775: which indicates read or write.
776: .IP B_DONE 10
777: This bit is set
778: to 0 when a block is handed to the the device strategy
779: routine and is turned on when the operation completes,
780: whether normally as the result of an error.
781: It is also used as part of the return argument of
782: .I getblk
783: to indicate if 1 that the returned
784: buffer actually contains the data in the requested block.
785: .IP B_ERROR 10
786: This bit may be set to 1 when
787: .I B_DONE
788: is set to indicate that an I/O or other error occurred.
789: If it is set the
790: .I b_error
791: byte of the buffer header may contain an error code
792: if it is non-zero.
793: If
794: .I b_error
795: is 0 the nature of the error is not specified.
796: Actually no driver at present sets
797: .I b_error;
798: the latter is provided for a future improvement
799: whereby a more detailed error-reporting
800: scheme may be implemented.
801: .IP B_BUSY 10
802: This bit indicates that the buffer header is not on
803: the free list, i.e. is
804: dedicated to someone's exclusive use.
805: The buffer still remains attached to the list of
806: blocks associated with its device, however.
807: When
808: .I getblk
809: (or
810: .I bread,
811: which calls it) searches the buffer list
812: for a given device and finds the requested
813: block with this bit on, it sleeps until the bit
814: clears.
815: .IP B_PHYS 10
816: This bit is set for raw I/O transactions that
817: need to allocate the Unibus map on an 11/70.
818: .IP B_MAP 10
819: This bit is set on buffers that have the Unibus map allocated,
820: so that the
821: .I iodone
822: routine knows to deallocate the map.
823: .IP B_WANTED 10
824: This flag is used in conjunction with the
825: .I B_BUSY
826: bit.
827: Before sleeping as described
828: just above,
829: .I getblk
830: sets this flag.
831: Conversely, when the block is freed and the busy bit
832: goes down (in
833: .I brelse)
834: a
835: .I wakeup
836: is given for the block header whenever
837: .I B_WANTED
838: is on.
839: This strategem avoids the overhead
840: of having to call
841: .I wakeup
842: every time a buffer is freed on the chance that someone
843: might want it.
844: .IP B_AGE
845: This bit may be set on buffers just before releasing them; if it
846: is on,
847: the buffer is placed at the head of the free list, rather than at the
848: tail.
849: It is a performance heuristic
850: used when the caller judges that the same block will not soon be used again.
851: .IP B_ASYNC 10
852: This bit is set by
853: .I bawrite
854: to indicate to the appropriate device driver
855: that the buffer should be released when the
856: write has been finished, usually at interrupt time.
857: The difference between
858: .I bwrite
859: and
860: .I bawrite
861: is that the former starts I/O, waits until it is done, and
862: frees the buffer.
863: The latter merely sets this bit and starts I/O.
864: The bit indicates that
865: .I relse
866: should be called for the buffer on completion.
867: .IP B_DELWRI 10
868: This bit is set by
869: .I bdwrite
870: before releasing the buffer.
871: When
872: .I getblk,
873: while searching for a free block,
874: discovers the bit is 1 in a buffer it would otherwise grab,
875: it causes the block to be written out before reusing it.
876: .SH
877: Block Device Drivers
878: .PP
879: The
880: .I bdevsw
881: table contains the names of the interface routines
882: and that of a table for each block device.
883: .PP
884: Just as for character devices, block device drivers may supply
885: an
886: .I open
887: and a
888: .I close
889: routine
890: called respectively on each open and on the final close
891: of the device.
892: Instead of separate read and write routines,
893: each block device driver has a
894: .I strategy
895: routine which is called with a pointer to a buffer
896: header as argument.
897: As discussed, the buffer header contains
898: a read/write flag, the core address,
899: the block number, a (negative) word count,
900: and the major and minor device number.
901: The role of the strategy routine
902: is to carry out the operation as requested by the
903: information in the buffer header.
904: When the transaction is complete the
905: .I B_DONE
906: (and possibly the
907: .I B_ERROR)
908: bits should be set.
909: Then if the
910: .I B_ASYNC
911: bit is set,
912: .I brelse
913: should be called;
914: otherwise,
915: .I wakeup.
916: In cases where the device
917: is capable, under error-free operation,
918: of transferring fewer words than requested,
919: the device's word-count register should be placed
920: in the residual count slot of
921: the buffer header;
922: otherwise, the residual count should be set to 0.
923: This particular mechanism is really for the benefit
924: of the magtape driver;
925: when reading this device
926: records shorter than requested are quite normal,
927: and the user should be told the actual length of the record.
928: .PP
929: Although the most usual argument
930: to the strategy routines
931: is a genuine buffer header allocated as discussed above,
932: all that is actually required
933: is that the argument be a pointer to a place containing the
934: appropriate information.
935: For example the
936: .I swap
937: routine, which manages movement
938: of core images to and from the swapping device,
939: uses the strategy routine
940: for this device.
941: Care has to be taken that
942: no extraneous bits get turned on in the
943: flag word.
944: .PP
945: The device's table specified by
946: .I bdevsw
947: has a
948: byte to contain an active flag and an error count,
949: a pair of links which constitute the
950: head of the chain of buffers for the device
951: .I "(b_forw, b_back),"
952: and a first and last pointer for a device queue.
953: Of these things, all are used solely by the device driver
954: itself
955: except for the buffer-chain pointers.
956: Typically the flag encodes the state of the
957: device, and is used at a minimum to
958: indicate that the device is currently engaged in
959: transferring information and no new command should be issued.
960: The error count is useful for counting retries
961: when errors occur.
962: The device queue is used to remember stacked requests;
963: in the simplest case it may be maintained as a first-in
964: first-out list.
965: Since buffers which have been handed over to
966: the strategy routines are never
967: on the list of free buffers,
968: the pointers in the buffer which maintain the free list
969: .I "(av_forw, av_back)"
970: are also used to contain the pointers
971: which maintain the device queues.
972: .PP
973: A couple of routines
974: are provided which are useful to block device drivers.
975: .I "iodone(bp)"
976: arranges that the buffer to which
977: .I bp
978: points be released or awakened,
979: as appropriate,
980: when the
981: strategy module has finished with the buffer,
982: either normally or after an error.
983: (In the latter case the
984: .I B_ERROR
985: bit has presumably been set.)
986: .PP
987: The routine
988: .I "geterror(bp)"
989: can be used to examine the error bit in a buffer header
990: and arrange that any error indication found therein is
991: reflected to the user.
992: It may be called only in the non-interrupt
993: part of a driver when I/O has completed
994: .I (B_DONE
995: has been set).
996: .SH
997: Raw Block-device I/O
998: .PP
999: A scheme has been set up whereby block device drivers may
1000: provide the ability to transfer information
1001: directly between the user's core image and the device
1002: without the use of buffers and in blocks as large as
1003: the caller requests.
1004: The method involves setting up a character-type special file
1005: corresponding to the raw device
1006: and providing
1007: .I read
1008: and
1009: .I write
1010: routines which set up what is usually a private,
1011: non-shared buffer header with the appropriate information
1012: and call the device's strategy routine.
1013: If desired, separate
1014: .I open
1015: and
1016: .I close
1017: routines may be provided but this is usually unnecessary.
1018: A special-function routine might come in handy, especially for
1019: magtape.
1020: .PP
1021: A great deal of work has to be done to generate the
1022: ``appropriate information''
1023: to put in the argument buffer for
1024: the strategy module;
1025: the worst part is to map relocated user addresses to physical addresses.
1026: Most of this work is done by
1027: .I "physio(strat, bp, dev, rw)
1028: whose arguments are the name of the
1029: strategy routine
1030: .I strat,
1031: the buffer pointer
1032: .I bp,
1033: the device number
1034: .I dev,
1035: and a read-write flag
1036: .I rw
1037: whose value is either
1038: .I B_READ
1039: or
1040: .I B_WRITE.
1041: .I Physio
1042: makes sure that the user's base address and count are
1043: even (because most devices work in words)
1044: and that the core area affected is contiguous
1045: in physical space;
1046: it delays until the buffer is not busy, and makes it
1047: busy while the operation is in progress;
1048: and it sets up user error return information.
This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.