|
|
1.1 root 1: .\" @(#)implement 6.2 (Berkeley) 6/17/86
2: .\"
3: .EH 'PS2:4-%''UNIX Implementation'
4: .OH 'UNIX Implementation''PS2:4-%'
5: .de P1
6: .DS
7: ..
8: .de P2
9: .DE
10: ..
11: .de UL
12: .lg 0
13: .if n .ul
14: \%\&\\$3\f3\\$1\fR\&\\$2
15: .lg
16: ..
17: .de UC
18: \&\\$3\s-1\\$1\\s0\&\\$2
19: ..
20: .de IT
21: .lg 0
22: .if n .ul
23: \%\&\\$3\f2\\$1\fR\&\\$2
24: .lg
25: ..
26: .de SP
27: .sp \\$1
28: ..
29: .hw device
30: .TL
31: UNIX Implementation
32: .AU "MH 2C-523" 2394
33: K. Thompson
34: .AI
35: .MH
36: .AB
37: This paper describes in high-level terms the
38: implementation of the resident
39: .UX
40: kernel.
41: This discussion is broken into three parts.
42: The first part describes
43: how the
44: .UX
45: system views processes, users, and programs.
46: The second part describes the I/O system.
47: The last part describes the
48: .UX
49: file system.
50: .AE
51: .NH
52: INTRODUCTION
53: .PP
54: The
55: .UX
56: kernel consists of about 10,000
57: lines of C code and about 1,000 lines of assembly code.
58: The assembly code can be further broken down into
59: 200 lines included for
60: the sake of efficiency
61: (they could have been written in C)
62: and 800 lines to perform hardware
63: functions not possible in C.
64: .PP
65: This code represents 5 to 10 percent of what has
66: been lumped into the broad expression
67: ``the
68: .UX
69: operating system.''
70: The kernel is the only
71: .UX
72: code that
73: cannot be substituted by a user to his
74: own liking.
75: For this reason,
76: the kernel should make as few real
77: decisions as possible.
78: This does not mean to allow the user
79: a million options to do the same thing.
80: Rather, it means to allow only one way to
81: do one thing,
82: but have that way be the least-common divisor
83: of all the options that might have been provided.
84: .PP
85: What is or is not implemented in the kernel
86: represents both a great responsibility and a great power.
87: It is a soap-box platform on
88: ``the way things should be done.''
89: Even so, if
90: ``the way'' is too radical,
91: no one will follow it.
92: Every important decision was weighed
93: carefully.
94: Throughout,
95: simplicity has been substituted for efficiency.
96: Complex algorithms are used only if
97: their complexity can be localized.
98: .NH
99: PROCESS CONTROL
100: .PP
101: In the
102: .UX
103: system,
104: a user executes programs in an
105: environment called a user process.
106: When a system function is required,
107: the user process calls the system
108: as a subroutine.
109: At some point in this call,
110: there is a distinct switch of environments.
111: After this,
112: the process is said to be a system process.
113: In the normal definition of processes,
114: the user and system processes are different
115: phases of the same process
116: (they never execute simultaneously).
117: For protection,
118: each system process has its own stack.
119: .PP
120: The user process may execute
121: from a read-only text segment,
122: which is shared by all processes
123: executing the same code.
124: There is no
125: .IT functional
126: benefit
127: from shared-text segments.
128: An
129: .IT efficiency
130: benefit comes from the fact
131: that there is no need to swap read-only
132: segments out because the original
133: copy on secondary memory is still current.
134: This is a great benefit to interactive
135: programs that tend to be swapped while
136: waiting for terminal input.
137: Furthermore,
138: if two processes are
139: executing
140: simultaneously
141: from the same copy of a read-only segment,
142: only one copy needs to reside in
143: primary memory.
144: This is a secondary effect,
145: because
146: simultaneous execution of a program
147: is not common.
148: It is ironic that this effect,
149: which reduces the use of primary memory,
150: only comes into play when there is
151: an overabundance of primary memory,
152: that is,
153: when there is enough memory
154: to keep waiting processes loaded.
155: .PP
156: All current read-only text segments in the
157: system are maintained from the
158: .IT "text table" .
159: A text table entry holds the location of the
160: text segment on secondary memory.
161: If the segment is loaded,
162: that table also holds the primary memory location
163: and the count of the number of processes
164: sharing this entry.
165: When this count is reduced to zero,
166: the entry is freed along with any
167: primary and secondary memory holding the segment.
168: When a process first executes a shared-text segment,
169: a text table entry is allocated and the
170: segment is loaded onto secondary memory.
171: If a second process executes a text segment
172: that is already allocated,
173: the entry reference count is simply incremented.
174: .PP
175: A user process has some strictly private
176: read-write data
177: contained in its
178: data segment.
179: As far as possible,
180: the system does not use the user's
181: data segment to hold system data.
182: In particular,
183: there are no I/O buffers in the
184: user address space.
185: .PP
186: The user data segment has two growing boundaries.
187: One, increased automatically by the system
188: as a result of memory faults,
189: is used for a stack.
190: The second boundary is only grown (or shrunk) by
191: explicit requests.
192: The contents of newly allocated primary memory
193: is initialized to zero.
194: .PP
195: Also associated and swapped with
196: a process is a small fixed-size
197: system data segment.
198: This segment contains all
199: the data about the process
200: that the system needs only when the
201: process is active.
202: Examples of the kind of data contained
203: in the system data segment are:
204: saved central processor registers,
205: open file descriptors,
206: accounting information,
207: scratch data area,
208: and the stack for the system phase
209: of the process.
210: The system data segment is not
211: addressable from the user process
212: and is therefore protected.
213: .PP
214: Last,
215: there is a process table with
216: one entry per process.
217: This entry contains all the data
218: needed by the system when the process
219: is
220: .IT not
221: active.
222: Examples are
223: the process's name,
224: the location of the other segments,
225: and scheduling information.
226: The process table entry is allocated
227: when the process is created, and freed
228: when the process terminates.
229: This process entry is always directly
230: addressable by the kernel.
231: .PP
232: Figure 1 shows the relationships
233: between the various process control
234: data.
235: In a sense,
236: the process table is the
237: definition of all processes,
238: because
239: all the data associated with a process
240: may be accessed
241: starting from the process table entry.
242: .KF
243: .in .375i
244: .so fig1.pic
245: .in -.375i
246: .sp 2v
247: .ce
248: Fig. 1\(emProcess control data structure.
249: .KE
250: .NH 2
251: Process creation and program execution
252: .PP
253: Processes are created by the system primitive
254: .UL fork .
255: The newly created process (child) is a copy of the original process (parent).
256: There is no detectable sharing of primary memory between the two processes.
257: (Of course,
258: if the parent process was executing from a read-only
259: text segment,
260: the child will share the text segment.)
261: Copies of all writable data segments
262: are made for the child process.
263: Files that were open before the
264: .UL fork
265: are
266: truly shared after the
267: .UL fork .
268: The processes are informed as to their part in the
269: relationship to
270: allow them to select their own
271: (usually non-identical)
272: destiny.
273: The parent may
274: .UL wait
275: for the termination of
276: any of its children.
277: .PP
278: A process may
279: .UL exec
280: a file.
281: This consists of exchanging the current text and data
282: segments of the process for new text and data
283: segments specified in the file.
284: The old segments are lost.
285: Doing an
286: .UL exec
287: does
288: .IT not
289: change processes;
290: the process that did the
291: .UL exec
292: persists,
293: but
294: after the
295: .UL exec
296: it is executing a different program.
297: Files that were open
298: before the
299: .UL exec
300: remain open after the
301: .UL exec .
302: .PP
303: If a program,
304: say the first pass of a compiler,
305: wishes to overlay itself with another program,
306: say the second pass,
307: then it simply
308: .UL exec s
309: the second program.
310: This is analogous
311: to a ``goto.''
312: If a program wishes to regain control
313: after
314: .UL exec ing
315: a second program,
316: it should
317: .UL fork
318: a child process,
319: have the child
320: .UL exec
321: the second program, and
322: have the parent
323: .UL wait
324: for the child.
325: This is analogous to a ``call.''
326: Breaking up the call into a binding followed by
327: a transfer is similar to the subroutine linkage in
328: SL-5.
329: .[
330: griswold hanson sl5 overview
331: .]
332: .NH 2
333: Swapping
334: .PP
335: The major data associated with a process
336: (the user data segment,
337: the system data segment, and
338: the text segment)
339: are swapped to and from secondary
340: memory, as needed.
341: The user data segment and the system data segment
342: are kept in contiguous primary memory to reduce
343: swapping latency.
344: (When low-latency devices, such as bubbles,
345: .UC CCD s,
346: or scatter/gather devices,
347: are used,
348: this decision will have to be reconsidered.)
349: Allocation of both primary
350: and secondary memory is performed
351: by the same simple first-fit algorithm.
352: When a process grows,
353: a new piece of primary memory is allocated.
354: The contents of the old memory is copied to the new memory.
355: The old memory is freed
356: and the tables are updated.
357: If there is not enough primary memory,
358: secondary memory is allocated instead.
359: The process is swapped out onto the
360: secondary memory,
361: ready to be swapped in with
362: its new size.
363: .PP
364: One separate process in the kernel,
365: the swapping process,
366: simply swaps the other
367: processes in and out of primary memory.
368: It examines the
369: process table looking for a process
370: that is swapped out and is
371: ready to run.
372: It allocates primary memory for that
373: process and
374: reads its segments into
375: primary memory, where that process competes for the
376: central processor with other loaded processes.
377: If no primary memory is available,
378: the swapping process makes memory available
379: by examining the process table for processes
380: that can be swapped out.
381: It selects a process to swap out,
382: writes it to secondary memory,
383: frees the primary memory,
384: and then goes back to look for a process
385: to swap in.
386: .PP
387: Thus there are two specific algorithms
388: to the swapping process.
389: Which of the possibly many processes that
390: are swapped out is to be swapped in?
391: This is decided by secondary storage residence
392: time.
393: The one with the longest time out is swapped in first.
394: There is a slight penalty for larger processes.
395: Which of the possibly many processes that
396: are loaded is to be swapped out?
397: Processes that are waiting for slow events
398: (i.e., not currently running or waiting for
399: disk I/O)
400: are picked first,
401: by age in primary memory,
402: again with size penalties.
403: The other processes are examined
404: by the same age algorithm,
405: but are not taken out unless they are
406: at least of some age.
407: This adds
408: hysteresis to the swapping and
409: prevents total thrashing.
410: .PP
411: These swapping algorithms are the
412: most suspect in the system.
413: With limited primary memory,
414: these algorithms cause total swapping.
415: This is not bad in itself, because
416: the swapping does not impact the
417: execution of the resident processes.
418: However, if the swapping device must
419: also be used for file storage,
420: the swapping traffic severely
421: impacts the file system traffic.
422: It is exactly these small systems
423: that tend to double usage of limited disk
424: resources.
425: .NH 2
426: Synchronization and scheduling
427: .PP
428: Process synchronization is accomplished by having processes
429: wait for events.
430: Events are represented by arbitrary integers.
431: By convention,
432: events are chosen to be addresses of
433: tables associated with those events.
434: For example, a process that is waiting for
435: any of its children to terminate will wait
436: for an event that is the address of
437: its own process table entry.
438: When a process terminates,
439: it signals the event represented by
440: its parent's process table entry.
441: Signaling an event on which no process
442: is waiting has no effect.
443: Similarly,
444: signaling an event on which many processes
445: are waiting will wake all of them up.
446: This differs considerably from
447: Dijkstra's P and V
448: synchronization operations,
449: .[
450: dijkstra sequential processes 1968
451: .]
452: in that
453: no memory is associated with events.
454: Thus there need be no allocation of events
455: prior to their use.
456: Events exist simply by being used.
457: .PP
458: On the negative side,
459: because there is no memory associated with events,
460: no notion of ``how much''
461: can be signaled via the event mechanism.
462: For example,
463: processes that want memory might
464: wait on an event associated with
465: memory allocation.
466: When any amount of memory becomes available,
467: the event would be signaled.
468: All the competing processes would then wake
469: up to fight over the new memory.
470: (In reality,
471: the swapping process is the only process
472: that waits for primary memory to become available.)
473: .PP
474: If an event occurs
475: between the time a process decides
476: to wait for that event and the
477: time that process enters the wait state,
478: then
479: the process will wait on an event that has
480: already happened (and may never happen again).
481: This race condition happens because there is no memory associated with
482: the event to indicate that the event has occurred;
483: the only action of an event is to change a set of processes
484: from wait state to run state.
485: This problem is relieved largely
486: by the fact that process switching can
487: only occur in the kernel by explicit calls
488: to the event-wait mechanism.
489: If the event in question is signaled by another
490: process,
491: then there is no problem.
492: But if the event is signaled by a hardware
493: interrupt,
494: then special care must be taken.
495: These synchronization races pose the biggest
496: problem when
497: .UX
498: is adapted to multiple-processor configurations.
499: .[
500: hawley meyer multiprocessing unix
501: .]
502: .PP
503: The event-wait code in the kernel
504: is like a co-routine linkage.
505: At any time,
506: all but one of the processes has called event-wait.
507: The remaining process is the one currently executing.
508: When it calls event-wait,
509: a process whose event has been signaled
510: is selected and that process
511: returns from its call to event-wait.
512: .PP
513: Which of the runable processes is to run next?
514: Associated with each process is a priority.
515: The priority of a system process is assigned by the code
516: issuing the wait on an event.
517: This is roughly equivalent to the response
518: that one would expect on such an event.
519: Disk events have high priority,
520: teletype events are low,
521: and time-of-day events are very low.
522: (From observation,
523: the difference in system process priorities
524: has little or no performance impact.)
525: All user-process priorities are lower than the
526: lowest system priority.
527: User-process priorities are assigned
528: by an algorithm based on the
529: recent ratio of the amount of compute time to real time consumed
530: by the process.
531: A process that has used a lot of
532: compute time in the last real-time
533: unit is assigned a low user priority.
534: Because interactive processes are characterized
535: by low ratios of compute to real time,
536: interactive response is maintained without any
537: special arrangements.
538: .PP
539: The scheduling algorithm simply picks
540: the process with the highest priority,
541: thus
542: picking all system processes first and
543: user processes second.
544: The compute-to-real-time ratio is updated
545: every second.
546: Thus,
547: all other things being equal,
548: looping user processes will be
549: scheduled round-robin with a
550: 1-second quantum.
551: A high-priority process waking up will
552: preempt a running, low-priority process.
553: The scheduling algorithm has a very desirable
554: negative feedback character.
555: If a process uses its high priority
556: to hog the computer,
557: its priority will drop.
558: At the same time, if a low-priority
559: process is ignored for a long time,
560: its priority will rise.
561: .NH
562: I/O SYSTEM
563: .PP
564: The I/O system
565: is broken into two completely separate systems:
566: the block I/O system and the character I/O system.
567: In retrospect,
568: the names should have been ``structured I/O''
569: and ``unstructured I/O,'' respectively;
570: while the term ``block I/O'' has some meaning,
571: ``character I/O'' is a complete misnomer.
572: .PP
573: Devices are characterized by a major device number,
574: a minor device number, and
575: a class (block or character).
576: For each class,
577: there is an array of entry points into the device drivers.
578: The major device number is used to index the array
579: when calling the code for a particular device driver.
580: The minor device number is passed to the
581: device driver as an argument.
582: The minor number has no significance other
583: than that attributed to it by the driver.
584: Usually,
585: the driver uses the minor number to access
586: one of several identical physical devices.
587: .PP
588: The use of the array of entry points
589: (configuration table)
590: as the only connection between the
591: system code and the device drivers is
592: very important.
593: Early versions of the system had a much
594: less formal connection with the drivers,
595: so that it was extremely hard to handcraft
596: differently configured systems.
597: Now it is possible to create new
598: device drivers in an average of a few hours.
599: The configuration table in most cases
600: is created automatically by a program
601: that reads the system's parts list.
602: .NH 2
603: Block I/O system
604: .PP
605: The model block I/O device consists
606: of randomly addressed, secondary
607: memory blocks of 512 bytes each.
608: The blocks are uniformly addressed
609: 0, 1, .\|.\|. up to the size of the device.
610: The block device driver has the job of
611: emulating this model on a
612: physical device.
613: .PP
614: The block I/O devices are accessed
615: through a layer of buffering software.
616: The system maintains a list of buffers
617: (typically between 10 and 70)
618: each assigned a device name and
619: a device address.
620: This buffer pool constitutes a data cache
621: for the block devices.
622: On a read request,
623: the cache is searched for the desired block.
624: If the block is found,
625: the data are made available to the
626: requester without any physical I/O.
627: If the block is not in the cache,
628: the least recently used block in the cache is renamed,
629: the correct device driver is called to
630: fill up the renamed buffer, and then the
631: data are made available.
632: Write requests are handled in an analogous manner.
633: The correct buffer is found
634: and relabeled if necessary.
635: The write is performed simply by marking
636: the buffer as ``dirty.''
637: The physical I/O is then deferred until
638: the buffer is renamed.
639: .PP
640: The benefits in reduction of physical I/O
641: of this scheme are substantial,
642: especially considering the file system implementation.
643: There are,
644: however,
645: some drawbacks.
646: The asynchronous nature of the
647: algorithm makes error reporting
648: and meaningful user error handling
649: almost impossible.
650: The cavalier approach to I/O error
651: handling in the
652: .UX
653: system is partly due to the asynchronous
654: nature of the block I/O system.
655: A second problem is in the delayed writes.
656: If the system stops unexpectedly,
657: it is almost certain that there is a
658: lot of logically complete,
659: but physically incomplete,
660: I/O in the buffers.
661: There is a system primitive to
662: flush all outstanding I/O activity
663: from the buffers.
664: Periodic use of this primitive helps,
665: but does not solve, the problem.
666: Finally,
667: the associativity in the buffers
668: can alter the physical I/O sequence
669: from that of the logical I/O sequence.
670: This means that there are times
671: when data structures on disk are inconsistent,
672: even though the software is careful
673: to perform I/O in the correct order.
674: On non-random devices,
675: notably magnetic tape,
676: the inversions of writes can be disastrous.
677: The problem with magnetic tapes is ``cured'' by
678: allowing only one outstanding write request
679: per drive.
680: .NH 2
681: Character I/O system
682: .PP
683: The character I/O system consists of all
684: devices that do not fall into the block I/O model.
685: This includes the ``classical'' character devices
686: such as communications lines, paper tape, and
687: line printers.
688: It also includes magnetic tape and disks when
689: they are not used in a stereotyped way,
690: for example, 80-byte physical records on tape
691: and track-at-a-time disk copies.
692: In short,
693: the character I/O interface
694: means ``everything other than block.''
695: I/O requests from the user are sent to the
696: device driver essentially unaltered.
697: The implementation of these requests is, of course,
698: up to the device driver.
699: There are guidelines and conventions
700: to help the implementation of
701: certain types of device drivers.
702: .NH 3
703: Disk drivers
704: .PP
705: Disk drivers are implemented
706: with a queue of transaction records.
707: Each record holds a read/write flag,
708: a primary memory address,
709: a secondary memory address, and
710: a transfer byte count.
711: Swapping is accomplished by passing
712: such a record to the swapping device driver.
713: The block I/O interface is implemented by
714: passing such records with requests to
715: fill and empty system buffers.
716: The character I/O interface to the disk
717: drivers create a transaction record that
718: points directly into the user area.
719: The routine that creates this record also insures
720: that the user is not swapped during this
721: I/O transaction.
722: Thus by implementing the general disk driver,
723: it is possible to use the disk
724: as a block device,
725: a character device, and a swap device.
726: The only really disk-specific code in normal
727: disk drivers is the pre-sort of transactions to
728: minimize latency for a particular device, and
729: the actual issuing of the I/O request.
730: .NH 3
731: Character lists
732: .PP
733: Real character-oriented devices may
734: be implemented using the common
735: code to handle character lists.
736: A character list is a queue of characters.
737: One routine puts a character on a queue.
738: Another gets a character from a queue.
739: It is also possible to ask how many
740: characters are currently on a queue.
741: Storage for all queues in the system comes
742: from a single common pool.
743: Putting a character on a queue will allocate
744: space from the common pool and link the
745: character onto the data structure defining the queue.
746: Getting a character from a queue returns
747: the corresponding space to the pool.
748: .PP
749: A typical character-output device
750: (paper tape punch, for example)
751: is implemented by passing characters
752: from the user onto a character queue until
753: some maximum number of characters is on the queue.
754: The I/O is prodded to start as
755: soon as there is anything on the queue
756: and, once started,
757: it is sustained by hardware completion interrupts.
758: Each time there is a completion interrupt,
759: the driver gets the next character from the queue
760: and sends it to the hardware.
761: The number of characters on the queue is checked and,
762: as the count falls through some intermediate level,
763: an event (the queue address) is signaled.
764: The process that is passing characters from
765: the user to the queue can be waiting on the event, and
766: refill the queue to its maximum
767: when the event occurs.
768: .PP
769: A typical character input device
770: (for example, a paper tape reader)
771: is handled in a very similar manner.
772: .PP
773: Another class of character devices is the terminals.
774: A terminal is represented by three
775: character queues.
776: There are two input queues (raw and canonical)
777: and an output queue.
778: Characters going to the output of a terminal
779: are handled by common code exactly as described
780: above.
781: The main difference is that there is also code
782: to interpret the output stream as
783: .UC ASCII
784: characters and to perform some translations,
785: e.g., escapes for deficient terminals.
786: Another common aspect of terminals is code
787: to insert real-time delay after certain control characters.
788: .PP
789: Input on terminals is a little different.
790: Characters are collected from the terminal and
791: placed on a raw input queue.
792: Some device-dependent code conversion and
793: escape interpretation is handled here.
794: When a line is complete in the raw queue,
795: an event is signaled.
796: The code catching this signal then copies a
797: line from the raw queue to a canonical queue
798: performing the character erase and line kill editing.
799: User read requests on terminals can be
800: directed at either the raw or canonical queues.
801: .NH 3
802: Other character devices
803: .PP
804: Finally,
805: there are devices that fit no general category.
806: These devices are set up as character I/O drivers.
807: An example is a driver that reads and writes
808: unmapped primary memory as an I/O device.
809: Some devices are too
810: fast to be treated a character at time,
811: but do not fit the disk I/O mold.
812: Examples are fast communications lines and
813: fast line printers.
814: These devices either have their own buffers
815: or ``borrow'' block I/O buffers for a while and
816: then give them back.
817: .NH
818: THE FILE SYSTEM
819: .PP
820: In the
821: .UX
822: system,
823: a file is a (one-dimensional) array of bytes.
824: No other structure of files is implied by the
825: system.
826: Files are attached anywhere
827: (and possibly multiply)
828: onto a hierarchy of directories.
829: Directories are simply files that
830: users cannot write.
831: For a further discussion
832: of the external view of files and directories,
833: see Ref.\0
834: .[
835: ritchie thompson unix bstj 1978
836: %Q This issue
837: .].
838: .PP
839: The
840: .UX
841: file system is a disk data structure
842: accessed completely through
843: the block I/O system.
844: As stated before,
845: the canonical view of a ``disk'' is
846: a randomly addressable array of
847: 512-byte blocks.
848: A file system breaks the disk into
849: four self-identifying regions.
850: The first block (address 0)
851: is unused by the file system.
852: It is left aside for booting procedures.
853: The second block (address 1)
854: contains the so-called ``super-block.''
855: This block,
856: among other things,
857: contains the size of the disk and
858: the boundaries of the other regions.
859: Next comes the i-list,
860: a list of file definitions.
861: Each file definition is
862: a 64-byte structure, called an i-node.
863: The offset of a particular i-node
864: within the i-list is called its i-number.
865: The combination of device name
866: (major and minor numbers) and i-number
867: serves to uniquely name a particular file.
868: After the i-list,
869: and to the end of the disk,
870: come free storage blocks that
871: are available for the contents of files.
872: .PP
873: The free space on a disk is maintained
874: by a linked list of available disk blocks.
875: Every block in this chain contains a disk address
876: of the next block in the chain.
877: The remaining space contains the address of up to
878: 50 disk blocks that are also free.
879: Thus with one I/O operation,
880: the system obtains 50 free blocks and a
881: pointer where to find more.
882: The disk allocation algorithms are
883: very straightforward.
884: Since all allocation is in fixed-size
885: blocks and there is strict accounting of
886: space,
887: there is no need to compact or garbage collect.
888: However,
889: as disk space becomes dispersed,
890: latency gradually increases.
891: Some installations choose to occasionally compact
892: disk space to reduce latency.
893: .PP
894: An i-node contains 13 disk addresses.
895: The first 10 of these addresses point directly at
896: the first 10 blocks of a file.
897: If a file is larger than 10 blocks (5,120 bytes),
898: then the eleventh address points at a block
899: that contains the addresses of the next 128 blocks of the file.
900: If the file is still larger than this
901: (70,656 bytes),
902: then the twelfth block points at up to 128 blocks,
903: each pointing to 128 blocks of the file.
904: Files yet larger
905: (8,459,264 bytes)
906: use the thirteenth address for a ``triple indirect'' address.
907: The algorithm ends here with the maximum file size
908: of 1,082,201,087 bytes.
909: .PP
910: A logical directory hierarchy is added
911: to this flat physical structure simply
912: by adding a new type of file, the directory.
913: A directory is accessed exactly as an ordinary file.
914: It contains 16-byte entries consisting of
915: a 14-byte name and an i-number.
916: The root of the hierarchy is at a known i-number
917: (\fIviz.,\fR 2).
918: The file system structure allows an arbitrary, directed graph
919: of directories with regular files linked in
920: at arbitrary places in this graph.
921: In fact,
922: very early
923: .UX
924: systems used such a structure.
925: Administration of such a structure became so
926: chaotic that later systems were restricted
927: to a directory tree.
928: Even now,
929: with regular files linked multiply
930: into arbitrary places in the tree,
931: accounting for space has become a problem.
932: It may become necessary to restrict the entire
933: structure to a tree,
934: and allow a new form of linking that
935: is subservient to the tree structure.
936: .PP
937: The file system allows
938: easy creation,
939: easy removal,
940: easy random accessing,
941: and very easy space allocation.
942: With most physical addresses confined
943: to a small contiguous section of disk,
944: it is also easy to dump, restore, and
945: check the consistency of the file system.
946: Large files suffer from indirect addressing,
947: but the cache prevents most of the implied physical I/O
948: without adding much execution.
949: The space overhead properties of this scheme are quite good.
950: For example,
951: on one particular file system,
952: there are 25,000 files containing 130M bytes of data-file content.
953: The overhead (i-node, indirect blocks, and last block breakage)
954: is about 11.5M bytes.
955: The directory structure to support these files
956: has about 1,500 directories containing 0.6M bytes of directory content
957: and about 0.5M bytes of overhead in accessing the directories.
958: Added up any way,
959: this comes out to less than a 10 percent overhead for actual
960: stored data.
961: Most systems have this much overhead in
962: padded trailing blanks alone.
963: .NH 2
964: File system implementation
965: .PP
966: Because the i-node defines a file,
967: the implementation of the file system centers
968: around access to the i-node.
969: The system maintains a table of all active
970: i-nodes.
971: As a new file is accessed,
972: the system locates the corresponding i-node,
973: allocates an i-node table entry, and reads
974: the i-node into primary memory.
975: As in the buffer cache,
976: the table entry is considered to be the current
977: version of the i-node.
978: Modifications to the i-node are made to
979: the table entry.
980: When the last access to the i-node goes
981: away,
982: the table entry is copied back to the
983: secondary store i-list and the table entry is freed.
984: .KF
985: .in .25i
986: .so fig2.pic
987: .in -.25i
988: .sp 2v
989: .ce
990: Fig. 2\(emFile system data structure.
991: .sp
992: .KE
993: .PP
994: All I/O operations on files are carried out
995: with the aid of the corresponding i-node table entry.
996: The accessing of a file is a straightforward
997: implementation of the algorithms mentioned previously.
998: The user is not aware of i-nodes and i-numbers.
999: References to the file system are made in terms of
1000: path names of the directory tree.
1001: Converting a path name into an i-node table entry
1002: is also straightforward.
1003: Starting at some known i-node
1004: (the root or the current directory of some process),
1005: the next component of the path name is
1006: searched by reading the directory.
1007: This gives an i-number and an implied device
1008: (that of the directory).
1009: Thus the next i-node table entry can be accessed.
1010: If that was the last component of the path name,
1011: then this i-node is the result.
1012: If not,
1013: this i-node is the directory needed to look up
1014: the next component of the path name, and the
1015: algorithm is repeated.
1016: .PP
1017: The user process accesses the file system with
1018: certain primitives.
1019: The most common of these are
1020: .UL open ,
1021: .UL create ,
1022: .UL read ,
1023: .UL write ,
1024: .UL seek ,
1025: and
1026: .UL close .
1027: The data structures maintained are shown in Fig. 2.
1028: In the system data segment associated with a user,
1029: there is room for some (usually between 10 and 50) open files.
1030: This open file table consists of pointers that can be used to access
1031: corresponding i-node table entries.
1032: Associated with each of these open files is
1033: a current I/O pointer.
1034: This is a byte offset of
1035: the next read/write operation on the file.
1036: The system treats each read/write request
1037: as random with an implied seek to the
1038: I/O pointer.
1039: The user usually thinks of the file as
1040: sequential with the I/O pointer
1041: automatically counting the number of bytes
1042: that have been read/written from the file.
1043: The user may,
1044: of course,
1045: perform random I/O by setting the I/O pointer
1046: before reads/writes.
1047: .PP
1048: With file sharing,
1049: it is necessary to allow related
1050: processes to share a common I/O pointer
1051: and yet have separate I/O pointers
1052: for independent processes
1053: that access the same file.
1054: With these two conditions,
1055: the I/O pointer cannot reside
1056: in the i-node table nor can
1057: it reside in the list of
1058: open files for the process.
1059: A new table
1060: (the open file table)
1061: was invented for the sole purpose
1062: of holding the I/O pointer.
1063: Processes that share the same open
1064: file
1065: (the result of
1066: .UL fork s)
1067: share a common open file table entry.
1068: A separate open of the same file will
1069: only share the i-node table entry,
1070: but will have distinct open file table entries.
1071: .PP
1072: The main file system primitives are implemented as follows.
1073: .UL \&open
1074: converts a file system path name into an i-node
1075: table entry.
1076: A pointer to the i-node table entry is placed in a
1077: newly created open file table entry.
1078: A pointer to the file table entry is placed in the
1079: system data segment for the process.
1080: .UL \&create
1081: first creates a new i-node entry,
1082: writes the i-number into a directory, and
1083: then builds the same structure as for an
1084: .UL open .
1085: .UL \&read
1086: and
1087: .UL write
1088: just access the i-node entry as described above.
1089: .UL \&seek
1090: simply manipulates the I/O pointer.
1091: No physical seeking is done.
1092: .UL \&close
1093: just frees the structures built by
1094: .UL open
1095: and
1096: .UL create .
1097: Reference counts are kept on the open file table entries and
1098: the i-node table entries to free these structures after
1099: the last reference goes away.
1100: .UL \&unlink
1101: simply decrements the count of the
1102: number of directories pointing at the given i-node.
1103: When the last reference to an i-node table entry
1104: goes away,
1105: if the i-node has no directories pointing to it,
1106: then the file is removed and the i-node is freed.
1107: This delayed removal of files prevents
1108: problems arising from removing active files.
1109: A file may be removed while still open.
1110: The resulting unnamed file vanishes
1111: when the file is closed.
1112: This is a method of obtaining temporary files.
1113: .PP
1114: There is a type of unnamed
1115: .UC FIFO
1116: file called a
1117: .UL pipe.
1118: Implementation of
1119: .UL pipe s
1120: consists of implied
1121: .UL seek s
1122: before each
1123: .UL read
1124: or
1125: .UL write
1126: in order to implement
1127: first-in-first-out.
1128: There are also checks and synchronization
1129: to prevent the
1130: writer from grossly outproducing the
1131: reader and to prevent the reader from
1132: overtaking the writer.
1133: .NH 2
1134: Mounted file systems
1135: .PP
1136: The file system of a
1137: .UX
1138: system
1139: starts with some designated block device
1140: formatted as described above to contain
1141: a hierarchy.
1142: The root of this structure is the root of
1143: the
1144: .UX
1145: file system.
1146: A second formatted block device may be
1147: mounted
1148: at any leaf of
1149: the current hierarchy.
1150: This logically extends the current hierarchy.
1151: The implementation of
1152: mounting
1153: is trivial.
1154: A mount table is maintained containing
1155: pairs of designated leaf i-nodes and
1156: block devices.
1157: When converting a path name into an i-node,
1158: a check is made to see if the new i-node is a
1159: designated leaf.
1160: If it is,
1161: the i-node of the root
1162: of the block device replaces it.
1163: .PP
1164: Allocation of space for a file is taken
1165: from the free pool on the device on which the
1166: file lives.
1167: Thus a file system consisting of many
1168: mounted devices does not have a common pool of
1169: free secondary storage space.
1170: This separation of space on different
1171: devices is necessary to allow easy
1172: unmounting
1173: of a device.
1174: .NH 2
1175: Other system functions
1176: .PP
1177: There are some other things that the system
1178: does for the user\-a
1179: little accounting,
1180: a little tracing/debugging,
1181: and a little access protection.
1182: Most of these things are not very
1183: well developed
1184: because our use of the system in computing science research
1185: does not need them.
1186: There are some features that are missed in some
1187: applications, for example, better inter-process communication.
1188: .PP
1189: The
1190: .UX
1191: kernel is an I/O multiplexer more than
1192: a complete operating system.
1193: This is as it should be.
1194: Because of this outlook,
1195: many features are
1196: found in most
1197: other operating systems that are missing from the
1198: .UX
1199: kernel.
1200: For example,
1201: the
1202: .UX
1203: kernel does not support
1204: file access methods,
1205: file disposition,
1206: file formats,
1207: file maximum size,
1208: spooling,
1209: command language,
1210: logical records,
1211: physical records,
1212: assignment of logical file names,
1213: logical file names,
1214: more than one character set,
1215: an operator's console,
1216: an operator,
1217: log-in,
1218: or log-out.
1219: Many of these things are symptoms rather than features.
1220: Many of these things are implemented
1221: in user software
1222: using the kernel as a tool.
1223: A good example of this is the command language.
1224: .[
1225: bourne shell 1978 bstj
1226: %Q This issue
1227: .]
1228: Each user may have his own command language.
1229: Maintenance of such code is as easy as
1230: maintaining user code.
1231: The idea of implementing ``system'' code with general
1232: user primitives
1233: comes directly from
1234: .UC MULTICS .
1235: .[
1236: organick multics 1972
1237: .]
1238: .LP
1239: .[
1240: $LIST$
1241: .]
This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.