|
|
1.1 ! root 1: .\" @(#)implement 6.2 (Berkeley) 6/17/86 ! 2: .\" ! 3: .EH 'PS2:4-%''UNIX Implementation' ! 4: .OH 'UNIX Implementation''PS2:4-%' ! 5: .de P1 ! 6: .DS ! 7: .. ! 8: .de P2 ! 9: .DE ! 10: .. ! 11: .de UL ! 12: .lg 0 ! 13: .if n .ul ! 14: \%\&\\$3\f3\\$1\fR\&\\$2 ! 15: .lg ! 16: .. ! 17: .de UC ! 18: \&\\$3\s-1\\$1\\s0\&\\$2 ! 19: .. ! 20: .de IT ! 21: .lg 0 ! 22: .if n .ul ! 23: \%\&\\$3\f2\\$1\fR\&\\$2 ! 24: .lg ! 25: .. ! 26: .de SP ! 27: .sp \\$1 ! 28: .. ! 29: .hw device ! 30: .TL ! 31: UNIX Implementation ! 32: .AU "MH 2C-523" 2394 ! 33: K. Thompson ! 34: .AI ! 35: .MH ! 36: .AB ! 37: This paper describes in high-level terms the ! 38: implementation of the resident ! 39: .UX ! 40: kernel. ! 41: This discussion is broken into three parts. ! 42: The first part describes ! 43: how the ! 44: .UX ! 45: system views processes, users, and programs. ! 46: The second part describes the I/O system. ! 47: The last part describes the ! 48: .UX ! 49: file system. ! 50: .AE ! 51: .NH ! 52: INTRODUCTION ! 53: .PP ! 54: The ! 55: .UX ! 56: kernel consists of about 10,000 ! 57: lines of C code and about 1,000 lines of assembly code. ! 58: The assembly code can be further broken down into ! 59: 200 lines included for ! 60: the sake of efficiency ! 61: (they could have been written in C) ! 62: and 800 lines to perform hardware ! 63: functions not possible in C. ! 64: .PP ! 65: This code represents 5 to 10 percent of what has ! 66: been lumped into the broad expression ! 67: ``the ! 68: .UX ! 69: operating system.'' ! 70: The kernel is the only ! 71: .UX ! 72: code that ! 73: cannot be substituted by a user to his ! 74: own liking. ! 75: For this reason, ! 76: the kernel should make as few real ! 77: decisions as possible. ! 78: This does not mean to allow the user ! 79: a million options to do the same thing. ! 80: Rather, it means to allow only one way to ! 81: do one thing, ! 82: but have that way be the least-common divisor ! 83: of all the options that might have been provided. ! 84: .PP ! 85: What is or is not implemented in the kernel ! 86: represents both a great responsibility and a great power. ! 87: It is a soap-box platform on ! 88: ``the way things should be done.'' ! 89: Even so, if ! 90: ``the way'' is too radical, ! 91: no one will follow it. ! 92: Every important decision was weighed ! 93: carefully. ! 94: Throughout, ! 95: simplicity has been substituted for efficiency. ! 96: Complex algorithms are used only if ! 97: their complexity can be localized. ! 98: .NH ! 99: PROCESS CONTROL ! 100: .PP ! 101: In the ! 102: .UX ! 103: system, ! 104: a user executes programs in an ! 105: environment called a user process. ! 106: When a system function is required, ! 107: the user process calls the system ! 108: as a subroutine. ! 109: At some point in this call, ! 110: there is a distinct switch of environments. ! 111: After this, ! 112: the process is said to be a system process. ! 113: In the normal definition of processes, ! 114: the user and system processes are different ! 115: phases of the same process ! 116: (they never execute simultaneously). ! 117: For protection, ! 118: each system process has its own stack. ! 119: .PP ! 120: The user process may execute ! 121: from a read-only text segment, ! 122: which is shared by all processes ! 123: executing the same code. ! 124: There is no ! 125: .IT functional ! 126: benefit ! 127: from shared-text segments. ! 128: An ! 129: .IT efficiency ! 130: benefit comes from the fact ! 131: that there is no need to swap read-only ! 132: segments out because the original ! 133: copy on secondary memory is still current. ! 134: This is a great benefit to interactive ! 135: programs that tend to be swapped while ! 136: waiting for terminal input. ! 137: Furthermore, ! 138: if two processes are ! 139: executing ! 140: simultaneously ! 141: from the same copy of a read-only segment, ! 142: only one copy needs to reside in ! 143: primary memory. ! 144: This is a secondary effect, ! 145: because ! 146: simultaneous execution of a program ! 147: is not common. ! 148: It is ironic that this effect, ! 149: which reduces the use of primary memory, ! 150: only comes into play when there is ! 151: an overabundance of primary memory, ! 152: that is, ! 153: when there is enough memory ! 154: to keep waiting processes loaded. ! 155: .PP ! 156: All current read-only text segments in the ! 157: system are maintained from the ! 158: .IT "text table" . ! 159: A text table entry holds the location of the ! 160: text segment on secondary memory. ! 161: If the segment is loaded, ! 162: that table also holds the primary memory location ! 163: and the count of the number of processes ! 164: sharing this entry. ! 165: When this count is reduced to zero, ! 166: the entry is freed along with any ! 167: primary and secondary memory holding the segment. ! 168: When a process first executes a shared-text segment, ! 169: a text table entry is allocated and the ! 170: segment is loaded onto secondary memory. ! 171: If a second process executes a text segment ! 172: that is already allocated, ! 173: the entry reference count is simply incremented. ! 174: .PP ! 175: A user process has some strictly private ! 176: read-write data ! 177: contained in its ! 178: data segment. ! 179: As far as possible, ! 180: the system does not use the user's ! 181: data segment to hold system data. ! 182: In particular, ! 183: there are no I/O buffers in the ! 184: user address space. ! 185: .PP ! 186: The user data segment has two growing boundaries. ! 187: One, increased automatically by the system ! 188: as a result of memory faults, ! 189: is used for a stack. ! 190: The second boundary is only grown (or shrunk) by ! 191: explicit requests. ! 192: The contents of newly allocated primary memory ! 193: is initialized to zero. ! 194: .PP ! 195: Also associated and swapped with ! 196: a process is a small fixed-size ! 197: system data segment. ! 198: This segment contains all ! 199: the data about the process ! 200: that the system needs only when the ! 201: process is active. ! 202: Examples of the kind of data contained ! 203: in the system data segment are: ! 204: saved central processor registers, ! 205: open file descriptors, ! 206: accounting information, ! 207: scratch data area, ! 208: and the stack for the system phase ! 209: of the process. ! 210: The system data segment is not ! 211: addressable from the user process ! 212: and is therefore protected. ! 213: .PP ! 214: Last, ! 215: there is a process table with ! 216: one entry per process. ! 217: This entry contains all the data ! 218: needed by the system when the process ! 219: is ! 220: .IT not ! 221: active. ! 222: Examples are ! 223: the process's name, ! 224: the location of the other segments, ! 225: and scheduling information. ! 226: The process table entry is allocated ! 227: when the process is created, and freed ! 228: when the process terminates. ! 229: This process entry is always directly ! 230: addressable by the kernel. ! 231: .PP ! 232: Figure 1 shows the relationships ! 233: between the various process control ! 234: data. ! 235: In a sense, ! 236: the process table is the ! 237: definition of all processes, ! 238: because ! 239: all the data associated with a process ! 240: may be accessed ! 241: starting from the process table entry. ! 242: .KF ! 243: .in .375i ! 244: .so fig1.pic ! 245: .in -.375i ! 246: .sp 2v ! 247: .ce ! 248: Fig. 1\(emProcess control data structure. ! 249: .KE ! 250: .NH 2 ! 251: Process creation and program execution ! 252: .PP ! 253: Processes are created by the system primitive ! 254: .UL fork . ! 255: The newly created process (child) is a copy of the original process (parent). ! 256: There is no detectable sharing of primary memory between the two processes. ! 257: (Of course, ! 258: if the parent process was executing from a read-only ! 259: text segment, ! 260: the child will share the text segment.) ! 261: Copies of all writable data segments ! 262: are made for the child process. ! 263: Files that were open before the ! 264: .UL fork ! 265: are ! 266: truly shared after the ! 267: .UL fork . ! 268: The processes are informed as to their part in the ! 269: relationship to ! 270: allow them to select their own ! 271: (usually non-identical) ! 272: destiny. ! 273: The parent may ! 274: .UL wait ! 275: for the termination of ! 276: any of its children. ! 277: .PP ! 278: A process may ! 279: .UL exec ! 280: a file. ! 281: This consists of exchanging the current text and data ! 282: segments of the process for new text and data ! 283: segments specified in the file. ! 284: The old segments are lost. ! 285: Doing an ! 286: .UL exec ! 287: does ! 288: .IT not ! 289: change processes; ! 290: the process that did the ! 291: .UL exec ! 292: persists, ! 293: but ! 294: after the ! 295: .UL exec ! 296: it is executing a different program. ! 297: Files that were open ! 298: before the ! 299: .UL exec ! 300: remain open after the ! 301: .UL exec . ! 302: .PP ! 303: If a program, ! 304: say the first pass of a compiler, ! 305: wishes to overlay itself with another program, ! 306: say the second pass, ! 307: then it simply ! 308: .UL exec s ! 309: the second program. ! 310: This is analogous ! 311: to a ``goto.'' ! 312: If a program wishes to regain control ! 313: after ! 314: .UL exec ing ! 315: a second program, ! 316: it should ! 317: .UL fork ! 318: a child process, ! 319: have the child ! 320: .UL exec ! 321: the second program, and ! 322: have the parent ! 323: .UL wait ! 324: for the child. ! 325: This is analogous to a ``call.'' ! 326: Breaking up the call into a binding followed by ! 327: a transfer is similar to the subroutine linkage in ! 328: SL-5. ! 329: .[ ! 330: griswold hanson sl5 overview ! 331: .] ! 332: .NH 2 ! 333: Swapping ! 334: .PP ! 335: The major data associated with a process ! 336: (the user data segment, ! 337: the system data segment, and ! 338: the text segment) ! 339: are swapped to and from secondary ! 340: memory, as needed. ! 341: The user data segment and the system data segment ! 342: are kept in contiguous primary memory to reduce ! 343: swapping latency. ! 344: (When low-latency devices, such as bubbles, ! 345: .UC CCD s, ! 346: or scatter/gather devices, ! 347: are used, ! 348: this decision will have to be reconsidered.) ! 349: Allocation of both primary ! 350: and secondary memory is performed ! 351: by the same simple first-fit algorithm. ! 352: When a process grows, ! 353: a new piece of primary memory is allocated. ! 354: The contents of the old memory is copied to the new memory. ! 355: The old memory is freed ! 356: and the tables are updated. ! 357: If there is not enough primary memory, ! 358: secondary memory is allocated instead. ! 359: The process is swapped out onto the ! 360: secondary memory, ! 361: ready to be swapped in with ! 362: its new size. ! 363: .PP ! 364: One separate process in the kernel, ! 365: the swapping process, ! 366: simply swaps the other ! 367: processes in and out of primary memory. ! 368: It examines the ! 369: process table looking for a process ! 370: that is swapped out and is ! 371: ready to run. ! 372: It allocates primary memory for that ! 373: process and ! 374: reads its segments into ! 375: primary memory, where that process competes for the ! 376: central processor with other loaded processes. ! 377: If no primary memory is available, ! 378: the swapping process makes memory available ! 379: by examining the process table for processes ! 380: that can be swapped out. ! 381: It selects a process to swap out, ! 382: writes it to secondary memory, ! 383: frees the primary memory, ! 384: and then goes back to look for a process ! 385: to swap in. ! 386: .PP ! 387: Thus there are two specific algorithms ! 388: to the swapping process. ! 389: Which of the possibly many processes that ! 390: are swapped out is to be swapped in? ! 391: This is decided by secondary storage residence ! 392: time. ! 393: The one with the longest time out is swapped in first. ! 394: There is a slight penalty for larger processes. ! 395: Which of the possibly many processes that ! 396: are loaded is to be swapped out? ! 397: Processes that are waiting for slow events ! 398: (i.e., not currently running or waiting for ! 399: disk I/O) ! 400: are picked first, ! 401: by age in primary memory, ! 402: again with size penalties. ! 403: The other processes are examined ! 404: by the same age algorithm, ! 405: but are not taken out unless they are ! 406: at least of some age. ! 407: This adds ! 408: hysteresis to the swapping and ! 409: prevents total thrashing. ! 410: .PP ! 411: These swapping algorithms are the ! 412: most suspect in the system. ! 413: With limited primary memory, ! 414: these algorithms cause total swapping. ! 415: This is not bad in itself, because ! 416: the swapping does not impact the ! 417: execution of the resident processes. ! 418: However, if the swapping device must ! 419: also be used for file storage, ! 420: the swapping traffic severely ! 421: impacts the file system traffic. ! 422: It is exactly these small systems ! 423: that tend to double usage of limited disk ! 424: resources. ! 425: .NH 2 ! 426: Synchronization and scheduling ! 427: .PP ! 428: Process synchronization is accomplished by having processes ! 429: wait for events. ! 430: Events are represented by arbitrary integers. ! 431: By convention, ! 432: events are chosen to be addresses of ! 433: tables associated with those events. ! 434: For example, a process that is waiting for ! 435: any of its children to terminate will wait ! 436: for an event that is the address of ! 437: its own process table entry. ! 438: When a process terminates, ! 439: it signals the event represented by ! 440: its parent's process table entry. ! 441: Signaling an event on which no process ! 442: is waiting has no effect. ! 443: Similarly, ! 444: signaling an event on which many processes ! 445: are waiting will wake all of them up. ! 446: This differs considerably from ! 447: Dijkstra's P and V ! 448: synchronization operations, ! 449: .[ ! 450: dijkstra sequential processes 1968 ! 451: .] ! 452: in that ! 453: no memory is associated with events. ! 454: Thus there need be no allocation of events ! 455: prior to their use. ! 456: Events exist simply by being used. ! 457: .PP ! 458: On the negative side, ! 459: because there is no memory associated with events, ! 460: no notion of ``how much'' ! 461: can be signaled via the event mechanism. ! 462: For example, ! 463: processes that want memory might ! 464: wait on an event associated with ! 465: memory allocation. ! 466: When any amount of memory becomes available, ! 467: the event would be signaled. ! 468: All the competing processes would then wake ! 469: up to fight over the new memory. ! 470: (In reality, ! 471: the swapping process is the only process ! 472: that waits for primary memory to become available.) ! 473: .PP ! 474: If an event occurs ! 475: between the time a process decides ! 476: to wait for that event and the ! 477: time that process enters the wait state, ! 478: then ! 479: the process will wait on an event that has ! 480: already happened (and may never happen again). ! 481: This race condition happens because there is no memory associated with ! 482: the event to indicate that the event has occurred; ! 483: the only action of an event is to change a set of processes ! 484: from wait state to run state. ! 485: This problem is relieved largely ! 486: by the fact that process switching can ! 487: only occur in the kernel by explicit calls ! 488: to the event-wait mechanism. ! 489: If the event in question is signaled by another ! 490: process, ! 491: then there is no problem. ! 492: But if the event is signaled by a hardware ! 493: interrupt, ! 494: then special care must be taken. ! 495: These synchronization races pose the biggest ! 496: problem when ! 497: .UX ! 498: is adapted to multiple-processor configurations. ! 499: .[ ! 500: hawley meyer multiprocessing unix ! 501: .] ! 502: .PP ! 503: The event-wait code in the kernel ! 504: is like a co-routine linkage. ! 505: At any time, ! 506: all but one of the processes has called event-wait. ! 507: The remaining process is the one currently executing. ! 508: When it calls event-wait, ! 509: a process whose event has been signaled ! 510: is selected and that process ! 511: returns from its call to event-wait. ! 512: .PP ! 513: Which of the runable processes is to run next? ! 514: Associated with each process is a priority. ! 515: The priority of a system process is assigned by the code ! 516: issuing the wait on an event. ! 517: This is roughly equivalent to the response ! 518: that one would expect on such an event. ! 519: Disk events have high priority, ! 520: teletype events are low, ! 521: and time-of-day events are very low. ! 522: (From observation, ! 523: the difference in system process priorities ! 524: has little or no performance impact.) ! 525: All user-process priorities are lower than the ! 526: lowest system priority. ! 527: User-process priorities are assigned ! 528: by an algorithm based on the ! 529: recent ratio of the amount of compute time to real time consumed ! 530: by the process. ! 531: A process that has used a lot of ! 532: compute time in the last real-time ! 533: unit is assigned a low user priority. ! 534: Because interactive processes are characterized ! 535: by low ratios of compute to real time, ! 536: interactive response is maintained without any ! 537: special arrangements. ! 538: .PP ! 539: The scheduling algorithm simply picks ! 540: the process with the highest priority, ! 541: thus ! 542: picking all system processes first and ! 543: user processes second. ! 544: The compute-to-real-time ratio is updated ! 545: every second. ! 546: Thus, ! 547: all other things being equal, ! 548: looping user processes will be ! 549: scheduled round-robin with a ! 550: 1-second quantum. ! 551: A high-priority process waking up will ! 552: preempt a running, low-priority process. ! 553: The scheduling algorithm has a very desirable ! 554: negative feedback character. ! 555: If a process uses its high priority ! 556: to hog the computer, ! 557: its priority will drop. ! 558: At the same time, if a low-priority ! 559: process is ignored for a long time, ! 560: its priority will rise. ! 561: .NH ! 562: I/O SYSTEM ! 563: .PP ! 564: The I/O system ! 565: is broken into two completely separate systems: ! 566: the block I/O system and the character I/O system. ! 567: In retrospect, ! 568: the names should have been ``structured I/O'' ! 569: and ``unstructured I/O,'' respectively; ! 570: while the term ``block I/O'' has some meaning, ! 571: ``character I/O'' is a complete misnomer. ! 572: .PP ! 573: Devices are characterized by a major device number, ! 574: a minor device number, and ! 575: a class (block or character). ! 576: For each class, ! 577: there is an array of entry points into the device drivers. ! 578: The major device number is used to index the array ! 579: when calling the code for a particular device driver. ! 580: The minor device number is passed to the ! 581: device driver as an argument. ! 582: The minor number has no significance other ! 583: than that attributed to it by the driver. ! 584: Usually, ! 585: the driver uses the minor number to access ! 586: one of several identical physical devices. ! 587: .PP ! 588: The use of the array of entry points ! 589: (configuration table) ! 590: as the only connection between the ! 591: system code and the device drivers is ! 592: very important. ! 593: Early versions of the system had a much ! 594: less formal connection with the drivers, ! 595: so that it was extremely hard to handcraft ! 596: differently configured systems. ! 597: Now it is possible to create new ! 598: device drivers in an average of a few hours. ! 599: The configuration table in most cases ! 600: is created automatically by a program ! 601: that reads the system's parts list. ! 602: .NH 2 ! 603: Block I/O system ! 604: .PP ! 605: The model block I/O device consists ! 606: of randomly addressed, secondary ! 607: memory blocks of 512 bytes each. ! 608: The blocks are uniformly addressed ! 609: 0, 1, .\|.\|. up to the size of the device. ! 610: The block device driver has the job of ! 611: emulating this model on a ! 612: physical device. ! 613: .PP ! 614: The block I/O devices are accessed ! 615: through a layer of buffering software. ! 616: The system maintains a list of buffers ! 617: (typically between 10 and 70) ! 618: each assigned a device name and ! 619: a device address. ! 620: This buffer pool constitutes a data cache ! 621: for the block devices. ! 622: On a read request, ! 623: the cache is searched for the desired block. ! 624: If the block is found, ! 625: the data are made available to the ! 626: requester without any physical I/O. ! 627: If the block is not in the cache, ! 628: the least recently used block in the cache is renamed, ! 629: the correct device driver is called to ! 630: fill up the renamed buffer, and then the ! 631: data are made available. ! 632: Write requests are handled in an analogous manner. ! 633: The correct buffer is found ! 634: and relabeled if necessary. ! 635: The write is performed simply by marking ! 636: the buffer as ``dirty.'' ! 637: The physical I/O is then deferred until ! 638: the buffer is renamed. ! 639: .PP ! 640: The benefits in reduction of physical I/O ! 641: of this scheme are substantial, ! 642: especially considering the file system implementation. ! 643: There are, ! 644: however, ! 645: some drawbacks. ! 646: The asynchronous nature of the ! 647: algorithm makes error reporting ! 648: and meaningful user error handling ! 649: almost impossible. ! 650: The cavalier approach to I/O error ! 651: handling in the ! 652: .UX ! 653: system is partly due to the asynchronous ! 654: nature of the block I/O system. ! 655: A second problem is in the delayed writes. ! 656: If the system stops unexpectedly, ! 657: it is almost certain that there is a ! 658: lot of logically complete, ! 659: but physically incomplete, ! 660: I/O in the buffers. ! 661: There is a system primitive to ! 662: flush all outstanding I/O activity ! 663: from the buffers. ! 664: Periodic use of this primitive helps, ! 665: but does not solve, the problem. ! 666: Finally, ! 667: the associativity in the buffers ! 668: can alter the physical I/O sequence ! 669: from that of the logical I/O sequence. ! 670: This means that there are times ! 671: when data structures on disk are inconsistent, ! 672: even though the software is careful ! 673: to perform I/O in the correct order. ! 674: On non-random devices, ! 675: notably magnetic tape, ! 676: the inversions of writes can be disastrous. ! 677: The problem with magnetic tapes is ``cured'' by ! 678: allowing only one outstanding write request ! 679: per drive. ! 680: .NH 2 ! 681: Character I/O system ! 682: .PP ! 683: The character I/O system consists of all ! 684: devices that do not fall into the block I/O model. ! 685: This includes the ``classical'' character devices ! 686: such as communications lines, paper tape, and ! 687: line printers. ! 688: It also includes magnetic tape and disks when ! 689: they are not used in a stereotyped way, ! 690: for example, 80-byte physical records on tape ! 691: and track-at-a-time disk copies. ! 692: In short, ! 693: the character I/O interface ! 694: means ``everything other than block.'' ! 695: I/O requests from the user are sent to the ! 696: device driver essentially unaltered. ! 697: The implementation of these requests is, of course, ! 698: up to the device driver. ! 699: There are guidelines and conventions ! 700: to help the implementation of ! 701: certain types of device drivers. ! 702: .NH 3 ! 703: Disk drivers ! 704: .PP ! 705: Disk drivers are implemented ! 706: with a queue of transaction records. ! 707: Each record holds a read/write flag, ! 708: a primary memory address, ! 709: a secondary memory address, and ! 710: a transfer byte count. ! 711: Swapping is accomplished by passing ! 712: such a record to the swapping device driver. ! 713: The block I/O interface is implemented by ! 714: passing such records with requests to ! 715: fill and empty system buffers. ! 716: The character I/O interface to the disk ! 717: drivers create a transaction record that ! 718: points directly into the user area. ! 719: The routine that creates this record also insures ! 720: that the user is not swapped during this ! 721: I/O transaction. ! 722: Thus by implementing the general disk driver, ! 723: it is possible to use the disk ! 724: as a block device, ! 725: a character device, and a swap device. ! 726: The only really disk-specific code in normal ! 727: disk drivers is the pre-sort of transactions to ! 728: minimize latency for a particular device, and ! 729: the actual issuing of the I/O request. ! 730: .NH 3 ! 731: Character lists ! 732: .PP ! 733: Real character-oriented devices may ! 734: be implemented using the common ! 735: code to handle character lists. ! 736: A character list is a queue of characters. ! 737: One routine puts a character on a queue. ! 738: Another gets a character from a queue. ! 739: It is also possible to ask how many ! 740: characters are currently on a queue. ! 741: Storage for all queues in the system comes ! 742: from a single common pool. ! 743: Putting a character on a queue will allocate ! 744: space from the common pool and link the ! 745: character onto the data structure defining the queue. ! 746: Getting a character from a queue returns ! 747: the corresponding space to the pool. ! 748: .PP ! 749: A typical character-output device ! 750: (paper tape punch, for example) ! 751: is implemented by passing characters ! 752: from the user onto a character queue until ! 753: some maximum number of characters is on the queue. ! 754: The I/O is prodded to start as ! 755: soon as there is anything on the queue ! 756: and, once started, ! 757: it is sustained by hardware completion interrupts. ! 758: Each time there is a completion interrupt, ! 759: the driver gets the next character from the queue ! 760: and sends it to the hardware. ! 761: The number of characters on the queue is checked and, ! 762: as the count falls through some intermediate level, ! 763: an event (the queue address) is signaled. ! 764: The process that is passing characters from ! 765: the user to the queue can be waiting on the event, and ! 766: refill the queue to its maximum ! 767: when the event occurs. ! 768: .PP ! 769: A typical character input device ! 770: (for example, a paper tape reader) ! 771: is handled in a very similar manner. ! 772: .PP ! 773: Another class of character devices is the terminals. ! 774: A terminal is represented by three ! 775: character queues. ! 776: There are two input queues (raw and canonical) ! 777: and an output queue. ! 778: Characters going to the output of a terminal ! 779: are handled by common code exactly as described ! 780: above. ! 781: The main difference is that there is also code ! 782: to interpret the output stream as ! 783: .UC ASCII ! 784: characters and to perform some translations, ! 785: e.g., escapes for deficient terminals. ! 786: Another common aspect of terminals is code ! 787: to insert real-time delay after certain control characters. ! 788: .PP ! 789: Input on terminals is a little different. ! 790: Characters are collected from the terminal and ! 791: placed on a raw input queue. ! 792: Some device-dependent code conversion and ! 793: escape interpretation is handled here. ! 794: When a line is complete in the raw queue, ! 795: an event is signaled. ! 796: The code catching this signal then copies a ! 797: line from the raw queue to a canonical queue ! 798: performing the character erase and line kill editing. ! 799: User read requests on terminals can be ! 800: directed at either the raw or canonical queues. ! 801: .NH 3 ! 802: Other character devices ! 803: .PP ! 804: Finally, ! 805: there are devices that fit no general category. ! 806: These devices are set up as character I/O drivers. ! 807: An example is a driver that reads and writes ! 808: unmapped primary memory as an I/O device. ! 809: Some devices are too ! 810: fast to be treated a character at time, ! 811: but do not fit the disk I/O mold. ! 812: Examples are fast communications lines and ! 813: fast line printers. ! 814: These devices either have their own buffers ! 815: or ``borrow'' block I/O buffers for a while and ! 816: then give them back. ! 817: .NH ! 818: THE FILE SYSTEM ! 819: .PP ! 820: In the ! 821: .UX ! 822: system, ! 823: a file is a (one-dimensional) array of bytes. ! 824: No other structure of files is implied by the ! 825: system. ! 826: Files are attached anywhere ! 827: (and possibly multiply) ! 828: onto a hierarchy of directories. ! 829: Directories are simply files that ! 830: users cannot write. ! 831: For a further discussion ! 832: of the external view of files and directories, ! 833: see Ref.\0 ! 834: .[ ! 835: ritchie thompson unix bstj 1978 ! 836: %Q This issue ! 837: .]. ! 838: .PP ! 839: The ! 840: .UX ! 841: file system is a disk data structure ! 842: accessed completely through ! 843: the block I/O system. ! 844: As stated before, ! 845: the canonical view of a ``disk'' is ! 846: a randomly addressable array of ! 847: 512-byte blocks. ! 848: A file system breaks the disk into ! 849: four self-identifying regions. ! 850: The first block (address 0) ! 851: is unused by the file system. ! 852: It is left aside for booting procedures. ! 853: The second block (address 1) ! 854: contains the so-called ``super-block.'' ! 855: This block, ! 856: among other things, ! 857: contains the size of the disk and ! 858: the boundaries of the other regions. ! 859: Next comes the i-list, ! 860: a list of file definitions. ! 861: Each file definition is ! 862: a 64-byte structure, called an i-node. ! 863: The offset of a particular i-node ! 864: within the i-list is called its i-number. ! 865: The combination of device name ! 866: (major and minor numbers) and i-number ! 867: serves to uniquely name a particular file. ! 868: After the i-list, ! 869: and to the end of the disk, ! 870: come free storage blocks that ! 871: are available for the contents of files. ! 872: .PP ! 873: The free space on a disk is maintained ! 874: by a linked list of available disk blocks. ! 875: Every block in this chain contains a disk address ! 876: of the next block in the chain. ! 877: The remaining space contains the address of up to ! 878: 50 disk blocks that are also free. ! 879: Thus with one I/O operation, ! 880: the system obtains 50 free blocks and a ! 881: pointer where to find more. ! 882: The disk allocation algorithms are ! 883: very straightforward. ! 884: Since all allocation is in fixed-size ! 885: blocks and there is strict accounting of ! 886: space, ! 887: there is no need to compact or garbage collect. ! 888: However, ! 889: as disk space becomes dispersed, ! 890: latency gradually increases. ! 891: Some installations choose to occasionally compact ! 892: disk space to reduce latency. ! 893: .PP ! 894: An i-node contains 13 disk addresses. ! 895: The first 10 of these addresses point directly at ! 896: the first 10 blocks of a file. ! 897: If a file is larger than 10 blocks (5,120 bytes), ! 898: then the eleventh address points at a block ! 899: that contains the addresses of the next 128 blocks of the file. ! 900: If the file is still larger than this ! 901: (70,656 bytes), ! 902: then the twelfth block points at up to 128 blocks, ! 903: each pointing to 128 blocks of the file. ! 904: Files yet larger ! 905: (8,459,264 bytes) ! 906: use the thirteenth address for a ``triple indirect'' address. ! 907: The algorithm ends here with the maximum file size ! 908: of 1,082,201,087 bytes. ! 909: .PP ! 910: A logical directory hierarchy is added ! 911: to this flat physical structure simply ! 912: by adding a new type of file, the directory. ! 913: A directory is accessed exactly as an ordinary file. ! 914: It contains 16-byte entries consisting of ! 915: a 14-byte name and an i-number. ! 916: The root of the hierarchy is at a known i-number ! 917: (\fIviz.,\fR 2). ! 918: The file system structure allows an arbitrary, directed graph ! 919: of directories with regular files linked in ! 920: at arbitrary places in this graph. ! 921: In fact, ! 922: very early ! 923: .UX ! 924: systems used such a structure. ! 925: Administration of such a structure became so ! 926: chaotic that later systems were restricted ! 927: to a directory tree. ! 928: Even now, ! 929: with regular files linked multiply ! 930: into arbitrary places in the tree, ! 931: accounting for space has become a problem. ! 932: It may become necessary to restrict the entire ! 933: structure to a tree, ! 934: and allow a new form of linking that ! 935: is subservient to the tree structure. ! 936: .PP ! 937: The file system allows ! 938: easy creation, ! 939: easy removal, ! 940: easy random accessing, ! 941: and very easy space allocation. ! 942: With most physical addresses confined ! 943: to a small contiguous section of disk, ! 944: it is also easy to dump, restore, and ! 945: check the consistency of the file system. ! 946: Large files suffer from indirect addressing, ! 947: but the cache prevents most of the implied physical I/O ! 948: without adding much execution. ! 949: The space overhead properties of this scheme are quite good. ! 950: For example, ! 951: on one particular file system, ! 952: there are 25,000 files containing 130M bytes of data-file content. ! 953: The overhead (i-node, indirect blocks, and last block breakage) ! 954: is about 11.5M bytes. ! 955: The directory structure to support these files ! 956: has about 1,500 directories containing 0.6M bytes of directory content ! 957: and about 0.5M bytes of overhead in accessing the directories. ! 958: Added up any way, ! 959: this comes out to less than a 10 percent overhead for actual ! 960: stored data. ! 961: Most systems have this much overhead in ! 962: padded trailing blanks alone. ! 963: .NH 2 ! 964: File system implementation ! 965: .PP ! 966: Because the i-node defines a file, ! 967: the implementation of the file system centers ! 968: around access to the i-node. ! 969: The system maintains a table of all active ! 970: i-nodes. ! 971: As a new file is accessed, ! 972: the system locates the corresponding i-node, ! 973: allocates an i-node table entry, and reads ! 974: the i-node into primary memory. ! 975: As in the buffer cache, ! 976: the table entry is considered to be the current ! 977: version of the i-node. ! 978: Modifications to the i-node are made to ! 979: the table entry. ! 980: When the last access to the i-node goes ! 981: away, ! 982: the table entry is copied back to the ! 983: secondary store i-list and the table entry is freed. ! 984: .KF ! 985: .in .25i ! 986: .so fig2.pic ! 987: .in -.25i ! 988: .sp 2v ! 989: .ce ! 990: Fig. 2\(emFile system data structure. ! 991: .sp ! 992: .KE ! 993: .PP ! 994: All I/O operations on files are carried out ! 995: with the aid of the corresponding i-node table entry. ! 996: The accessing of a file is a straightforward ! 997: implementation of the algorithms mentioned previously. ! 998: The user is not aware of i-nodes and i-numbers. ! 999: References to the file system are made in terms of ! 1000: path names of the directory tree. ! 1001: Converting a path name into an i-node table entry ! 1002: is also straightforward. ! 1003: Starting at some known i-node ! 1004: (the root or the current directory of some process), ! 1005: the next component of the path name is ! 1006: searched by reading the directory. ! 1007: This gives an i-number and an implied device ! 1008: (that of the directory). ! 1009: Thus the next i-node table entry can be accessed. ! 1010: If that was the last component of the path name, ! 1011: then this i-node is the result. ! 1012: If not, ! 1013: this i-node is the directory needed to look up ! 1014: the next component of the path name, and the ! 1015: algorithm is repeated. ! 1016: .PP ! 1017: The user process accesses the file system with ! 1018: certain primitives. ! 1019: The most common of these are ! 1020: .UL open , ! 1021: .UL create , ! 1022: .UL read , ! 1023: .UL write , ! 1024: .UL seek , ! 1025: and ! 1026: .UL close . ! 1027: The data structures maintained are shown in Fig. 2. ! 1028: In the system data segment associated with a user, ! 1029: there is room for some (usually between 10 and 50) open files. ! 1030: This open file table consists of pointers that can be used to access ! 1031: corresponding i-node table entries. ! 1032: Associated with each of these open files is ! 1033: a current I/O pointer. ! 1034: This is a byte offset of ! 1035: the next read/write operation on the file. ! 1036: The system treats each read/write request ! 1037: as random with an implied seek to the ! 1038: I/O pointer. ! 1039: The user usually thinks of the file as ! 1040: sequential with the I/O pointer ! 1041: automatically counting the number of bytes ! 1042: that have been read/written from the file. ! 1043: The user may, ! 1044: of course, ! 1045: perform random I/O by setting the I/O pointer ! 1046: before reads/writes. ! 1047: .PP ! 1048: With file sharing, ! 1049: it is necessary to allow related ! 1050: processes to share a common I/O pointer ! 1051: and yet have separate I/O pointers ! 1052: for independent processes ! 1053: that access the same file. ! 1054: With these two conditions, ! 1055: the I/O pointer cannot reside ! 1056: in the i-node table nor can ! 1057: it reside in the list of ! 1058: open files for the process. ! 1059: A new table ! 1060: (the open file table) ! 1061: was invented for the sole purpose ! 1062: of holding the I/O pointer. ! 1063: Processes that share the same open ! 1064: file ! 1065: (the result of ! 1066: .UL fork s) ! 1067: share a common open file table entry. ! 1068: A separate open of the same file will ! 1069: only share the i-node table entry, ! 1070: but will have distinct open file table entries. ! 1071: .PP ! 1072: The main file system primitives are implemented as follows. ! 1073: .UL \&open ! 1074: converts a file system path name into an i-node ! 1075: table entry. ! 1076: A pointer to the i-node table entry is placed in a ! 1077: newly created open file table entry. ! 1078: A pointer to the file table entry is placed in the ! 1079: system data segment for the process. ! 1080: .UL \&create ! 1081: first creates a new i-node entry, ! 1082: writes the i-number into a directory, and ! 1083: then builds the same structure as for an ! 1084: .UL open . ! 1085: .UL \&read ! 1086: and ! 1087: .UL write ! 1088: just access the i-node entry as described above. ! 1089: .UL \&seek ! 1090: simply manipulates the I/O pointer. ! 1091: No physical seeking is done. ! 1092: .UL \&close ! 1093: just frees the structures built by ! 1094: .UL open ! 1095: and ! 1096: .UL create . ! 1097: Reference counts are kept on the open file table entries and ! 1098: the i-node table entries to free these structures after ! 1099: the last reference goes away. ! 1100: .UL \&unlink ! 1101: simply decrements the count of the ! 1102: number of directories pointing at the given i-node. ! 1103: When the last reference to an i-node table entry ! 1104: goes away, ! 1105: if the i-node has no directories pointing to it, ! 1106: then the file is removed and the i-node is freed. ! 1107: This delayed removal of files prevents ! 1108: problems arising from removing active files. ! 1109: A file may be removed while still open. ! 1110: The resulting unnamed file vanishes ! 1111: when the file is closed. ! 1112: This is a method of obtaining temporary files. ! 1113: .PP ! 1114: There is a type of unnamed ! 1115: .UC FIFO ! 1116: file called a ! 1117: .UL pipe. ! 1118: Implementation of ! 1119: .UL pipe s ! 1120: consists of implied ! 1121: .UL seek s ! 1122: before each ! 1123: .UL read ! 1124: or ! 1125: .UL write ! 1126: in order to implement ! 1127: first-in-first-out. ! 1128: There are also checks and synchronization ! 1129: to prevent the ! 1130: writer from grossly outproducing the ! 1131: reader and to prevent the reader from ! 1132: overtaking the writer. ! 1133: .NH 2 ! 1134: Mounted file systems ! 1135: .PP ! 1136: The file system of a ! 1137: .UX ! 1138: system ! 1139: starts with some designated block device ! 1140: formatted as described above to contain ! 1141: a hierarchy. ! 1142: The root of this structure is the root of ! 1143: the ! 1144: .UX ! 1145: file system. ! 1146: A second formatted block device may be ! 1147: mounted ! 1148: at any leaf of ! 1149: the current hierarchy. ! 1150: This logically extends the current hierarchy. ! 1151: The implementation of ! 1152: mounting ! 1153: is trivial. ! 1154: A mount table is maintained containing ! 1155: pairs of designated leaf i-nodes and ! 1156: block devices. ! 1157: When converting a path name into an i-node, ! 1158: a check is made to see if the new i-node is a ! 1159: designated leaf. ! 1160: If it is, ! 1161: the i-node of the root ! 1162: of the block device replaces it. ! 1163: .PP ! 1164: Allocation of space for a file is taken ! 1165: from the free pool on the device on which the ! 1166: file lives. ! 1167: Thus a file system consisting of many ! 1168: mounted devices does not have a common pool of ! 1169: free secondary storage space. ! 1170: This separation of space on different ! 1171: devices is necessary to allow easy ! 1172: unmounting ! 1173: of a device. ! 1174: .NH 2 ! 1175: Other system functions ! 1176: .PP ! 1177: There are some other things that the system ! 1178: does for the user\-a ! 1179: little accounting, ! 1180: a little tracing/debugging, ! 1181: and a little access protection. ! 1182: Most of these things are not very ! 1183: well developed ! 1184: because our use of the system in computing science research ! 1185: does not need them. ! 1186: There are some features that are missed in some ! 1187: applications, for example, better inter-process communication. ! 1188: .PP ! 1189: The ! 1190: .UX ! 1191: kernel is an I/O multiplexer more than ! 1192: a complete operating system. ! 1193: This is as it should be. ! 1194: Because of this outlook, ! 1195: many features are ! 1196: found in most ! 1197: other operating systems that are missing from the ! 1198: .UX ! 1199: kernel. ! 1200: For example, ! 1201: the ! 1202: .UX ! 1203: kernel does not support ! 1204: file access methods, ! 1205: file disposition, ! 1206: file formats, ! 1207: file maximum size, ! 1208: spooling, ! 1209: command language, ! 1210: logical records, ! 1211: physical records, ! 1212: assignment of logical file names, ! 1213: logical file names, ! 1214: more than one character set, ! 1215: an operator's console, ! 1216: an operator, ! 1217: log-in, ! 1218: or log-out. ! 1219: Many of these things are symptoms rather than features. ! 1220: Many of these things are implemented ! 1221: in user software ! 1222: using the kernel as a tool. ! 1223: A good example of this is the command language. ! 1224: .[ ! 1225: bourne shell 1978 bstj ! 1226: %Q This issue ! 1227: .] ! 1228: Each user may have his own command language. ! 1229: Maintenance of such code is as easy as ! 1230: maintaining user code. ! 1231: The idea of implementing ``system'' code with general ! 1232: user primitives ! 1233: comes directly from ! 1234: .UC MULTICS . ! 1235: .[ ! 1236: organick multics 1972 ! 1237: .] ! 1238: .LP ! 1239: .[ ! 1240: $LIST$ ! 1241: .]
This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.