|
|
1.1 ! root 1: .so ../ADM/mac ! 2: .XX sam 399 "The Text Editor Sam" ! 3: .nr dP 2 ! 4: .nr dV 3p ! 5: .de Cs ! 6: .br ! 7: .fi ! 8: .ft 2 ! 9: .ps -2 ! 10: .vs -2 ! 11: .. ! 12: .de Ce ! 13: .br ! 14: .nf ! 15: .ft 1 ! 16: .ps ! 17: .vs ! 18: .sp ! 19: .. ! 20: .TL ! 21: The Text Editor Sam\(dg ! 22: .AU ! 23: Rob Pike ! 24: .AI ! 25: .MH ! 26: .AB ! 27: .PP ! 28: .I Sam ! 29: is an interactive multi-file text editor intended for ! 30: bitmap displays. ! 31: A textual command language ! 32: supplements the mouse-driven, cut-and-paste interface ! 33: to make complex or ! 34: repetitive editing tasks easy to specify. ! 35: The language is characterized by the composition of regular expressions ! 36: to describe the structure of the text being modified. ! 37: The treatment of files as a database, with changes logged ! 38: as atomic transactions, guides the implementation and ! 39: makes a general `undo' mechanism straightforward. ! 40: .PP ! 41: .I Sam ! 42: is implemented as two processes connected by a low-bandwidth stream, ! 43: one process handling the display and the other the editing ! 44: algorithms. Therefore it can run with the display process ! 45: in a bitmap terminal and the editor on a local host, ! 46: with both processes on a bitmap-equipped host, or with ! 47: the display process in the terminal and the editor in a ! 48: remote host. ! 49: By suppressing the display process, ! 50: it can even run without a bitmap terminal. ! 51: .AE ! 52: .2C ! 53: .FS ! 54: \(dg This article originally appeared as |reference(spe sam pike). ! 55: It is reprinted by permission of John Wiley & Sons, Ltd. ! 56: .FE ! 57: .NH ! 58: Introduction ! 59: .LP ! 60: .I Sam ! 61: is an interactive text editor that combines cut-and-paste interactive editing with ! 62: an unusual command language based on the composition of regular expressions. ! 63: It is written as two programs: one, the `host part,' runs on a ! 64: .UX ! 65: system ! 66: and implements the command language and provides file access; the other, the ! 67: `terminal part,' runs asynchronously ! 68: on a machine with a mouse and bitmap display ! 69: and supports the display and interactive editing. ! 70: The host part may be even run in isolation on an ordinary terminal ! 71: to edit text using the command ! 72: language, much like a traditional line editor, ! 73: without assistance from a mouse or display. ! 74: Most often, ! 75: the terminal part runs on a Blit|reference(blit bstj) terminal ! 76: (actually on a Teletype DMD 5620, the production version of the Blit), whose ! 77: host connection is an ordinary 9600 bps RS-232 link; ! 78: on the Sun computer the host and display processes run on a single machine, ! 79: connected by a pipe. ! 80: .PP ! 81: .I Sam ! 82: edits uninterpreted ! 83: ASCII text. ! 84: It has no facilities for multiple fonts, graphics or tables, ! 85: unlike MacWrite,|reference(macwrite) Bravo,|reference(bravo) ! 86: Tioga|reference(tioga) ! 87: or Lara.|reference(lara) ! 88: Also unlike them, it has a rich command language. ! 89: (Throughout this paper, the phrase ! 90: .I ! 91: command language ! 92: .R ! 93: refers to ! 94: textual commands; commands activated from the mouse form the ! 95: .I mouse ! 96: .I language. ) ! 97: .I Sam ! 98: developed as an editor for use by programmers, and tries to join ! 99: the styles of the ! 100: .UX ! 101: text editor ! 102: .I ed ! 103: |reference(v7manual)|reference(kernighan pike) ! 104: with that of interactive cut-and-paste editors by ! 105: providing a comfortable mouse-driven interface ! 106: to a program with a solid command language driven by regular expressions. ! 107: The command language developed more than the mouse language, and ! 108: acquired a notation for describing the structure of files ! 109: more richly than as a sequence of lines, ! 110: using a dataflow-like syntax for specifying changes. ! 111: .PP ! 112: The interactive style was influenced by ! 113: .I jim , ! 114: |reference(bstj blit) ! 115: an early cut-and-paste editor for the Blit, and by ! 116: .I mux , ! 117: |reference(latest volume1) ! 118: the Blit window system. ! 119: .I Mux ! 120: merges the original Blit window system, ! 121: .I mpx , ! 122: |reference(bstj blit) ! 123: with cut-and-paste editing, forming something like a ! 124: multiplexed version of ! 125: .I jim ! 126: that edits the output of (and input to) command sessions rather than files. ! 127: .PP ! 128: The first part of this paper describes the command language, then the mouse ! 129: language, and explains how they interact. ! 130: That is followed by a description of the implementation, ! 131: first of the host part, then of the terminal part. ! 132: A principle that influenced the design of ! 133: .I sam ! 134: is that it should have no explicit limits, such as upper limits on ! 135: file size or line length. ! 136: A secondary consideration is that it be efficient. ! 137: To honor these two goals together requires a method for efficiently ! 138: manipulating ! 139: huge strings (files) without breaking them into lines, ! 140: perhaps while making thousands of changes ! 141: under control of the command language. ! 142: .I Sam 's ! 143: method is to ! 144: treat the file as a transaction database, implementing changes as atomic ! 145: updates. These updates may be unwound easily to `undo' changes. ! 146: Efficiency is achieved through a collection of caches that minimizes ! 147: disc traffic and data motion, both within the two parts of the program ! 148: and between them. ! 149: .PP ! 150: The terminal part of ! 151: .I sam ! 152: is fairly straightforward. ! 153: More interesting is how the two halves of the editor stay ! 154: synchronized when either half may initiate a change. ! 155: This is achieved through a data structure that organizes the ! 156: communications and is maintained in parallel by both halves. ! 157: .PP ! 158: The last part of the paper chronicles the writing of ! 159: .I sam ! 160: and discusses the lessons that were learned through its development and use. ! 161: .PP ! 162: The paper is long, but is composed largely of two papers of reasonable length: ! 163: a description of the user interface of ! 164: .I sam ! 165: and a discussion of its implementation. ! 166: They are combined because the implementation is strongly influenced by ! 167: the user interface, and vice versa. ! 168: .NH ! 169: The Interface ! 170: .LP ! 171: .I Sam ! 172: is a text editor for multiple files. ! 173: File names may be provided when it is invoked: ! 174: .P1 2n ! 175: sam file1 file2 ... ! 176: .P2 ! 177: and there are commands ! 178: to add new files and discard unneeded ones. ! 179: Files are not read until necessary ! 180: to complete some command. ! 181: Editing operations apply to an internal copy ! 182: made when the file is read; the ! 183: .UX ! 184: file associated with the copy ! 185: is changed only by an explicit command. ! 186: To simplify the discussion, the internal copy is here called a ! 187: .I file , ! 188: while the disc-resident original is called a ! 189: .I ! 190: disc file. ! 191: .R ! 192: .PP ! 193: .I Sam ! 194: is usually connected to a bitmap display that presents a cut-and-paste ! 195: editor driven by the mouse. ! 196: In this mode, the command language is still available: ! 197: text typed in a special window, called the ! 198: .I sam ! 199: .I window, ! 200: is interpreted ! 201: as commands to be executed in the current file. ! 202: Cut-and-paste editing may be used in any window \(em even in the ! 203: .I sam ! 204: window to construct commands. ! 205: The other mode of operation, invoked by starting ! 206: .I sam ! 207: with the option ! 208: .CW -d ! 209: (for `no download'), ! 210: does not use the mouse or bitmap display, but still permits ! 211: editing using the textual command language, even on an ordinary terminal, ! 212: interactively or from a script. ! 213: .PP ! 214: The following sections describe first the command language (under ! 215: .CW sam\ -d ! 216: and in the ! 217: .I sam ! 218: window), and then the mouse interface. ! 219: These two languages are nearly independent, but connect through the ! 220: .I current ! 221: .I text, ! 222: described below. ! 223: .NH 2 ! 224: The Command Language ! 225: .LP ! 226: A file consists of its contents, which are an array of characters ! 227: (that is, a string); the ! 228: .I name ! 229: of the associated disc file; the ! 230: .I ! 231: modified bit ! 232: .R ! 233: that states whether the contents match those of ! 234: the disc file; ! 235: and a substring of the contents, called the ! 236: .I ! 237: current text ! 238: .R ! 239: or ! 240: .I dot ! 241: (see Figures 1 and 2). ! 242: If the current text is a null string, dot falls between characters. ! 243: The ! 244: .I value ! 245: of dot is the location of the current text; the ! 246: .I contents ! 247: of dot are the characters it contains. ! 248: .I Sam ! 249: imparts to the text no two-dimensional interpretation such as columns ! 250: or fields; text is always one-dimensional. ! 251: Even the idea of a `line' of text as understood by most ! 252: .UX ! 253: programs ! 254: \(em a sequence of characters terminated by a newline character \(em ! 255: is only weakly supported. ! 256: .PP ! 257: The ! 258: .I ! 259: current file ! 260: .R ! 261: is the file to which editing commands refer. ! 262: The current text is therefore dot in the current file. ! 263: If a command doesn't explicitly name a particular file or piece of text, ! 264: the command is assumed to apply to the current text. ! 265: For the moment, ignore the presence of multiple files and consider ! 266: editing a single file. ! 267: .1C ! 268: .KF L ! 269: .IB fig1.ps 3.10i 5.65i c ! 270: .Cs ! 271: Figure 1. A typical ! 272: .I sam ! 273: screen, with the editing menu presented. ! 274: The ! 275: .I sam ! 276: (command language) window is in the middle, with file windows above and below. ! 277: (The user interface makes it easy to create these abutting windows.) ! 278: The partially obscured window is a third file window. ! 279: The uppermost window is that to which typing and mouse operations apply, ! 280: as indicated by its heavy border. ! 281: Each window has its current text highlighted in reverse video. ! 282: The ! 283: .I sam ! 284: window's current text is the null string on the last visible line, ! 285: indicated by a vertical bar. ! 286: See also Figure 2. ! 287: .Ce ! 288: .KE ! 289: .2C ! 290: .PP ! 291: Commands have one-letter names. ! 292: Except for non-editing commands such as writing ! 293: the file to disc, most commands make some change ! 294: to the text in dot and leave dot set to the text resulting from the change. ! 295: For example, the delete command, ! 296: .CW d , ! 297: deletes the text in dot, replacing it by the null string and setting dot ! 298: to the result. ! 299: The change command, ! 300: .CW c , ! 301: replaces dot by text delimited by an arbitrary punctuation character, ! 302: conventionally ! 303: a slash. Thus, ! 304: .P1 2n ! 305: c/Peter/ ! 306: .P2 ! 307: replaces the text in dot by the string ! 308: .CW Peter . ! 309: Similarly, ! 310: .P1 2n ! 311: a/Peter/ ! 312: .P2 ! 313: (append) adds the string after dot, and ! 314: .P1 2n ! 315: i/Peter/ ! 316: .P2 ! 317: (insert) inserts before dot. ! 318: All three leave dot set to the new text, ! 319: .CW Peter . ! 320: .PP ! 321: Newlines are part of the syntax of commands: ! 322: the newline character lexically terminates a command. ! 323: Within the inserted text, however, newlines are never implicit. ! 324: But since it is often convenient to insert multiple lines of text, ! 325: .I sam ! 326: has a special ! 327: syntax for that case: ! 328: .P1 2n ! 329: a ! 330: some lines of text ! 331: to be inserted in the file, ! 332: terminated by a period ! 333: on a line by itself ! 334: \&. ! 335: .P2 ! 336: In the one-line syntax, a newline character may be specified by a C-like ! 337: escape, so ! 338: .P1 2n ! 339: c/\en/ ! 340: .P2 ! 341: replaces dot by a single newline character. ! 342: .PP ! 343: .I Sam ! 344: also has a substitute command, ! 345: .CW s : ! 346: .P1 2n ! 347: s/\f2expression\fP/\f2replacement\fP/ ! 348: .P2 ! 349: substitutes the replacement text for the first match, in dot, ! 350: of the regular expression. ! 351: Thus, if dot is the string ! 352: .CW Peter , ! 353: the command ! 354: .P1 2n ! 355: s/t/st/ ! 356: .P2 ! 357: changes it to ! 358: .CW Pester . ! 359: In general, ! 360: .CW s ! 361: is unnecessary, but it was inherited from ! 362: .I ed ! 363: and it has some convenient variations. ! 364: For instance, the replacement text may include the matched text, ! 365: specified by ! 366: .CW & : ! 367: .P1 2n ! 368: s/Peter/Oh, &, &, &, &!/ ! 369: .P2 ! 370: .PP ! 371: There are also three commands that apply programs ! 372: to text: ! 373: .P1 2n ! 374: < \f2Unix program\fP ! 375: .P2 ! 376: replaces dot by the output of the ! 377: .UX ! 378: program. ! 379: Similarly, the ! 380: .CW > ! 381: command ! 382: runs the program with dot as its standard input, and ! 383: .CW | ! 384: does both. For example, ! 385: .P1 2n ! 386: | sort ! 387: .P2 ! 388: replaces dot by the result of applying the standard sorting utility to it. ! 389: Again, newlines have no special significance for these ! 390: .I sam ! 391: commands. ! 392: The text acted upon and resulting from these commands is not necessarily ! 393: bounded by newlines, although for connection with ! 394: .UX ! 395: programs, ! 396: newlines may be necessary to obey conventions. ! 397: .PP ! 398: One more command: ! 399: .CW p ! 400: prints the contents of dot. ! 401: Table I summarizes ! 402: .I sam 's ! 403: commands. ! 404: .1C ! 405: .KF ! 406: .Tm Table of Commands s ! 407: .TS ! 408: center; ! 409: c s ! 410: lfCW l. ! 411: Table I. \fISam\fP commands ! 412: .sp .4 ! 413: .ft CW ! 414: _ ! 415: .ft ! 416: .sp .4 ! 417: \f1Text commands\fP ! 418: .sp .4 ! 419: _ ! 420: .sp .4 ! 421: a/\f2text\fP/ Append text after dot ! 422: c/\f2text\fP/ Change text in dot ! 423: i/\f2text\fP/ Insert text before dot ! 424: d Delete text in dot ! 425: s/\f2regexp\fP/\f2text\fP/ Substitute text for match of regular expression in dot ! 426: m \f2address\fP Move text in dot after address ! 427: t \f2address\fP Copy text in dot after address ! 428: .sp .4 ! 429: _ ! 430: .sp .4 ! 431: \f1Display commands\fP ! 432: .sp .4 ! 433: _ ! 434: .sp .2 ! 435: p Print contents of dot ! 436: \&= Print value (line numbers and character numbers) of dot ! 437: .sp .4 ! 438: _ ! 439: .sp .4 ! 440: \f1File commands\fP ! 441: .sp .4 ! 442: _ ! 443: .sp .2 ! 444: b \f2file-list\fP Set current file to first file in list that \fIsam\fP has in menu ! 445: B \f2file-list\fP Same as \f(CWb\fP, but load new files ! 446: n Print menu lines of all files ! 447: D \f2file-list\fP Delete named files from \fIsam\fP ! 448: .sp .4 ! 449: _ ! 450: .sp .4 ! 451: \f1I/O commands\fP ! 452: .sp .4 ! 453: _ ! 454: .sp .2 ! 455: e \f2filename\fP Replace file with named disc file ! 456: r \f2filename\fP Replace dot by contents of named disc file ! 457: w \f2filename\fP Write file to named disc file ! 458: f \f2filename\fP Set file name and print new menu line ! 459: < \f2Unix-command\fP Replace dot by standard output of command ! 460: > \f2Unix-command\fP Send dot to standard input of command ! 461: | \f2Unix-command\fP Replace dot by result of command applied to dot ! 462: ! \f2Unix-command\fP Run the command ! 463: .sp .4 ! 464: _ ! 465: .sp .4 ! 466: \f1Loops and conditionals\fP ! 467: .sp .4 ! 468: _ ! 469: .sp .2 ! 470: x/\f2regexp\fP/ \f2command\fP For each match of regexp, set dot and run command ! 471: y/\f2regexp\fP/ \f2command\fP Between adjacent matches of regexp, set dot and run command ! 472: X/\f2regexp\fP/ \f2command\fP Run command in each file whose menu line matches regexp ! 473: Y/\f2regexp\fP/ \f2command\fP Run command in each file whose menu line does not match ! 474: g/\f2regexp\fP/ \f2command\fP If dot contains a match of regexp, run command ! 475: v/\f2regexp\fP/ \f2command\fP If dot does not contain a match of regexp, run command ! 476: .sp .4 ! 477: _ ! 478: .sp .4 ! 479: \f1Miscellany\fP ! 480: .sp .4 ! 481: _ ! 482: .sp .2 ! 483: k Set address mark to value of dot ! 484: q Quit ! 485: u \f2n\fP Undo last \f2n\fP (default 1) changes ! 486: { } Braces group commands ! 487: .sp .3 ! 488: .ft CW ! 489: _ ! 490: .ft ! 491: .TE ! 492: .sp ! 493: .Tm n S ! 494: .Tm k S ! 495: .Tm q S ! 496: .Tm p S ! 497: .KE ! 498: .2C ! 499: .PP ! 500: The value of dot may be changed by ! 501: specifying an ! 502: .I address ! 503: for the command. ! 504: The simplest address is a line number: ! 505: .P1 2n ! 506: 3 ! 507: .P2 ! 508: refers to the third line of the file, so ! 509: .P1 2n ! 510: 3d ! 511: .P2 ! 512: deletes the third line of the file, and implicitly renumbers ! 513: the lines so the old line 4 is now numbered 3. ! 514: (This is one of the few places where ! 515: .I sam ! 516: deals with lines directly.) ! 517: Line ! 518: .CW 0 ! 519: is the null string at the beginning of the file. ! 520: If a command consists of only an address, a ! 521: .CW p ! 522: command is assumed, so typing an unadorned ! 523: .CW 3 ! 524: prints line 3 on the terminal. ! 525: There are a couple of other basic addresses: ! 526: a period addresses dot itself; and ! 527: a dollar sign ! 528: .CW $ ) ( ! 529: addresses the null string at the end of the file. ! 530: .PP ! 531: An address is always a single substring of the file. ! 532: Thus, the address ! 533: .CW 3 ! 534: addresses the characters ! 535: after the second newline of ! 536: the file through the third newline of the file. ! 537: A ! 538: .I ! 539: compound address ! 540: .R ! 541: is constructed by the comma operator ! 542: .P1 2n ! 543: \f2address1\fP,\f2address2\fP ! 544: .P2 ! 545: and addresses the substring of the file from the beginning of ! 546: .I address1 ! 547: to the end of ! 548: .I address2 . ! 549: For example, the command ! 550: .CW 3,5p ! 551: prints the third through fifth lines of the file and ! 552: .CW .,$d ! 553: deletes the text from the beginning of dot to the end of the file. ! 554: .PP ! 555: These addresses are all absolute positions in the file, but ! 556: .I sam ! 557: also has relative addresses, indicated by ! 558: .CW + ! 559: or ! 560: .CW - . ! 561: For example, ! 562: .P1 2n ! 563: $-3 ! 564: .P2 ! 565: is the third line before the end of the file and ! 566: .P1 2n ! 567: \&.+1 ! 568: .P2 ! 569: is the line after dot. ! 570: If no address appears to the left of the ! 571: .CW + ! 572: or ! 573: .CW - , ! 574: dot is assumed; ! 575: if nothing appears to the right, ! 576: .CW 1 ! 577: is assumed. ! 578: Therefore, ! 579: .CW .+1 ! 580: may be abbreviated to just a plus sign. ! 581: .PP ! 582: The ! 583: .CW + ! 584: operator acts relative to the end of its first argument, while the ! 585: .CW - ! 586: operator acts relative to the beginning. Thus ! 587: .CW .+1 ! 588: addresses the first line after dot, ! 589: .CW .- ! 590: addresses the first line before dot, and ! 591: .CW +- ! 592: refers to the line containing the end of dot. (Dot may span multiple lines, and ! 593: .CW + ! 594: selects the line after the end of dot, then ! 595: .CW - ! 596: backs up one line.) ! 597: .PP ! 598: The final type of address is a regular expression, which addresses the ! 599: text matched by the expression. The expression is enclosed in slashes, as in ! 600: .P1 2n ! 601: /\f2expression\fP/ ! 602: .P2 ! 603: The expressions are the same as those in the ! 604: .UX ! 605: program ! 606: .I egrep , ! 607: |reference(v7manual)|reference(kernighan pike) ! 608: and include closures, alternations, and so on. ! 609: They find the ! 610: .I ! 611: leftmost longest ! 612: .R ! 613: string that matches the expression, that is, ! 614: the first match after the point where the search is started, ! 615: and if more than one match begins at the same spot, the longest such match. ! 616: (I assume familiarity with the syntax for regular expressions in ! 617: .UX ! 618: programs.|reference(bsdmanual)) ! 619: For example, ! 620: .P1 2n ! 621: /x/ ! 622: .P2 ! 623: matches the next ! 624: .CW x ! 625: character in the file, ! 626: .P1 2n ! 627: /xx*/ ! 628: .P2 ! 629: matches the next run of one or more ! 630: .CW x 's, ! 631: and ! 632: .P1 2n ! 633: /x|Peter/ ! 634: .P2 ! 635: matches the next ! 636: .CW x ! 637: or ! 638: .CW Peter . ! 639: For compatibility with other ! 640: .UX ! 641: programs, the `any character' operator, ! 642: a period, ! 643: does not match a newline, so ! 644: .P1 2n ! 645: /.*/ ! 646: .P2 ! 647: matches the text from dot to the end of the line, but excludes the newline ! 648: and so will not match across ! 649: the line boundary. ! 650: .PP ! 651: Regular expressions are always relative addresses. ! 652: The direction is forwards by default, ! 653: so ! 654: .CW /Peter/ ! 655: is really an abbreviation for ! 656: .CW +/Peter/ . ! 657: The search can be reversed with a minus sign, so ! 658: .P1 2n ! 659: .CW -/Peter/ ! 660: .P2 ! 661: finds the first ! 662: .CW Peter ! 663: before dot. ! 664: Regular expressions may be used with other address forms, so ! 665: .CW 0+/Peter/ ! 666: finds the first ! 667: .CW Peter ! 668: in the file and ! 669: .CW $-/Peter/ ! 670: finds the last. ! 671: Table II summarizes ! 672: .I sam 's ! 673: addresses. ! 674: .1C ! 675: .KF ! 676: .TS ! 677: center; ! 678: c s ! 679: lfCW l. ! 680: Table II. \fISam\fP addresses ! 681: .sp .4 ! 682: .ft CW ! 683: _ ! 684: .ft ! 685: .sp .4 ! 686: \f1Simple addresses\fP ! 687: .sp .4 ! 688: _ ! 689: .sp .2 ! 690: #\f2n\fP The empty string after character \f2n\fP ! 691: \f2n\fP Line \f2n\fP. ! 692: /\f2regexp\fP/ The first following match of the regular expression ! 693: -/\f2regexp\fP/ The first previous match of the regular expression ! 694: $ The null string at the end of the file ! 695: \&. Dot ! 696: \&' The address mark, set by \f(CWk\fP command ! 697: "\f2regexp\fP" Dot in the file whose menu line matches regexp ! 698: .sp .4 ! 699: _ ! 700: .sp .4 ! 701: \f1Compound addresses\fP ! 702: .sp .4 ! 703: _ ! 704: .sp .2 ! 705: \f2a1\fP+\f2a2\fP The address \f2a2\fP evaluated starting at right of \f2a1\fP ! 706: \f2a1\fP-\f2a2\fP \f2a2\fP evaluated in the reverse direction starting at left of \f2a1\fP ! 707: \f2a1\fP,\f2a2\fP From the left of \f2a1\fP to the right of \f2a2\fP (default \f(CW0,$\fP) ! 708: \f2a1\fP;\f2a2\fP Like \f(CW,\fP but sets dot after evaluating \f2a1\fP ! 709: .sp .4 ! 710: _ ! 711: .sp .4 ! 712: .T& ! 713: c s. ! 714: T{ ! 715: The operators ! 716: .CW + ! 717: and ! 718: .CW - ! 719: are high precedence, while ! 720: .CW , ! 721: and ! 722: .CW ; ! 723: are low precedence. ! 724: In both ! 725: .CW + ! 726: and ! 727: .CW - ! 728: forms, ! 729: .I a2 ! 730: defaults to 1 and ! 731: .I a1 ! 732: defaults to dot. ! 733: If both ! 734: .I a1 ! 735: and ! 736: .I a2 ! 737: are present, ! 738: .CW + ! 739: may be elided. ! 740: T} ! 741: .sp .5 ! 742: .ft CW ! 743: _ ! 744: .ft ! 745: .TE ! 746: .sp ! 747: .KE ! 748: .2C ! 749: .PP ! 750: The language discussed so far will not seem novel ! 751: to people who use ! 752: .UX ! 753: text editors ! 754: such as ! 755: .I ed ! 756: or ! 757: .I vi . ! 758: |reference(bsdmanual) ! 759: Moreover, the kinds of editing operations these commands allow, with the exception ! 760: of regular expressions and line numbers, ! 761: are clearly more conveniently handled by a mouse-based interface. ! 762: Indeed, ! 763: .I sam 's ! 764: mouse language (discussed at length below) is the means by which ! 765: simple changes are usually made. ! 766: For large or repetitive changes, however, a textual language ! 767: outperforms a manual interface. ! 768: .PP ! 769: Imagine that, instead of deleting just one occurrence of the string ! 770: .CW Peter , ! 771: we wanted to eliminate every ! 772: .CW Peter . ! 773: What's needed is an iterator that runs a command for each occurrence of some ! 774: text. ! 775: .I Sam 's ! 776: iterator is called ! 777: .CW x , ! 778: for extract: ! 779: .P1 2n ! 780: x/\f2expression\fP/ \f2command\fP ! 781: .P2 ! 782: finds all matches in dot of the specified expression, and for each ! 783: such match, sets dot to the text matched and runs the command. ! 784: So to delete all the ! 785: .CW Peters: ! 786: .P1 2n ! 787: 0,$ x/Peter/ d ! 788: .P2 ! 789: (Blanks in these examples are to improve readability; ! 790: .I sam ! 791: neither requires nor interprets them.) ! 792: This searches the entire file ! 793: .CW 0,$ ) ( ! 794: for occurrences of the string ! 795: .CW Peter , ! 796: and runs the ! 797: .CW d ! 798: command with dot set to each such occurrence. ! 799: (By contrast, the comparable ! 800: .I ed ! 801: command would delete all ! 802: .I lines ! 803: containing ! 804: .CW Peter ; ! 805: .I sam ! 806: deletes only the ! 807: .CW Peters .) ! 808: The address ! 809: .CW 0,$ ! 810: is commonly used, and may be abbreviated to just a comma. ! 811: As another example, ! 812: .P1 2n ! 813: , x/Peter/ p ! 814: .P2 ! 815: prints a list of ! 816: .CW Peters, ! 817: one for each appearance in the file, with no intervening text (not even newlines ! 818: to separate the instances). ! 819: .PP ! 820: Of course, the text extracted by ! 821: .CW x ! 822: may be selected by a regular expression, ! 823: which complicates deciding what set of matches is chosen \(em ! 824: matches may overlap. This is resolved by generating the matches ! 825: starting from the beginning of dot using the leftmost-longest rule, ! 826: and searching for each match starting from the end of the previous one. ! 827: Regular expressions may also match null strings, but a null match ! 828: adjacent to a non-null match is never selected; at least one character ! 829: must intervene. ! 830: For example, ! 831: .P1 2n ! 832: , c/AAA/ ! 833: x/B*/ c/-/ ! 834: , p ! 835: .P2 ! 836: produces as output ! 837: .P1 2n ! 838: -A-A-A- ! 839: .P2 ! 840: because the pattern ! 841: .CW B* ! 842: matches the null strings separating the ! 843: .CW A 's. ! 844: .PP ! 845: The ! 846: .CW x ! 847: command has a complement, ! 848: .CW y , ! 849: with similar syntax, that executes the command with dot set to the text ! 850: .I between ! 851: the matches of the expression. ! 852: For example, ! 853: .P1 2n ! 854: , c/AAA/ ! 855: y/A/ c/-/ ! 856: , p ! 857: .P2 ! 858: produces the same result as the example above. ! 859: .PP ! 860: The ! 861: .CW x ! 862: and ! 863: .CW y ! 864: commands are looping constructs, and ! 865: .I sam ! 866: has a pair of conditional commands to go with them. ! 867: They have similar syntax: ! 868: .P1 2n ! 869: g/\f2expression\fP/ \f2command\fP ! 870: .P2 ! 871: (guard) ! 872: runs the command exactly once if dot contains a match of the expression. ! 873: This is different from ! 874: .CW x , ! 875: which runs the command for ! 876: .I each ! 877: match: ! 878: .CW x ! 879: loops; ! 880: .CW g ! 881: merely tests, without changing the value of dot. ! 882: Thus, ! 883: .P1 2n ! 884: , x/Peter/ d ! 885: .P2 ! 886: deletes all occurrences of ! 887: .CW Peter , ! 888: but ! 889: .P1 2n ! 890: , g/Peter/ d ! 891: .P2 ! 892: deletes the whole file (reduces it to a null string) if ! 893: .CW Peter ! 894: occurs anywhere in the text. ! 895: The complementary conditional is ! 896: .CW v , ! 897: which runs the command if there is ! 898: .I no ! 899: match of the expression. ! 900: .PP ! 901: These control-structure-like commands may be composed to construct more ! 902: involved operations. For example, to print those lines of text that ! 903: contain the string ! 904: .CW Peter : ! 905: .P1 2n ! 906: , x/.*\en/ g/Peter/ p ! 907: .P2 ! 908: The ! 909: .CW x ! 910: breaks the file into lines, the ! 911: .CW g ! 912: selects those lines containing ! 913: .CW Peter , ! 914: and the ! 915: .CW p ! 916: prints them. ! 917: This command gives an address for the ! 918: .CW x ! 919: command (the whole file), but because ! 920: .CW g ! 921: does not have an explicit address, it applies to the value of ! 922: dot produced by the ! 923: .CW x ! 924: command, that is, to each line. ! 925: All commands in ! 926: .I sam ! 927: except for the command to write a file to disc use dot for the ! 928: default address. ! 929: .PP ! 930: Composition may be continued indefinitely. ! 931: .P1 2n ! 932: , x/.*\en/ g/Peter/ v/SaltPeter/ p ! 933: .P2 ! 934: prints those lines containing ! 935: .CW Peter ! 936: but ! 937: .I not ! 938: those containing ! 939: .CW SaltPeter . ! 940: .NH 2 ! 941: Structural Regular Expressions ! 942: .LP ! 943: Unlike other ! 944: .UX ! 945: text editors, ! 946: including the non-interactive ones such as ! 947: .I sed ! 948: and ! 949: .I awk , ! 950: |reference(kernighan pike) ! 951: .I sam ! 952: is good for manipulating files with multi-line `records.' ! 953: An example is an on-line phone book composed of records, ! 954: separated by blank lines, of the form ! 955: .P1 2n ! 956: Herbert Tic ! 957: 44 Turnip Ave., Endive, NJ ! 958: 201-5555642 ! 959: ! 960: Norbert Twinge ! 961: 16 Potato St., Cabbagetown, NJ ! 962: 201-5553145 ! 963: ! 964: \&... ! 965: .P2 ! 966: The format may be encoded as a regular expression: ! 967: .P1 2n ! 968: (.+\en)+ ! 969: .P2 ! 970: that is, a sequence of one or more non-blank lines. ! 971: The command to print Mr. Tic's entire record is then ! 972: .P1 2n ! 973: , x/(.+\en)+/ g/^Herbert Tic$/ p ! 974: .P2 ! 975: and that to extract just the phone number is ! 976: .P1 2n ! 977: , x/(.+\en)+/ g/^Herbert Tic$/ ! 978: x/^[0-9]*-[0-9]*\en/ p ! 979: .P2 ! 980: The latter command breaks the file into records, ! 981: chooses Mr. Tic's record, ! 982: extracts the phone number from the record, ! 983: and finally prints the number. ! 984: .PP ! 985: A more involved problem is that of ! 986: renaming a particular variable, say ! 987: .CW n , ! 988: to ! 989: .CW num ! 990: in a C program. ! 991: The obvious first attempt, ! 992: .P1 2n ! 993: , x/n/ c/num/ ! 994: .P2 ! 995: is badly flawed: it changes not only the variable ! 996: .CW n ! 997: but any letter ! 998: .CW n ! 999: that appears. ! 1000: We need to extract all the variables, and select those that match ! 1001: .CW n ! 1002: and only ! 1003: .CW n : ! 1004: .P1 2n ! 1005: , x/[A-Za-z_][A-Za-z_0-9]*/ g/n/ v/../ c/num/ ! 1006: .P2 ! 1007: The pattern ! 1008: .CW [A-Za-z_][A-Za-z_0-9]* ! 1009: matches C identifiers. ! 1010: Next ! 1011: .CW g/n/ ! 1012: selects those containing an ! 1013: .CW n . ! 1014: Then ! 1015: .CW v/../ ! 1016: rejects those containing two (or more) characters, and finally ! 1017: .CW c/num/ ! 1018: changes the remainder (identifiers ! 1019: .CW n ) ! 1020: to ! 1021: .CW num . ! 1022: This version clearly works much better, but there may still be problems. ! 1023: For example, in C character and string constants, the sequence ! 1024: .CW \en ! 1025: is interpreted as a newline character, and we don't want to change it to ! 1026: .CW \enum. ! 1027: This problem can be forestalled with a ! 1028: .CW y ! 1029: command ! 1030: (the following two examples are split onto multiple lines for readability): ! 1031: .P1 2n ! 1032: , y/\e\en/ x/[A-Za-z_][A-Za-z_0-9]*/ ! 1033: g/n/ v/../ c/num/ ! 1034: .P2 ! 1035: (the second ! 1036: .CW \e ! 1037: is necessary because of lexical conventions in regular expressions), ! 1038: or we could even reject character constants and strings outright: ! 1039: .P1 2n ! 1040: , y/'[^']*'/ y/"[^"]*"/ ! 1041: x/[A-Za-z_][A-Za-z_0-9]*/ ! 1042: g/n/ v/../ c/num/ ! 1043: .P2 ! 1044: The ! 1045: .CW y ! 1046: commands in this version exclude from consideration all character constants ! 1047: and strings. ! 1048: The only remaining problem is to deal with the possible occurrence of ! 1049: .CW \e' ! 1050: or ! 1051: .CW \e" ! 1052: within these sequences, but it's easy to see how to resolve this difficulty. ! 1053: .PP ! 1054: The point of these composed commands is successive refinement. ! 1055: A simple version of the command is tried, and if it's not good enough, ! 1056: it can be honed by adding a clause or two. ! 1057: (Mistakes can be undone; see below. ! 1058: Also, the mouse language makes it unnecessary to retype the command each time.) ! 1059: The resulting chains of commands are somewhat reminiscent of ! 1060: shell pipelines.|reference(kernighan pike) ! 1061: Unlike pipelines, though, which pass along modified ! 1062: .I data , ! 1063: .I sam ! 1064: commands pass a ! 1065: .I view ! 1066: of the data. ! 1067: The text at each step of the command is the same, but which pieces ! 1068: are selected is refined step by step until the correct piece is ! 1069: available to the final step of the command line, which ultimately makes the change. ! 1070: .PP ! 1071: In other ! 1072: .UX ! 1073: programs, regular expressions are used only for selection, ! 1074: as in the ! 1075: .I sam ! 1076: .CW g ! 1077: command, never for extraction as in the ! 1078: .CW x ! 1079: or ! 1080: .CW y ! 1081: command. ! 1082: For example, patterns in ! 1083: .I awk ! 1084: |reference(kernighan pike) ! 1085: are used to select lines to be operated on, but cannot be used ! 1086: to describe the format of the input text, or to handle newline-free text. ! 1087: The use of regular expressions to describe the structure of a piece ! 1088: of text rather than its contents, as in the ! 1089: .CW x ! 1090: command, ! 1091: has been given a name: ! 1092: .I ! 1093: structural regular expressions. ! 1094: .R ! 1095: When they are composed, as in the above example, ! 1096: they are pleasantly expressive. ! 1097: Their use is discussed at greater length elsewhere.|reference(pike structural) ! 1098: .NH 2 ! 1099: Multiple files ! 1100: .LP ! 1101: .I Sam ! 1102: has a few other commands, mostly relating to input and output. ! 1103: .P1 2n ! 1104: e discfilename ! 1105: .P2 ! 1106: replaces the contents and name of the current file with those of the named ! 1107: disc file; ! 1108: .P1 2n ! 1109: w discfilename ! 1110: .P2 ! 1111: writes the contents to the named disc file; and ! 1112: .P1 2n ! 1113: r discfilename ! 1114: .P2 ! 1115: replaces dot with the contents of the named disc file. ! 1116: All these commands use the current file's name if none is specified. ! 1117: Finally, ! 1118: .P1 2n ! 1119: f discfilename ! 1120: .P2 ! 1121: changes the name associated with the file and displays the result: ! 1122: .P1 2n ! 1123: \&'-. discfilename ! 1124: .P2 ! 1125: This output is called the file's ! 1126: .I ! 1127: menu line, ! 1128: .R ! 1129: because it is the contents of the file's line in the button 3 menu (described ! 1130: in the ! 1131: next section). ! 1132: The first three characters are a concise notation for the state of the file. ! 1133: The apostrophe signifies that the file is modified. ! 1134: The minus sign indicates the number of windows ! 1135: open on the file (see the next section): ! 1136: .CW - ! 1137: means none, ! 1138: .CW + ! 1139: means one, and ! 1140: .CW * ! 1141: means more than one. ! 1142: Finally, the period indicates that this is the current file. ! 1143: These characters are useful for controlling the ! 1144: .CW X ! 1145: command, described shortly. ! 1146: .PP ! 1147: .I Sam ! 1148: may be started with a set of disc files (such as all the source for ! 1149: a program) by invoking it with a list of file names as arguments, and ! 1150: more may be added or deleted on demand. ! 1151: .P1 2n ! 1152: B discfile1 discfile2 ... ! 1153: .P2 ! 1154: adds the named files to ! 1155: .I sam 's ! 1156: list, and ! 1157: .P1 2n ! 1158: D discfile1 discfile2 ... ! 1159: .P2 ! 1160: removes them from ! 1161: .I sam 's ! 1162: memory (without effect on associated disc files). ! 1163: Both these commands have a syntax for using the shell|reference(kernighan pike) ! 1164: (the ! 1165: .UX ! 1166: command interpreter) to generate the lists: ! 1167: .P1 2n ! 1168: B <echo *.c ! 1169: .P2 ! 1170: will add all C source files, and ! 1171: .P1 2n ! 1172: B <grep -l variable *.c ! 1173: .P2 ! 1174: will add all C source files referencing a particular variable ! 1175: (the ! 1176: .UX ! 1177: command ! 1178: .CW grep\ -l ! 1179: lists all files in its arguments that contain matches of ! 1180: the specified regular expression). ! 1181: Finally, ! 1182: .CW D ! 1183: without arguments deletes the current file. ! 1184: .PP ! 1185: There are two ways to change which file is current: ! 1186: .P1 2n ! 1187: b filename ! 1188: .P2 ! 1189: makes the named file current. ! 1190: The ! 1191: .CW B ! 1192: command ! 1193: does the same, but also adds any new files to ! 1194: .I sam 's ! 1195: list. ! 1196: (In practice, of course, the current file ! 1197: is usually chosen by mouse actions, not by textual commands.) ! 1198: The other way is to use a form of address that refers to files: ! 1199: .P1 2n ! 1200: "\f2expression\fP" \f2address\fP ! 1201: .P2 ! 1202: refers to the address evaluated in the file whose menu line ! 1203: matches the expression (there must be exactly one match). ! 1204: For example, ! 1205: .P1 2n ! 1206: "peter.c" 3 ! 1207: .P2 ! 1208: refers to the third line of the file whose name matches ! 1209: .CW peter.c . ! 1210: This is most useful in the move ! 1211: .CW m ) ( ! 1212: and copy ! 1213: .CW t ) ( ! 1214: commands: ! 1215: .P1 2n ! 1216: 0,$ t "peter.c" 0 ! 1217: .P2 ! 1218: makes a copy of the current file at the beginning of ! 1219: .CW peter.c . ! 1220: .PP ! 1221: The ! 1222: .CW X ! 1223: command ! 1224: is a looping construct, like ! 1225: .CW x , ! 1226: that refers to files instead of strings: ! 1227: .P1 2n ! 1228: X/\f2expression\fP/ \f2command\fP ! 1229: .P2 ! 1230: runs the command in all ! 1231: files whose menu lines match the expression. The best example is ! 1232: .P1 2n ! 1233: X/'/ w ! 1234: .P2 ! 1235: which writes to disc all modified files. ! 1236: .CW Y ! 1237: is the complement of ! 1238: .CW X : ! 1239: it runs the command on all files whose menu lines don't match the expression: ! 1240: .P1 2n ! 1241: Y/\e.c/ D ! 1242: .P2 ! 1243: deletes all files that don't have ! 1244: .CW \&.c ! 1245: in their names, that is, it keeps all C source files and deletes the rest. ! 1246: .PP ! 1247: Braces allow commands to be grouped, so ! 1248: .P1 2n ! 1249: { ! 1250: \f2command1\fP ! 1251: \f2command2\fP ! 1252: } ! 1253: .P2 ! 1254: is syntactically a single command that runs two commands. ! 1255: Thus, ! 1256: .P1 2n ! 1257: X/\e.c/ ,g/variable/ { ! 1258: f ! 1259: , x/.*\en/ g/variable/ p ! 1260: } ! 1261: .P2 ! 1262: finds all occurrences of ! 1263: .CW variable ! 1264: in C source files, and prints ! 1265: out the file names and lines of each match. ! 1266: The precise semantics of compound operations is discussed in the implementation ! 1267: sections below. ! 1268: .PP ! 1269: Finally, ! 1270: the undo command, ! 1271: .CW u , ! 1272: undoes the last command, ! 1273: no matter how many files were affected. ! 1274: Multiple undo operations move further back in time, so ! 1275: .P1 2n ! 1276: u ! 1277: u ! 1278: .P2 ! 1279: (which may be abbreviated ! 1280: .CW u2 ) ! 1281: undoes the last two commands. An undo may not be undone, however, nor ! 1282: may any command that adds or deletes files. ! 1283: Everything else is undoable, though, including for example ! 1284: .CW e ! 1285: commands: ! 1286: .P1 2n ! 1287: e filename ! 1288: u ! 1289: .P2 ! 1290: restores the state of the file completely, including its name, dot, ! 1291: and modified bit. Because of the undo, potentially dangerous commands ! 1292: are not guarded by confirmations. Only ! 1293: .CW D , ! 1294: which destroys the information necessary to restore itself, is protected. ! 1295: It will not delete a modified file, but a second ! 1296: .CW D ! 1297: of the same file will succeed regardless. ! 1298: The ! 1299: .CW q ! 1300: command, which exits ! 1301: .I sam , ! 1302: is similarly guarded. ! 1303: .NH 2 ! 1304: Mouse Interface ! 1305: .LP ! 1306: .I Sam ! 1307: is most commonly run ! 1308: connected to a bitmap display and mouse for interactive editing. ! 1309: The only difference in the command language ! 1310: between regular, mouse-driven ! 1311: .I sam ! 1312: and ! 1313: .CW sam\ -d ! 1314: is that if an address ! 1315: is provided without a command, ! 1316: .CW sam\ -d ! 1317: will print the text referenced by the address, but ! 1318: regular ! 1319: .I sam ! 1320: will highlight it on the screen \(em in fact, ! 1321: dot is always highlighted (see Figure 2). ! 1322: .WS 1 ! 1323: .1C ! 1324: .KF ! 1325: .IB fig2.ps 1.47i 4.61i c ! 1326: .Cs ! 1327: Figure 2. A ! 1328: .I sam ! 1329: window. The scroll bar down the left ! 1330: represents the file, with the bubble showing the fraction ! 1331: visible in the window. ! 1332: The scroll bar may be manipulated by the mouse for convenient browsing. ! 1333: The current text, ! 1334: which is highlighted, need not fit on a line. Here it consists of one partial ! 1335: line, one complete line, and a final partial line. ! 1336: .Ce ! 1337: .KE ! 1338: .2C ! 1339: .PP ! 1340: Each file may have zero or more windows open on the display. ! 1341: At any time, only one window in all of ! 1342: .I sam ! 1343: is the ! 1344: .I ! 1345: current window, ! 1346: .R ! 1347: that is, the window to which typing and mouse actions refer; ! 1348: this may be the ! 1349: .I sam ! 1350: window (that in which commands may be typed) ! 1351: or one of the file windows. ! 1352: When a file has multiple windows, the image of the file in each window ! 1353: is always kept up to date. ! 1354: The current file is the last file affected by a command, ! 1355: so if the ! 1356: .I sam ! 1357: window is current, ! 1358: the current window is not a window on the current file. ! 1359: However, each window on a file has its own value of dot, ! 1360: and when switching between windows on a single file, ! 1361: the file's value of dot is changed to that of the window. ! 1362: Thus, flipping between windows behaves in the obvious, convenient way. ! 1363: .PP ! 1364: The mouse on the Blit has three buttons, numbered left to right. ! 1365: Button 3 has a list of commands to manipulate windows, ! 1366: followed by a list of `menu lines' exactly as printed by the ! 1367: .CW f ! 1368: command, one per file (not one per window). ! 1369: These menu lines are sorted by file name. ! 1370: If the list is long, the Blit menu software will make it more manageable ! 1371: by generating a scrolling menu instead of an unwieldy long list. ! 1372: Using the menu to select a file from the list makes that file the current ! 1373: file, and the most recently current window in that file the current window. ! 1374: But if that file is already current, selecting it in the menu cycles through ! 1375: the windows on the file; this simple trick avoids a special menu to ! 1376: choose windows on a file. ! 1377: If there is no window open on the file, ! 1378: .I sam ! 1379: changes the mouse cursor to prompt the user to create one. ! 1380: .PP ! 1381: The commands on the button 3 menu are straightforward (see Figure 3), and ! 1382: are like the commands to manipulate windows in ! 1383: .I mux , ! 1384: |reference(latest volume1) ! 1385: the Blit's window system. ! 1386: .CW new ! 1387: makes a new file, and gives it one empty window, whose size is determined ! 1388: by a rectangle swept by the mouse. ! 1389: .CW xerox ! 1390: prompts for a window to be selected, and ! 1391: makes a clone of that window; this is how multiple windows are created on one file. ! 1392: .CW reshape ! 1393: changes the size of the indicated window, and ! 1394: .CW close ! 1395: deletes it. If that is the last window open on the file, ! 1396: .CW close ! 1397: first does a ! 1398: .CW D ! 1399: command on the file. ! 1400: .CW write ! 1401: is identical to a ! 1402: .CW w ! 1403: command on the file; it is in the menu purely for convenience. ! 1404: Finally, ! 1405: .CW ~~sam~~ ! 1406: is a menu item that appears between the commands and the file names. ! 1407: Selecting it makes the ! 1408: .I sam ! 1409: window the current window, ! 1410: causing subsequent typing to be interpreted as commands. ! 1411: .KF ! 1412: .IB fig3.ps 1.98i 1.39i c ! 1413: .Cs ! 1414: Figure 3. The menu on button 3. ! 1415: The black rectangle on the left is a scroll bar; the menu is limited to ! 1416: the length shown to prevent its becoming unwieldy. ! 1417: Above the ! 1418: .CW ~~sam~~ ! 1419: line is a list of commands; ! 1420: beneath it is a list of files, presented exactly as with the ! 1421: .CW f ! 1422: command. ! 1423: .Ce ! 1424: .KE ! 1425: .PP ! 1426: When ! 1427: .I sam ! 1428: requests that a window be swept, in response to ! 1429: .CW new , ! 1430: .CW xerox ! 1431: or ! 1432: .CW reshape , ! 1433: it changes the mouse cursor from the usual arrow to a box with ! 1434: a small arrow. ! 1435: In this state, the mouse may be used to indicate an arbitrary rectangle by ! 1436: pressing button 3 at one corner and releasing it at the opposite corner. ! 1437: More conveniently, ! 1438: button 3 may simply be clicked, ! 1439: whereupon ! 1440: .I sam ! 1441: creates the maximal rectangle that contains the cursor ! 1442: and abuts the ! 1443: .I sam ! 1444: window. ! 1445: By placing the ! 1446: .I sam ! 1447: window in the middle of the screen, the user can define two regions (one above, ! 1448: one below) in which stacked fully-overlapping ! 1449: windows can be created with minimal fuss (see Figure 1). ! 1450: This simple user interface trick makes window creation noticeably easier. ! 1451: .PP ! 1452: The cut-and-paste editor is essentially the same as that in Smalltalk-80. ! 1453: |reference(smalltalk80 goldberg) ! 1454: The text in dot is always highlighted on the screen. ! 1455: When a character is typed it replaces dot, and sets dot to the null ! 1456: string after the character. Thus, ordinary typing inserts text. ! 1457: Button 1 is used for selection: ! 1458: pressing the button, moving the mouse, and lifting the button ! 1459: selects (sets dot to) the text between the points where the ! 1460: button was pressed and released. ! 1461: Pressing and releasing at the same point selects a null string; this ! 1462: is called clicking. Clicking twice quickly, or ! 1463: .I ! 1464: double clicking, ! 1465: .R ! 1466: selects larger objects; ! 1467: for example, double clicking in a word selects the word, ! 1468: double clicking just inside an opening bracket selects the text ! 1469: contained in the brackets (handling nested brackets correctly), ! 1470: and similarly for ! 1471: parentheses, quotes, and so on. ! 1472: The double-clicking rules reflect a bias toward ! 1473: programmers. ! 1474: If ! 1475: .I sam ! 1476: were intended more for word processing, double-clicks would probably ! 1477: select linguistic structures such as sentences. ! 1478: .PP ! 1479: If button 1 is pressed outside the current window, it makes the indicated ! 1480: window current. ! 1481: This is the easiest way to switch between windows and files. ! 1482: .PP ! 1483: Pressing button 2 brings up a menu of editing functions (see Figure 4). ! 1484: These mostly apply to the selected text: ! 1485: .CW cut ! 1486: deletes the selected text, and remembers it in a hidden buffer called the ! 1487: .I ! 1488: snarf buffer, ! 1489: .R ! 1490: .CW paste ! 1491: replaces the selected text by the contents of the snarf buffer, ! 1492: .CW snarf ! 1493: just copies the selected text to the snarf buffer, ! 1494: .CW look ! 1495: searches forward for the next literal occurrence of the selected text, and ! 1496: .CW <mux> ! 1497: exchanges snarf buffers with the window system in which ! 1498: .I sam ! 1499: is running. ! 1500: Finally, the last regular expression used appears as a menu entry ! 1501: to search ! 1502: forward for the next occurrence of a match for the expression. ! 1503: .WS 1 ! 1504: .KF ! 1505: .IB fig4.ps 0.87i 0.81i c ! 1506: .Cs ! 1507: Figure 4. The menu on button 2. ! 1508: The bottom entry tracks the most recently used regular expression, which may ! 1509: be literal text. ! 1510: .Ce ! 1511: .KE ! 1512: .PP ! 1513: The relationship between the command language and the mouse language is ! 1514: entirely due to the equality of dot and the selected text chosen ! 1515: with button 1 on the mouse. ! 1516: For example, to make a set of changes in a C subroutine, dot can be ! 1517: set by double clicking on the left brace that begins the subroutine, ! 1518: which sets dot for the command language. ! 1519: An address-free command then typed in the ! 1520: .I sam ! 1521: window will apply only to the text between the opening and closing ! 1522: braces of the function. ! 1523: The idea is to select what you want, and then say what you want ! 1524: to do with it, whether invoked by a menu selection or by a typed command. ! 1525: And of course, the value of dot is highlighted on ! 1526: the display after the command completes. ! 1527: This relationship between mouse interface and command language ! 1528: is clumsy to explain, but comfortable, even natural, in practice. ! 1529: .NH ! 1530: The Implementation ! 1531: .LP ! 1532: The next few sections describe how ! 1533: .I sam ! 1534: is put together, first the host part, ! 1535: then the inter-component communication, ! 1536: then the terminal part. ! 1537: After explaining how the command language is implemented, ! 1538: the discussion follows (roughly) the path of a character ! 1539: from the temporary file on disc to the screen. ! 1540: The presentation centers on the data structures, ! 1541: because that is how the program was designed and because ! 1542: the algorithms are easy to provide, given the right data ! 1543: structures. ! 1544: ........ ! 1545: .NH 2 ! 1546: Parsing and execution ! 1547: .LP ! 1548: The command language is interpreted by parsing each command with a ! 1549: table-driven recursive ! 1550: descent parser, and when a complete command is assembled, invoking a top-down ! 1551: executor. ! 1552: Most editors instead employ a simple character-at-a-time ! 1553: lexical scanner. ! 1554: Use of a parser makes it ! 1555: easy and unambiguous to detect when a command is complete, ! 1556: which has two advantages. ! 1557: First, escape conventions such as backslashes to quote ! 1558: multiple-line commands are unnecessary; if the command isn't finished, ! 1559: the parser keeps reading. For example, a multiple-line append driven by an ! 1560: .CW x ! 1561: command is straightforward: ! 1562: .P1 2n ! 1563: x/.*\en/ g/Peter/ a ! 1564: one line about Peter ! 1565: another line about Peter ! 1566: \&. ! 1567: .P2 ! 1568: Other ! 1569: .UX ! 1570: editors would require a backslash after all but the last line. ! 1571: .PP ! 1572: The other advantage is specific to the two-process structure of ! 1573: .I sam . ! 1574: The host process must decide when a command is completed so the ! 1575: command interpreter can be called. This problem is easily resolved ! 1576: by having the lexical analyzer read the single stream of events from the ! 1577: terminal, directly executing all typing and mouse commands, ! 1578: but passing to the parser characters typed to the ! 1579: .I sam ! 1580: command window. ! 1581: This scheme is slightly complicated by the availability of cut-and-paste ! 1582: editing in the ! 1583: .I sam ! 1584: window, but that difficulty is resolved by applying the rules ! 1585: used in ! 1586: .I mux : ! 1587: when a newline is typed to the ! 1588: .I sam ! 1589: window, all text between the newline and the previously typed newline ! 1590: is made available to the parser. ! 1591: This permits arbitrary editing to be done to a command before ! 1592: typing newline and thereby requesting execution. ! 1593: .PP ! 1594: The parser is driven by a table because the syntax of addresses ! 1595: and commands is regular enough ! 1596: to be encoded compactly. There are few special cases, such as the ! 1597: replacement text in a substitution, so the syntax of almost all commands ! 1598: can be encoded with a few flags. ! 1599: These include whether the command allows an address (for example, ! 1600: .CW e ! 1601: does not), whether it takes a regular expression (as in ! 1602: .CW x ! 1603: and ! 1604: .CW s ), ! 1605: whether it takes replacement text (as in ! 1606: .CW c ! 1607: or ! 1608: .CW i ), ! 1609: which may be multi-line, and so on. ! 1610: The internal syntax of regular expressions is handled by a separate ! 1611: parser; a regular expression is a leaf of the command parse tree. ! 1612: Regular expressions are discussed fully in the next section. ! 1613: .PP ! 1614: The parser table also has information about defaults, so the interpreter ! 1615: is always called with a complete tree. For example, the parser fills in ! 1616: the implicit ! 1617: .CW 0 ! 1618: and ! 1619: .CW $ ! 1620: in the abbreviated address ! 1621: .CW , ! 1622: (comma), ! 1623: inserts a ! 1624: .CW + ! 1625: to the left of an unadorned regular expression in an address, ! 1626: and provides the usual default address ! 1627: .CW . ! 1628: (dot) for commands that expect an address but are not given one. ! 1629: .PP ! 1630: Once a complete command is parsed, the evaluation is easy. ! 1631: The address is evaluated left-to-right starting from the value of dot, ! 1632: with a mostly ordinary expression evaluator. ! 1633: Addresses, like many of the data structures in ! 1634: .I sam , ! 1635: are held in a C structure and passed around by value: ! 1636: .P1 2n ! 1637: ! 1638: typedef long Posn; /* file position */ ! 1639: typedef struct Range{ ! 1640: Posn p1, p2; ! 1641: }Range; ! 1642: typedef struct Address{ ! 1643: Range r; ! 1644: File *f; ! 1645: }Address; ! 1646: .P2 ! 1647: .1C ! 1648: .KF top ! 1649: .SP 1 ! 1650: .P1 10n ! 1651: Address ! 1652: address(ap, a, sign) ! 1653: Addrtree *ap; ! 1654: Address a; ! 1655: int sign; ! 1656: { ! 1657: Address a2; ! 1658: do ! 1659: switch(ap->type){ ! 1660: case '.': ! 1661: a=a.f->dot; ! 1662: break; ! 1663: case '$': ! 1664: a.r.p1=a.r.p2=a.f->nbytes; ! 1665: break; ! 1666: case '"': ! 1667: a=matchfile(a, ap->aregexp)->dot; ! 1668: break; ! 1669: case ',': ! 1670: a2=address(ap->right, a, 0); ! 1671: a=address(ap->left, a, 0); ! 1672: if(a.f!=a2.f || a2.r.p2<a.r.p1) ! 1673: error(Eorder); ! 1674: a.r.p2=a2.r.p2; ! 1675: return a; ! 1676: /* and so on */ ! 1677: } ! 1678: while((ap=ap->right)!=0); ! 1679: return a; ! 1680: } ! 1681: .P2 ! 1682: .ce ! 1683: \fBDisplay 1.\fP The address interpreter ! 1684: .KE ! 1685: .2C ! 1686: An address is encoded as a substring (character positions ! 1687: .CW p1 ! 1688: to ! 1689: .CW p2 ) ! 1690: in a file ! 1691: .CW f . ! 1692: (The data type ! 1693: .CW File ! 1694: is described in detail below.) ! 1695: .PP ! 1696: The address interpreter (shown in Display 1) is an ! 1697: .CW Address -valued ! 1698: function that traverses the parse tree describing an address (the ! 1699: parse tree for the address has type ! 1700: .CW Addrtree ). ! 1701: .PP ! 1702: Throughout, errors are handled by a non-local ! 1703: .CW goto ! 1704: (a ! 1705: .CW setjmp/longjmp ! 1706: in C terminology) ! 1707: hidden in a routine called ! 1708: .CW error ! 1709: that immediately aborts the execution, retracts any ! 1710: partially made changes (see the section below on `undoing'), and ! 1711: returns to the top level of the parser. ! 1712: The argument to ! 1713: .CW error ! 1714: is an enumeration type that ! 1715: is translated to a terse but possibly helpful ! 1716: message such as ! 1717: .CW "?addresses out of order" . ! 1718: Very common messages are kept short; for example the message for ! 1719: a failed regular expression search is ! 1720: .CW ?search . ! 1721: .PP ! 1722: Character addresses such as ! 1723: .CW #3 ! 1724: are trivial to implement, as the ! 1725: .CW File ! 1726: data structure is accessible by character number. ! 1727: However, ! 1728: .I sam ! 1729: keeps no information about the position of newlines \(em it is too ! 1730: expensive to track dynamically \(em so line addresses are computed by reading ! 1731: the file, counting newlines. Except in very large files, this has proven ! 1732: acceptable: file access is fast enough to make the technique practical, ! 1733: and lines are not central to the structure of the command language. ! 1734: .PP ! 1735: The command interpreter, called ! 1736: .CW cmdexec , ! 1737: is also straightforward. The parse table includes a ! 1738: function to call to interpret a particular command. That function ! 1739: receives as arguments ! 1740: the calculated address ! 1741: for the command ! 1742: and the command tree (of type ! 1743: .CW Cmdtree ), ! 1744: which may contain information such as the subtree for compound commands. ! 1745: For example, the function for the ! 1746: .CW g ! 1747: and ! 1748: .CW v ! 1749: commands is shown in Display 2. ! 1750: .CW Compile "" ( ! 1751: and ! 1752: .CW execute ! 1753: are part of the regular expression code, described in the next section.) ! 1754: Because the parser and the ! 1755: .CW File ! 1756: data structure do most of the work, most commands ! 1757: are similarly brief. ! 1758: .1C ! 1759: .KF bottom ! 1760: .P1 10n ! 1761: int ! 1762: g_cmd(a, cp) ! 1763: Address a; ! 1764: Cmdtree *cp; ! 1765: { ! 1766: compile(cp->regexp); ! 1767: if(execute(a.f, a.r.p1, a.r.p2) != (cp->cmdchar=='v')){ ! 1768: a.f->dot=a; ! 1769: return cmdexec(a, cp->subcmd); ! 1770: } ! 1771: return TRUE; /* indicate that execution is to continue */ ! 1772: } ! 1773: .P2 ! 1774: .ce ! 1775: \fBDisplay 2. \f(CWg\fR and \f(CWv\fP commands ! 1776: .SP ! 1777: .KE ! 1778: .2C ! 1779: .NH 2 ! 1780: Regular expressions ! 1781: .LP ! 1782: The regular expression code in ! 1783: .I sam ! 1784: is an interpreted, rather than compiled on-the-fly, implementation of Thompson's ! 1785: non-deterministic finite automaton algorithm.|reference(thompson regular expression) ! 1786: The syntax and semantics of the expressions are as in the ! 1787: .UX ! 1788: program ! 1789: .I egrep , ! 1790: including alternation, closures, character classes, and so on. ! 1791: The only changes in the notation are two additions: ! 1792: .CW \en ! 1793: is translated to, and matches, a newline character, and ! 1794: .CW @ ! 1795: matches any character. In ! 1796: .I egrep , ! 1797: the character ! 1798: .CW \&. ! 1799: matches any character except newline, and in ! 1800: .I sam ! 1801: the same rule seemed safest, to prevent idioms like ! 1802: .CW \&.* ! 1803: from spanning newlines. ! 1804: .I Egrep ! 1805: expressions are arguably too complicated for an interactive editor \(em ! 1806: certainly it would make sense if all the special characters were two-character ! 1807: sequences, so that most of the punctuation characters wouldn't have ! 1808: peculiar meanings \(em but for an interesting command language, full ! 1809: regular expressions are necessary, and ! 1810: .I egrep ! 1811: defines the full regular expression syntax for ! 1812: .UX ! 1813: programs. ! 1814: Also, it seemed superfluous to define a new syntax, since various ! 1815: .UX ! 1816: programs ! 1817: .I ed , ( ! 1818: .I egrep ! 1819: and ! 1820: .I vi ) ! 1821: define too many already. ! 1822: .PP ! 1823: The expressions are compiled by a routine, ! 1824: .CW compile , ! 1825: that generates the description of the non-deterministic finite state machine. ! 1826: A second routine, ! 1827: .CW execute , ! 1828: interprets the machine to generate the leftmost-longest match of the ! 1829: expression in a substring of the file. ! 1830: The algorithm is described elsewhere.|reference(thompson regular expression)|reference(aho hopcroft ullman) ! 1831: .CW Execute ! 1832: reports ! 1833: whether a match was found, and sets a global variable, ! 1834: of type ! 1835: .CW Range , ! 1836: to the substring matched. ! 1837: .PP ! 1838: A trick is required to evaluate the expression in reverse, such as when ! 1839: searching backwards for an expression. ! 1840: For example, ! 1841: .P1 2n ! 1842: -/P.*r/ ! 1843: .P2 ! 1844: looks backwards through the file for a match of the expression. ! 1845: The expression, however, is defined for a forward search. ! 1846: The solution is to construct a machine identical to the machine ! 1847: for a forward search except for a reversal of all the concatenation ! 1848: operators (the other operators are symmetric under direction reversal), ! 1849: to exchange the meaning of the operators ! 1850: .CW ^ ! 1851: and ! 1852: .CW $ , ! 1853: and then to read the file backwards, looking for the ! 1854: usual earliest longest match. ! 1855: .PP ! 1856: .CW Execute ! 1857: generates only one match each time it is called. ! 1858: To interpret looping constructs such as the ! 1859: .CW x ! 1860: command, ! 1861: .I sam ! 1862: must therefore synchronize between ! 1863: calls of ! 1864: .CW execute ! 1865: to avoid ! 1866: problems with null matches. ! 1867: For example, even given the leftmost-longest rule, ! 1868: the expression ! 1869: .CW a* ! 1870: matches three times in the string ! 1871: .CW ab ! 1872: (the character ! 1873: .CW a , ! 1874: the null string between the ! 1875: .CW a ! 1876: and ! 1877: .CW b , ! 1878: and the final null string). ! 1879: After returning a match for the ! 1880: .CW a , ! 1881: .I sam ! 1882: must not match the null string before the ! 1883: .CW b . ! 1884: The algorithm starts ! 1885: .CW execute ! 1886: at the end of its previous match, and ! 1887: if the match it returns ! 1888: is null and abuts the previous match, rejects the match and advances ! 1889: the initial position one character. ! 1890: .NH 2 ! 1891: Memory allocation ! 1892: .LP ! 1893: The C language has no memory allocation primitives, although a standard ! 1894: library routine, ! 1895: .I malloc , ! 1896: provides adequate service for simple programs. ! 1897: For specific uses, however, ! 1898: it can be better to write a custom allocator. ! 1899: The allocator (or rather, pair of allocators) described here ! 1900: work in both the terminal and host parts of ! 1901: .I sam . ! 1902: They are designed for efficient manipulation of strings, ! 1903: which are allocated and freed frequently and vary in length from essentially ! 1904: zero to 32 Kbytes (very large strings are written to disc). ! 1905: More important, strings may be large and change size often, ! 1906: so to minimize memory usage it is helpful to reclaim and to coalesce the ! 1907: unused portions of strings when they are truncated. ! 1908: .PP ! 1909: Objects to be allocated in ! 1910: .I sam ! 1911: are of two flavors: ! 1912: the first is C ! 1913: .CW structs , ! 1914: which are small and often addressed by pointer variables; ! 1915: the second is variable-sized arrays of characters ! 1916: or integers whose ! 1917: base pointer is always used to access them. ! 1918: The memory allocator in ! 1919: .I sam ! 1920: is therefore in two parts: ! 1921: first, a traditional first-fit allocator that provides fixed storage for ! 1922: .CW structs ; ! 1923: and second, a garbage-compacting allocator that reduces storage ! 1924: overhead for variable-sized objects, at the cost of some bookkeeping. ! 1925: The two types of objects are allocated from adjoining arenas, with ! 1926: the garbage-compacting allocator controlling the arena with higher addresses. ! 1927: Separating into two arenas simplifies compaction and prevents fragmentation due ! 1928: to immovable objects. ! 1929: The access rules for garbage-compactable objects ! 1930: (discussed in the next paragraph) allow them to be relocated, so when ! 1931: the first-fit arena needs space, it moves the garbage-compacted arena ! 1932: to higher addresses to make room. Storage is therefore created only ! 1933: at successively higher addresses, either when more garbage-compacted ! 1934: space is needed or when the first-fit arena pushes up the other arena. ! 1935: .PP ! 1936: Objects that may be compacted declare to the ! 1937: allocator a cell that is guaranteed to be the sole repository of the ! 1938: address of the object whenever a compaction can occur. ! 1939: The compactor can then update the address when the object is moved. ! 1940: For example, the implementation of type ! 1941: .CW List ! 1942: (really a variable-length array) ! 1943: is: ! 1944: .P1 2n ! 1945: typedef struct List{ ! 1946: int nused; ! 1947: long *ptr; ! 1948: }List; ! 1949: .P2 ! 1950: The ! 1951: .CW ptr ! 1952: cell must always be used directly, and never copied. When a ! 1953: .CW List ! 1954: is to be created the ! 1955: .CW List ! 1956: structure is allocated in the ordinary first-fit arena ! 1957: and its ! 1958: .CW ptr ! 1959: is allocated in the garbage-compacted arena. ! 1960: A similar data type for strings, called ! 1961: .CW String , ! 1962: stores variable-length character arrays of up to 32767 elements. ! 1963: .PP ! 1964: A related matter of programming style: ! 1965: .I sam ! 1966: frequently passes structures by value, which ! 1967: simplifies the code. ! 1968: Traditionally, C programs have ! 1969: passed structures by reference, but implicit allocation on ! 1970: the stack is easier to use. ! 1971: Structure passing is a relatively new feature of C ! 1972: (it is not in the ! 1973: standard reference manual for C|reference(cbook)), ! 1974: and is poorly supported in most ! 1975: commercial C compilers. ! 1976: It's convenient and expressive, though, ! 1977: and simplifies memory management by ! 1978: avoiding the allocator altogether ! 1979: and eliminating pointer aliases. ! 1980: ..... ! 1981: .NH 2 ! 1982: Data structures for manipulating files ! 1983: .LP ! 1984: Experience with ! 1985: .I jim ! 1986: showed that the requirements ! 1987: of the file data structure were few, but strict. ! 1988: First, files need to be read and written quickly; ! 1989: adding a fresh file must be painless. ! 1990: Second, the implementation must place no arbitrary upper limit on ! 1991: the number or sizes of files. (It should be practical to edit many files, ! 1992: and files up to megabytes in length should be handled gracefully.) ! 1993: This implies that files be stored on disc, not in main memory. ! 1994: (Aficionados of virtual memory may argue otherwise, but the ! 1995: implementation of virtual ! 1996: memory in our system is not something to depend on ! 1997: for good performance.) ! 1998: Third, changes to files need be made by only two primitives: ! 1999: deletion and insertion. ! 2000: These are inverses of each other, ! 2001: which simplifies the implementation of the undo operation. ! 2002: Finally, ! 2003: it must be easy and efficient to access the file, either ! 2004: forwards or backwards, a byte at a time. ! 2005: .PP ! 2006: The ! 2007: .CW File ! 2008: data type is constructed from three simpler data structures that hold arrays ! 2009: of characters. ! 2010: Each of these types has an insertion and deletion operator, and the ! 2011: insertion and deletion operators of the ! 2012: .CW File ! 2013: type itself are constructed from them. ! 2014: .PP ! 2015: The simplest type is the ! 2016: .CW String , ! 2017: which is used to hold strings in main memory. ! 2018: The code that manages ! 2019: .CW Strings ! 2020: guarantees that they will never be longer ! 2021: than some moderate size, and in practice they are rarely larger than 8 Kbytes. ! 2022: .CW Strings ! 2023: have two purposes: they hold short strings like file names with little overhead, ! 2024: and because they are deliberately small, they are efficient to modify. ! 2025: They are therefore used as the data structure for in-memory caches. ! 2026: .PP ! 2027: The disc copy of the file is managed by a data structure called a ! 2028: .CW Disc , ! 2029: which corresponds to a temporary file. A ! 2030: .CW Disc ! 2031: has no storage in main memory other than bookkeeping information; ! 2032: the actual data being held is all on the disc. ! 2033: To reduce the number of open files needed, ! 2034: .I sam ! 2035: opens a dozen temporary ! 2036: .UX ! 2037: files and multiplexes the ! 2038: .CW Discs ! 2039: upon them. ! 2040: This permits many files to ! 2041: be edited; the entire ! 2042: .I sam ! 2043: source (48 files) may be edited comfortably with a single ! 2044: instance of ! 2045: .I sam . ! 2046: Allocating one temporary file per ! 2047: .CW Disc ! 2048: would strain the operating system's limit on the number of open files. ! 2049: Also, spreading the traffic among temporary files keeps the files shorter, ! 2050: and shorter files are more efficiently implemented by the ! 2051: .UX ! 2052: I/O subsystem. ! 2053: .PP ! 2054: A ! 2055: .CW Disc ! 2056: is an array of fixed-length blocks, each of which contains ! 2057: between 1 and 4096 characters of active data. ! 2058: (The block size of our ! 2059: .UX ! 2060: file system is 4096 bytes.) ! 2061: The block addresses within the temporary file and the length of each ! 2062: block are stored in a ! 2063: .CW List . ! 2064: When changes are made the live part of blocks may change size. ! 2065: Blocks are created and coalesced when necessary to try to keep the sizes ! 2066: between 2048 and 4096 bytes. ! 2067: An actively changing part of the ! 2068: .CW Disc ! 2069: therefore typically has about a kilobyte of slop that can be ! 2070: inserted or deleted ! 2071: without changing more than one block or affecting the block order. ! 2072: When an insertion would overflow a block, the block is split, a new one ! 2073: is allocated to receive the overflow, and the memory-resident list of blocks ! 2074: is rearranged to reflect the insertion of the new block. ! 2075: .PP ! 2076: Obviously, going to the disc for every modification to the file is ! 2077: prohibitively expensive. ! 2078: The data type ! 2079: .CW Buffer ! 2080: consists of a ! 2081: .CW Disc ! 2082: to hold the data and a ! 2083: .CW String ! 2084: that acts as a cache. ! 2085: This is the first of a series of caches throughout the data structures in ! 2086: .I sam. ! 2087: The caches not only improve performance, they provide a way to organize ! 2088: the flow of data, particularly in the communication between the host ! 2089: and terminal. ! 2090: This idea is developed below, in the section on communications. ! 2091: .PP ! 2092: To reduce disc traffic, changes to a ! 2093: .CW Buffer ! 2094: are mediated by a variable-length string, in memory, that acts as a cache. ! 2095: When an insertion or deletion is made to a ! 2096: .CW Buffer , ! 2097: if the change can be accommodated by the cache, it is done there. ! 2098: If the cache becomes bigger than a block because of an insertion, ! 2099: some of it is written to the ! 2100: .CW Disc ! 2101: and deleted from the cache. ! 2102: If the change does not intersect the cache, the cache is flushed. ! 2103: The cache is only loaded at the new position if the change is smaller than a block; ! 2104: otherwise, it is sent directly to the ! 2105: .CW Disc . ! 2106: This is because ! 2107: large changes are typically sequential, ! 2108: whereupon the next change is unlikely to overlap the current one. ! 2109: .PP ! 2110: A ! 2111: .CW File ! 2112: comprises a ! 2113: .CW String ! 2114: to hold the file name and some ancillary data such as dot and the modified bit. ! 2115: The most important components, though, are a pair of ! 2116: .CW Buffers , ! 2117: one called the transcript and the other the contents. ! 2118: Their use is described in the next section. ! 2119: .PP ! 2120: The overall structure is shown in Figure 5. ! 2121: Although it may seem that the data is touched many times on its ! 2122: way from the ! 2123: .CW Disc , ! 2124: it is read (by one ! 2125: .UX ! 2126: system call) directly into the cache of the ! 2127: associated ! 2128: .CW Buffer ; ! 2129: no extra copy is done. ! 2130: Similarly, when flushing the cache, the text is written ! 2131: directly from the cache to disc. ! 2132: Most operations act directly on the text in the cache. ! 2133: A principle applied throughout ! 2134: .I sam ! 2135: is that the fewer times the data is copied, the faster the program will run ! 2136: (see also the paper by Waite|reference(waite lexical analysis)). ! 2137: .1C ! 2138: .KF ! 2139: .PS ! 2140: copy "fig5.pic" ! 2141: .PE ! 2142: .Cs ! 2143: Figure 5. File data structures. ! 2144: The temporary files are stored in the standard repository for such files ! 2145: on the host system. ! 2146: .Ce ! 2147: .KE ! 2148: .2C ! 2149: .PP ! 2150: The contents of a ! 2151: .CW File ! 2152: are accessed by a routine that ! 2153: copies to a buffer a substring of a file starting at a specified offset. ! 2154: To read a byte at a time, a ! 2155: .CW File "" per- ! 2156: array is loaded starting from a specified initial position, ! 2157: and bytes may then be read from the array. ! 2158: The implementation is done by a macro similar to the C standard I/O ! 2159: .I getc ! 2160: macro.|reference(cbook) ! 2161: Because the reading may be done at any address, a minor change to the ! 2162: macro allows the file to be read backwards. ! 2163: This array is read-only; there is no ! 2164: .I putc . ! 2165: .NH 2 ! 2166: Doing and undoing ! 2167: .LP ! 2168: .I Sam ! 2169: has an unusual method for managing changes to files. ! 2170: The command language makes it easy to specify multiple variable-length changes ! 2171: to a file millions of bytes long, and such changes ! 2172: must be made efficiently if the editor is to be practical. ! 2173: The usual techniques for inserting and deleting strings ! 2174: are inadequate under these conditions. ! 2175: The ! 2176: .CW Buffer ! 2177: and ! 2178: .CW Disc ! 2179: data structures are designed for efficient random access to long strings, ! 2180: but care must be taken to avoid super-linear behavior when making ! 2181: many changes simultaneously. ! 2182: .PP ! 2183: .I Sam ! 2184: uses a two-pass algorithm for making changes, and treats each file as a database ! 2185: against which transactions are registered. ! 2186: Changes are not made directly to the contents. ! 2187: Instead, when a command is started, a `mark' containing ! 2188: a sequence number is placed in the transcript ! 2189: .CW Buffer , ! 2190: and each change made to the file, either an insertion or deletion ! 2191: or a change to the file name, ! 2192: is appended to the end of the transcript. ! 2193: When the command is complete, the transcript is rewound to the ! 2194: mark and applied to the contents. ! 2195: .PP ! 2196: One reason for separating evaluation from ! 2197: application in this way is to simplify tracking the addresses of changes ! 2198: made in the middle of a long sequence. ! 2199: The two-pass algorithm also allows all changes to apply to the ! 2200: .I original ! 2201: data: no change can affect another change made in the same command. ! 2202: This is particularly important when evaluating an ! 2203: .CW x ! 2204: command because it prevents regular expression matches ! 2205: from stumbling over changes made earlier in the execution. ! 2206: Also, the two-pass ! 2207: algorithm is cleaner than the way other ! 2208: .UX ! 2209: editors allow changes to ! 2210: affect each other; ! 2211: for example, ! 2212: .I ed 's ! 2213: idioms to do things like delete every other line ! 2214: depend critically on the implementation. ! 2215: Instead, ! 2216: .I sam 's ! 2217: simple model, in which all changes in a command occur effectively ! 2218: simultaneously, is easy to explain and to understand. ! 2219: .PP ! 2220: The records in the transcript are of the form ``delete substring from ! 2221: locations ! 2222: 123 to 456'' and ``insert 11 characters `hello there' at location 789.'' ! 2223: (It is an error if the changes are not at monotonically greater ! 2224: positions through the file.) ! 2225: While the update is occurring, these numbers must be ! 2226: offset by earlier changes, but that is straightforward and ! 2227: local to the update routine; ! 2228: moreover, all the numbers have been computed ! 2229: before the first is examined. ! 2230: .PP ! 2231: Treating the file as a transaction system has another advantage: ! 2232: undo is trivial. ! 2233: All it takes is to invert the transcript after it has been ! 2234: implemented, converting insertions ! 2235: into deletions and vice versa, and saving them in a holding ! 2236: .CW Buffer . ! 2237: The `do' transcript can then be deleted from ! 2238: the transcript ! 2239: .CW Buffer ! 2240: and replaced by the `undo' transcript. ! 2241: If an undo is requested, the transcript is rewound and the undo transcript ! 2242: executed. ! 2243: Because the transcript ! 2244: .CW Buffer ! 2245: is not truncated after each command, it accumulates ! 2246: successive changes. ! 2247: A sequence of undo commands ! 2248: can therefore back up the file arbitrarily, ! 2249: which is more helpful than the more commonly implemented self-inverse form of undo. ! 2250: .I Sam "" ( ! 2251: provides no way to undo an undo, but if it were desired, ! 2252: it would be easy to provide by re-interpreting the `do' transcript.) ! 2253: Each mark in the transcript contains a sequence number and the offset into ! 2254: the transcript of the previous mark, to aid in unwinding the transcript. ! 2255: Marks also contain the value of dot and the modified bit so these can be ! 2256: restored easily. ! 2257: Undoing multiple files is easy; it merely demands undoing all files whose ! 2258: latest change has the same sequence number as the current file. ! 2259: .PP ! 2260: Another benefit of having a transcript is that errors encountered in the middle ! 2261: of a complicated command need not leave the files in an intermediate state. ! 2262: By rewinding the transcript to the mark beginning the command, ! 2263: the partial command can be trivially undone. ! 2264: .PP ! 2265: When the update algorithm was first implemented, it was unacceptably slow, ! 2266: so a cache was added to coalesce nearby changes, ! 2267: replacing multiple small changes by a single larger one. ! 2268: This reduced the number ! 2269: of insertions into the transaction ! 2270: .CW Buffer , ! 2271: and made a dramatic improvement in performance, ! 2272: but made it impossible ! 2273: to handle changes in non-monotonic order in the file; the caching method ! 2274: only works if changes don't overlap. ! 2275: Before the cache was added, the transaction could in principle be sorted ! 2276: if the changes were out of order, although ! 2277: this was never done. ! 2278: The current status is therefore acceptable performance with a minor ! 2279: restriction on global changes, which is sometimes, but rarely, an annoyance. ! 2280: .PP ! 2281: The update algorithm obviously paws the data more than simpler ! 2282: algorithms, but it is not prohibitively expensive; ! 2283: the caches help. ! 2284: (The principle of avoiding copying the data is still honored here, ! 2285: although not as piously: ! 2286: the data is moved from contents' cache to ! 2287: the transcript's all at once and through only one internal buffer.) ! 2288: Performance figures confirm the efficiency. ! 2289: To read from a dead start a hundred kilobyte file on a VAX-11/750 ! 2290: takes 1.4 seconds of user time, 2.5 seconds of system time, ! 2291: and 5 seconds of real time. ! 2292: Reading the same file in ! 2293: .I ed ! 2294: takes 6.0 seconds of user time, 1.7 seconds of system time, ! 2295: and 8 seconds of real time. ! 2296: .I Sam ! 2297: uses about half the CPU time. ! 2298: A more interesting example is the one stated above: ! 2299: inserting a character between every pair of characters in the file. ! 2300: The ! 2301: .I sam ! 2302: command is ! 2303: .P1 2n ! 2304: ,y/@/ a/x/ ! 2305: .P2 ! 2306: and takes 3 CPU seconds per kilobyte of input file, of which ! 2307: about a third is spent in the regular expression code. ! 2308: This translates to about 500 changes per second. ! 2309: .I Ed ! 2310: takes 1.5 seconds per kilobyte to make a similar change (ignoring newlines), ! 2311: but cannot undo it. ! 2312: The same example in ! 2313: .I ex ,|reference(bsdmanual) ! 2314: a variant of ! 2315: .I ed ! 2316: done at the University of California at Berkeley, ! 2317: which allows one level of undoing, again takes 3 seconds. ! 2318: In summary, ! 2319: .I sam 's ! 2320: performance is comparable to that of other ! 2321: .UX ! 2322: editors, although it solves ! 2323: a harder problem. ! 2324: .NH 2 ! 2325: Communications ! 2326: .LP ! 2327: The discussion so far has described the implementation of the host part of ! 2328: .I sam ; ! 2329: the next few sections explain how a machine with mouse and bitmap display ! 2330: can be engaged to improve interaction. ! 2331: .I Sam ! 2332: is not the first editor to be written as two processes,|reference(fraser text editor) ! 2333: but its implementation ! 2334: has some unusual aspects. ! 2335: .PP ! 2336: There are several ways ! 2337: .I sam 's ! 2338: host and terminal parts may be connected. ! 2339: The first and simplest is to forgo the terminal part and use the host ! 2340: part's command language to edit text on an ordinary terminal. ! 2341: This mode is invoked by starting ! 2342: .I sam ! 2343: with the ! 2344: .CW -d ! 2345: option. ! 2346: With no options, ! 2347: .I sam ! 2348: runs separate host and terminal programs, ! 2349: communicating with a message protocol over the physical ! 2350: connection that joins them. ! 2351: Typically, the connection is an RS-232 link between a Blit ! 2352: (the prototypical display for ! 2353: .I sam ) ! 2354: and a host running ! 2355: the Research Edition of the ! 2356: .UX ! 2357: operating system.|reference(latest volume1) ! 2358: (This is the version of the system used in the Computing Sciences Research ! 2359: Center at AT&T Bell Laboratories, where I work. Its relevant ! 2360: aspects are discussed in the Blit paper.|reference(bstj blit)) ! 2361: The implementation of ! 2362: .I sam ! 2363: for the Sun computer runs both processes on the same machine and ! 2364: connects them by a pipe. ! 2365: .PP ! 2366: The low bandwidth of an RS-232 link ! 2367: necessitated the split between ! 2368: the two programs. ! 2369: The division is a mixed blessing: ! 2370: a program in two parts is much harder to write and to debug ! 2371: than a self-contained one, ! 2372: but the split makes several unusual configurations possible. ! 2373: The terminal may be physically separated from the host, allowing the conveniences ! 2374: of a mouse and bitmap display to be taken home while leaving the files at work. ! 2375: It is also possible to run the host part on a remote machine: ! 2376: .P1 2n ! 2377: sam -r host ! 2378: .P2 ! 2379: connects to the terminal in the usual way, and then makes a call ! 2380: across the network to establish the host part of ! 2381: .I sam ! 2382: on the named machine. ! 2383: Finally, it cross-connects the I/O to join the two parts. ! 2384: This allows ! 2385: .I sam ! 2386: to be run on machines that do not support bitmap displays; ! 2387: for example, ! 2388: .I sam ! 2389: is the editor of choice on our Cray X-MP/24. ! 2390: .I Sam ! 2391: .CW -r ! 2392: involves ! 2393: .I three ! 2394: machines: the remote host, the terminal, and the local host. ! 2395: The local host's job is simple but vital: it passes the data ! 2396: between the remote host and terminal. ! 2397: .PP ! 2398: The host and terminal exchange messages asynchronously ! 2399: (rather than, say, as remote procedure calls) but there is no ! 2400: error detection or correction ! 2401: because, whatever the configuration, the connection is reliable. ! 2402: Because the terminal handles mundane interaction tasks such as ! 2403: popping up menus and interpreting the responses, the messages are about ! 2404: data, not actions. ! 2405: For example, the host knows nothing about what is displayed on the screen, ! 2406: and when the user types a character, the message sent to the host says ! 2407: ``insert a one-byte string at location 123 in file 7,'' not ``a character ! 2408: was typed at the current position in the current file.'' ! 2409: In other words, the messages look very much like the transaction records ! 2410: in the transcripts. ! 2411: .PP ! 2412: Either the host or terminal part of ! 2413: .I sam ! 2414: may initiate a change to a file. ! 2415: The command language operates on the host, while typing and some ! 2416: mouse operations are executed directly in the terminal to optimize response. ! 2417: Changes initiated by the host program must be transmitted to the terminal, ! 2418: and ! 2419: vice versa. ! 2420: (A token is exchanged to determine which end is in control, ! 2421: which means that characters typed while a time-consuming command runs ! 2422: must be buffered and do not appear until the command is complete.) ! 2423: To maintain consistent information, ! 2424: the host and terminal track changes through a per-file ! 2425: data structure that records what portions of the file ! 2426: the terminal has received. ! 2427: The data structure, called a ! 2428: .CW Rasp ! 2429: (a weak pun: it's a file with holes) ! 2430: is held and updated by both the host and terminal. ! 2431: A ! 2432: .CW Rasp ! 2433: is a list of ! 2434: .CW Strings ! 2435: holding those parts of the file known to the terminal, ! 2436: separated by counts of the number of bytes in the interstices. ! 2437: Of course, the host doesn't keep a separate copy of the data (it only needs ! 2438: the lengths of the various pieces), ! 2439: but the structure is the same on both ends. ! 2440: .PP ! 2441: The ! 2442: .CW Rasp ! 2443: in the terminal doubles as a cache. ! 2444: Since the terminal keeps the text for portions of the file it has displayed, ! 2445: it need not request data from the host when revisiting old parts of the file ! 2446: or redrawing obscured windows, which speeds things up considerably ! 2447: over low-speed links. ! 2448: .PP ! 2449: It's trivial for the terminal to maintain its ! 2450: .CW Rasp , ! 2451: because all changes made on the terminal apply to parts of the file ! 2452: already loaded there. ! 2453: Changes made by the host are compared against the ! 2454: .CW Rasp ! 2455: during the update sequence after each command. ! 2456: Small changes to pieces of the file loaded in the terminal ! 2457: are sent in their entirety. ! 2458: Larger changes, and changes that fall entirely in the holes, ! 2459: are transmitted as messages without literal data: ! 2460: only the lengths of the deleted and inserted strings are transmitted. ! 2461: When a command is completed, the terminal examines its visible ! 2462: windows to see if any holes in their ! 2463: .CW Rasps ! 2464: intersect the visible portion of the file. ! 2465: It then requests the missing data from the host, ! 2466: along with up to 512 bytes of surrounding data, to minimize ! 2467: the number of messages when visiting a new portion of the file. ! 2468: This technique provides a kind of two-level lazy evaluation for the terminal. ! 2469: The first level sends a minimum of information about ! 2470: parts of the file not being edited interactively; ! 2471: the second level waits until a change is displayed before ! 2472: transmitting the new data. ! 2473: Of course, ! 2474: performance is also helped by having the terminal respond immediately to typing ! 2475: and simple mouse requests. ! 2476: Except for small changes to active pieces of the file, which are ! 2477: transmitted to the terminal without negotiation, ! 2478: the terminal is wholly responsible for deciding what is displayed; ! 2479: the host uses the ! 2480: .CW Rasp ! 2481: only to tell the terminal what might be relevant. ! 2482: .PP ! 2483: When a change is initiated by the host, ! 2484: the messages to the terminal describing the change ! 2485: are generated by the routine that applies the transcript of the changes ! 2486: to the contents of the ! 2487: .CW File . ! 2488: Since changes are undone by the same update routine, ! 2489: undoing requires ! 2490: no extra code in the communications; ! 2491: the usual messages describing changes to the file are sufficient ! 2492: to back up the screen image. ! 2493: .PP ! 2494: The ! 2495: .CW Rasp ! 2496: is a particularly good example of the way caches are used in ! 2497: .I sam . ! 2498: First, it facilitates access to the active portion of the text by placing ! 2499: the busy text in main memory. ! 2500: In so doing, it provides efficient access ! 2501: to a large data structure that does not fit in memory. ! 2502: Since the form of data is to be imposed by the user, not by the program, ! 2503: and because characters will frequently be scanned sequentially, ! 2504: files are stored as flat objects. ! 2505: Caches help keep performance good and linear when working with such ! 2506: data. ! 2507: .PP ! 2508: Second, the ! 2509: .CW Rasp ! 2510: and several of the other caches have some ! 2511: .I read-ahead; ! 2512: that is, the cache is loaded with more information than is needed for ! 2513: the job immediately at hand. ! 2514: When manipulating linear structures, the accesses are usually sequential, ! 2515: and read-ahead can significantly reduce the average time to access the ! 2516: next element of the object. ! 2517: Sequential access is a common mode for people as well as programs; ! 2518: consider scrolling through a document while looking for something. ! 2519: .PP ! 2520: Finally, like any good data structure, ! 2521: the cache guides the algorithm, or at least the implementation. ! 2522: The ! 2523: .CW Rasp ! 2524: was actually invented to control the communications between the host and ! 2525: terminal parts, but I realized very early that it was also a form of ! 2526: cache. Other caches were more explicitly intended to serve a double ! 2527: purpose: for example, the caches in ! 2528: .CW Files ! 2529: that coalesce updates not only reduce traffic to the ! 2530: transcript and contents ! 2531: .CW Buffers , ! 2532: they also clump screen updates so that complicated changes to the ! 2533: screen are achieved in ! 2534: just a few messages to the terminal. ! 2535: This saved me considerable work: I did not need to write special ! 2536: code to optimize the message traffic to the ! 2537: terminal. ! 2538: Caches pay off in surprising ways. ! 2539: Also, they tend to be independent, so their performance improvements ! 2540: are multiplicative. ! 2541: .NH 2 ! 2542: Data structures in the terminal ! 2543: .LP ! 2544: The terminal's job is to display and to maintain a consistent image of ! 2545: pieces of the files being edited. ! 2546: Because the text is always in memory, the data structures are ! 2547: considerably simpler than those in the host part. ! 2548: .PP ! 2549: .I Sam ! 2550: typically has far more windows than does ! 2551: .I mux , ! 2552: the window system within which its Blit implementation runs. ! 2553: .I Mux ! 2554: has a fairly small number of asynchronously updated windows; ! 2555: .I sam ! 2556: needs a large number of synchronously updated windows that are ! 2557: usually static and often fully obscured. ! 2558: The different tradeoffs guided ! 2559: .I sam ! 2560: away from the memory-intensive implementation of windows, called ! 2561: .CW Layers ,|reference(pike overlapping) ! 2562: used in ! 2563: .I mux. ! 2564: Rather than depending on a complete bitmap image of the display for each window, ! 2565: .I sam ! 2566: regenerates the image from its in-memory text ! 2567: (stored in the ! 2568: .CW Rasp ) ! 2569: when necessary, although it will use such an image if it is available. ! 2570: Like ! 2571: .CW Layers , ! 2572: though, ! 2573: .I sam ! 2574: uses the screen bitmap as active storage in which to update the image using ! 2575: .I bitblt .|reference(guibas stolfi)|reference(pike locanthi blit bitmap) ! 2576: The resulting organization, pictured in Figure 6, ! 2577: has a global array of windows, called ! 2578: .CW Flayers , ! 2579: each of which holds an image of a piece of text held in a data structure ! 2580: called a ! 2581: .CW Frame , ! 2582: which in turn represents ! 2583: a rectangular window full of text displayed in some ! 2584: .CW Bitmap . ! 2585: Each ! 2586: .CW Flayer ! 2587: appears in a global list that orders them all front-to-back ! 2588: on the display, and simultaneously as an element of a per-file array ! 2589: that holds all the open windows for that file. ! 2590: The complement in the terminal of the ! 2591: .CW File ! 2592: on the host is called a ! 2593: .CW Text ; ! 2594: each connects its ! 2595: .CW Flayers ! 2596: to the associated ! 2597: .CW Rasp . ! 2598: .1C ! 2599: .KF ! 2600: .PS ! 2601: copy "fig6.pic" ! 2602: .PE ! 2603: .Cs ! 2604: Figure 6. Data structures in the terminal. ! 2605: .CW Flayers ! 2606: are also linked together into a front-to-back list. ! 2607: .CW Boxes ! 2608: are discussed in the next section. ! 2609: .Ce ! 2610: .KE ! 2611: .2C ! 2612: .PP ! 2613: The ! 2614: .CW Bitmap ! 2615: for a ! 2616: .CW Frame ! 2617: contains the image of the text. ! 2618: For a fully visible window, the ! 2619: .CW Bitmap ! 2620: will be the screen (or at least the ! 2621: .CW Layer ! 2622: in which ! 2623: .I sam ! 2624: is being run), ! 2625: while for partially obscured windows the ! 2626: .CW Bitmap ! 2627: will be off-screen. ! 2628: If the window is fully obscured, the ! 2629: .CW Bitmap ! 2630: will be null. ! 2631: .PP ! 2632: The ! 2633: .CW Bitmap ! 2634: is a kind of cache. ! 2635: When making changes to the display, most of the original image will ! 2636: look the same in the final image, and the update algorithms exploit this. ! 2637: The ! 2638: .CW Frame ! 2639: software updates the image in the ! 2640: .CW Bitmap ! 2641: incrementally; the ! 2642: .CW Bitmap ! 2643: is not just an image, it is a data structure.|reference(guibas stolfi)|reference(pike locanthi blit bitmap) ! 2644: The job of the software that updates the display is therefore ! 2645: to use as much as possible of the existing image (converting the ! 2646: text from ASCII characters to pixels is expensive) in a sort of two-dimensional ! 2647: string insertion algorithm. ! 2648: The details of this process are described in the next section. ! 2649: .PP ! 2650: The ! 2651: .CW Frame ! 2652: software has no code to support overlapping windows; ! 2653: its job is to keep a single ! 2654: .CW Bitmap ! 2655: up to date. ! 2656: It falls to the ! 2657: .CW Flayer ! 2658: software to multiplex the various ! 2659: .CW Bitmaps ! 2660: onto the screen. ! 2661: The problem of maintaining overlapping ! 2662: .CW Flayers ! 2663: is easier than for ! 2664: .CW Layers |reference(pike overlapping) ! 2665: because changes are made synchronously and because the contents of the window ! 2666: can be reconstructed from the data stored in the ! 2667: .CW Frame ; ! 2668: the ! 2669: .CW Layers ! 2670: software ! 2671: makes no such assumptions. ! 2672: In ! 2673: .I sam , ! 2674: the window being changed is almost always fully visible, because the current ! 2675: window is always fully visible, by construction. ! 2676: However, when multi-file changes are being made, or when ! 2677: more than one window is open on a file, ! 2678: it may be necessary to update partially obscured windows. ! 2679: .PP ! 2680: There are three cases: the window is ! 2681: fully visible, invisible (fully obscured), or partially visible. ! 2682: If fully visible, the ! 2683: .CW Bitmap ! 2684: is part of the screen, so when the ! 2685: .CW Flayer ! 2686: update routine calls the ! 2687: .CW Frame ! 2688: update routine, the screen will be updated directly. ! 2689: If the window is invisible, ! 2690: there is no associated ! 2691: .CW Bitmap , ! 2692: and all that is necessary is to update the ! 2693: .CW Frame ! 2694: data structure, not the image. ! 2695: If the window is partially visible, the ! 2696: .CW Frame ! 2697: routine is called to update the image in the off-screen ! 2698: .CW Bitmap , ! 2699: which may require regenerating it from the text of the window. ! 2700: The ! 2701: .CW Flayer ! 2702: code then clips this ! 2703: .CW Bitmap ! 2704: against the ! 2705: .CW Bitmaps ! 2706: of all ! 2707: .CW Frames ! 2708: in front of the ! 2709: .CW Frame ! 2710: being modified, and the remainder is copied to the display. ! 2711: .PP ! 2712: This is much faster than recreating the image off-screen ! 2713: for every change, or clipping all the changes made to the image ! 2714: during its update. ! 2715: Unfortunately, these caches can also consume prohibitive amounts of ! 2716: memory, so they are freed fairly liberally \(em after every change to the ! 2717: front-to-back order of the ! 2718: .CW Flayers . ! 2719: The result is that ! 2720: the off-screen ! 2721: .CW Bitmaps ! 2722: exist only while multi-window changes are occurring, ! 2723: which is the only time the performance improvement they provide is needed. ! 2724: Also, the user interface causes fully-obscured windows to be the ! 2725: easiest to make \(em ! 2726: creating a canonically sized and placed window requires only a button click ! 2727: \(em which reduces the need for caching still further. ! 2728: .NH 2 ! 2729: Screen update ! 2730: .LP ! 2731: Only two low-level primitives are needed for incremental update: ! 2732: .I bitblt , ! 2733: which copies rectangles of pixels, and ! 2734: .I string ! 2735: (which in turn calls ! 2736: .I bitblt ), ! 2737: which draws a null-terminated character string in a ! 2738: .CW Bitmap . ! 2739: A ! 2740: .CW Frame ! 2741: contains a list of ! 2742: .CW Boxes , ! 2743: each of which defines a horizontal strip of text in the window ! 2744: (see Figure 7). ! 2745: A ! 2746: .CW Box ! 2747: has a character string ! 2748: .CW str , ! 2749: and a ! 2750: .CW Rectangle ! 2751: .CW rect ! 2752: that defines the location of the strip in the window. ! 2753: (The text in ! 2754: .CW str ! 2755: is stored in the ! 2756: .CW Box ! 2757: separately from the ! 2758: .CW Rasp ! 2759: associated with the window's file, so ! 2760: .CW Boxes ! 2761: are self-contained.) ! 2762: The invariant is that ! 2763: the image of the ! 2764: .CW Box ! 2765: can be reproduced by calling ! 2766: .CW string ! 2767: with argument ! 2768: .CW str ! 2769: to draw the string in ! 2770: .CW rect , ! 2771: and the resulting picture fits perfectly within ! 2772: .CW rect . ! 2773: In other words, the ! 2774: .CW Boxes ! 2775: define the tiling of the window. ! 2776: The tiling may be complicated by long lines of text, which ! 2777: are folded onto the next line. ! 2778: Some editors use horizontal scrolling to avoid this complication, ! 2779: but to be comfortable this technique requires that lines not be ! 2780: .I too ! 2781: long; ! 2782: .I sam ! 2783: has no such restriction. ! 2784: Also, and perhaps more importantly, ! 2785: .UX ! 2786: programs and terminals traditionally fold ! 2787: long lines to make their contents fully visible. ! 2788: .PP ! 2789: Two special kinds of ! 2790: .CW Boxes ! 2791: contain a single ! 2792: character: either a newline or a tab. ! 2793: Newlines and tabs are white space. ! 2794: A newline ! 2795: .CW Box ! 2796: always extends to the right edge of the window, ! 2797: forcing the following ! 2798: .CW Box ! 2799: to the next line. ! 2800: The width of a tab depends on where it is located: ! 2801: it forces the next ! 2802: .CW Box ! 2803: to begin at a tab location. ! 2804: Tabs also ! 2805: have a minimum width equivalent to a blank (blanks are ! 2806: drawn by ! 2807: .CW string ! 2808: and are not treated specially); newlines have a minimum width of zero. ! 2809: .1C ! 2810: .KF ! 2811: .PS ! 2812: copy "fig7.pic" ! 2813: .PE ! 2814: .sp .5 ! 2815: .Cs ! 2816: Figure 7. A line of text showing its ! 2817: .CW Boxes . ! 2818: The first two blank ! 2819: .CW Boxes ! 2820: contain tabs; the last contains a newline. ! 2821: Spaces are handled as ordinary characters. ! 2822: .Ce ! 2823: .KE ! 2824: .2C ! 2825: .PP ! 2826: The update algorithms always use the ! 2827: .CW Bitmap ! 2828: image of the text (either the display or cache ! 2829: .CW Bitmap ); ! 2830: they never examine the characters within a ! 2831: .CW Box ! 2832: except when the ! 2833: .CW Box ! 2834: needs to be split in two. ! 2835: Before a change, the window consists of a tiling of ! 2836: .CW Boxes ; ! 2837: after the change the window is tiled differently. ! 2838: The update algorithms rearrange the tiles in place, without ! 2839: backup storage. ! 2840: The algorithms are not strictly optimal \(em for example, they can ! 2841: clear a pixel that is later going to be written upon \(em ! 2842: but they never move a tile that doesn't need to be moved, ! 2843: and they move each tile at most once. ! 2844: .CW Frinsert ! 2845: on a Blit can absorb over a thousand characters a second if the strings ! 2846: being inserted are a few tens of characters long. ! 2847: .PP ! 2848: Consider ! 2849: .CW frdelete . ! 2850: Its job is to delete a substring from a ! 2851: .CW Frame ! 2852: and restore the image of the ! 2853: .CW Frame . ! 2854: The image of a substring has a peculiar shape (see Figure 2) comprising ! 2855: possibly a partial line, ! 2856: zero or more full lines, ! 2857: and possibly a final partial line. ! 2858: For reference, call this the ! 2859: .I ! 2860: Z-shape. ! 2861: .R ! 2862: .CW Frdelete ! 2863: begins by splitting, if necessary, the ! 2864: .CW Boxes ! 2865: containing the ends of ! 2866: the substring so the substring begins and ends on ! 2867: .CW Box ! 2868: boundaries. ! 2869: Because the substring is being deleted, its image is not needed, ! 2870: so the Z-shape is then cleared. ! 2871: Then, tiles (that is, the images of ! 2872: .CW Boxes ) ! 2873: are copied, using ! 2874: .CW bitblt , ! 2875: from immediately after the Z-shape to ! 2876: the beginning of the Z-shape, ! 2877: resulting in a new Z-shape. ! 2878: .CW Boxes "" ( ! 2879: whose contents would span two lines in the new position must first be split.) ! 2880: .PP ! 2881: Copying the remainder of the ! 2882: .CW Frame ! 2883: tile by tile ! 2884: this way will clearly accomplish the deletion but eventually, ! 2885: typically when the copying algorithm encounters a tab or newline, ! 2886: the old and new ! 2887: .CW x ! 2888: coordinates of the tile ! 2889: to be copied are the same. ! 2890: This correspondence implies ! 2891: that the Z-shape has its beginning and ending edges aligned ! 2892: vertically, and a sequence of at most two ! 2893: .I bitblt s ! 2894: can be used to copy the remaining tiles. ! 2895: The last step is to clear out the resulting empty space at the bottom ! 2896: of the window; ! 2897: the number of lines to be cleared is the number of complete lines in the ! 2898: Z-shape closed by the final ! 2899: .I bitblt s. ! 2900: The final step is to merge horizontally adjacent ! 2901: .CW Boxes ! 2902: of plain text. ! 2903: The complete source to ! 2904: .CW frdelete ! 2905: is less than 100 lines of C. ! 2906: .PP ! 2907: .CW frinsert ! 2908: is more complicated because it must do four passes: ! 2909: one to construct the ! 2910: .CW Box ! 2911: list for the inserted string, ! 2912: one to reconnoitre, ! 2913: one to copy (in opposite order to ! 2914: .CW frdelete ) ! 2915: the ! 2916: .CW Boxes ! 2917: to make the hole for the new text, ! 2918: and finally one to copy the new text into place. ! 2919: Overall, though, ! 2920: .CW frinsert ! 2921: has a similar flavor to ! 2922: .CW frdelete , ! 2923: and needn't be described further. ! 2924: .CW Frinsert ! 2925: and its subsidiary routines comprise 211 lines of C. ! 2926: .PP ! 2927: The terminal source code is 3024 lines of C, ! 2928: and the host source is 5797 lines. ! 2929: .NH ! 2930: Discussion ! 2931: .NH 2 ! 2932: History ! 2933: .LP ! 2934: The immediate ancestor of ! 2935: .I sam ! 2936: was the original text editor for the Blit, called ! 2937: .I jim . ! 2938: .I Sam ! 2939: inherited ! 2940: .I jim 's ! 2941: two-process structure and mouse language almost unchanged, but ! 2942: .I jim ! 2943: suffered from several drawbacks that were addressed in the design of ! 2944: .I sam . ! 2945: The most important of these was the lack of a command language. ! 2946: Although ! 2947: .I jim ! 2948: was easy to use for simple editing, it provided no direct help with ! 2949: large or repetitive editing tasks. Instead, it provided a command to pass ! 2950: selected text through a shell pipeline, ! 2951: but this was no more satisfactory than could be expected of a stopgap measure. ! 2952: .PP ! 2953: .I Jim ! 2954: was written primarily as a vehicle for experimenting with a mouse-based ! 2955: interface to text, and the experiment was successful. ! 2956: .I Jim ! 2957: had some spin-offs: ! 2958: .I mux , ! 2959: the second window system for the Blit, is essentially a multiplexed ! 2960: version of the terminal part of ! 2961: .I jim ; ! 2962: and the debugger ! 2963: .I pi 's ! 2964: user interface|reference(cargill feel pi) was closely modeled on ! 2965: .I jim 's. ! 2966: But after a couple of years, ! 2967: .I jim ! 2968: had become difficult to maintain and limiting to use, ! 2969: and its replacement was overdue. ! 2970: .PP ! 2971: I began the design of ! 2972: .I sam ! 2973: by asking ! 2974: .I jim ! 2975: customers what they wanted. ! 2976: This was probably a mistake; the answers were essentially a list of features ! 2977: to be found in other editors, which did not provide any of the ! 2978: guiding principles I was seeking. ! 2979: For instance, one common request was for a ``global substitute,'' ! 2980: but no one suggested how to provide it within a cut-and-paste editor. ! 2981: I was looking for a scheme that would ! 2982: support such specialized features comfortably in the context of some ! 2983: general command language. ! 2984: Ideas were not forthcoming, though, particularly given my insistence ! 2985: on removing all limits on file sizes, line lengths and so on. ! 2986: Even worse, I recognized that, since the mouse could easily ! 2987: indicate a region of the screen that was not an integral number of lines, ! 2988: the command language would best forget about newlines altogether, ! 2989: and that meant the command language had to treat the file as a single ! 2990: string, not an array of lines. ! 2991: .PP ! 2992: Eventually, I decided that thinking was not getting me very far and it was ! 2993: time to try building. ! 2994: I knew that the terminal part could be built easily \(em ! 2995: that part of ! 2996: .I jim ! 2997: behaved acceptably well \(em and that most of the hard work was going ! 2998: to be in the host part: the file interface, command interpreter and so on. ! 2999: Moreover, I had some ideas about how the architecture of ! 3000: .I jim ! 3001: could be improved without destroying its basic structure, which I liked ! 3002: in principle but which hadn't worked out as well as I had hoped. ! 3003: So I began by designing the file data structure, ! 3004: starting with the way ! 3005: .I jim ! 3006: worked \(em comparable to a single structure merging ! 3007: .CW Disc ! 3008: and ! 3009: .CW Buffer , ! 3010: which I split to make the cache more general ! 3011: \(em and thinking about how global substitute could be implemented. ! 3012: The answer was clearly that it had to be done in two passes, ! 3013: and the transcript-oriented implementation fell out naturally. ! 3014: .PP ! 3015: .I Sam ! 3016: was written bottom-up, ! 3017: starting from the data structures and algorithms for manipulating text, ! 3018: through the command language and up to the code for maintaining ! 3019: the display. ! 3020: In retrospect, it turned out well, but this implementation method is ! 3021: not recommended in general. ! 3022: There were several times when I had a large body of interesting code ! 3023: assembled and no clue how to proceed with it. ! 3024: The command language, in particular, took almost a year to figure out, ! 3025: but can be implemented (given what was there at the beginning of that year) ! 3026: in a day or two. Similarly, inventing the ! 3027: .CW Rasp ! 3028: data structure delayed the ! 3029: connection of the host and terminal pieces by another few months. ! 3030: .I Sam ! 3031: took about two years to write, although only about four months were ! 3032: spent actually working on it. ! 3033: .PP ! 3034: Part of the design process was unusual: ! 3035: the subset of the protocol that maintains the ! 3036: .CW Rasp ! 3037: was simulated, debugged ! 3038: and verified by an automatic protocol analyzer,|reference(holzmann atttj) and was bug-free ! 3039: from the start. ! 3040: The rest of the protocol, concerned mostly ! 3041: with keeping menus up to date, ! 3042: was unfortunately too unwieldy for such analysis, ! 3043: and was debugged by more traditional methods, primarily ! 3044: by logging in a file all messages in and out of the host. ! 3045: .NH 2 ! 3046: Reflections ! 3047: .LP ! 3048: .I Sam ! 3049: is essentially the only interactive editor used by the sixty or so members of ! 3050: the computing science research center in which I work. ! 3051: The same could not be said of ! 3052: .I jim ; ! 3053: the lack of a command language kept some people from adopting it. ! 3054: The union of a user interface as comfortable as ! 3055: .I jim 's ! 3056: with a command language as powerful as ! 3057: .CW ed 's\(dg ! 3058: .FS ! 3059: .vs 9 ! 3060: \(dg ! 3061: The people who criticize ! 3062: .I ed ! 3063: as an interactive program often forget that it and its close relative ! 3064: .I sed |reference(kernighan pike) ! 3065: still thrive as programmable editors. The strength of these programs is ! 3066: independent of their convenience for interactive editing. ! 3067: .br ! 3068: .vs ! 3069: .FE ! 3070: is essential to ! 3071: .I sam 's ! 3072: success. ! 3073: When ! 3074: .I sam ! 3075: was first made available to the ! 3076: .I jim ! 3077: community, ! 3078: almost everyone switched to it within two or three days. ! 3079: In the months that followed, even people who had never adopted ! 3080: .I jim ! 3081: started using ! 3082: .I sam ! 3083: exclusively. ! 3084: .PP ! 3085: To be honest, ! 3086: .I ed ! 3087: still gets occasional use, but usually when ! 3088: something quick needs to be done and the overhead of ! 3089: downloading the terminal part of ! 3090: .I sam ! 3091: isn't worth the trouble. ! 3092: Also, as a `line' editor, ! 3093: .CW "sam -d" ! 3094: is a bit odd; ! 3095: when using a good old ASCII terminal, it's comforting to have ! 3096: a true line editor. ! 3097: But it is fair to say that ! 3098: .I sam 's ! 3099: command language has displaced ! 3100: .I ed 's ! 3101: for most of the complicated editing that has kept line editors ! 3102: (that is, command-driven editors) with us. ! 3103: .PP ! 3104: .I Sam 's ! 3105: command language is even fancier than ! 3106: .I ed 's, ! 3107: and most ! 3108: .I sam ! 3109: customers don't come near to using all its capabilities. ! 3110: Does it need to be so sophisticated? ! 3111: I think the answer is yes, for two reasons. ! 3112: .PP ! 3113: First, the ! 3114: .I model ! 3115: for ! 3116: .I sam 's ! 3117: command language is really relatively simple, and certainly simpler than that of ! 3118: .I ed . ! 3119: For instance, there is only one kind of textual loop in ! 3120: .I sam ! 3121: \(em the ! 3122: .CW x ! 3123: command \(em ! 3124: while ! 3125: .I ed ! 3126: has three (the ! 3127: .CW g ! 3128: command, the global flag on substitutions, and the implicit loop over ! 3129: lines in multi-line substitutions). ! 3130: Also, ! 3131: .I ed 's ! 3132: substitute command is necessary to make changes within lines, but in ! 3133: .I sam ! 3134: the ! 3135: .CW s ! 3136: command is more of a familiar convenience than a necessity; ! 3137: .CW c ! 3138: and ! 3139: .CW t ! 3140: can do all the work. ! 3141: .PP ! 3142: Second, ! 3143: given a community that expects an editor to be about as powerful as ! 3144: .I ed , ! 3145: it's hard to see how ! 3146: .I sam ! 3147: could really be much simpler and still satisfy that expectation. ! 3148: People want to do ``global substitutes,'' and most are content ! 3149: to have the recipe for that and a few other fancy changes. ! 3150: The sophistication of the command language is really just a veneer ! 3151: over a design that makes it possible to do global substitutes ! 3152: in a screen editor. ! 3153: Some people will always want something more, however, and it's gratifying to ! 3154: be able to provide it. ! 3155: The real power of ! 3156: .I sam 's ! 3157: command language comes from composability of the operators, which is by ! 3158: nature orthogonal to the underlying model. ! 3159: In other words, ! 3160: .I sam ! 3161: is not itself complex, but it makes complex things possible. ! 3162: If you don't want to do anything complex, you can ignore the ! 3163: complexity altogether, and many people do so. ! 3164: .PP ! 3165: Sometimes I am asked the opposite question: why didn't I just make ! 3166: .I sam ! 3167: a real programmable editor, with macros and variables and so on? ! 3168: The main reason is a matter of taste: I like the editor ! 3169: to be the same every time I use it. ! 3170: There is one technical reason, though: ! 3171: programmability in editors is largely a workaround for insufficient ! 3172: interactivity. ! 3173: Programmable editors are used to make particular, usually short-term, ! 3174: things easy to do, such as by providing shorthands for common actions. ! 3175: If things are generally easy to do in the first place, ! 3176: shorthands are not as helpful. ! 3177: .I Sam ! 3178: makes common editing operations very easy, and the solutions to ! 3179: complex editing problems seem commensurate with the problems themselves. ! 3180: Also, the ability to edit the ! 3181: .I sam ! 3182: window makes it easy to repeat commands \(em it only takes a mouse button click ! 3183: to execute a command again. ! 3184: .NH 2 ! 3185: Pros and cons ! 3186: .LP ! 3187: .I Sam ! 3188: has several other good points, ! 3189: and its share of problems. ! 3190: Among the good things is the idea of ! 3191: structural regular expressions, ! 3192: whose usefulness has only begun to be explored. ! 3193: They were arrived at serendipitously when I attempted to distill the essence of ! 3194: .I ed 's ! 3195: way of doing global substitution and recognized that the looping command in ! 3196: .I ed ! 3197: was implicitly imposing a structure (an array of lines) on the file. ! 3198: .PP ! 3199: Another of ! 3200: .I sam 's ! 3201: good things is its undo capability. ! 3202: I had never before used an editor with a true undo, ! 3203: but I would never go back now. ! 3204: Undo ! 3205: .I must ! 3206: be done well, but if it is, it can be relied on. ! 3207: For example, ! 3208: it's safe to experiment if you're not sure how to write some intricate command, ! 3209: because if you make a mistake, it can be fixed simply and reliably. ! 3210: I learned two things about undo from writing ! 3211: .I sam : ! 3212: first, it's easy to provide if you design it in from the beginning, and ! 3213: second, it's necessary, particularly if the system has some subtle ! 3214: properties that may be unfamiliar or error-prone for users. ! 3215: .PP ! 3216: .I Sam 's ! 3217: lack of internal limits and sizes is a virtue. ! 3218: Because it avoids all fixed-size tables and data structures, ! 3219: .I sam ! 3220: is able to make global changes to files that some of our other ! 3221: tools cannot even read. ! 3222: Moreover, the design keeps the performance linear when doing such ! 3223: operations, although I must admit ! 3224: .I sam ! 3225: does get slow when editing a huge file. ! 3226: .PP ! 3227: Now, the problems. ! 3228: Externally, the most obvious is that it is poorly integrated into the ! 3229: surrounding window system. ! 3230: By design, the user interface in ! 3231: .I sam ! 3232: feels almost identical to that of ! 3233: .I mux , ! 3234: but a thick wall separates text in ! 3235: .I sam ! 3236: from the programs running in ! 3237: .I mux . ! 3238: For instance, the `snarf buffer' in ! 3239: .I sam ! 3240: must be maintained separately from that in ! 3241: .I mux . ! 3242: This is regrettable, but probably necessary given the unusual configuration ! 3243: of the system, with a programmable terminal on the far end of an RS-232 link. ! 3244: .PP ! 3245: .I Sam ! 3246: is reliable; otherwise, people wouldn't use it. ! 3247: But it was written over such a long time, and has so many new (to me) ! 3248: ideas in it, that I would like to see it done over again to clean ! 3249: up the code and remove many of the lingering problems in the implementation. ! 3250: The worst part is in the interconnection of the host and terminal parts, ! 3251: which might even be able to go away in a redesign for a more ! 3252: conventional window system. ! 3253: The program must be split in two to use the terminal effectively, ! 3254: but the low bandwidth of the connection forces the separation to ! 3255: occur in an inconvenient part of the design if performance is to be acceptable. ! 3256: A simple remote procedure call ! 3257: protocol driven by the host, emitting only graphics ! 3258: commands, would be easy to write but wouldn't have nearly the ! 3259: necessary responsiveness. On the other hand, if the terminal were in control ! 3260: and requested much simpler file services from the host, regular expression ! 3261: searches would require that the terminal read the entire file over its RS-232 ! 3262: link, which would be unreasonably slow. ! 3263: A compromise in which either end can take control is necessary. ! 3264: In retrospect, the communications protocol should have been ! 3265: designed and verified formally, although I do not know of any tool ! 3266: that can adequately relate the protocol to ! 3267: its implementation. ! 3268: .PP ! 3269: Not all of ! 3270: .I sam 's ! 3271: users are comfortable with its command language, and few are adept. ! 3272: Some (venerable) people use a sort of ! 3273: .I ed \& `` ! 3274: subset'' of ! 3275: .I sam 's ! 3276: command language, ! 3277: and even ask why ! 3278: .I sam 's ! 3279: command language is not exactly ! 3280: .I ed 's. ! 3281: (The reason, of course, is that ! 3282: .I sam 's ! 3283: model for text does not include newlines, which are central to ! 3284: .I ed . ! 3285: Making the text an array of newlines to the command language would ! 3286: be too much of a break from the seamless model provided by the mouse. ! 3287: Some editors, such as ! 3288: .I vi , ! 3289: are willing to make this break, though.) ! 3290: The difficulty is that ! 3291: .I sam 's ! 3292: syntax is so close to ! 3293: .I ed 's ! 3294: that people believe it ! 3295: .I should ! 3296: be the same. ! 3297: I thought, with some justification in hindsight, ! 3298: that making ! 3299: .I sam ! 3300: similar to ! 3301: .I ed ! 3302: would make it easier to learn and to accept. ! 3303: But I may have overstepped and raised the users' ! 3304: expectations too much. ! 3305: It's hard to decide which way to resolve this problem. ! 3306: .PP ! 3307: Finally, there is a tradeoff in ! 3308: .I sam ! 3309: that was decided by the environment in which it runs: ! 3310: .I sam ! 3311: is a multi-file editor, although in a different system there might instead be ! 3312: multiple single-file editors. ! 3313: The decision was made primarily because starting a new program in a Blit is ! 3314: time-consuming. ! 3315: If the choice could be made freely, however, I would ! 3316: still choose the multi-file architecture, because it allows ! 3317: groups of files to be handled as a unit; ! 3318: the usefulness of the multi-file commands is incontrovertible. ! 3319: It is delightful to have the source to an entire program ! 3320: available at your fingertips. ! 3321: .NH ! 3322: Acknowledgements ! 3323: .LP ! 3324: Tom Cargill suggested the idea behind the ! 3325: .CW Rasp ! 3326: data structure. ! 3327: Norman Wilson and Ken Thompson influenced the command language. ! 3328: This paper was improved by comments from ! 3329: Al Aho, ! 3330: Jon Bentley, ! 3331: Chris Fraser, ! 3332: Gerard Holzmann, ! 3333: Brian Kernighan, ! 3334: Ted Kowalski, ! 3335: Doug McIlroy ! 3336: and ! 3337: Dennis Ritchie. ! 3338: .NH ! 3339: References ! 3340: .LP ! 3341: |reference_placement
This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.