Annotation of researchv10dc/vol2/sam/sam.ms, revision 1.1.1.1

1.1       root        1: .so ../ADM/mac
                      2: .XX sam 399 "The Text Editor Sam"
                      3: .nr dP 2
                      4: .nr dV 3p
                      5: .de Cs
                      6: .br
                      7: .fi
                      8: .ft 2
                      9: .ps -2
                     10: .vs -2
                     11: ..
                     12: .de Ce
                     13: .br
                     14: .nf
                     15: .ft 1
                     16: .ps
                     17: .vs
                     18: .sp
                     19: ..
                     20: .TL
                     21: The Text Editor Sam\(dg
                     22: .AU
                     23: Rob Pike
                     24: .AI
                     25: .MH
                     26: .AB
                     27: .PP
                     28: .I Sam
                     29: is an interactive multi-file text editor intended for
                     30: bitmap displays.
                     31: A textual command language
                     32: supplements the mouse-driven, cut-and-paste interface
                     33: to make complex or
                     34: repetitive editing tasks easy to specify.
                     35: The language is characterized by the composition of regular expressions
                     36: to describe the structure of the text being modified.
                     37: The treatment of files as a database, with changes logged
                     38: as atomic transactions, guides the implementation and
                     39: makes a general `undo' mechanism straightforward.
                     40: .PP
                     41: .I Sam
                     42: is implemented as two processes connected by a low-bandwidth stream,
                     43: one process handling the display and the other the editing
                     44: algorithms.  Therefore it can run with the display process
                     45: in a bitmap terminal and the editor on a local host,
                     46: with both processes on a bitmap-equipped host, or with
                     47: the display process in the terminal and the editor in a
                     48: remote host.
                     49: By suppressing the display process,
                     50: it can even run without a bitmap terminal.
                     51: .AE
                     52: .2C
                     53: .FS
                     54: \(dg This article originally appeared as |reference(spe sam pike).
                     55: It is reprinted by permission of John Wiley & Sons, Ltd.
                     56: .FE
                     57: .NH
                     58: Introduction
                     59: .LP
                     60: .I Sam
                     61: is an interactive text editor that combines cut-and-paste interactive editing with
                     62: an unusual command language based on the composition of regular expressions.
                     63: It is written as two programs: one, the `host part,' runs on a
                     64: .UX
                     65: system
                     66: and implements the command language and provides file access; the other, the
                     67: `terminal part,' runs asynchronously
                     68: on a machine with a mouse and bitmap display
                     69: and supports the display and interactive editing.
                     70: The host part may be even run in isolation on an ordinary terminal
                     71: to edit text using the command
                     72: language, much like a traditional line editor,
                     73: without assistance from a mouse or display.
                     74: Most often,
                     75: the terminal part runs on a Blit|reference(blit bstj) terminal
                     76: (actually on a Teletype DMD 5620, the production version of the Blit), whose
                     77: host connection is an ordinary 9600 bps RS-232 link;
                     78: on the Sun computer the host and display processes run on a single machine,
                     79: connected by a pipe.
                     80: .PP
                     81: .I Sam
                     82: edits uninterpreted
                     83: ASCII text.
                     84: It has no facilities for multiple fonts, graphics or tables,
                     85: unlike MacWrite,|reference(macwrite) Bravo,|reference(bravo)
                     86: Tioga|reference(tioga)
                     87: or Lara.|reference(lara)
                     88: Also unlike them, it has a rich command language.
                     89: (Throughout this paper, the phrase
                     90: .I
                     91: command language
                     92: .R
                     93: refers to
                     94: textual commands; commands activated from the mouse form the
                     95: .I mouse
                     96: .I language. )
                     97: .I Sam
                     98: developed as an editor for use by programmers, and tries to join
                     99: the styles of the
                    100: .UX
                    101: text editor
                    102: .I ed
                    103: |reference(v7manual)|reference(kernighan pike)
                    104: with that of interactive cut-and-paste editors by
                    105: providing a comfortable mouse-driven interface
                    106: to a program with a solid command language driven by regular expressions.
                    107: The command language developed more than the mouse language, and
                    108: acquired a notation for describing the structure of files
                    109: more richly than as a sequence of lines,
                    110: using a dataflow-like syntax for specifying changes.
                    111: .PP
                    112: The interactive style was influenced by
                    113: .I jim ,
                    114: |reference(bstj blit)
                    115: an early cut-and-paste editor for the Blit, and by
                    116: .I mux ,
                    117: |reference(latest volume1)
                    118: the Blit window system.
                    119: .I Mux
                    120: merges the original Blit window system,
                    121: .I mpx ,
                    122: |reference(bstj blit)
                    123: with cut-and-paste editing, forming something like a
                    124: multiplexed version of
                    125: .I jim
                    126: that edits the output of (and input to) command sessions rather than files.
                    127: .PP
                    128: The first part of this paper describes the command language, then the mouse
                    129: language, and explains how they interact.
                    130: That is followed by a description of the implementation,
                    131: first of the host part, then of the terminal part.
                    132: A principle that influenced the design of
                    133: .I sam
                    134: is that it should have no explicit limits, such as upper limits on
                    135: file size or line length.
                    136: A secondary consideration is that it be efficient.
                    137: To honor these two goals together requires a method for efficiently
                    138: manipulating
                    139: huge strings (files) without breaking them into lines,
                    140: perhaps while making thousands of changes
                    141: under control of the command language.
                    142: .I Sam 's
                    143: method is to
                    144: treat the file as a transaction database, implementing changes as atomic
                    145: updates.  These updates may be unwound easily to `undo' changes.
                    146: Efficiency is achieved through a collection of caches that minimizes
                    147: disc traffic and data motion, both within the two parts of the program
                    148: and between them.
                    149: .PP
                    150: The terminal part of
                    151: .I sam
                    152: is fairly straightforward.
                    153: More interesting is how the two halves of the editor stay
                    154: synchronized when either half may initiate a change.
                    155: This is achieved through a data structure that organizes the
                    156: communications and is maintained in parallel by both halves.
                    157: .PP
                    158: The last part of the paper chronicles the writing of
                    159: .I sam
                    160: and discusses the lessons that were learned through its development and use.
                    161: .PP
                    162: The paper is long, but is composed largely of two papers of reasonable length:
                    163: a description of the user interface of
                    164: .I sam
                    165: and a discussion of its implementation.
                    166: They are combined because the implementation is strongly influenced by
                    167: the user interface, and vice versa.
                    168: .NH
                    169: The Interface
                    170: .LP
                    171: .I Sam
                    172: is a text editor for multiple files.
                    173: File names may be provided when it is invoked:
                    174: .P1 2n
                    175: sam file1 file2 ...
                    176: .P2
                    177: and there are commands
                    178: to add new files and discard unneeded ones.
                    179: Files are not read until necessary
                    180: to complete some command.
                    181: Editing operations apply to an internal copy
                    182: made when the file is read; the
                    183: .UX
                    184: file associated with the copy
                    185: is changed only by an explicit command.
                    186: To simplify the discussion, the internal copy is here called a
                    187: .I file ,
                    188: while the disc-resident original is called a
                    189: .I
                    190: disc file.
                    191: .R
                    192: .PP
                    193: .I Sam
                    194: is usually connected to a bitmap display that presents a cut-and-paste
                    195: editor driven by the mouse.
                    196: In this mode, the command language is still available:
                    197: text typed in a special window, called the
                    198: .I sam
                    199: .I window,
                    200: is interpreted
                    201: as commands to be executed in the current file.
                    202: Cut-and-paste editing may be used in any window \(em even in the
                    203: .I sam
                    204: window to construct commands.
                    205: The other mode of operation, invoked by starting
                    206: .I sam
                    207: with the option
                    208: .CW -d
                    209: (for `no download'),
                    210: does not use the mouse or bitmap display, but still permits
                    211: editing using the textual command language, even on an ordinary terminal,
                    212: interactively or from a script.
                    213: .PP
                    214: The following sections describe first the command language (under
                    215: .CW sam\ -d
                    216: and in the
                    217: .I sam
                    218: window), and then the mouse interface.
                    219: These two languages are nearly independent, but connect through the
                    220: .I current
                    221: .I text,
                    222: described below.
                    223: .NH 2
                    224: The Command Language
                    225: .LP
                    226: A file consists of its contents, which are an array of characters
                    227: (that is, a string); the
                    228: .I name
                    229: of the associated disc file; the
                    230: .I
                    231: modified bit
                    232: .R
                    233: that states whether the contents match those of
                    234: the disc file;
                    235: and a substring of the contents, called the
                    236: .I
                    237: current text
                    238: .R
                    239: or
                    240: .I dot
                    241: (see Figures 1 and 2).
                    242: If the current text is a null string, dot falls between characters.
                    243: The
                    244: .I value
                    245: of dot is the location of the current text; the
                    246: .I contents
                    247: of dot are the characters it contains.
                    248: .I Sam
                    249: imparts to the text no two-dimensional interpretation such as columns
                    250: or fields; text is always one-dimensional.
                    251: Even the idea of a `line' of text as understood by most
                    252: .UX
                    253: programs
                    254: \(em a sequence of characters terminated by a newline character \(em
                    255: is only weakly supported.
                    256: .PP
                    257: The
                    258: .I
                    259: current file
                    260: .R
                    261: is the file to which editing commands refer.
                    262: The current text is therefore dot in the current file.
                    263: If a command doesn't explicitly name a particular file or piece of text,
                    264: the command is assumed to apply to the current text.
                    265: For the moment, ignore the presence of multiple files and consider
                    266: editing a single file.
                    267: .1C
                    268: .KF L
                    269: .IB fig1.ps 3.10i 5.65i c
                    270: .Cs
                    271: Figure 1. A typical
                    272: .I sam
                    273: screen, with the editing menu presented.
                    274: The
                    275: .I sam
                    276: (command language) window is in the middle, with file windows above and below.
                    277: (The user interface makes it easy to create these abutting windows.)
                    278: The partially obscured window is a third file window.
                    279: The uppermost window is that to which typing and mouse operations apply,
                    280: as indicated by its heavy border.
                    281: Each window has its current text highlighted in reverse video.
                    282: The
                    283: .I sam
                    284: window's current text is the null string on the last visible line,
                    285: indicated by a vertical bar.
                    286: See also Figure 2.
                    287: .Ce
                    288: .KE
                    289: .2C
                    290: .PP
                    291: Commands have one-letter names.
                    292: Except for non-editing commands such as writing
                    293: the file to disc, most commands make some change
                    294: to the text in dot and leave dot set to the text resulting from the change.
                    295: For example, the delete command,
                    296: .CW d ,
                    297: deletes the text in dot, replacing it by the null string and setting dot
                    298: to the result.
                    299: The change command,
                    300: .CW c ,
                    301: replaces dot by text delimited by an arbitrary punctuation character,
                    302: conventionally
                    303: a slash.  Thus,
                    304: .P1 2n
                    305: c/Peter/
                    306: .P2
                    307: replaces the text in dot by the string
                    308: .CW Peter .
                    309: Similarly,
                    310: .P1 2n
                    311: a/Peter/
                    312: .P2
                    313: (append) adds the string after dot, and
                    314: .P1 2n
                    315: i/Peter/
                    316: .P2
                    317: (insert) inserts before dot.
                    318: All three leave dot set to the new text,
                    319: .CW Peter .
                    320: .PP
                    321: Newlines are part of the syntax of commands:
                    322: the newline character lexically terminates a command.
                    323: Within the inserted text, however, newlines are never implicit.
                    324: But since it is often convenient to insert multiple lines of text,
                    325: .I sam
                    326: has a special
                    327: syntax for that case:
                    328: .P1 2n
                    329: a
                    330: some lines of text
                    331: to be inserted in the file,
                    332: terminated by a period
                    333: on a line by itself
                    334: \&.
                    335: .P2
                    336: In the one-line syntax, a newline character may be specified by a C-like
                    337: escape, so
                    338: .P1 2n
                    339: c/\en/
                    340: .P2
                    341: replaces dot by a single newline character.
                    342: .PP
                    343: .I Sam
                    344: also has a substitute command,
                    345: .CW s :
                    346: .P1 2n
                    347: s/\f2expression\fP/\f2replacement\fP/
                    348: .P2
                    349: substitutes the replacement text for the first match, in dot,
                    350: of the regular expression.
                    351: Thus, if dot is the string
                    352: .CW Peter ,
                    353: the command
                    354: .P1 2n
                    355: s/t/st/
                    356: .P2
                    357: changes it to
                    358: .CW Pester .
                    359: In general,
                    360: .CW s
                    361: is unnecessary, but it was inherited from
                    362: .I ed
                    363: and it has some convenient variations.
                    364: For instance, the replacement text may include the matched text,
                    365: specified by
                    366: .CW & :
                    367: .P1 2n
                    368: s/Peter/Oh, &, &, &, &!/
                    369: .P2
                    370: .PP
                    371: There are also three commands that apply programs
                    372: to text:
                    373: .P1 2n
                    374: < \f2Unix program\fP
                    375: .P2
                    376: replaces dot by the output of the
                    377: .UX
                    378: program.
                    379: Similarly, the
                    380: .CW >
                    381: command
                    382: runs the program with dot as its standard input, and
                    383: .CW |
                    384: does both.  For example,
                    385: .P1 2n
                    386: | sort
                    387: .P2
                    388: replaces dot by the result of applying the standard sorting utility to it.
                    389: Again, newlines have no special significance for these
                    390: .I sam
                    391: commands.
                    392: The text acted upon and resulting from these commands is not necessarily
                    393: bounded by newlines, although for connection with
                    394: .UX
                    395: programs,
                    396: newlines may be necessary to obey conventions.
                    397: .PP
                    398: One more command:
                    399: .CW p
                    400: prints the contents of dot.
                    401: Table I summarizes
                    402: .I sam 's
                    403: commands.
                    404: .1C
                    405: .KF
                    406: .Tm Table of Commands  s
                    407: .TS
                    408: center;
                    409: c s
                    410: lfCW l.
                    411: Table I. \fISam\fP commands
                    412: .sp .4
                    413: .ft CW
                    414: _
                    415: .ft
                    416: .sp .4
                    417: \f1Text commands\fP    
                    418: .sp .4
                    419: _
                    420: .sp .4
                    421: a/\f2text\fP/  Append text after dot
                    422: c/\f2text\fP/  Change text in dot
                    423: i/\f2text\fP/  Insert text before dot
                    424: d      Delete text in dot
                    425: s/\f2regexp\fP/\f2text\fP/     Substitute text for match of regular expression in dot
                    426: m \f2address\fP        Move text in dot after address
                    427: t \f2address\fP        Copy text in dot after address
                    428: .sp .4
                    429: _
                    430: .sp .4
                    431: \f1Display commands\fP 
                    432: .sp .4
                    433: _
                    434: .sp .2
                    435: p      Print contents of dot
                    436: \&=    Print value (line numbers and character numbers) of dot
                    437: .sp .4
                    438: _
                    439: .sp .4
                    440: \f1File commands\fP
                    441: .sp .4
                    442: _
                    443: .sp .2
                    444: b \f2file-list\fP      Set current file to first file in list that \fIsam\fP has in menu
                    445: B \f2file-list\fP      Same as \f(CWb\fP, but load new files
                    446: n      Print menu lines of all files
                    447: D \f2file-list\fP      Delete named files from \fIsam\fP
                    448: .sp .4
                    449: _
                    450: .sp .4
                    451: \f1I/O commands\fP     
                    452: .sp .4
                    453: _
                    454: .sp .2
                    455: e \f2filename\fP       Replace file with named disc file
                    456: r \f2filename\fP       Replace dot by contents of named disc file
                    457: w \f2filename\fP       Write file to named disc file
                    458: f \f2filename\fP       Set file name and print new menu line
                    459: < \f2Unix-command\fP   Replace dot by standard output of command
                    460: > \f2Unix-command\fP   Send dot to standard input of command
                    461: | \f2Unix-command\fP   Replace dot by result of command applied to dot
                    462: ! \f2Unix-command\fP   Run the command
                    463: .sp .4
                    464: _
                    465: .sp .4
                    466: \f1Loops and conditionals\fP   
                    467: .sp .4
                    468: _
                    469: .sp .2
                    470: x/\f2regexp\fP/ \f2command\fP  For each match of regexp, set dot and run command
                    471: y/\f2regexp\fP/ \f2command\fP  Between adjacent matches of regexp, set dot and run command
                    472: X/\f2regexp\fP/ \f2command\fP  Run command in each file whose menu line matches regexp
                    473: Y/\f2regexp\fP/ \f2command\fP  Run command in each file whose menu line does not match
                    474: g/\f2regexp\fP/ \f2command\fP  If dot contains a match of regexp, run command
                    475: v/\f2regexp\fP/ \f2command\fP  If dot does not contain a match of regexp, run command
                    476: .sp .4
                    477: _
                    478: .sp .4
                    479: \f1Miscellany\fP       
                    480: .sp .4
                    481: _
                    482: .sp .2
                    483: k      Set address mark to value of dot
                    484: q      Quit
                    485: u \f2n\fP      Undo last \f2n\fP (default 1) changes
                    486: { }    Braces group commands
                    487: .sp .3
                    488: .ft CW
                    489: _
                    490: .ft
                    491: .TE
                    492: .sp
                    493: .Tm  n         S
                    494: .Tm  k         S
                    495: .Tm  q         S
                    496: .Tm  p         S
                    497: .KE
                    498: .2C
                    499: .PP
                    500: The value of dot may be changed by
                    501: specifying an
                    502: .I address
                    503: for the command.
                    504: The simplest address is a line number:
                    505: .P1 2n
                    506: 3
                    507: .P2
                    508: refers to the third line of the file, so
                    509: .P1 2n
                    510: 3d
                    511: .P2
                    512: deletes the third line of the file, and implicitly renumbers
                    513: the lines so the old line 4 is now numbered 3.
                    514: (This is one of the few places where
                    515: .I sam
                    516: deals with lines directly.)
                    517: Line
                    518: .CW 0
                    519: is the null string at the beginning of the file.
                    520: If a command consists of only an address, a
                    521: .CW p
                    522: command is assumed, so typing an unadorned
                    523: .CW 3
                    524: prints line 3 on the terminal.
                    525: There are a couple of other basic addresses:
                    526: a period addresses dot itself; and
                    527: a dollar sign
                    528: .CW $ ) (
                    529: addresses the null string at the end of the file.
                    530: .PP
                    531: An address is always a single substring of the file.
                    532: Thus, the address
                    533: .CW 3
                    534: addresses the characters
                    535: after the second newline of
                    536: the file through the third newline of the file.
                    537: A
                    538: .I
                    539: compound address
                    540: .R
                    541: is constructed by the comma operator
                    542: .P1 2n
                    543: \f2address1\fP,\f2address2\fP
                    544: .P2
                    545: and addresses the substring of the file from the beginning of
                    546: .I address1
                    547: to the end of
                    548: .I address2 .
                    549: For example, the command
                    550: .CW 3,5p
                    551: prints the third through fifth lines of the file and
                    552: .CW .,$d
                    553: deletes the text from the beginning of dot to the end of the file.
                    554: .PP
                    555: These addresses are all absolute positions in the file, but
                    556: .I sam
                    557: also has relative addresses, indicated by
                    558: .CW +
                    559: or
                    560: .CW - .
                    561: For example,
                    562: .P1 2n
                    563: $-3
                    564: .P2
                    565: is the third line before the end of the file and
                    566: .P1 2n
                    567: \&.+1
                    568: .P2
                    569: is the line after dot.
                    570: If no address appears to the left of the
                    571: .CW +
                    572: or
                    573: .CW - ,
                    574: dot is assumed;
                    575: if nothing appears to the right,
                    576: .CW 1
                    577: is assumed.
                    578: Therefore,
                    579: .CW .+1
                    580: may be abbreviated to just a plus sign.
                    581: .PP
                    582: The
                    583: .CW +
                    584: operator acts relative to the end of its first argument, while the
                    585: .CW -
                    586: operator acts relative to the beginning.  Thus
                    587: .CW .+1
                    588: addresses the first line after dot,
                    589: .CW .-
                    590: addresses the first line before dot, and
                    591: .CW +-
                    592: refers to the line containing the end of dot.  (Dot may span multiple lines, and
                    593: .CW +
                    594: selects the line after the end of dot, then
                    595: .CW -
                    596: backs up one line.)
                    597: .PP
                    598: The final type of address is a regular expression, which addresses the
                    599: text matched by the expression.  The expression is enclosed in slashes, as in
                    600: .P1 2n
                    601: /\f2expression\fP/
                    602: .P2
                    603: The expressions are the same as those in the
                    604: .UX
                    605: program
                    606: .I egrep ,
                    607: |reference(v7manual)|reference(kernighan pike)
                    608: and include closures, alternations, and so on.
                    609: They find the
                    610: .I
                    611: leftmost longest
                    612: .R
                    613: string that matches the expression, that is,
                    614: the first match after the point where the search is started,
                    615: and if more than one match begins at the same spot, the longest such match.
                    616: (I assume familiarity with the syntax for regular expressions in
                    617: .UX
                    618: programs.|reference(bsdmanual))
                    619: For example,
                    620: .P1 2n
                    621: /x/
                    622: .P2
                    623: matches the next
                    624: .CW x
                    625: character in the file,
                    626: .P1 2n
                    627: /xx*/
                    628: .P2
                    629: matches the next run of one or more
                    630: .CW x 's,
                    631: and
                    632: .P1 2n
                    633: /x|Peter/
                    634: .P2
                    635: matches the next
                    636: .CW x
                    637: or
                    638: .CW Peter .
                    639: For compatibility with other
                    640: .UX
                    641: programs, the `any character' operator,
                    642: a period,
                    643: does not match a newline, so
                    644: .P1 2n
                    645: /.*/
                    646: .P2
                    647: matches the text from dot to the end of the line, but excludes the newline
                    648: and so will not match across
                    649: the line boundary.
                    650: .PP
                    651: Regular expressions are always relative addresses.
                    652: The direction is forwards by default,
                    653: so
                    654: .CW /Peter/
                    655: is really an abbreviation for
                    656: .CW +/Peter/ .
                    657: The search can be reversed with a minus sign, so
                    658: .P1 2n
                    659: .CW -/Peter/
                    660: .P2
                    661: finds the first
                    662: .CW Peter
                    663: before dot.
                    664: Regular expressions may be used with other address forms, so
                    665: .CW 0+/Peter/
                    666: finds the first
                    667: .CW Peter
                    668: in the file and
                    669: .CW $-/Peter/
                    670: finds the last.
                    671: Table II summarizes
                    672: .I sam 's
                    673: addresses.
                    674: .1C
                    675: .KF
                    676: .TS
                    677: center;
                    678: c s
                    679: lfCW l.
                    680: Table II. \fISam\fP addresses
                    681: .sp .4
                    682: .ft CW
                    683: _
                    684: .ft
                    685: .sp .4
                    686: \f1Simple addresses\fP 
                    687: .sp .4
                    688: _
                    689: .sp .2
                    690: #\f2n\fP       The empty string after character \f2n\fP
                    691: \f2n\fP        Line \f2n\fP.
                    692: /\f2regexp\fP/ The first following match of the regular expression
                    693: -/\f2regexp\fP/        The first previous match of the regular expression
                    694: $      The null string at the end of the file
                    695: \&.    Dot
                    696: \&'    The address mark, set by \f(CWk\fP command
                    697: "\f2regexp\fP" Dot in the file whose menu line matches regexp
                    698: .sp .4
                    699: _
                    700: .sp .4
                    701: \f1Compound addresses\fP       
                    702: .sp .4
                    703: _
                    704: .sp .2
                    705: \f2a1\fP+\f2a2\fP      The address \f2a2\fP evaluated starting at right of \f2a1\fP
                    706: \f2a1\fP-\f2a2\fP      \f2a2\fP evaluated in the reverse direction starting at left of \f2a1\fP
                    707: \f2a1\fP,\f2a2\fP      From the left of \f2a1\fP to the right of \f2a2\fP (default \f(CW0,$\fP)
                    708: \f2a1\fP;\f2a2\fP      Like \f(CW,\fP but sets dot after evaluating \f2a1\fP
                    709: .sp .4
                    710: _
                    711: .sp .4
                    712: .T&
                    713: c s.
                    714: T{
                    715: The operators
                    716: .CW +
                    717: and
                    718: .CW -
                    719: are high precedence, while
                    720: .CW ,
                    721: and
                    722: .CW ;
                    723: are low precedence.
                    724: In both
                    725: .CW +
                    726: and
                    727: .CW -
                    728: forms,
                    729: .I a2
                    730: defaults to 1 and
                    731: .I a1
                    732: defaults to dot.
                    733: If both
                    734: .I a1
                    735: and
                    736: .I a2
                    737: are present,
                    738: .CW +
                    739: may be elided.
                    740: T}
                    741: .sp .5
                    742: .ft CW
                    743: _
                    744: .ft
                    745: .TE
                    746: .sp
                    747: .KE
                    748: .2C
                    749: .PP
                    750: The language discussed so far will not seem novel
                    751: to people who use
                    752: .UX
                    753: text editors
                    754: such as
                    755: .I ed
                    756: or
                    757: .I vi .
                    758: |reference(bsdmanual)
                    759: Moreover, the kinds of editing operations these commands allow, with the exception
                    760: of regular expressions and line numbers,
                    761: are clearly more conveniently handled by a mouse-based interface.
                    762: Indeed,
                    763: .I sam 's
                    764: mouse language (discussed at length below) is the means by which
                    765: simple changes are usually made.
                    766: For large or repetitive changes, however, a textual language
                    767: outperforms a manual interface.
                    768: .PP
                    769: Imagine that, instead of deleting just one occurrence of the string
                    770: .CW Peter ,
                    771: we wanted to eliminate every
                    772: .CW Peter .
                    773: What's needed is an iterator that runs a command for each occurrence of some
                    774: text.
                    775: .I Sam 's
                    776: iterator is called
                    777: .CW x ,
                    778: for extract:
                    779: .P1 2n
                    780: x/\f2expression\fP/ \f2command\fP
                    781: .P2
                    782: finds all matches in dot of the specified expression, and for each
                    783: such match, sets dot to the text matched and runs the command.
                    784: So to delete all the
                    785: .CW Peters:
                    786: .P1 2n
                    787: 0,$ x/Peter/ d
                    788: .P2
                    789: (Blanks in these examples are to improve readability;
                    790: .I sam
                    791: neither requires nor interprets them.)
                    792: This searches the entire file
                    793: .CW 0,$ ) (
                    794: for occurrences of the string
                    795: .CW Peter ,
                    796: and runs the
                    797: .CW d
                    798: command with dot set to each such occurrence.
                    799: (By contrast, the comparable
                    800: .I ed
                    801: command would delete all
                    802: .I lines
                    803: containing
                    804: .CW Peter ;
                    805: .I sam
                    806: deletes only the
                    807: .CW Peters .)
                    808: The address
                    809: .CW 0,$
                    810: is commonly used, and may be abbreviated to just a comma.
                    811: As another example,
                    812: .P1 2n
                    813: , x/Peter/ p
                    814: .P2
                    815: prints a list of
                    816: .CW Peters,
                    817: one for each appearance in the file, with no intervening text (not even newlines
                    818: to separate the instances).
                    819: .PP
                    820: Of course, the text extracted by
                    821: .CW x
                    822: may be selected by a regular expression,
                    823: which complicates deciding what set of matches is chosen \(em
                    824: matches may overlap.  This is resolved by generating the matches
                    825: starting from the beginning of dot using the leftmost-longest rule,
                    826: and searching for each match starting from the end of the previous one.
                    827: Regular expressions may also match null strings, but a null match
                    828: adjacent to a non-null match is never selected; at least one character
                    829: must intervene.
                    830: For example,
                    831: .P1 2n
                    832: , c/AAA/
                    833: x/B*/ c/-/
                    834: , p
                    835: .P2
                    836: produces as output
                    837: .P1 2n
                    838: -A-A-A-
                    839: .P2
                    840: because the pattern
                    841: .CW B*
                    842: matches the null strings separating the
                    843: .CW A 's.
                    844: .PP
                    845: The
                    846: .CW x
                    847: command has a complement,
                    848: .CW y ,
                    849: with similar syntax, that executes the command with dot set to the text
                    850: .I between
                    851: the matches of the expression.
                    852: For example,
                    853: .P1 2n
                    854: , c/AAA/
                    855: y/A/ c/-/
                    856: , p
                    857: .P2
                    858: produces the same result as the example above.
                    859: .PP
                    860: The
                    861: .CW x
                    862: and
                    863: .CW y
                    864: commands are looping constructs, and
                    865: .I sam
                    866: has a pair of conditional commands to go with them.
                    867: They have similar syntax:
                    868: .P1 2n
                    869: g/\f2expression\fP/ \f2command\fP
                    870: .P2
                    871: (guard)
                    872: runs the command exactly once if dot contains a match of the expression.
                    873: This is different from
                    874: .CW x ,
                    875: which runs the command for
                    876: .I each
                    877: match:
                    878: .CW x
                    879: loops;
                    880: .CW g
                    881: merely tests, without changing the value of dot.
                    882: Thus,
                    883: .P1 2n
                    884: , x/Peter/ d
                    885: .P2
                    886: deletes all occurrences of
                    887: .CW Peter ,
                    888: but
                    889: .P1 2n
                    890: , g/Peter/ d
                    891: .P2
                    892: deletes the whole file (reduces it to a null string) if
                    893: .CW Peter
                    894: occurs anywhere in the text.
                    895: The complementary conditional is
                    896: .CW v ,
                    897: which runs the command if there is
                    898: .I no
                    899: match of the expression.
                    900: .PP
                    901: These control-structure-like commands may be composed to construct more
                    902: involved operations.  For example, to print those lines of text that
                    903: contain the string
                    904: .CW Peter :
                    905: .P1 2n
                    906: , x/.*\en/ g/Peter/ p
                    907: .P2
                    908: The
                    909: .CW x
                    910: breaks the file into lines, the
                    911: .CW g
                    912: selects those lines containing
                    913: .CW Peter ,
                    914: and the
                    915: .CW p
                    916: prints them.
                    917: This command gives an address for the
                    918: .CW x
                    919: command (the whole file), but because
                    920: .CW g
                    921: does not have an explicit address, it applies to the value of
                    922: dot produced by the
                    923: .CW x
                    924: command, that is, to each line.
                    925: All commands in
                    926: .I sam
                    927: except for the command to write a file to disc use dot for the
                    928: default address.
                    929: .PP
                    930: Composition may be continued indefinitely.
                    931: .P1 2n
                    932: , x/.*\en/ g/Peter/ v/SaltPeter/ p
                    933: .P2
                    934: prints those lines containing
                    935: .CW Peter
                    936: but
                    937: .I not
                    938: those containing
                    939: .CW SaltPeter .
                    940: .NH 2
                    941: Structural Regular Expressions
                    942: .LP
                    943: Unlike other
                    944: .UX
                    945: text editors,
                    946: including the non-interactive ones such as
                    947: .I sed
                    948: and
                    949: .I awk ,
                    950: |reference(kernighan pike)
                    951: .I sam
                    952: is good for manipulating files with multi-line `records.'
                    953: An example is an on-line phone book composed of records,
                    954: separated by blank lines, of the form
                    955: .P1 2n
                    956: Herbert Tic
                    957: 44 Turnip Ave., Endive, NJ
                    958: 201-5555642
                    959: 
                    960: Norbert Twinge
                    961: 16 Potato St., Cabbagetown, NJ
                    962: 201-5553145
                    963: 
                    964: \&...
                    965: .P2
                    966: The format may be encoded as a regular expression:
                    967: .P1 2n
                    968: (.+\en)+
                    969: .P2
                    970: that is, a sequence of one or more non-blank lines.
                    971: The command to print Mr. Tic's entire record is then
                    972: .P1 2n
                    973: , x/(.+\en)+/ g/^Herbert Tic$/ p
                    974: .P2
                    975: and that to extract just the phone number is
                    976: .P1 2n
                    977: , x/(.+\en)+/ g/^Herbert Tic$/
                    978:        x/^[0-9]*-[0-9]*\en/ p
                    979: .P2
                    980: The latter command breaks the file into records,
                    981: chooses Mr. Tic's record,
                    982: extracts the phone number from the record,
                    983: and finally prints the number.
                    984: .PP
                    985: A more involved problem is that of
                    986: renaming a particular variable, say
                    987: .CW n ,
                    988: to
                    989: .CW num
                    990: in a C program.
                    991: The obvious first attempt,
                    992: .P1 2n
                    993: , x/n/ c/num/
                    994: .P2
                    995: is badly flawed: it changes not only the variable
                    996: .CW n
                    997: but any letter
                    998: .CW n
                    999: that appears.
                   1000: We need to extract all the variables, and select those that match
                   1001: .CW n
                   1002: and only
                   1003: .CW n :
                   1004: .P1 2n
                   1005: , x/[A-Za-z_][A-Za-z_0-9]*/ g/n/ v/../ c/num/
                   1006: .P2
                   1007: The pattern
                   1008: .CW [A-Za-z_][A-Za-z_0-9]*
                   1009: matches C identifiers.
                   1010: Next
                   1011: .CW g/n/
                   1012: selects those containing an
                   1013: .CW n .
                   1014: Then
                   1015: .CW v/../
                   1016: rejects those containing two (or more) characters, and finally
                   1017: .CW c/num/
                   1018: changes the remainder (identifiers
                   1019: .CW n )
                   1020: to
                   1021: .CW num .
                   1022: This version clearly works much better, but there may still be problems.
                   1023: For example, in C character and string constants, the sequence
                   1024: .CW \en
                   1025: is interpreted as a newline character, and we don't want to change it to
                   1026: .CW \enum.
                   1027: This problem can be forestalled with a
                   1028: .CW y
                   1029: command
                   1030: (the following two examples are split onto multiple lines for readability):
                   1031: .P1 2n
                   1032: , y/\e\en/ x/[A-Za-z_][A-Za-z_0-9]*/
                   1033:        g/n/ v/../ c/num/
                   1034: .P2
                   1035: (the second
                   1036: .CW \e
                   1037: is necessary because of lexical conventions in regular expressions),
                   1038: or we could even reject character constants and strings outright:
                   1039: .P1 2n
                   1040: , y/'[^']*'/ y/"[^"]*"/
                   1041:        x/[A-Za-z_][A-Za-z_0-9]*/
                   1042:                g/n/ v/../ c/num/
                   1043: .P2
                   1044: The
                   1045: .CW y
                   1046: commands in this version exclude from consideration all character constants
                   1047: and strings.
                   1048: The only remaining problem is to deal with the possible occurrence of
                   1049: .CW \e'
                   1050: or
                   1051: .CW \e"
                   1052: within these sequences, but it's easy to see how to resolve this difficulty.
                   1053: .PP
                   1054: The point of these composed commands is successive refinement.
                   1055: A simple version of the command is tried, and if it's not good enough,
                   1056: it can be honed by adding a clause or two.
                   1057: (Mistakes can be undone; see below.
                   1058: Also, the mouse language makes it unnecessary to retype the command each time.)
                   1059: The resulting chains of commands are somewhat reminiscent of
                   1060: shell pipelines.|reference(kernighan pike)
                   1061: Unlike pipelines, though, which pass along modified
                   1062: .I data ,
                   1063: .I sam
                   1064: commands pass a
                   1065: .I view
                   1066: of the data.
                   1067: The text at each step of the command is the same, but which pieces
                   1068: are selected is refined step by step until the correct piece is
                   1069: available to the final step of the command line, which ultimately makes the change.
                   1070: .PP
                   1071: In other
                   1072: .UX
                   1073: programs, regular expressions are used only for selection,
                   1074: as in the
                   1075: .I sam
                   1076: .CW g
                   1077: command, never for extraction as in the
                   1078: .CW x
                   1079: or
                   1080: .CW y
                   1081: command.
                   1082: For example, patterns in
                   1083: .I awk
                   1084: |reference(kernighan pike)
                   1085: are used to select lines to be operated on, but cannot be used
                   1086: to describe the format of the input text, or to handle newline-free text.
                   1087: The use of regular expressions to describe the structure of a piece
                   1088: of text rather than its contents, as in the
                   1089: .CW x
                   1090: command, 
                   1091: has been given a name:
                   1092: .I
                   1093: structural regular expressions.
                   1094: .R
                   1095: When they are composed, as in the above example,
                   1096: they are pleasantly expressive.
                   1097: Their use is discussed at greater length elsewhere.|reference(pike structural)
                   1098: .NH 2
                   1099: Multiple files
                   1100: .LP
                   1101: .I Sam
                   1102: has a few other commands, mostly relating to input and output.
                   1103: .P1 2n
                   1104: e discfilename
                   1105: .P2
                   1106: replaces the contents and name of the current file with those of the named
                   1107: disc file;
                   1108: .P1 2n
                   1109: w discfilename
                   1110: .P2
                   1111: writes the contents to the named disc file; and
                   1112: .P1 2n
                   1113: r discfilename
                   1114: .P2
                   1115: replaces dot with the contents of the named disc file.
                   1116: All these commands use the current file's name if none is specified.
                   1117: Finally,
                   1118: .P1 2n
                   1119: f discfilename
                   1120: .P2
                   1121: changes the name associated with the file and displays the result:
                   1122: .P1 2n
                   1123: \&'-. discfilename
                   1124: .P2
                   1125: This output is called the file's
                   1126: .I
                   1127: menu line,
                   1128: .R
                   1129: because it is the contents of the file's line in the button 3 menu (described
                   1130: in the
                   1131: next section).
                   1132: The first three characters are a concise notation for the state of the file.
                   1133: The apostrophe signifies that the file is modified.
                   1134: The minus sign indicates the number of windows
                   1135: open on the file (see the next section):
                   1136: .CW -
                   1137: means none,
                   1138: .CW +
                   1139: means one, and
                   1140: .CW *
                   1141: means more than one.
                   1142: Finally, the period indicates that this is the current file.
                   1143: These characters are useful for controlling the
                   1144: .CW X
                   1145: command, described shortly.
                   1146: .PP
                   1147: .I Sam
                   1148: may be started with a set of disc files (such as all the source for
                   1149: a program) by invoking it with a list of file names as arguments, and
                   1150: more may be added or deleted on demand.
                   1151: .P1 2n
                   1152: B discfile1 discfile2 ...
                   1153: .P2
                   1154: adds the named files to
                   1155: .I sam 's
                   1156: list, and
                   1157: .P1 2n
                   1158: D discfile1 discfile2 ...
                   1159: .P2
                   1160: removes them from
                   1161: .I sam 's
                   1162: memory (without effect on associated disc files).
                   1163: Both these commands have a syntax for using the shell|reference(kernighan pike)
                   1164: (the
                   1165: .UX
                   1166: command interpreter) to generate the lists:
                   1167: .P1 2n
                   1168: B <echo *.c
                   1169: .P2
                   1170: will add all C source files, and
                   1171: .P1 2n
                   1172: B <grep -l variable *.c
                   1173: .P2
                   1174: will add all C source files referencing a particular variable
                   1175: (the
                   1176: .UX
                   1177: command
                   1178: .CW grep\ -l
                   1179: lists all files in its arguments that contain matches of
                   1180: the specified regular expression).
                   1181: Finally,
                   1182: .CW D
                   1183: without arguments deletes the current file.
                   1184: .PP
                   1185: There are two ways to change which file is current:
                   1186: .P1 2n
                   1187: b filename
                   1188: .P2
                   1189: makes the named file current.
                   1190: The
                   1191: .CW B
                   1192: command
                   1193: does the same, but also adds any new files to
                   1194: .I sam 's
                   1195: list.
                   1196: (In practice, of course, the current file
                   1197: is usually chosen by mouse actions, not by textual commands.)
                   1198: The other way is to use a form of address that refers to files:
                   1199: .P1 2n
                   1200: "\f2expression\fP" \f2address\fP
                   1201: .P2
                   1202: refers to the address evaluated in the file whose menu line
                   1203: matches the expression (there must be exactly one match).
                   1204: For example,
                   1205: .P1 2n
                   1206: "peter.c" 3
                   1207: .P2
                   1208: refers to the third line of the file whose name matches
                   1209: .CW peter.c .
                   1210: This is most useful in the move
                   1211: .CW m ) (
                   1212: and copy
                   1213: .CW t ) (
                   1214: commands:
                   1215: .P1 2n
                   1216: 0,$ t "peter.c" 0
                   1217: .P2
                   1218: makes a copy of the current file at the beginning of
                   1219: .CW peter.c .
                   1220: .PP
                   1221: The
                   1222: .CW X
                   1223: command
                   1224: is a looping construct, like
                   1225: .CW x ,
                   1226: that refers to files instead of strings:
                   1227: .P1 2n
                   1228: X/\f2expression\fP/ \f2command\fP
                   1229: .P2
                   1230: runs the command in all
                   1231: files whose menu lines match the expression.  The best example is
                   1232: .P1 2n
                   1233: X/'/ w
                   1234: .P2
                   1235: which writes to disc all modified files.
                   1236: .CW Y
                   1237: is the complement of
                   1238: .CW X :
                   1239: it runs the command on all files whose menu lines don't match the expression:
                   1240: .P1 2n
                   1241: Y/\e.c/ D
                   1242: .P2
                   1243: deletes all files that don't have
                   1244: .CW \&.c
                   1245: in their names, that is, it keeps all C source files and deletes the rest.
                   1246: .PP
                   1247: Braces allow commands to be grouped, so
                   1248: .P1 2n
                   1249: {
                   1250:        \f2command1\fP
                   1251:        \f2command2\fP
                   1252: }
                   1253: .P2
                   1254: is syntactically a single command that runs two commands.
                   1255: Thus,
                   1256: .P1 2n
                   1257: X/\e.c/ ,g/variable/ {
                   1258:        f
                   1259:        , x/.*\en/ g/variable/ p
                   1260: }
                   1261: .P2
                   1262: finds all occurrences of
                   1263: .CW variable
                   1264: in C source files, and prints
                   1265: out the file names and lines of each match.
                   1266: The precise semantics of compound operations is discussed in the implementation
                   1267: sections below.
                   1268: .PP
                   1269: Finally,
                   1270: the undo command,
                   1271: .CW u ,
                   1272: undoes the last command,
                   1273: no matter how many files were affected.
                   1274: Multiple undo operations move further back in time, so
                   1275: .P1 2n
                   1276: u
                   1277: u
                   1278: .P2
                   1279: (which may be abbreviated
                   1280: .CW u2 )
                   1281: undoes the last two commands.  An undo may not be undone, however, nor
                   1282: may any command that adds or deletes files.
                   1283: Everything else is undoable, though, including for example
                   1284: .CW e
                   1285: commands:
                   1286: .P1 2n
                   1287: e filename
                   1288: u
                   1289: .P2
                   1290: restores the state of the file completely, including its name, dot,
                   1291: and modified bit.  Because of the undo, potentially dangerous commands
                   1292: are not guarded by confirmations.  Only
                   1293: .CW D ,
                   1294: which destroys the information necessary to restore itself, is protected.
                   1295: It will not delete a modified file, but a second
                   1296: .CW D
                   1297: of the same file will succeed regardless.
                   1298: The
                   1299: .CW q
                   1300: command, which exits
                   1301: .I sam ,
                   1302: is similarly guarded.
                   1303: .NH 2
                   1304: Mouse Interface
                   1305: .LP
                   1306: .I Sam
                   1307: is most commonly run
                   1308: connected to a bitmap display and mouse for interactive editing.
                   1309: The only difference in the command language
                   1310: between regular, mouse-driven
                   1311: .I sam
                   1312: and
                   1313: .CW sam\ -d
                   1314: is that if an address
                   1315: is provided without a command,
                   1316: .CW sam\ -d
                   1317: will print the text referenced by the address, but
                   1318: regular
                   1319: .I sam
                   1320: will highlight it on the screen \(em in fact,
                   1321: dot is always highlighted (see Figure 2).
                   1322: .WS 1
                   1323: .1C
                   1324: .KF
                   1325: .IB fig2.ps 1.47i 4.61i c
                   1326: .Cs
                   1327: Figure 2. A
                   1328: .I sam
                   1329: window.  The scroll bar down the left
                   1330: represents the file, with the bubble showing the fraction
                   1331: visible in the window.
                   1332: The scroll bar may be manipulated by the mouse for convenient browsing.
                   1333: The current text,
                   1334: which is highlighted, need not fit on a line.  Here it consists of one partial
                   1335: line, one complete line, and a final partial line.
                   1336: .Ce
                   1337: .KE
                   1338: .2C
                   1339: .PP
                   1340: Each file may have zero or more windows open on the display.
                   1341: At any time, only one window in all of
                   1342: .I sam
                   1343: is the
                   1344: .I
                   1345: current window,
                   1346: .R
                   1347: that is, the window to which typing and mouse actions refer;
                   1348: this may be the
                   1349: .I sam
                   1350: window (that in which commands may be typed)
                   1351: or one of the file windows.
                   1352: When a file has multiple windows, the image of the file in each window
                   1353: is always kept up to date.
                   1354: The current file is the last file affected by a command,
                   1355: so if the
                   1356: .I sam
                   1357: window is current,
                   1358: the current window is not a window on the current file.
                   1359: However, each window on a file has its own value of dot,
                   1360: and when switching between windows on a single file,
                   1361: the file's value of dot is changed to that of the window.
                   1362: Thus, flipping between windows behaves in the obvious, convenient way.
                   1363: .PP
                   1364: The mouse on the Blit has three buttons, numbered left to right.
                   1365: Button 3 has a list of commands to manipulate windows,
                   1366: followed by a list of `menu lines' exactly as printed by the
                   1367: .CW f
                   1368: command, one per file (not one per window).
                   1369: These menu lines are sorted by file name.
                   1370: If the list is long, the Blit menu software will make it more manageable
                   1371: by generating a scrolling menu instead of an unwieldy long list.
                   1372: Using the menu to select a file from the list makes that file the current
                   1373: file, and the most recently current window in that file the current window.
                   1374: But if that file is already current, selecting it in the menu cycles through
                   1375: the windows on the file; this simple trick avoids a special menu to
                   1376: choose windows on a file.
                   1377: If there is no window open on the file,
                   1378: .I sam
                   1379: changes the mouse cursor to prompt the user to create one.
                   1380: .PP
                   1381: The commands on the button 3 menu are straightforward (see Figure 3), and
                   1382: are like the commands to manipulate windows in
                   1383: .I mux ,
                   1384: |reference(latest volume1)
                   1385: the Blit's window system.
                   1386: .CW new
                   1387: makes a new file, and gives it one empty window, whose size is determined
                   1388: by a rectangle swept by the mouse.
                   1389: .CW xerox
                   1390: prompts for a window to be selected, and
                   1391: makes a clone of that window; this is how multiple windows are created on one file.
                   1392: .CW reshape
                   1393: changes the size of the indicated window, and
                   1394: .CW close
                   1395: deletes it.  If that is the last window open on the file,
                   1396: .CW close
                   1397: first does a
                   1398: .CW D
                   1399: command on the file.
                   1400: .CW write
                   1401: is identical to a
                   1402: .CW w
                   1403: command on the file; it is in the menu purely for convenience.
                   1404: Finally,
                   1405: .CW ~~sam~~
                   1406: is a menu item that appears between the commands and the file names.
                   1407: Selecting it makes the
                   1408: .I sam
                   1409: window the current window,
                   1410: causing subsequent typing to be interpreted as commands.
                   1411: .KF
                   1412: .IB fig3.ps 1.98i 1.39i c
                   1413: .Cs
                   1414: Figure 3. The menu on button 3.
                   1415: The black rectangle on the left is a scroll bar; the menu is limited to
                   1416: the length shown to prevent its becoming unwieldy.
                   1417: Above the
                   1418: .CW ~~sam~~
                   1419: line is a list of commands;
                   1420: beneath it is a list of files, presented exactly as with the
                   1421: .CW f
                   1422: command.
                   1423: .Ce
                   1424: .KE
                   1425: .PP
                   1426: When
                   1427: .I sam
                   1428: requests that a window be swept, in response to
                   1429: .CW new ,
                   1430: .CW xerox
                   1431: or
                   1432: .CW reshape ,
                   1433: it changes the mouse cursor from the usual arrow to a box with
                   1434: a small arrow.
                   1435: In this state, the mouse may be used to indicate an arbitrary rectangle by
                   1436: pressing button 3 at one corner and releasing it at the opposite corner.
                   1437: More conveniently,
                   1438: button 3 may simply be clicked,
                   1439: whereupon
                   1440: .I sam
                   1441: creates the maximal rectangle that contains the cursor
                   1442: and abuts the
                   1443: .I sam
                   1444: window.
                   1445: By placing the
                   1446: .I sam
                   1447: window in the middle of the screen, the user can define two regions (one above,
                   1448: one below) in which stacked fully-overlapping
                   1449: windows can be created with minimal fuss (see Figure 1).
                   1450: This simple user interface trick makes window creation noticeably easier.
                   1451: .PP
                   1452: The cut-and-paste editor is essentially the same as that in Smalltalk-80.
                   1453: |reference(smalltalk80 goldberg)
                   1454: The text in dot is always highlighted on the screen.
                   1455: When a character is typed it replaces dot, and sets dot to the null
                   1456: string after the character.  Thus, ordinary typing inserts text.
                   1457: Button 1 is used for selection:
                   1458: pressing the button, moving the mouse, and lifting the button
                   1459: selects (sets dot to) the text between the points where the
                   1460: button was pressed and released.
                   1461: Pressing and releasing at the same point selects a null string; this
                   1462: is called clicking.  Clicking twice quickly, or
                   1463: .I
                   1464: double clicking,
                   1465: .R
                   1466: selects larger objects;
                   1467: for example, double clicking in a word selects the word,
                   1468: double clicking just inside an opening bracket selects the text
                   1469: contained in the brackets (handling nested brackets correctly),
                   1470: and similarly for
                   1471: parentheses, quotes, and so on.
                   1472: The double-clicking rules reflect a bias toward
                   1473: programmers.
                   1474: If
                   1475: .I sam
                   1476: were intended more for word processing, double-clicks would probably
                   1477: select linguistic structures such as sentences.
                   1478: .PP
                   1479: If button 1 is pressed outside the current window, it makes the indicated
                   1480: window current.
                   1481: This is the easiest way to switch between windows and files.
                   1482: .PP
                   1483: Pressing button 2 brings up a menu of editing functions (see Figure 4).
                   1484: These mostly apply to the selected text:
                   1485: .CW cut
                   1486: deletes the selected text, and remembers it in a hidden buffer called the
                   1487: .I
                   1488: snarf buffer,
                   1489: .R
                   1490: .CW paste
                   1491: replaces the selected text by the contents of the snarf buffer,
                   1492: .CW snarf
                   1493: just copies the selected text to the snarf buffer,
                   1494: .CW look
                   1495: searches forward for the next literal occurrence of the selected text, and
                   1496: .CW <mux>
                   1497: exchanges snarf buffers with the window system in which
                   1498: .I sam
                   1499: is running.
                   1500: Finally, the last regular expression used appears as a menu entry
                   1501: to search
                   1502: forward for the next occurrence of a match for the expression.
                   1503: .WS 1
                   1504: .KF
                   1505: .IB fig4.ps 0.87i 0.81i c
                   1506: .Cs
                   1507: Figure 4. The menu on button 2.
                   1508: The bottom entry tracks the most recently used regular expression, which may
                   1509: be literal text.
                   1510: .Ce
                   1511: .KE
                   1512: .PP
                   1513: The relationship between the command language and the mouse language is
                   1514: entirely due to the equality of dot and the selected text chosen
                   1515: with button 1 on the mouse.
                   1516: For example, to make a set of changes in a C subroutine, dot can be
                   1517: set by double clicking on the left brace that begins the subroutine,
                   1518: which sets dot for the command language.
                   1519: An address-free command then typed in the
                   1520: .I sam
                   1521: window will apply only to the text between the opening and closing
                   1522: braces of the function.
                   1523: The idea is to select what you want, and then say what you want
                   1524: to do with it, whether invoked by a menu selection or by a typed command.
                   1525: And of course, the value of dot is highlighted on
                   1526: the display after the command completes.
                   1527: This relationship between mouse interface and command language
                   1528: is clumsy to explain, but comfortable, even natural, in practice.
                   1529: .NH
                   1530: The Implementation
                   1531: .LP
                   1532: The next few sections describe how
                   1533: .I sam
                   1534: is put together, first the host part,
                   1535: then the inter-component communication,
                   1536: then the terminal part.
                   1537: After explaining how the command language is implemented,
                   1538: the discussion follows (roughly) the path of a character
                   1539: from the temporary file on disc to the screen.
                   1540: The presentation centers on the data structures,
                   1541: because that is how the program was designed and because
                   1542: the algorithms are easy to provide, given the right data
                   1543: structures.
                   1544: ........
                   1545: .NH 2
                   1546: Parsing and execution
                   1547: .LP
                   1548: The command language is interpreted by parsing each command with a
                   1549: table-driven recursive
                   1550: descent parser, and when a complete command is assembled, invoking a top-down
                   1551: executor.
                   1552: Most editors instead employ a simple character-at-a-time
                   1553: lexical scanner.
                   1554: Use of a parser makes it
                   1555: easy and unambiguous to detect when a command is complete,
                   1556: which has two advantages.
                   1557: First, escape conventions such as backslashes to quote
                   1558: multiple-line commands are unnecessary;  if the command isn't finished,
                   1559: the parser keeps reading.  For example, a multiple-line append driven by an
                   1560: .CW x
                   1561: command is straightforward:
                   1562: .P1 2n
                   1563: x/.*\en/ g/Peter/ a
                   1564: one line about Peter
                   1565: another line about Peter
                   1566: \&.
                   1567: .P2
                   1568: Other
                   1569: .UX
                   1570: editors would require a backslash after all but the last line.
                   1571: .PP
                   1572: The other advantage is specific to the two-process structure of
                   1573: .I sam .
                   1574: The host process must decide when a command is completed so the
                   1575: command interpreter can be called.  This problem is easily resolved
                   1576: by having the lexical analyzer read the single stream of events from the
                   1577: terminal, directly executing all typing and mouse commands,
                   1578: but passing to the parser characters typed to the
                   1579: .I sam
                   1580: command window.
                   1581: This scheme is slightly complicated by the availability of cut-and-paste
                   1582: editing in the
                   1583: .I sam
                   1584: window, but that difficulty is resolved by applying the rules
                   1585: used in
                   1586: .I mux :
                   1587: when a newline is typed to the
                   1588: .I sam
                   1589: window, all text between the newline and the previously typed newline
                   1590: is made available to the parser.
                   1591: This permits arbitrary editing to be done to a command before
                   1592: typing newline and thereby requesting execution.
                   1593: .PP
                   1594: The parser is driven by a table because the syntax of addresses
                   1595: and commands is regular enough
                   1596: to be encoded compactly.  There are few special cases, such as the
                   1597: replacement text in a substitution, so the syntax of almost all commands
                   1598: can be encoded with a few flags.
                   1599: These include whether the command allows an address (for example,
                   1600: .CW e
                   1601: does not), whether it takes a regular expression (as in
                   1602: .CW x
                   1603: and
                   1604: .CW s ),
                   1605: whether it takes replacement text (as in
                   1606: .CW c
                   1607: or
                   1608: .CW i ),
                   1609: which may be multi-line, and so on.
                   1610: The internal syntax of regular expressions is handled by a separate
                   1611: parser; a regular expression is a leaf of the command parse tree.
                   1612: Regular expressions are discussed fully in the next section.
                   1613: .PP
                   1614: The parser table also has information about defaults, so the interpreter
                   1615: is always called with a complete tree.  For example, the parser fills in
                   1616: the implicit
                   1617: .CW 0
                   1618: and
                   1619: .CW $
                   1620: in the abbreviated address
                   1621: .CW ,
                   1622: (comma),
                   1623: inserts a
                   1624: .CW +
                   1625: to the left of an unadorned regular expression in an address,
                   1626: and provides the usual default address
                   1627: .CW .
                   1628: (dot) for commands that expect an address but are not given one.
                   1629: .PP
                   1630: Once a complete command is parsed, the evaluation is easy.
                   1631: The address is evaluated left-to-right starting from the value of dot,
                   1632: with a mostly ordinary expression evaluator.
                   1633: Addresses, like many of the data structures in
                   1634: .I sam ,
                   1635: are held in a C structure and passed around by value:
                   1636: .P1 2n
                   1637: 
                   1638: typedef long Posn;     /* file position */
                   1639: typedef struct Range{
                   1640:        Posn p1, p2;
                   1641: }Range;
                   1642: typedef struct Address{
                   1643:        Range r;
                   1644:        File *f;
                   1645: }Address;
                   1646: .P2
                   1647: .1C
                   1648: .KF top
                   1649: .SP 1
                   1650: .P1 10n
                   1651: Address
                   1652: address(ap, a, sign)
                   1653:        Addrtree *ap;
                   1654:        Address a;
                   1655:        int sign;
                   1656: {
                   1657:        Address a2;
                   1658:        do
                   1659:                switch(ap->type){
                   1660:                case '.':
                   1661:                        a=a.f->dot;
                   1662:                        break;
                   1663:                case '$':
                   1664:                        a.r.p1=a.r.p2=a.f->nbytes;
                   1665:                        break;
                   1666:                case '"':       
                   1667:                        a=matchfile(a, ap->aregexp)->dot; 
                   1668:                        break;
                   1669:                case ',':
                   1670:                        a2=address(ap->right, a, 0);
                   1671:                        a=address(ap->left, a, 0);
                   1672:                        if(a.f!=a2.f || a2.r.p2<a.r.p1)
                   1673:                                error(Eorder);
                   1674:                        a.r.p2=a2.r.p2;
                   1675:                        return a;
                   1676:                /* and so on */
                   1677:                }
                   1678:        while((ap=ap->right)!=0);
                   1679:        return a;
                   1680: }
                   1681: .P2
                   1682: .ce
                   1683: \fBDisplay 1.\fP The address interpreter
                   1684: .KE
                   1685: .2C
                   1686: An address is encoded as a substring (character positions
                   1687: .CW p1
                   1688: to
                   1689: .CW p2 )
                   1690: in a file
                   1691: .CW f .
                   1692: (The data type
                   1693: .CW File
                   1694: is described in detail below.)
                   1695: .PP
                   1696: The address interpreter (shown in Display 1) is an
                   1697: .CW Address -valued
                   1698: function that traverses the parse tree describing an address (the
                   1699: parse tree for the address has type
                   1700: .CW Addrtree ).
                   1701: .PP
                   1702: Throughout, errors are handled by a non-local
                   1703: .CW goto
                   1704: (a
                   1705: .CW setjmp/longjmp
                   1706: in C terminology)
                   1707: hidden in a routine called
                   1708: .CW error
                   1709: that immediately aborts the execution, retracts any
                   1710: partially made changes (see the section below on `undoing'), and
                   1711: returns to the top level of the parser.
                   1712: The argument to
                   1713: .CW error
                   1714: is an enumeration type that
                   1715: is translated to a terse but possibly helpful
                   1716: message such as
                   1717: .CW "?addresses out of order" .
                   1718: Very common messages are kept short; for example the message for
                   1719: a failed regular expression search is
                   1720: .CW ?search .
                   1721: .PP
                   1722: Character addresses such as
                   1723: .CW #3
                   1724: are trivial to implement, as the
                   1725: .CW File
                   1726: data structure is accessible by character number.
                   1727: However,
                   1728: .I sam
                   1729: keeps no information about the position of newlines \(em it is too
                   1730: expensive to track dynamically \(em so line addresses are computed by reading
                   1731: the file, counting newlines.  Except in very large files, this has proven
                   1732: acceptable: file access is fast enough to make the technique practical,
                   1733: and lines are not central to the structure of the command language.
                   1734: .PP
                   1735: The command interpreter, called
                   1736: .CW cmdexec ,
                   1737: is also straightforward.  The parse table includes a
                   1738: function to call to interpret a particular command.  That function
                   1739: receives as arguments
                   1740: the calculated address
                   1741: for the command
                   1742: and the command tree (of type
                   1743: .CW Cmdtree ),
                   1744: which may contain information such as the subtree for compound commands.
                   1745: For example, the function for the
                   1746: .CW g
                   1747: and
                   1748: .CW v
                   1749: commands is shown in Display 2.
                   1750: .CW Compile "" (
                   1751: and
                   1752: .CW execute
                   1753: are part of the regular expression code, described in the next section.)
                   1754: Because the parser and the
                   1755: .CW File
                   1756: data structure do most of the work, most commands
                   1757: are similarly brief.
                   1758: .1C
                   1759: .KF bottom
                   1760: .P1 10n
                   1761: int
                   1762: g_cmd(a, cp)
                   1763:        Address a;
                   1764:        Cmdtree *cp;
                   1765: {
                   1766:        compile(cp->regexp);
                   1767:        if(execute(a.f, a.r.p1, a.r.p2) != (cp->cmdchar=='v')){
                   1768:                a.f->dot=a;
                   1769:                return cmdexec(a, cp->subcmd);
                   1770:        }
                   1771:        return TRUE;    /* indicate that execution is to continue */
                   1772: }
                   1773: .P2
                   1774: .ce
                   1775: \fBDisplay 2. \f(CWg\fR and \f(CWv\fP commands
                   1776: .SP
                   1777: .KE
                   1778: .2C
                   1779: .NH 2
                   1780: Regular expressions
                   1781: .LP
                   1782: The regular expression code in
                   1783: .I sam
                   1784: is an interpreted, rather than compiled on-the-fly, implementation of Thompson's
                   1785: non-deterministic finite automaton algorithm.|reference(thompson regular expression)
                   1786: The syntax and semantics of the expressions are as in the
                   1787: .UX
                   1788: program
                   1789: .I egrep ,
                   1790: including alternation, closures, character classes, and so on.
                   1791: The only changes in the notation are two additions:
                   1792: .CW \en
                   1793: is translated to, and matches, a newline character, and
                   1794: .CW @
                   1795: matches any character.  In
                   1796: .I egrep ,
                   1797: the character
                   1798: .CW \&.
                   1799: matches any character except newline, and in
                   1800: .I sam
                   1801: the same rule seemed safest, to prevent idioms like
                   1802: .CW \&.*
                   1803: from spanning newlines.
                   1804: .I Egrep
                   1805: expressions are arguably too complicated for an interactive editor \(em
                   1806: certainly it would make sense if all the special characters were two-character
                   1807: sequences, so that most of the punctuation characters wouldn't have
                   1808: peculiar meanings \(em but for an interesting command language, full
                   1809: regular expressions are necessary, and
                   1810: .I egrep
                   1811: defines the full regular expression syntax for
                   1812: .UX
                   1813: programs.
                   1814: Also, it seemed superfluous to define a new syntax, since various
                   1815: .UX
                   1816: programs
                   1817: .I ed , (
                   1818: .I egrep
                   1819: and
                   1820: .I vi )
                   1821: define too many already.
                   1822: .PP
                   1823: The expressions are compiled by a routine,
                   1824: .CW compile ,
                   1825: that generates the description of the non-deterministic finite state machine.
                   1826: A second routine,
                   1827: .CW execute ,
                   1828: interprets the machine to generate the leftmost-longest match of the
                   1829: expression in a substring of the file.
                   1830: The algorithm is described elsewhere.|reference(thompson regular expression)|reference(aho hopcroft ullman)
                   1831: .CW Execute
                   1832: reports
                   1833: whether a match was found, and sets a global variable,
                   1834: of type
                   1835: .CW Range ,
                   1836: to the substring matched.
                   1837: .PP
                   1838: A trick is required to evaluate the expression in reverse, such as when
                   1839: searching backwards for an expression.
                   1840: For example,
                   1841: .P1 2n
                   1842: -/P.*r/
                   1843: .P2
                   1844: looks backwards through the file for a match of the expression.
                   1845: The expression, however, is defined for a forward search.
                   1846: The solution is to construct a machine identical to the machine
                   1847: for a forward search except for a reversal of all the concatenation
                   1848: operators (the other operators are symmetric under direction reversal),
                   1849: to exchange the meaning of the operators
                   1850: .CW ^
                   1851: and
                   1852: .CW $ ,
                   1853: and then to read the file backwards, looking for the
                   1854: usual earliest longest match.
                   1855: .PP
                   1856: .CW Execute
                   1857: generates only one match each time it is called.
                   1858: To interpret looping constructs such as the
                   1859: .CW x
                   1860: command,
                   1861: .I sam
                   1862: must therefore synchronize between
                   1863: calls of
                   1864: .CW execute
                   1865: to avoid
                   1866: problems with null matches.
                   1867: For example, even given the leftmost-longest rule,
                   1868: the expression
                   1869: .CW a*
                   1870: matches three times in the string
                   1871: .CW ab
                   1872: (the character
                   1873: .CW a ,
                   1874: the null string between the
                   1875: .CW a
                   1876: and
                   1877: .CW b ,
                   1878: and the final null string).
                   1879: After returning a match for the
                   1880: .CW a ,
                   1881: .I sam
                   1882: must not match the null string before the
                   1883: .CW b .
                   1884: The algorithm starts
                   1885: .CW execute
                   1886: at the end of its previous match, and
                   1887: if the match it returns
                   1888: is null and abuts the previous match, rejects the match and advances
                   1889: the initial position one character.
                   1890: .NH 2
                   1891: Memory allocation
                   1892: .LP
                   1893: The C language has no memory allocation primitives, although a standard
                   1894: library routine,
                   1895: .I malloc ,
                   1896: provides adequate service for simple programs.
                   1897: For specific uses, however,
                   1898: it can be better to write a custom allocator.
                   1899: The allocator (or rather, pair of allocators) described here
                   1900: work in both the terminal and host parts of
                   1901: .I sam .
                   1902: They are designed for efficient manipulation of strings,
                   1903: which are allocated and freed frequently and vary in length from essentially
                   1904: zero to 32 Kbytes (very large strings are written to disc).
                   1905: More important, strings may be large and change size often,
                   1906: so to minimize memory usage it is helpful to reclaim and to coalesce the
                   1907: unused portions of strings when they are truncated.
                   1908: .PP
                   1909: Objects to be allocated in
                   1910: .I sam
                   1911: are of two flavors:
                   1912: the first is C
                   1913: .CW structs ,
                   1914: which are small and often addressed by pointer variables;
                   1915: the second is variable-sized arrays of characters
                   1916: or integers whose
                   1917: base pointer is always used to access them.
                   1918: The memory allocator in
                   1919: .I sam
                   1920: is therefore in two parts:
                   1921: first, a traditional first-fit allocator that provides fixed storage for
                   1922: .CW structs ;
                   1923: and second, a garbage-compacting allocator that reduces storage
                   1924: overhead for variable-sized objects, at the cost of some bookkeeping.
                   1925: The two types of objects are allocated from adjoining arenas, with
                   1926: the garbage-compacting allocator controlling the arena with higher addresses.
                   1927: Separating into two arenas simplifies compaction and prevents fragmentation due
                   1928: to immovable objects.
                   1929: The access rules for garbage-compactable objects
                   1930: (discussed in the next paragraph) allow them to be relocated, so when
                   1931: the first-fit arena needs space, it moves the garbage-compacted arena
                   1932: to higher addresses to make room.  Storage is therefore created only
                   1933: at successively higher addresses, either when more garbage-compacted
                   1934: space is needed or when the first-fit arena pushes up the other arena.
                   1935: .PP
                   1936: Objects that may be compacted declare to the
                   1937: allocator a cell that is guaranteed to be the sole repository of the
                   1938: address of the object whenever a compaction can occur.
                   1939: The compactor can then update the address when the object is moved.
                   1940: For example, the implementation of type
                   1941: .CW List
                   1942: (really a variable-length array)
                   1943: is:
                   1944: .P1 2n
                   1945: typedef struct List{
                   1946:        int     nused;
                   1947:        long    *ptr;
                   1948: }List;
                   1949: .P2
                   1950: The
                   1951: .CW ptr
                   1952: cell must always be used directly, and never copied.  When a
                   1953: .CW List
                   1954: is to be created the
                   1955: .CW List
                   1956: structure is allocated in the ordinary first-fit arena
                   1957: and its
                   1958: .CW ptr
                   1959: is allocated in the garbage-compacted arena.
                   1960: A similar data type for strings, called
                   1961: .CW String ,
                   1962: stores variable-length character arrays of up to 32767 elements.
                   1963: .PP
                   1964: A related matter of programming style:
                   1965: .I sam
                   1966: frequently passes structures by value, which
                   1967: simplifies the code.
                   1968: Traditionally, C programs have
                   1969: passed structures by reference, but implicit allocation on
                   1970: the stack is easier to use.
                   1971: Structure passing is a relatively new feature of C
                   1972: (it is not in the 
                   1973: standard reference manual for C|reference(cbook)),
                   1974: and is poorly supported in most
                   1975: commercial C compilers.
                   1976: It's convenient and expressive, though,
                   1977: and simplifies memory management by
                   1978: avoiding the allocator altogether
                   1979: and eliminating pointer aliases.
                   1980: .....
                   1981: .NH 2
                   1982: Data structures for manipulating files
                   1983: .LP
                   1984: Experience with
                   1985: .I jim
                   1986: showed that the requirements
                   1987: of the file data structure were few, but strict.
                   1988: First, files need to be read and written quickly;
                   1989: adding a fresh file must be painless.
                   1990: Second, the implementation must place no arbitrary upper limit on
                   1991: the number or sizes of files.  (It should be practical to edit many files,
                   1992: and files up to megabytes in length should be handled gracefully.)
                   1993: This implies that files be stored on disc, not in main memory.
                   1994: (Aficionados of virtual memory may argue otherwise, but the
                   1995: implementation of virtual
                   1996: memory in our system is not something to depend on
                   1997: for good performance.)
                   1998: Third, changes to files need be made by only two primitives:
                   1999: deletion and insertion.
                   2000: These are inverses of each other,
                   2001: which simplifies the implementation of the undo operation.
                   2002: Finally,
                   2003: it must be easy and efficient to access the file, either
                   2004: forwards or backwards, a byte at a time.
                   2005: .PP
                   2006: The
                   2007: .CW File
                   2008: data type is constructed from three simpler data structures that hold arrays
                   2009: of characters.
                   2010: Each of these types has an insertion and deletion operator, and the
                   2011: insertion and deletion operators of the
                   2012: .CW File
                   2013: type itself are constructed from them.
                   2014: .PP
                   2015: The simplest type is the
                   2016: .CW String ,
                   2017: which is used to hold strings in main memory.
                   2018: The code that manages
                   2019: .CW Strings
                   2020: guarantees that they will never be longer
                   2021: than some moderate size, and in practice they are rarely larger than 8 Kbytes.
                   2022: .CW Strings
                   2023: have two purposes: they hold short strings like file names with little overhead,
                   2024: and because they are deliberately small, they are efficient to modify.
                   2025: They are therefore used as the data structure for in-memory caches.
                   2026: .PP
                   2027: The disc copy of the file is managed by a data structure called a
                   2028: .CW Disc ,
                   2029: which corresponds to a temporary file.  A
                   2030: .CW Disc
                   2031: has no storage in main memory other than bookkeeping information;
                   2032: the actual data being held is all on the disc.
                   2033: To reduce the number of open files needed,
                   2034: .I sam
                   2035: opens a dozen temporary
                   2036: .UX
                   2037: files and multiplexes the
                   2038: .CW Discs
                   2039: upon them.
                   2040: This permits many files to
                   2041: be edited; the entire
                   2042: .I sam
                   2043: source (48 files) may be edited comfortably with a single
                   2044: instance of
                   2045: .I sam .
                   2046: Allocating one temporary file per
                   2047: .CW Disc
                   2048: would strain the operating system's limit on the number of open files.
                   2049: Also, spreading the traffic among temporary files keeps the files shorter,
                   2050: and shorter files are more efficiently implemented by the
                   2051: .UX
                   2052: I/O subsystem.
                   2053: .PP
                   2054: A
                   2055: .CW Disc
                   2056: is an array of fixed-length blocks, each of which contains
                   2057: between 1 and 4096 characters of active data.
                   2058: (The block size of our
                   2059: .UX
                   2060: file system is 4096 bytes.)
                   2061: The block addresses within the temporary file and the length of each
                   2062: block are stored in a
                   2063: .CW List .
                   2064: When changes are made the live part of blocks may change size.
                   2065: Blocks are created and coalesced when necessary to try to keep the sizes
                   2066: between 2048 and 4096 bytes.
                   2067: An actively changing part of the
                   2068: .CW Disc
                   2069: therefore typically has about a kilobyte of slop that can be
                   2070: inserted or deleted
                   2071: without changing more than one block or affecting the block order.
                   2072: When an insertion would overflow a block, the block is split, a new one
                   2073: is allocated to receive the overflow, and the memory-resident list of blocks
                   2074: is rearranged to reflect the insertion of the new block.
                   2075: .PP
                   2076: Obviously, going to the disc for every modification to the file is
                   2077: prohibitively expensive.
                   2078: The data type
                   2079: .CW Buffer
                   2080: consists of a
                   2081: .CW Disc
                   2082: to hold the data and a
                   2083: .CW String
                   2084: that acts as a cache.
                   2085: This is the first of a series of caches throughout the data structures in
                   2086: .I sam.
                   2087: The caches not only improve performance, they provide a way to organize
                   2088: the flow of data, particularly in the communication between the host
                   2089: and terminal.
                   2090: This idea is developed below, in the section on communications.
                   2091: .PP
                   2092: To reduce disc traffic, changes to a
                   2093: .CW Buffer
                   2094: are mediated by a variable-length string, in memory, that acts as a cache.
                   2095: When an insertion or deletion is made to a
                   2096: .CW Buffer ,
                   2097: if the change can be accommodated by the cache, it is done there.
                   2098: If the cache becomes bigger than a block because of an insertion,
                   2099: some of it is written to the
                   2100: .CW Disc
                   2101: and deleted from the cache.
                   2102: If the change does not intersect the cache, the cache is flushed.
                   2103: The cache is only loaded at the new position if the change is smaller than a block;
                   2104: otherwise, it is sent directly to the
                   2105: .CW Disc .
                   2106: This is because
                   2107: large changes are typically sequential,
                   2108: whereupon the next change is unlikely to overlap the current one.
                   2109: .PP
                   2110: A
                   2111: .CW File
                   2112: comprises a
                   2113: .CW String
                   2114: to hold the file name and some ancillary data such as dot and the modified bit.
                   2115: The most important components, though, are a pair of
                   2116: .CW Buffers ,
                   2117: one called the transcript and the other the contents.
                   2118: Their use is described in the next section.
                   2119: .PP
                   2120: The overall structure is shown in Figure 5.
                   2121: Although it may seem that the data is touched many times on its
                   2122: way from the
                   2123: .CW Disc ,
                   2124: it is read (by one
                   2125: .UX
                   2126: system call) directly into the cache of the
                   2127: associated
                   2128: .CW Buffer ;
                   2129: no extra copy is done.
                   2130: Similarly, when flushing the cache, the text is written
                   2131: directly from the cache to disc.
                   2132: Most operations act directly on the text in the cache.
                   2133: A principle applied throughout
                   2134: .I sam
                   2135: is that the fewer times the data is copied, the faster the program will run
                   2136: (see also the paper by Waite|reference(waite lexical analysis)).
                   2137: .1C
                   2138: .KF
                   2139: .PS
                   2140: copy "fig5.pic"
                   2141: .PE
                   2142: .Cs
                   2143: Figure 5. File data structures.
                   2144: The temporary files are stored in the standard repository for such files
                   2145: on the host system.
                   2146: .Ce
                   2147: .KE
                   2148: .2C
                   2149: .PP
                   2150: The contents of a
                   2151: .CW File
                   2152: are accessed by a routine that
                   2153: copies to a buffer a substring of a file starting at a specified offset.
                   2154: To read a byte at a time, a
                   2155: .CW File "" per-
                   2156: array is loaded starting from a specified initial position,
                   2157: and bytes may then be read from the array.
                   2158: The implementation is done by a macro similar to the C standard I/O
                   2159: .I getc
                   2160: macro.|reference(cbook)
                   2161: Because the reading may be done at any address, a minor change to the
                   2162: macro allows the file to be read backwards.
                   2163: This array is read-only; there is no
                   2164: .I putc .
                   2165: .NH 2
                   2166: Doing and undoing
                   2167: .LP
                   2168: .I Sam
                   2169: has an unusual method for managing changes to files.
                   2170: The command language makes it easy to specify multiple variable-length changes
                   2171: to a file millions of bytes long, and such changes
                   2172: must be made efficiently if the editor is to be practical.
                   2173: The usual techniques for inserting and deleting strings
                   2174: are inadequate under these conditions.
                   2175: The
                   2176: .CW Buffer
                   2177: and
                   2178: .CW Disc
                   2179: data structures are designed for efficient random access to long strings,
                   2180: but care must be taken to avoid super-linear behavior when making
                   2181: many changes simultaneously.
                   2182: .PP
                   2183: .I Sam
                   2184: uses a two-pass algorithm for making changes, and treats each file as a database
                   2185: against which transactions are registered.
                   2186: Changes are not made directly to the contents.
                   2187: Instead, when a command is started, a `mark' containing
                   2188: a sequence number is placed in the transcript
                   2189: .CW Buffer ,
                   2190: and each change made to the file, either an insertion or deletion
                   2191: or a change to the file name,
                   2192: is appended to the end of the transcript.
                   2193: When the command is complete, the transcript is rewound to the
                   2194: mark and applied to the contents.
                   2195: .PP
                   2196: One reason for separating evaluation from
                   2197: application in this way is to simplify tracking the addresses of changes
                   2198: made in the middle of a long sequence.
                   2199: The two-pass algorithm also allows all changes to apply to the
                   2200: .I original
                   2201: data: no change can affect another change made in the same command.
                   2202: This is particularly important when evaluating an
                   2203: .CW x
                   2204: command because it prevents regular expression matches
                   2205: from stumbling over changes made earlier in the execution.
                   2206: Also, the two-pass
                   2207: algorithm is cleaner than the way other
                   2208: .UX
                   2209: editors allow changes to
                   2210: affect each other;
                   2211: for example,
                   2212: .I ed 's
                   2213: idioms to do things like delete every other line
                   2214: depend critically on the implementation.
                   2215: Instead,
                   2216: .I sam 's
                   2217: simple model, in which all changes in a command occur effectively
                   2218: simultaneously, is easy to explain and to understand.
                   2219: .PP
                   2220: The records in the transcript are of the form ``delete substring from
                   2221: locations
                   2222: 123 to 456'' and ``insert 11 characters `hello there' at location 789.''
                   2223: (It is an error if the changes are not at monotonically greater
                   2224: positions through the file.)
                   2225: While the update is occurring, these numbers must be
                   2226: offset by earlier changes, but that is straightforward and
                   2227: local to the update routine;
                   2228: moreover, all the numbers have been computed
                   2229: before the first is examined.
                   2230: .PP
                   2231: Treating the file as a transaction system has another advantage:
                   2232: undo is trivial.
                   2233: All it takes is to invert the transcript after it has been
                   2234: implemented, converting insertions
                   2235: into deletions and vice versa, and saving them in a holding
                   2236: .CW Buffer .
                   2237: The `do' transcript can then be deleted from
                   2238: the transcript
                   2239: .CW Buffer
                   2240: and replaced by the `undo' transcript.
                   2241: If an undo is requested, the transcript is rewound and the undo transcript
                   2242: executed.
                   2243: Because the transcript
                   2244: .CW Buffer
                   2245: is not truncated after each command, it accumulates
                   2246: successive changes.
                   2247: A sequence of undo commands
                   2248: can therefore back up the file arbitrarily,
                   2249: which is more helpful than the more commonly implemented self-inverse form of undo.
                   2250: .I Sam "" (
                   2251: provides no way to undo an undo, but if it were desired,
                   2252: it would be easy to provide by re-interpreting the `do' transcript.)
                   2253: Each mark in the transcript contains a sequence number and the offset into
                   2254: the transcript of the previous mark, to aid in unwinding the transcript.
                   2255: Marks also contain the value of dot and the modified bit so these can be
                   2256: restored easily.
                   2257: Undoing multiple files is easy; it merely demands undoing all files whose
                   2258: latest change has the same sequence number as the current file.
                   2259: .PP
                   2260: Another benefit of having a transcript is that errors encountered in the middle
                   2261: of a complicated command need not leave the files in an intermediate state.
                   2262: By rewinding the transcript to the mark beginning the command,
                   2263: the partial command can be trivially undone.
                   2264: .PP
                   2265: When the update algorithm was first implemented, it was unacceptably slow,
                   2266: so a cache was added to coalesce nearby changes,
                   2267: replacing multiple small changes by a single larger one.
                   2268: This reduced the number
                   2269: of insertions into the transaction
                   2270: .CW Buffer ,
                   2271: and made a dramatic improvement in performance,
                   2272: but made it impossible
                   2273: to handle changes in non-monotonic order in the file; the caching method
                   2274: only works if changes don't overlap.
                   2275: Before the cache was added, the transaction could in principle be sorted
                   2276: if the changes were out of order, although
                   2277: this was never done.
                   2278: The current status is therefore acceptable performance with a minor
                   2279: restriction on global changes, which is sometimes, but rarely, an annoyance.
                   2280: .PP
                   2281: The update algorithm obviously paws the data more than simpler
                   2282: algorithms, but it is not prohibitively expensive;
                   2283: the caches help.
                   2284: (The principle of avoiding copying the data is still honored here,
                   2285: although not as piously:
                   2286: the data is moved from contents' cache to
                   2287: the transcript's all at once and through only one internal buffer.)
                   2288: Performance figures confirm the efficiency.
                   2289: To read from a dead start a hundred kilobyte file on a VAX-11/750
                   2290: takes 1.4 seconds of user time, 2.5 seconds of system time,
                   2291: and 5 seconds of real time.
                   2292: Reading the same file in
                   2293: .I ed
                   2294: takes 6.0 seconds of user time, 1.7 seconds of system time,
                   2295: and 8 seconds of real time.
                   2296: .I Sam
                   2297: uses about half the CPU time.
                   2298: A more interesting example is the one stated above:
                   2299: inserting a character between every pair of characters in the file.
                   2300: The
                   2301: .I sam
                   2302: command is
                   2303: .P1 2n
                   2304: ,y/@/ a/x/
                   2305: .P2
                   2306: and takes 3 CPU seconds per kilobyte of input file, of which
                   2307: about a third is spent in the regular expression code.
                   2308: This translates to about 500 changes per second.
                   2309: .I Ed
                   2310: takes 1.5 seconds per kilobyte to make a similar change (ignoring newlines),
                   2311: but cannot undo it.
                   2312: The same example in
                   2313: .I ex ,|reference(bsdmanual)
                   2314: a variant of
                   2315: .I ed
                   2316: done at the University of California at Berkeley,
                   2317: which allows one level of undoing, again takes 3 seconds.
                   2318: In summary,
                   2319: .I sam 's
                   2320: performance is comparable to that of other
                   2321: .UX
                   2322: editors, although it solves
                   2323: a harder problem.
                   2324: .NH 2
                   2325: Communications
                   2326: .LP
                   2327: The discussion so far has described the implementation of the host part of
                   2328: .I sam ;
                   2329: the next few sections explain how a machine with mouse and bitmap display
                   2330: can be engaged to improve interaction.
                   2331: .I Sam
                   2332: is not the first editor to be written as two processes,|reference(fraser text editor)
                   2333: but its implementation
                   2334: has some unusual aspects.
                   2335: .PP
                   2336: There are several ways
                   2337: .I sam 's
                   2338: host and terminal parts may be connected.
                   2339: The first and simplest is to forgo the terminal part and use the host
                   2340: part's command language to edit text on an ordinary terminal.
                   2341: This mode is invoked by starting
                   2342: .I sam
                   2343: with the
                   2344: .CW -d
                   2345: option.
                   2346: With no options,
                   2347: .I sam
                   2348: runs separate host and terminal programs,
                   2349: communicating with a message protocol over the physical
                   2350: connection that joins them.
                   2351: Typically, the connection is an RS-232 link between a Blit
                   2352: (the prototypical display for
                   2353: .I sam )
                   2354: and a host running
                   2355: the Research Edition of the
                   2356: .UX
                   2357: operating system.|reference(latest volume1)
                   2358: (This is the version of the system used in the Computing Sciences Research
                   2359: Center at AT&T Bell Laboratories, where I work.  Its relevant
                   2360: aspects are discussed in the Blit paper.|reference(bstj blit))
                   2361: The implementation of
                   2362: .I sam
                   2363: for the Sun computer runs both processes on the same machine and
                   2364: connects them by a pipe.
                   2365: .PP
                   2366: The low bandwidth of an RS-232 link
                   2367: necessitated the split between
                   2368: the two programs.
                   2369: The division is a mixed blessing:
                   2370: a program in two parts is much harder to write and to debug
                   2371: than a self-contained one,
                   2372: but the split makes several unusual configurations possible.
                   2373: The terminal may be physically separated from the host, allowing the conveniences
                   2374: of a mouse and bitmap display to be taken home while leaving the files at work.
                   2375: It is also possible to run the host part on a remote machine:
                   2376: .P1 2n
                   2377: sam -r host
                   2378: .P2
                   2379: connects to the terminal in the usual way, and then makes a call
                   2380: across the network to establish the host part of
                   2381: .I sam
                   2382: on the named machine.
                   2383: Finally, it cross-connects the I/O to join the two parts.
                   2384: This allows
                   2385: .I sam
                   2386: to be run on machines that do not support bitmap displays;
                   2387: for example,
                   2388: .I sam
                   2389: is the editor of choice on our Cray X-MP/24.
                   2390: .I Sam
                   2391: .CW -r
                   2392: involves
                   2393: .I three
                   2394: machines: the remote host, the terminal, and the local host.
                   2395: The local host's job is simple but vital: it passes the data
                   2396: between the remote host and terminal.
                   2397: .PP
                   2398: The host and terminal exchange messages asynchronously
                   2399: (rather than, say, as remote procedure calls) but there is no
                   2400: error detection or correction
                   2401: because, whatever the configuration, the connection is reliable.
                   2402: Because the terminal handles mundane interaction tasks such as
                   2403: popping up menus and interpreting the responses, the messages are about
                   2404: data, not actions.
                   2405: For example, the host knows nothing about what is displayed on the screen,
                   2406: and when the user types a character, the message sent to the host says
                   2407: ``insert a one-byte string at location 123 in file 7,'' not ``a character
                   2408: was typed at the current position in the current file.''
                   2409: In other words, the messages look very much like the transaction records
                   2410: in the transcripts.
                   2411: .PP
                   2412: Either the host or terminal part of
                   2413: .I sam
                   2414: may initiate a change to a file.
                   2415: The command language operates on the host, while typing and some
                   2416: mouse operations are executed directly in the terminal to optimize response.
                   2417: Changes initiated by the host program must be transmitted to the terminal,
                   2418: and
                   2419: vice versa.
                   2420: (A token is exchanged to determine which end is in control,
                   2421: which means that characters typed while a time-consuming command runs
                   2422: must be buffered and do not appear until the command is complete.)
                   2423: To maintain consistent information,
                   2424: the host and terminal track changes through a per-file
                   2425: data structure that records what portions of the file
                   2426: the terminal has received.
                   2427: The data structure, called a
                   2428: .CW Rasp
                   2429: (a weak pun: it's a file with holes)
                   2430: is held and updated by both the host and terminal.
                   2431: A
                   2432: .CW Rasp
                   2433: is a list of
                   2434: .CW Strings
                   2435: holding those parts of the file known to the terminal,
                   2436: separated by counts of the number of bytes in the interstices.
                   2437: Of course, the host doesn't keep a separate copy of the data (it only needs
                   2438: the lengths of the various pieces),
                   2439: but the structure is the same on both ends.
                   2440: .PP
                   2441: The
                   2442: .CW Rasp
                   2443: in the terminal doubles as a cache.
                   2444: Since the terminal keeps the text for portions of the file it has displayed,
                   2445: it need not request data from the host when revisiting old parts of the file
                   2446: or redrawing obscured windows, which speeds things up considerably
                   2447: over low-speed links.
                   2448: .PP
                   2449: It's trivial for the terminal to maintain its
                   2450: .CW Rasp ,
                   2451: because all changes made on the terminal apply to parts of the file
                   2452: already loaded there.
                   2453: Changes made by the host are compared against the
                   2454: .CW Rasp
                   2455: during the update sequence after each command.
                   2456: Small changes to pieces of the file loaded in the terminal
                   2457: are sent in their entirety.
                   2458: Larger changes, and changes that fall entirely in the holes,
                   2459: are transmitted as messages without literal data:
                   2460: only the lengths of the deleted and inserted strings are transmitted.
                   2461: When a command is completed, the terminal examines its visible
                   2462: windows to see if any holes in their
                   2463: .CW Rasps
                   2464: intersect the visible portion of the file.
                   2465: It then requests the missing data from the host,
                   2466: along with up to 512 bytes of surrounding data, to minimize
                   2467: the number of messages when visiting a new portion of the file.
                   2468: This technique provides a kind of two-level lazy evaluation for the terminal.
                   2469: The first level sends a minimum of information about
                   2470: parts of the file not being edited interactively;
                   2471: the second level waits until a change is displayed before
                   2472: transmitting the new data.
                   2473: Of course,
                   2474: performance is also helped by having the terminal respond immediately to typing
                   2475: and simple mouse requests.
                   2476: Except for small changes to active pieces of the file, which are
                   2477: transmitted to the terminal without negotiation,
                   2478: the terminal is wholly responsible for deciding what is displayed;
                   2479: the host uses the
                   2480: .CW Rasp
                   2481: only to tell the terminal what might be relevant.
                   2482: .PP
                   2483: When a change is initiated by the host,
                   2484: the messages to the terminal describing the change
                   2485: are generated by the routine that applies the transcript of the changes
                   2486: to the contents of the
                   2487: .CW File .
                   2488: Since changes are undone by the same update routine,
                   2489: undoing requires
                   2490: no extra code in the communications;
                   2491: the usual messages describing changes to the file are sufficient
                   2492: to back up the screen image.
                   2493: .PP
                   2494: The
                   2495: .CW Rasp
                   2496: is a particularly good example of the way caches are used in
                   2497: .I sam .
                   2498: First, it facilitates access to the active portion of the text by placing
                   2499: the busy text in main memory.
                   2500: In so doing, it provides efficient access
                   2501: to a large data structure that does not fit in memory.
                   2502: Since the form of data is to be imposed by the user, not by the program,
                   2503: and because characters will frequently be scanned sequentially,
                   2504: files are stored as flat objects.
                   2505: Caches help keep performance good and linear when working with such
                   2506: data.
                   2507: .PP
                   2508: Second, the
                   2509: .CW Rasp
                   2510: and several of the other caches have some
                   2511: .I read-ahead;
                   2512: that is, the cache is loaded with more information than is needed for
                   2513: the job immediately at hand.
                   2514: When manipulating linear structures, the accesses are usually sequential,
                   2515: and read-ahead can significantly reduce the average time to access the
                   2516: next element of the object.
                   2517: Sequential access is a common mode for people as well as programs;
                   2518: consider scrolling through a document while looking for something.
                   2519: .PP
                   2520: Finally, like any good data structure,
                   2521: the cache guides the algorithm, or at least the implementation.
                   2522: The
                   2523: .CW Rasp
                   2524: was actually invented to control the communications between the host and
                   2525: terminal parts, but I realized very early that it was also a form of
                   2526: cache.  Other caches were more explicitly intended to serve a double
                   2527: purpose: for example, the caches in
                   2528: .CW Files
                   2529: that coalesce updates not only reduce traffic to the
                   2530: transcript and contents
                   2531: .CW Buffers ,
                   2532: they also clump screen updates so that complicated changes to the
                   2533: screen are achieved in
                   2534: just a few messages to the terminal.
                   2535: This saved me considerable work: I did not need to write special
                   2536: code to optimize the message traffic to the
                   2537: terminal.
                   2538: Caches pay off in surprising ways.
                   2539: Also, they tend to be independent, so their performance improvements
                   2540: are multiplicative.
                   2541: .NH 2
                   2542: Data structures in the terminal
                   2543: .LP
                   2544: The terminal's job is to display and to maintain a consistent image of
                   2545: pieces of the files being edited.
                   2546: Because the text is always in memory, the data structures are
                   2547: considerably simpler than those in the host part.
                   2548: .PP
                   2549: .I Sam
                   2550: typically has far more windows than does
                   2551: .I mux ,
                   2552: the window system within which its Blit implementation runs.
                   2553: .I Mux
                   2554: has a fairly small number of asynchronously updated windows;
                   2555: .I sam
                   2556: needs a large number of synchronously updated windows that are
                   2557: usually static and often fully obscured.
                   2558: The different tradeoffs guided
                   2559: .I sam
                   2560: away from the memory-intensive implementation of windows, called
                   2561: .CW Layers ,|reference(pike overlapping)
                   2562: used in
                   2563: .I mux.
                   2564: Rather than depending on a complete bitmap image of the display for each window,
                   2565: .I sam
                   2566: regenerates the image from its in-memory text
                   2567: (stored in the
                   2568: .CW Rasp )
                   2569: when necessary, although it will use such an image if it is available.
                   2570: Like
                   2571: .CW Layers ,
                   2572: though,
                   2573: .I sam
                   2574: uses the screen bitmap as active storage in which to update the image using
                   2575: .I bitblt .|reference(guibas stolfi)|reference(pike locanthi blit bitmap)
                   2576: The resulting organization, pictured in Figure 6,
                   2577: has a global array of windows, called
                   2578: .CW Flayers ,
                   2579: each of which holds an image of a piece of text held in a data structure
                   2580: called a
                   2581: .CW Frame ,
                   2582: which in turn represents
                   2583: a rectangular window full of text displayed in some
                   2584: .CW Bitmap .
                   2585: Each
                   2586: .CW Flayer
                   2587: appears in a global list that orders them all front-to-back
                   2588: on the display, and simultaneously as an element of a per-file array
                   2589: that holds all the open windows for that file.
                   2590: The complement in the terminal of the
                   2591: .CW File
                   2592: on the host is called a
                   2593: .CW Text ;
                   2594: each connects its
                   2595: .CW Flayers
                   2596: to the associated
                   2597: .CW Rasp .
                   2598: .1C
                   2599: .KF
                   2600: .PS
                   2601: copy "fig6.pic"
                   2602: .PE
                   2603: .Cs
                   2604: Figure 6. Data structures in the terminal.
                   2605: .CW Flayers
                   2606: are also linked together into a front-to-back list.
                   2607: .CW Boxes
                   2608: are discussed in the next section.
                   2609: .Ce
                   2610: .KE
                   2611: .2C
                   2612: .PP
                   2613: The
                   2614: .CW Bitmap
                   2615: for a
                   2616: .CW Frame
                   2617: contains the image of the text.
                   2618: For a fully visible window, the
                   2619: .CW Bitmap
                   2620: will be the screen (or at least the
                   2621: .CW Layer
                   2622: in which
                   2623: .I sam
                   2624: is being run),
                   2625: while for partially obscured windows the
                   2626: .CW Bitmap
                   2627: will be off-screen.
                   2628: If the window is fully obscured, the
                   2629: .CW Bitmap
                   2630: will be null.
                   2631: .PP
                   2632: The
                   2633: .CW Bitmap
                   2634: is a kind of cache.
                   2635: When making changes to the display, most of the original image will
                   2636: look the same in the final image, and the update algorithms exploit this.
                   2637: The
                   2638: .CW Frame
                   2639: software updates the image in the
                   2640: .CW Bitmap
                   2641: incrementally; the
                   2642: .CW Bitmap
                   2643: is not just an image, it is a data structure.|reference(guibas stolfi)|reference(pike locanthi blit bitmap)
                   2644: The job of the software that updates the display is therefore
                   2645: to use as much as possible of the existing image (converting the
                   2646: text from ASCII characters to pixels is expensive) in a sort of two-dimensional
                   2647: string insertion algorithm.
                   2648: The details of this process are described in the next section.
                   2649: .PP
                   2650: The
                   2651: .CW Frame
                   2652: software has no code to support overlapping windows;
                   2653: its job is to keep a single
                   2654: .CW Bitmap
                   2655: up to date.
                   2656: It falls to the
                   2657: .CW Flayer
                   2658: software to multiplex the various
                   2659: .CW Bitmaps
                   2660: onto the screen.
                   2661: The problem of maintaining overlapping
                   2662: .CW Flayers
                   2663: is easier than for
                   2664: .CW Layers |reference(pike overlapping)
                   2665: because changes are made synchronously and because the contents of the window
                   2666: can be reconstructed from the data stored in the
                   2667: .CW Frame ;
                   2668: the
                   2669: .CW Layers
                   2670: software
                   2671: makes no such assumptions.
                   2672: In
                   2673: .I sam ,
                   2674: the window being changed is almost always fully visible, because the current
                   2675: window is always fully visible, by construction.
                   2676: However, when multi-file changes are being made, or when
                   2677: more than one window is open on a file,
                   2678: it may be necessary to update partially obscured windows.
                   2679: .PP
                   2680: There are three cases: the window is 
                   2681: fully visible, invisible (fully obscured), or partially visible.
                   2682: If fully visible, the
                   2683: .CW Bitmap
                   2684: is part of the screen, so when the
                   2685: .CW Flayer
                   2686: update routine calls the
                   2687: .CW Frame
                   2688: update routine, the screen will be updated directly.
                   2689: If the window is invisible,
                   2690: there is no associated
                   2691: .CW Bitmap ,
                   2692: and all that is necessary is to update the
                   2693: .CW Frame
                   2694: data structure, not the image.
                   2695: If the window is partially visible, the
                   2696: .CW Frame
                   2697: routine is called to update the image in the off-screen
                   2698: .CW Bitmap ,
                   2699: which may require regenerating it from the text of the window.
                   2700: The
                   2701: .CW Flayer
                   2702: code then clips this
                   2703: .CW Bitmap
                   2704: against the
                   2705: .CW Bitmaps
                   2706: of all
                   2707: .CW Frames
                   2708: in front of the
                   2709: .CW Frame
                   2710: being modified, and the remainder is copied to the display.
                   2711: .PP
                   2712: This is much faster than recreating the image off-screen
                   2713: for every change, or clipping all the changes made to the image
                   2714: during its update.
                   2715: Unfortunately, these caches can also consume prohibitive amounts of
                   2716: memory, so they are freed fairly liberally \(em after every change to the
                   2717: front-to-back order of the
                   2718: .CW Flayers .
                   2719: The result is that
                   2720: the off-screen
                   2721: .CW Bitmaps
                   2722: exist only while multi-window changes are occurring,
                   2723: which is the only time the performance improvement they provide is needed.
                   2724: Also, the user interface causes fully-obscured windows to be the
                   2725: easiest to make \(em
                   2726: creating a canonically sized and placed window requires only a button click
                   2727: \(em which reduces the need for caching still further.
                   2728: .NH 2
                   2729: Screen update
                   2730: .LP
                   2731: Only two low-level primitives are needed for incremental update:
                   2732: .I bitblt ,
                   2733: which copies rectangles of pixels, and
                   2734: .I string
                   2735: (which in turn calls
                   2736: .I bitblt ),
                   2737: which draws a null-terminated character string in a
                   2738: .CW Bitmap .
                   2739: A
                   2740: .CW Frame
                   2741: contains a list of
                   2742: .CW Boxes ,
                   2743: each of which defines a horizontal strip of text in the window
                   2744: (see Figure 7).
                   2745: A
                   2746: .CW Box
                   2747: has a character string
                   2748: .CW str ,
                   2749: and a
                   2750: .CW Rectangle
                   2751: .CW rect
                   2752: that defines the location of the strip in the window.
                   2753: (The text in
                   2754: .CW str
                   2755: is stored in the
                   2756: .CW Box
                   2757: separately from the
                   2758: .CW Rasp
                   2759: associated with the window's file, so
                   2760: .CW Boxes
                   2761: are self-contained.)
                   2762: The invariant is that
                   2763: the image of the
                   2764: .CW Box
                   2765: can be reproduced by calling
                   2766: .CW string
                   2767: with argument
                   2768: .CW str
                   2769: to draw the string in
                   2770: .CW rect ,
                   2771: and the resulting picture fits perfectly within
                   2772: .CW rect .
                   2773: In other words, the
                   2774: .CW Boxes
                   2775: define the tiling of the window.
                   2776: The tiling may be complicated by long lines of text, which
                   2777: are folded onto the next line.
                   2778: Some editors use horizontal scrolling to avoid this complication,
                   2779: but to be comfortable this technique requires that lines not be
                   2780: .I too
                   2781: long;
                   2782: .I sam
                   2783: has no such restriction.
                   2784: Also, and perhaps more importantly,
                   2785: .UX
                   2786: programs and terminals traditionally fold
                   2787: long lines to make their contents fully visible.
                   2788: .PP
                   2789: Two special kinds of
                   2790: .CW Boxes
                   2791: contain a single
                   2792: character: either a newline or a tab.
                   2793: Newlines and tabs are white space.
                   2794: A newline
                   2795: .CW Box
                   2796: always extends to the right edge of the window,
                   2797: forcing the following
                   2798: .CW Box
                   2799: to the next line.
                   2800: The width of a tab depends on where it is located:
                   2801: it forces the next
                   2802: .CW Box
                   2803: to begin at a tab location.
                   2804: Tabs also
                   2805: have a minimum width equivalent to a blank (blanks are
                   2806: drawn by
                   2807: .CW string
                   2808: and are not treated specially); newlines have a minimum width of zero.
                   2809: .1C
                   2810: .KF
                   2811: .PS
                   2812: copy "fig7.pic"
                   2813: .PE
                   2814: .sp .5
                   2815: .Cs
                   2816: Figure 7. A line of text showing its
                   2817: .CW Boxes .
                   2818: The first two blank
                   2819: .CW Boxes
                   2820: contain tabs; the last contains a newline.
                   2821: Spaces are handled as ordinary characters.
                   2822: .Ce
                   2823: .KE
                   2824: .2C
                   2825: .PP
                   2826: The update algorithms always use the
                   2827: .CW Bitmap
                   2828: image of the text (either the display or cache
                   2829: .CW Bitmap );
                   2830: they never examine the characters within a
                   2831: .CW Box
                   2832: except when the
                   2833: .CW Box
                   2834: needs to be split in two.
                   2835: Before a change, the window consists of a tiling of
                   2836: .CW Boxes ;
                   2837: after the change the window is tiled differently.
                   2838: The update algorithms rearrange the tiles in place, without
                   2839: backup storage.
                   2840: The algorithms are not strictly optimal \(em for example, they can
                   2841: clear a pixel that is later going to be written upon \(em
                   2842: but they never move a tile that doesn't need to be moved,
                   2843: and they move each tile at most once.
                   2844: .CW Frinsert
                   2845: on a Blit can absorb over a thousand characters a second if the strings
                   2846: being inserted are a few tens of characters long.
                   2847: .PP
                   2848: Consider
                   2849: .CW frdelete .
                   2850: Its job is to delete a substring from a
                   2851: .CW Frame
                   2852: and restore the image of the
                   2853: .CW Frame .
                   2854: The image of a substring has a peculiar shape (see Figure 2) comprising
                   2855: possibly a partial line,
                   2856: zero or more full lines,
                   2857: and possibly a final partial line.
                   2858: For reference, call this the
                   2859: .I
                   2860: Z-shape.
                   2861: .R
                   2862: .CW Frdelete
                   2863: begins by splitting, if necessary, the
                   2864: .CW Boxes
                   2865: containing the ends of
                   2866: the substring so the substring begins and ends on
                   2867: .CW Box
                   2868: boundaries.
                   2869: Because the substring is being deleted, its image is not needed,
                   2870: so the Z-shape is then cleared.
                   2871: Then, tiles (that is, the images of
                   2872: .CW Boxes )
                   2873: are copied, using
                   2874: .CW bitblt ,
                   2875: from immediately after the Z-shape to
                   2876: the beginning of the Z-shape,
                   2877: resulting in a new Z-shape.
                   2878: .CW Boxes "" (
                   2879: whose contents would span two lines in the new position must first be split.)
                   2880: .PP
                   2881: Copying the remainder of the
                   2882: .CW Frame
                   2883: tile by tile
                   2884: this way will clearly accomplish the deletion but eventually,
                   2885: typically when the copying algorithm encounters a tab or newline,
                   2886: the old and new
                   2887: .CW x
                   2888: coordinates of the tile
                   2889: to be copied are the same.
                   2890: This correspondence implies
                   2891: that the Z-shape has its beginning and ending edges aligned
                   2892: vertically, and a sequence of at most two
                   2893: .I bitblt s
                   2894: can be used to copy the remaining tiles.
                   2895: The last step is to clear out the resulting empty space at the bottom
                   2896: of the window;
                   2897: the number of lines to be cleared is the number of complete lines in the
                   2898: Z-shape closed by the final
                   2899: .I bitblt s.
                   2900: The final step is to merge horizontally adjacent
                   2901: .CW Boxes
                   2902: of plain text.
                   2903: The complete source to
                   2904: .CW frdelete
                   2905: is less than 100 lines of C.
                   2906: .PP
                   2907: .CW frinsert
                   2908: is more complicated because it must do four passes:
                   2909: one to construct the
                   2910: .CW Box
                   2911: list for the inserted string,
                   2912: one to reconnoitre,
                   2913: one to copy (in opposite order to
                   2914: .CW frdelete )
                   2915: the
                   2916: .CW Boxes
                   2917: to make the hole for the new text,
                   2918: and finally one to copy the new text into place.
                   2919: Overall, though,
                   2920: .CW frinsert
                   2921: has a similar flavor to
                   2922: .CW frdelete ,
                   2923: and needn't be described further.
                   2924: .CW Frinsert
                   2925: and its subsidiary routines comprise 211 lines of C.
                   2926: .PP
                   2927: The terminal source code is 3024 lines of C,
                   2928: and the host source is 5797 lines.
                   2929: .NH
                   2930: Discussion
                   2931: .NH 2
                   2932: History
                   2933: .LP
                   2934: The immediate ancestor of
                   2935: .I sam
                   2936: was the original text editor for the Blit, called
                   2937: .I jim .
                   2938: .I Sam
                   2939: inherited
                   2940: .I jim 's
                   2941: two-process structure and mouse language almost unchanged, but
                   2942: .I jim
                   2943: suffered from several drawbacks that were addressed in the design of
                   2944: .I sam .
                   2945: The most important of these was the lack of a command language.
                   2946: Although
                   2947: .I jim
                   2948: was easy to use for simple editing, it provided no direct help with
                   2949: large or repetitive editing tasks.  Instead, it provided a command to pass
                   2950: selected text through a shell pipeline,
                   2951: but this was no more satisfactory than could be expected of a stopgap measure.
                   2952: .PP
                   2953: .I Jim
                   2954: was written primarily as a vehicle for experimenting with a mouse-based
                   2955: interface to text, and the experiment was successful.
                   2956: .I Jim
                   2957: had some spin-offs:
                   2958: .I mux ,
                   2959: the second window system for the Blit, is essentially a multiplexed
                   2960: version of the terminal part of
                   2961: .I jim ;
                   2962: and the debugger
                   2963: .I pi 's
                   2964: user interface|reference(cargill feel pi) was closely modeled on
                   2965: .I jim 's.
                   2966: But after a couple of years,
                   2967: .I jim
                   2968: had become difficult to maintain and limiting to use,
                   2969: and its replacement was overdue.
                   2970: .PP
                   2971: I began the design of
                   2972: .I sam
                   2973: by asking
                   2974: .I jim
                   2975: customers what they wanted.
                   2976: This was probably a mistake; the answers were essentially a list of features
                   2977: to be found in other editors, which did not provide any of the
                   2978: guiding principles I was seeking.
                   2979: For instance, one common request was for a ``global substitute,''
                   2980: but no one suggested how to provide it within a cut-and-paste editor.
                   2981: I was looking for a scheme that would
                   2982: support such specialized features comfortably in the context of some
                   2983: general command language.
                   2984: Ideas were not forthcoming, though, particularly given my insistence
                   2985: on removing all limits on file sizes, line lengths and so on.
                   2986: Even worse, I recognized that, since the mouse could easily
                   2987: indicate a region of the screen that was not an integral number of lines,
                   2988: the command language would best forget about newlines altogether,
                   2989: and that meant the command language had to treat the file as a single
                   2990: string, not an array of lines.
                   2991: .PP
                   2992: Eventually, I decided that thinking was not getting me very far and it was
                   2993: time to try building.
                   2994: I knew that the terminal part could be built easily \(em
                   2995: that part of
                   2996: .I jim
                   2997: behaved acceptably well \(em and that most of the hard work was going
                   2998: to be in the host part: the file interface, command interpreter and so on.
                   2999: Moreover, I had some ideas about how the architecture of
                   3000: .I jim
                   3001: could be improved without destroying its basic structure, which I liked
                   3002: in principle but which hadn't worked out as well as I had hoped.
                   3003: So I began by designing the file data structure,
                   3004: starting with the way
                   3005: .I jim
                   3006: worked \(em comparable to a single structure merging
                   3007: .CW Disc
                   3008: and
                   3009: .CW Buffer ,
                   3010: which I split to make the cache more general
                   3011: \(em and thinking about how global substitute could be implemented.
                   3012: The answer was clearly that it had to be done in two passes,
                   3013: and the transcript-oriented implementation fell out naturally.
                   3014: .PP
                   3015: .I Sam
                   3016: was written bottom-up,
                   3017: starting from the data structures and algorithms for manipulating text,
                   3018: through the command language and up to the code for maintaining
                   3019: the display.
                   3020: In retrospect, it turned out well, but this implementation method is
                   3021: not recommended in general.
                   3022: There were several times when I had a large body of interesting code
                   3023: assembled and no clue how to proceed with it.
                   3024: The command language, in particular, took almost a year to figure out,
                   3025: but can be implemented (given what was there at the beginning of that year)
                   3026: in a day or two.  Similarly, inventing the
                   3027: .CW Rasp
                   3028: data structure delayed the
                   3029: connection of the host and terminal pieces by another few months.
                   3030: .I Sam
                   3031: took about two years to write, although only about four months were
                   3032: spent actually working on it.
                   3033: .PP
                   3034: Part of the design process was unusual:
                   3035: the subset of the protocol that maintains the
                   3036: .CW Rasp
                   3037: was simulated, debugged
                   3038: and verified by an automatic protocol analyzer,|reference(holzmann atttj) and was bug-free
                   3039: from the start.
                   3040: The rest of the protocol, concerned mostly
                   3041: with keeping menus up to date,
                   3042: was unfortunately too unwieldy for such analysis,
                   3043: and was debugged by more traditional methods, primarily
                   3044: by logging in a file all messages in and out of the host.
                   3045: .NH 2
                   3046: Reflections
                   3047: .LP
                   3048: .I Sam
                   3049: is essentially the only interactive editor used by the sixty or so members of
                   3050: the computing science research center in which I work.
                   3051: The same could not be said of
                   3052: .I jim ;
                   3053: the lack of a command language kept some people from adopting it.
                   3054: The union of a user interface as comfortable as
                   3055: .I jim 's
                   3056: with a command language as powerful as
                   3057: .CW ed 's\(dg
                   3058: .FS
                   3059: .vs 9
                   3060: \(dg
                   3061: The people who criticize
                   3062: .I ed
                   3063: as an interactive program often forget that it and its close relative
                   3064: .I sed |reference(kernighan pike)
                   3065: still thrive as programmable editors.  The strength of these programs is
                   3066: independent of their convenience for interactive editing.
                   3067: .br
                   3068: .vs
                   3069: .FE
                   3070: is essential to
                   3071: .I sam 's
                   3072: success.
                   3073: When
                   3074: .I sam
                   3075: was first made available to the
                   3076: .I jim
                   3077: community,
                   3078: almost everyone switched to it within two or three days.
                   3079: In the months that followed, even people who had never adopted
                   3080: .I jim
                   3081: started using
                   3082: .I sam
                   3083: exclusively.
                   3084: .PP
                   3085: To be honest,
                   3086: .I ed
                   3087: still gets occasional use, but usually when
                   3088: something quick needs to be done and the overhead of
                   3089: downloading the terminal part of
                   3090: .I sam
                   3091: isn't worth the trouble.
                   3092: Also, as a `line' editor,
                   3093: .CW "sam -d"
                   3094: is a bit odd;
                   3095: when using a good old ASCII terminal, it's comforting to have
                   3096: a true line editor.
                   3097: But it is fair to say that
                   3098: .I sam 's
                   3099: command language has displaced
                   3100: .I ed 's
                   3101: for most of the complicated editing that has kept line editors
                   3102: (that is, command-driven editors) with us.
                   3103: .PP
                   3104: .I Sam 's
                   3105: command language is even fancier than
                   3106: .I ed 's,
                   3107: and most
                   3108: .I sam
                   3109: customers don't come near to using all its capabilities.
                   3110: Does it need to be so sophisticated?
                   3111: I think the answer is yes, for two reasons.
                   3112: .PP
                   3113: First, the
                   3114: .I model
                   3115: for
                   3116: .I sam 's
                   3117: command language is really relatively simple, and certainly simpler than that of
                   3118: .I ed .
                   3119: For instance, there is only one kind of textual loop in
                   3120: .I sam
                   3121: \(em the
                   3122: .CW x
                   3123: command \(em
                   3124: while
                   3125: .I ed
                   3126: has three (the
                   3127: .CW g
                   3128: command, the global flag on substitutions, and the implicit loop over
                   3129: lines in multi-line substitutions).
                   3130: Also,
                   3131: .I ed 's
                   3132: substitute command is necessary to make changes within lines, but in
                   3133: .I sam
                   3134: the
                   3135: .CW s
                   3136: command is more of a familiar convenience than a necessity;
                   3137: .CW c
                   3138: and
                   3139: .CW t
                   3140: can do all the work.
                   3141: .PP
                   3142: Second,
                   3143: given a community that expects an editor to be about as powerful as
                   3144: .I ed ,
                   3145: it's hard to see how
                   3146: .I sam
                   3147: could really be much simpler and still satisfy that expectation.
                   3148: People want to do ``global substitutes,'' and most are content
                   3149: to have the recipe for that and a few other fancy changes.
                   3150: The sophistication of the command language is really just a veneer
                   3151: over a design that makes it possible to do global substitutes
                   3152: in a screen editor.
                   3153: Some people will always want something more, however, and it's gratifying to
                   3154: be able to provide it.
                   3155: The real power of
                   3156: .I sam 's
                   3157: command language comes from composability of the operators, which is by
                   3158: nature orthogonal to the underlying model.
                   3159: In other words,
                   3160: .I sam
                   3161: is not itself complex, but it makes complex things possible.
                   3162: If you don't want to do anything complex, you can ignore the
                   3163: complexity altogether, and many people do so.
                   3164: .PP
                   3165: Sometimes I am asked the opposite question: why didn't I just make
                   3166: .I sam
                   3167: a real programmable editor, with macros and variables and so on?
                   3168: The main reason is a matter of taste: I like the editor
                   3169: to be the same every time I use it.
                   3170: There is one technical reason, though:
                   3171: programmability in editors is largely a workaround for insufficient
                   3172: interactivity.
                   3173: Programmable editors are used to make particular, usually short-term,
                   3174: things easy to do, such as by providing shorthands for common actions.
                   3175: If things are generally easy to do in the first place,
                   3176: shorthands are not as helpful.
                   3177: .I Sam
                   3178: makes common editing operations very easy, and the solutions to
                   3179: complex editing problems seem commensurate with the problems themselves.
                   3180: Also, the ability to edit the
                   3181: .I sam
                   3182: window makes it easy to repeat commands \(em it only takes a mouse button click
                   3183: to execute a command again.
                   3184: .NH 2
                   3185: Pros and cons
                   3186: .LP
                   3187: .I Sam
                   3188: has several other good points,
                   3189: and its share of problems.
                   3190: Among the good things is the idea of
                   3191: structural regular expressions,
                   3192: whose usefulness has only begun to be explored.
                   3193: They were arrived at serendipitously when I attempted to distill the essence of
                   3194: .I ed 's
                   3195: way of doing global substitution and recognized that the looping command in
                   3196: .I ed
                   3197: was implicitly imposing a structure (an array of lines) on the file.
                   3198: .PP
                   3199: Another of
                   3200: .I sam 's
                   3201: good things is its undo capability.
                   3202: I had never before used an editor with a true undo,
                   3203: but I would never go back now.
                   3204: Undo
                   3205: .I must
                   3206: be done well, but if it is, it can be relied on.
                   3207: For example,
                   3208: it's safe to experiment if you're not sure how to write some intricate command,
                   3209: because if you make a mistake, it can be fixed simply and reliably.
                   3210: I learned two things about undo from writing
                   3211: .I sam :
                   3212: first, it's easy to provide if you design it in from the beginning, and
                   3213: second, it's necessary, particularly if the system has some subtle
                   3214: properties that may be unfamiliar or error-prone for users.
                   3215: .PP
                   3216: .I Sam 's
                   3217: lack of internal limits and sizes is a virtue.
                   3218: Because it avoids all fixed-size tables and data structures,
                   3219: .I sam
                   3220: is able to make global changes to files that some of our other
                   3221: tools cannot even read.
                   3222: Moreover, the design keeps the performance linear when doing such
                   3223: operations, although I must admit
                   3224: .I sam
                   3225: does get slow when editing a huge file.
                   3226: .PP
                   3227: Now, the problems.
                   3228: Externally, the most obvious is that it is poorly integrated into the
                   3229: surrounding window system.
                   3230: By design, the user interface in
                   3231: .I sam
                   3232: feels almost identical to that of
                   3233: .I mux ,
                   3234: but a thick wall separates text in
                   3235: .I sam
                   3236: from the programs running in
                   3237: .I mux .
                   3238: For instance, the `snarf buffer' in
                   3239: .I sam
                   3240: must be maintained separately from that in
                   3241: .I mux .
                   3242: This is regrettable, but probably necessary given the unusual configuration
                   3243: of the system, with a programmable terminal on the far end of an RS-232 link.
                   3244: .PP
                   3245: .I Sam
                   3246: is reliable; otherwise, people wouldn't use it.
                   3247: But it was written over such a long time, and has so many new (to me)
                   3248: ideas in it, that I would like to see it done over again to clean
                   3249: up the code and remove many of the lingering problems in the implementation.
                   3250: The worst part is in the interconnection of the host and terminal parts,
                   3251: which might even be able to go away in a redesign for a more
                   3252: conventional window system.
                   3253: The program must be split in two to use the terminal effectively,
                   3254: but the low bandwidth of the connection forces the separation to
                   3255: occur in an inconvenient part of the design if performance is to be acceptable.
                   3256: A simple remote procedure call
                   3257: protocol driven by the host, emitting only graphics
                   3258: commands, would be easy to write but wouldn't have nearly the
                   3259: necessary responsiveness.  On the other hand, if the terminal were in control
                   3260: and requested much simpler file services from the host, regular expression
                   3261: searches would require that the terminal read the entire file over its RS-232
                   3262: link, which would be unreasonably slow.
                   3263: A compromise in which either end can take control is necessary.
                   3264: In retrospect, the communications protocol should have been
                   3265: designed and verified formally, although I do not know of any tool
                   3266: that can adequately relate the protocol to
                   3267: its implementation.
                   3268: .PP
                   3269: Not all of
                   3270: .I sam 's
                   3271: users are comfortable with its command language, and few are adept.
                   3272: Some (venerable) people use a sort of
                   3273: .I ed \& ``
                   3274: subset'' of
                   3275: .I sam 's
                   3276: command language,
                   3277: and even ask why
                   3278: .I sam 's
                   3279: command language is not exactly
                   3280: .I ed 's.
                   3281: (The reason, of course, is that
                   3282: .I sam 's
                   3283: model for text does not include newlines, which are central to
                   3284: .I ed .
                   3285: Making the text an array of newlines to the command language would
                   3286: be too much of a break from the seamless model provided by the mouse.
                   3287: Some editors, such as
                   3288: .I vi ,
                   3289: are willing to make this break, though.)
                   3290: The difficulty is that
                   3291: .I sam 's
                   3292: syntax is so close to
                   3293: .I ed 's
                   3294: that people believe it
                   3295: .I should
                   3296: be the same.
                   3297: I thought, with some justification in hindsight,
                   3298: that making
                   3299: .I sam
                   3300: similar to
                   3301: .I ed
                   3302: would make it easier to learn and to accept.
                   3303: But I may have overstepped and raised the users'
                   3304: expectations too much.
                   3305: It's hard to decide which way to resolve this problem.
                   3306: .PP
                   3307: Finally, there is a tradeoff in
                   3308: .I sam
                   3309: that was decided by the environment in which it runs:
                   3310: .I sam
                   3311: is a multi-file editor, although in a different system there might instead be
                   3312: multiple single-file editors.
                   3313: The decision was made primarily because starting a new program in a Blit is
                   3314: time-consuming.
                   3315: If the choice could be made freely, however, I would
                   3316: still choose the multi-file architecture, because it allows
                   3317: groups of files to be handled as a unit;
                   3318: the usefulness of the multi-file commands is incontrovertible.
                   3319: It is delightful to have the source to an entire program
                   3320: available at your fingertips.
                   3321: .NH
                   3322: Acknowledgements
                   3323: .LP
                   3324: Tom Cargill suggested the idea behind the
                   3325: .CW Rasp
                   3326: data structure.
                   3327: Norman Wilson and Ken Thompson influenced the command language.
                   3328: This paper was improved by comments from
                   3329: Al Aho,
                   3330: Jon Bentley,
                   3331: Chris Fraser,
                   3332: Gerard Holzmann,
                   3333: Brian Kernighan,
                   3334: Ted Kowalski,
                   3335: Doug McIlroy
                   3336: and
                   3337: Dennis Ritchie.
                   3338: .NH
                   3339: References
                   3340: .LP
                   3341: |reference_placement

unix.superglobalmegacorp.com

This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.