|
|
1.1 root 1: .so ../ADM/mac
2: .XX sam 399 "The Text Editor Sam"
3: .nr dP 2
4: .nr dV 3p
5: .de Cs
6: .br
7: .fi
8: .ft 2
9: .ps -2
10: .vs -2
11: ..
12: .de Ce
13: .br
14: .nf
15: .ft 1
16: .ps
17: .vs
18: .sp
19: ..
20: .TL
21: The Text Editor Sam\(dg
22: .AU
23: Rob Pike
24: .AI
25: .MH
26: .AB
27: .PP
28: .I Sam
29: is an interactive multi-file text editor intended for
30: bitmap displays.
31: A textual command language
32: supplements the mouse-driven, cut-and-paste interface
33: to make complex or
34: repetitive editing tasks easy to specify.
35: The language is characterized by the composition of regular expressions
36: to describe the structure of the text being modified.
37: The treatment of files as a database, with changes logged
38: as atomic transactions, guides the implementation and
39: makes a general `undo' mechanism straightforward.
40: .PP
41: .I Sam
42: is implemented as two processes connected by a low-bandwidth stream,
43: one process handling the display and the other the editing
44: algorithms. Therefore it can run with the display process
45: in a bitmap terminal and the editor on a local host,
46: with both processes on a bitmap-equipped host, or with
47: the display process in the terminal and the editor in a
48: remote host.
49: By suppressing the display process,
50: it can even run without a bitmap terminal.
51: .AE
52: .2C
53: .FS
54: \(dg This article originally appeared as |reference(spe sam pike).
55: It is reprinted by permission of John Wiley & Sons, Ltd.
56: .FE
57: .NH
58: Introduction
59: .LP
60: .I Sam
61: is an interactive text editor that combines cut-and-paste interactive editing with
62: an unusual command language based on the composition of regular expressions.
63: It is written as two programs: one, the `host part,' runs on a
64: .UX
65: system
66: and implements the command language and provides file access; the other, the
67: `terminal part,' runs asynchronously
68: on a machine with a mouse and bitmap display
69: and supports the display and interactive editing.
70: The host part may be even run in isolation on an ordinary terminal
71: to edit text using the command
72: language, much like a traditional line editor,
73: without assistance from a mouse or display.
74: Most often,
75: the terminal part runs on a Blit|reference(blit bstj) terminal
76: (actually on a Teletype DMD 5620, the production version of the Blit), whose
77: host connection is an ordinary 9600 bps RS-232 link;
78: on the Sun computer the host and display processes run on a single machine,
79: connected by a pipe.
80: .PP
81: .I Sam
82: edits uninterpreted
83: ASCII text.
84: It has no facilities for multiple fonts, graphics or tables,
85: unlike MacWrite,|reference(macwrite) Bravo,|reference(bravo)
86: Tioga|reference(tioga)
87: or Lara.|reference(lara)
88: Also unlike them, it has a rich command language.
89: (Throughout this paper, the phrase
90: .I
91: command language
92: .R
93: refers to
94: textual commands; commands activated from the mouse form the
95: .I mouse
96: .I language. )
97: .I Sam
98: developed as an editor for use by programmers, and tries to join
99: the styles of the
100: .UX
101: text editor
102: .I ed
103: |reference(v7manual)|reference(kernighan pike)
104: with that of interactive cut-and-paste editors by
105: providing a comfortable mouse-driven interface
106: to a program with a solid command language driven by regular expressions.
107: The command language developed more than the mouse language, and
108: acquired a notation for describing the structure of files
109: more richly than as a sequence of lines,
110: using a dataflow-like syntax for specifying changes.
111: .PP
112: The interactive style was influenced by
113: .I jim ,
114: |reference(bstj blit)
115: an early cut-and-paste editor for the Blit, and by
116: .I mux ,
117: |reference(latest volume1)
118: the Blit window system.
119: .I Mux
120: merges the original Blit window system,
121: .I mpx ,
122: |reference(bstj blit)
123: with cut-and-paste editing, forming something like a
124: multiplexed version of
125: .I jim
126: that edits the output of (and input to) command sessions rather than files.
127: .PP
128: The first part of this paper describes the command language, then the mouse
129: language, and explains how they interact.
130: That is followed by a description of the implementation,
131: first of the host part, then of the terminal part.
132: A principle that influenced the design of
133: .I sam
134: is that it should have no explicit limits, such as upper limits on
135: file size or line length.
136: A secondary consideration is that it be efficient.
137: To honor these two goals together requires a method for efficiently
138: manipulating
139: huge strings (files) without breaking them into lines,
140: perhaps while making thousands of changes
141: under control of the command language.
142: .I Sam 's
143: method is to
144: treat the file as a transaction database, implementing changes as atomic
145: updates. These updates may be unwound easily to `undo' changes.
146: Efficiency is achieved through a collection of caches that minimizes
147: disc traffic and data motion, both within the two parts of the program
148: and between them.
149: .PP
150: The terminal part of
151: .I sam
152: is fairly straightforward.
153: More interesting is how the two halves of the editor stay
154: synchronized when either half may initiate a change.
155: This is achieved through a data structure that organizes the
156: communications and is maintained in parallel by both halves.
157: .PP
158: The last part of the paper chronicles the writing of
159: .I sam
160: and discusses the lessons that were learned through its development and use.
161: .PP
162: The paper is long, but is composed largely of two papers of reasonable length:
163: a description of the user interface of
164: .I sam
165: and a discussion of its implementation.
166: They are combined because the implementation is strongly influenced by
167: the user interface, and vice versa.
168: .NH
169: The Interface
170: .LP
171: .I Sam
172: is a text editor for multiple files.
173: File names may be provided when it is invoked:
174: .P1 2n
175: sam file1 file2 ...
176: .P2
177: and there are commands
178: to add new files and discard unneeded ones.
179: Files are not read until necessary
180: to complete some command.
181: Editing operations apply to an internal copy
182: made when the file is read; the
183: .UX
184: file associated with the copy
185: is changed only by an explicit command.
186: To simplify the discussion, the internal copy is here called a
187: .I file ,
188: while the disc-resident original is called a
189: .I
190: disc file.
191: .R
192: .PP
193: .I Sam
194: is usually connected to a bitmap display that presents a cut-and-paste
195: editor driven by the mouse.
196: In this mode, the command language is still available:
197: text typed in a special window, called the
198: .I sam
199: .I window,
200: is interpreted
201: as commands to be executed in the current file.
202: Cut-and-paste editing may be used in any window \(em even in the
203: .I sam
204: window to construct commands.
205: The other mode of operation, invoked by starting
206: .I sam
207: with the option
208: .CW -d
209: (for `no download'),
210: does not use the mouse or bitmap display, but still permits
211: editing using the textual command language, even on an ordinary terminal,
212: interactively or from a script.
213: .PP
214: The following sections describe first the command language (under
215: .CW sam\ -d
216: and in the
217: .I sam
218: window), and then the mouse interface.
219: These two languages are nearly independent, but connect through the
220: .I current
221: .I text,
222: described below.
223: .NH 2
224: The Command Language
225: .LP
226: A file consists of its contents, which are an array of characters
227: (that is, a string); the
228: .I name
229: of the associated disc file; the
230: .I
231: modified bit
232: .R
233: that states whether the contents match those of
234: the disc file;
235: and a substring of the contents, called the
236: .I
237: current text
238: .R
239: or
240: .I dot
241: (see Figures 1 and 2).
242: If the current text is a null string, dot falls between characters.
243: The
244: .I value
245: of dot is the location of the current text; the
246: .I contents
247: of dot are the characters it contains.
248: .I Sam
249: imparts to the text no two-dimensional interpretation such as columns
250: or fields; text is always one-dimensional.
251: Even the idea of a `line' of text as understood by most
252: .UX
253: programs
254: \(em a sequence of characters terminated by a newline character \(em
255: is only weakly supported.
256: .PP
257: The
258: .I
259: current file
260: .R
261: is the file to which editing commands refer.
262: The current text is therefore dot in the current file.
263: If a command doesn't explicitly name a particular file or piece of text,
264: the command is assumed to apply to the current text.
265: For the moment, ignore the presence of multiple files and consider
266: editing a single file.
267: .1C
268: .KF L
269: .IB fig1.ps 3.10i 5.65i c
270: .Cs
271: Figure 1. A typical
272: .I sam
273: screen, with the editing menu presented.
274: The
275: .I sam
276: (command language) window is in the middle, with file windows above and below.
277: (The user interface makes it easy to create these abutting windows.)
278: The partially obscured window is a third file window.
279: The uppermost window is that to which typing and mouse operations apply,
280: as indicated by its heavy border.
281: Each window has its current text highlighted in reverse video.
282: The
283: .I sam
284: window's current text is the null string on the last visible line,
285: indicated by a vertical bar.
286: See also Figure 2.
287: .Ce
288: .KE
289: .2C
290: .PP
291: Commands have one-letter names.
292: Except for non-editing commands such as writing
293: the file to disc, most commands make some change
294: to the text in dot and leave dot set to the text resulting from the change.
295: For example, the delete command,
296: .CW d ,
297: deletes the text in dot, replacing it by the null string and setting dot
298: to the result.
299: The change command,
300: .CW c ,
301: replaces dot by text delimited by an arbitrary punctuation character,
302: conventionally
303: a slash. Thus,
304: .P1 2n
305: c/Peter/
306: .P2
307: replaces the text in dot by the string
308: .CW Peter .
309: Similarly,
310: .P1 2n
311: a/Peter/
312: .P2
313: (append) adds the string after dot, and
314: .P1 2n
315: i/Peter/
316: .P2
317: (insert) inserts before dot.
318: All three leave dot set to the new text,
319: .CW Peter .
320: .PP
321: Newlines are part of the syntax of commands:
322: the newline character lexically terminates a command.
323: Within the inserted text, however, newlines are never implicit.
324: But since it is often convenient to insert multiple lines of text,
325: .I sam
326: has a special
327: syntax for that case:
328: .P1 2n
329: a
330: some lines of text
331: to be inserted in the file,
332: terminated by a period
333: on a line by itself
334: \&.
335: .P2
336: In the one-line syntax, a newline character may be specified by a C-like
337: escape, so
338: .P1 2n
339: c/\en/
340: .P2
341: replaces dot by a single newline character.
342: .PP
343: .I Sam
344: also has a substitute command,
345: .CW s :
346: .P1 2n
347: s/\f2expression\fP/\f2replacement\fP/
348: .P2
349: substitutes the replacement text for the first match, in dot,
350: of the regular expression.
351: Thus, if dot is the string
352: .CW Peter ,
353: the command
354: .P1 2n
355: s/t/st/
356: .P2
357: changes it to
358: .CW Pester .
359: In general,
360: .CW s
361: is unnecessary, but it was inherited from
362: .I ed
363: and it has some convenient variations.
364: For instance, the replacement text may include the matched text,
365: specified by
366: .CW & :
367: .P1 2n
368: s/Peter/Oh, &, &, &, &!/
369: .P2
370: .PP
371: There are also three commands that apply programs
372: to text:
373: .P1 2n
374: < \f2Unix program\fP
375: .P2
376: replaces dot by the output of the
377: .UX
378: program.
379: Similarly, the
380: .CW >
381: command
382: runs the program with dot as its standard input, and
383: .CW |
384: does both. For example,
385: .P1 2n
386: | sort
387: .P2
388: replaces dot by the result of applying the standard sorting utility to it.
389: Again, newlines have no special significance for these
390: .I sam
391: commands.
392: The text acted upon and resulting from these commands is not necessarily
393: bounded by newlines, although for connection with
394: .UX
395: programs,
396: newlines may be necessary to obey conventions.
397: .PP
398: One more command:
399: .CW p
400: prints the contents of dot.
401: Table I summarizes
402: .I sam 's
403: commands.
404: .1C
405: .KF
406: .Tm Table of Commands s
407: .TS
408: center;
409: c s
410: lfCW l.
411: Table I. \fISam\fP commands
412: .sp .4
413: .ft CW
414: _
415: .ft
416: .sp .4
417: \f1Text commands\fP
418: .sp .4
419: _
420: .sp .4
421: a/\f2text\fP/ Append text after dot
422: c/\f2text\fP/ Change text in dot
423: i/\f2text\fP/ Insert text before dot
424: d Delete text in dot
425: s/\f2regexp\fP/\f2text\fP/ Substitute text for match of regular expression in dot
426: m \f2address\fP Move text in dot after address
427: t \f2address\fP Copy text in dot after address
428: .sp .4
429: _
430: .sp .4
431: \f1Display commands\fP
432: .sp .4
433: _
434: .sp .2
435: p Print contents of dot
436: \&= Print value (line numbers and character numbers) of dot
437: .sp .4
438: _
439: .sp .4
440: \f1File commands\fP
441: .sp .4
442: _
443: .sp .2
444: b \f2file-list\fP Set current file to first file in list that \fIsam\fP has in menu
445: B \f2file-list\fP Same as \f(CWb\fP, but load new files
446: n Print menu lines of all files
447: D \f2file-list\fP Delete named files from \fIsam\fP
448: .sp .4
449: _
450: .sp .4
451: \f1I/O commands\fP
452: .sp .4
453: _
454: .sp .2
455: e \f2filename\fP Replace file with named disc file
456: r \f2filename\fP Replace dot by contents of named disc file
457: w \f2filename\fP Write file to named disc file
458: f \f2filename\fP Set file name and print new menu line
459: < \f2Unix-command\fP Replace dot by standard output of command
460: > \f2Unix-command\fP Send dot to standard input of command
461: | \f2Unix-command\fP Replace dot by result of command applied to dot
462: ! \f2Unix-command\fP Run the command
463: .sp .4
464: _
465: .sp .4
466: \f1Loops and conditionals\fP
467: .sp .4
468: _
469: .sp .2
470: x/\f2regexp\fP/ \f2command\fP For each match of regexp, set dot and run command
471: y/\f2regexp\fP/ \f2command\fP Between adjacent matches of regexp, set dot and run command
472: X/\f2regexp\fP/ \f2command\fP Run command in each file whose menu line matches regexp
473: Y/\f2regexp\fP/ \f2command\fP Run command in each file whose menu line does not match
474: g/\f2regexp\fP/ \f2command\fP If dot contains a match of regexp, run command
475: v/\f2regexp\fP/ \f2command\fP If dot does not contain a match of regexp, run command
476: .sp .4
477: _
478: .sp .4
479: \f1Miscellany\fP
480: .sp .4
481: _
482: .sp .2
483: k Set address mark to value of dot
484: q Quit
485: u \f2n\fP Undo last \f2n\fP (default 1) changes
486: { } Braces group commands
487: .sp .3
488: .ft CW
489: _
490: .ft
491: .TE
492: .sp
493: .Tm n S
494: .Tm k S
495: .Tm q S
496: .Tm p S
497: .KE
498: .2C
499: .PP
500: The value of dot may be changed by
501: specifying an
502: .I address
503: for the command.
504: The simplest address is a line number:
505: .P1 2n
506: 3
507: .P2
508: refers to the third line of the file, so
509: .P1 2n
510: 3d
511: .P2
512: deletes the third line of the file, and implicitly renumbers
513: the lines so the old line 4 is now numbered 3.
514: (This is one of the few places where
515: .I sam
516: deals with lines directly.)
517: Line
518: .CW 0
519: is the null string at the beginning of the file.
520: If a command consists of only an address, a
521: .CW p
522: command is assumed, so typing an unadorned
523: .CW 3
524: prints line 3 on the terminal.
525: There are a couple of other basic addresses:
526: a period addresses dot itself; and
527: a dollar sign
528: .CW $ ) (
529: addresses the null string at the end of the file.
530: .PP
531: An address is always a single substring of the file.
532: Thus, the address
533: .CW 3
534: addresses the characters
535: after the second newline of
536: the file through the third newline of the file.
537: A
538: .I
539: compound address
540: .R
541: is constructed by the comma operator
542: .P1 2n
543: \f2address1\fP,\f2address2\fP
544: .P2
545: and addresses the substring of the file from the beginning of
546: .I address1
547: to the end of
548: .I address2 .
549: For example, the command
550: .CW 3,5p
551: prints the third through fifth lines of the file and
552: .CW .,$d
553: deletes the text from the beginning of dot to the end of the file.
554: .PP
555: These addresses are all absolute positions in the file, but
556: .I sam
557: also has relative addresses, indicated by
558: .CW +
559: or
560: .CW - .
561: For example,
562: .P1 2n
563: $-3
564: .P2
565: is the third line before the end of the file and
566: .P1 2n
567: \&.+1
568: .P2
569: is the line after dot.
570: If no address appears to the left of the
571: .CW +
572: or
573: .CW - ,
574: dot is assumed;
575: if nothing appears to the right,
576: .CW 1
577: is assumed.
578: Therefore,
579: .CW .+1
580: may be abbreviated to just a plus sign.
581: .PP
582: The
583: .CW +
584: operator acts relative to the end of its first argument, while the
585: .CW -
586: operator acts relative to the beginning. Thus
587: .CW .+1
588: addresses the first line after dot,
589: .CW .-
590: addresses the first line before dot, and
591: .CW +-
592: refers to the line containing the end of dot. (Dot may span multiple lines, and
593: .CW +
594: selects the line after the end of dot, then
595: .CW -
596: backs up one line.)
597: .PP
598: The final type of address is a regular expression, which addresses the
599: text matched by the expression. The expression is enclosed in slashes, as in
600: .P1 2n
601: /\f2expression\fP/
602: .P2
603: The expressions are the same as those in the
604: .UX
605: program
606: .I egrep ,
607: |reference(v7manual)|reference(kernighan pike)
608: and include closures, alternations, and so on.
609: They find the
610: .I
611: leftmost longest
612: .R
613: string that matches the expression, that is,
614: the first match after the point where the search is started,
615: and if more than one match begins at the same spot, the longest such match.
616: (I assume familiarity with the syntax for regular expressions in
617: .UX
618: programs.|reference(bsdmanual))
619: For example,
620: .P1 2n
621: /x/
622: .P2
623: matches the next
624: .CW x
625: character in the file,
626: .P1 2n
627: /xx*/
628: .P2
629: matches the next run of one or more
630: .CW x 's,
631: and
632: .P1 2n
633: /x|Peter/
634: .P2
635: matches the next
636: .CW x
637: or
638: .CW Peter .
639: For compatibility with other
640: .UX
641: programs, the `any character' operator,
642: a period,
643: does not match a newline, so
644: .P1 2n
645: /.*/
646: .P2
647: matches the text from dot to the end of the line, but excludes the newline
648: and so will not match across
649: the line boundary.
650: .PP
651: Regular expressions are always relative addresses.
652: The direction is forwards by default,
653: so
654: .CW /Peter/
655: is really an abbreviation for
656: .CW +/Peter/ .
657: The search can be reversed with a minus sign, so
658: .P1 2n
659: .CW -/Peter/
660: .P2
661: finds the first
662: .CW Peter
663: before dot.
664: Regular expressions may be used with other address forms, so
665: .CW 0+/Peter/
666: finds the first
667: .CW Peter
668: in the file and
669: .CW $-/Peter/
670: finds the last.
671: Table II summarizes
672: .I sam 's
673: addresses.
674: .1C
675: .KF
676: .TS
677: center;
678: c s
679: lfCW l.
680: Table II. \fISam\fP addresses
681: .sp .4
682: .ft CW
683: _
684: .ft
685: .sp .4
686: \f1Simple addresses\fP
687: .sp .4
688: _
689: .sp .2
690: #\f2n\fP The empty string after character \f2n\fP
691: \f2n\fP Line \f2n\fP.
692: /\f2regexp\fP/ The first following match of the regular expression
693: -/\f2regexp\fP/ The first previous match of the regular expression
694: $ The null string at the end of the file
695: \&. Dot
696: \&' The address mark, set by \f(CWk\fP command
697: "\f2regexp\fP" Dot in the file whose menu line matches regexp
698: .sp .4
699: _
700: .sp .4
701: \f1Compound addresses\fP
702: .sp .4
703: _
704: .sp .2
705: \f2a1\fP+\f2a2\fP The address \f2a2\fP evaluated starting at right of \f2a1\fP
706: \f2a1\fP-\f2a2\fP \f2a2\fP evaluated in the reverse direction starting at left of \f2a1\fP
707: \f2a1\fP,\f2a2\fP From the left of \f2a1\fP to the right of \f2a2\fP (default \f(CW0,$\fP)
708: \f2a1\fP;\f2a2\fP Like \f(CW,\fP but sets dot after evaluating \f2a1\fP
709: .sp .4
710: _
711: .sp .4
712: .T&
713: c s.
714: T{
715: The operators
716: .CW +
717: and
718: .CW -
719: are high precedence, while
720: .CW ,
721: and
722: .CW ;
723: are low precedence.
724: In both
725: .CW +
726: and
727: .CW -
728: forms,
729: .I a2
730: defaults to 1 and
731: .I a1
732: defaults to dot.
733: If both
734: .I a1
735: and
736: .I a2
737: are present,
738: .CW +
739: may be elided.
740: T}
741: .sp .5
742: .ft CW
743: _
744: .ft
745: .TE
746: .sp
747: .KE
748: .2C
749: .PP
750: The language discussed so far will not seem novel
751: to people who use
752: .UX
753: text editors
754: such as
755: .I ed
756: or
757: .I vi .
758: |reference(bsdmanual)
759: Moreover, the kinds of editing operations these commands allow, with the exception
760: of regular expressions and line numbers,
761: are clearly more conveniently handled by a mouse-based interface.
762: Indeed,
763: .I sam 's
764: mouse language (discussed at length below) is the means by which
765: simple changes are usually made.
766: For large or repetitive changes, however, a textual language
767: outperforms a manual interface.
768: .PP
769: Imagine that, instead of deleting just one occurrence of the string
770: .CW Peter ,
771: we wanted to eliminate every
772: .CW Peter .
773: What's needed is an iterator that runs a command for each occurrence of some
774: text.
775: .I Sam 's
776: iterator is called
777: .CW x ,
778: for extract:
779: .P1 2n
780: x/\f2expression\fP/ \f2command\fP
781: .P2
782: finds all matches in dot of the specified expression, and for each
783: such match, sets dot to the text matched and runs the command.
784: So to delete all the
785: .CW Peters:
786: .P1 2n
787: 0,$ x/Peter/ d
788: .P2
789: (Blanks in these examples are to improve readability;
790: .I sam
791: neither requires nor interprets them.)
792: This searches the entire file
793: .CW 0,$ ) (
794: for occurrences of the string
795: .CW Peter ,
796: and runs the
797: .CW d
798: command with dot set to each such occurrence.
799: (By contrast, the comparable
800: .I ed
801: command would delete all
802: .I lines
803: containing
804: .CW Peter ;
805: .I sam
806: deletes only the
807: .CW Peters .)
808: The address
809: .CW 0,$
810: is commonly used, and may be abbreviated to just a comma.
811: As another example,
812: .P1 2n
813: , x/Peter/ p
814: .P2
815: prints a list of
816: .CW Peters,
817: one for each appearance in the file, with no intervening text (not even newlines
818: to separate the instances).
819: .PP
820: Of course, the text extracted by
821: .CW x
822: may be selected by a regular expression,
823: which complicates deciding what set of matches is chosen \(em
824: matches may overlap. This is resolved by generating the matches
825: starting from the beginning of dot using the leftmost-longest rule,
826: and searching for each match starting from the end of the previous one.
827: Regular expressions may also match null strings, but a null match
828: adjacent to a non-null match is never selected; at least one character
829: must intervene.
830: For example,
831: .P1 2n
832: , c/AAA/
833: x/B*/ c/-/
834: , p
835: .P2
836: produces as output
837: .P1 2n
838: -A-A-A-
839: .P2
840: because the pattern
841: .CW B*
842: matches the null strings separating the
843: .CW A 's.
844: .PP
845: The
846: .CW x
847: command has a complement,
848: .CW y ,
849: with similar syntax, that executes the command with dot set to the text
850: .I between
851: the matches of the expression.
852: For example,
853: .P1 2n
854: , c/AAA/
855: y/A/ c/-/
856: , p
857: .P2
858: produces the same result as the example above.
859: .PP
860: The
861: .CW x
862: and
863: .CW y
864: commands are looping constructs, and
865: .I sam
866: has a pair of conditional commands to go with them.
867: They have similar syntax:
868: .P1 2n
869: g/\f2expression\fP/ \f2command\fP
870: .P2
871: (guard)
872: runs the command exactly once if dot contains a match of the expression.
873: This is different from
874: .CW x ,
875: which runs the command for
876: .I each
877: match:
878: .CW x
879: loops;
880: .CW g
881: merely tests, without changing the value of dot.
882: Thus,
883: .P1 2n
884: , x/Peter/ d
885: .P2
886: deletes all occurrences of
887: .CW Peter ,
888: but
889: .P1 2n
890: , g/Peter/ d
891: .P2
892: deletes the whole file (reduces it to a null string) if
893: .CW Peter
894: occurs anywhere in the text.
895: The complementary conditional is
896: .CW v ,
897: which runs the command if there is
898: .I no
899: match of the expression.
900: .PP
901: These control-structure-like commands may be composed to construct more
902: involved operations. For example, to print those lines of text that
903: contain the string
904: .CW Peter :
905: .P1 2n
906: , x/.*\en/ g/Peter/ p
907: .P2
908: The
909: .CW x
910: breaks the file into lines, the
911: .CW g
912: selects those lines containing
913: .CW Peter ,
914: and the
915: .CW p
916: prints them.
917: This command gives an address for the
918: .CW x
919: command (the whole file), but because
920: .CW g
921: does not have an explicit address, it applies to the value of
922: dot produced by the
923: .CW x
924: command, that is, to each line.
925: All commands in
926: .I sam
927: except for the command to write a file to disc use dot for the
928: default address.
929: .PP
930: Composition may be continued indefinitely.
931: .P1 2n
932: , x/.*\en/ g/Peter/ v/SaltPeter/ p
933: .P2
934: prints those lines containing
935: .CW Peter
936: but
937: .I not
938: those containing
939: .CW SaltPeter .
940: .NH 2
941: Structural Regular Expressions
942: .LP
943: Unlike other
944: .UX
945: text editors,
946: including the non-interactive ones such as
947: .I sed
948: and
949: .I awk ,
950: |reference(kernighan pike)
951: .I sam
952: is good for manipulating files with multi-line `records.'
953: An example is an on-line phone book composed of records,
954: separated by blank lines, of the form
955: .P1 2n
956: Herbert Tic
957: 44 Turnip Ave., Endive, NJ
958: 201-5555642
959:
960: Norbert Twinge
961: 16 Potato St., Cabbagetown, NJ
962: 201-5553145
963:
964: \&...
965: .P2
966: The format may be encoded as a regular expression:
967: .P1 2n
968: (.+\en)+
969: .P2
970: that is, a sequence of one or more non-blank lines.
971: The command to print Mr. Tic's entire record is then
972: .P1 2n
973: , x/(.+\en)+/ g/^Herbert Tic$/ p
974: .P2
975: and that to extract just the phone number is
976: .P1 2n
977: , x/(.+\en)+/ g/^Herbert Tic$/
978: x/^[0-9]*-[0-9]*\en/ p
979: .P2
980: The latter command breaks the file into records,
981: chooses Mr. Tic's record,
982: extracts the phone number from the record,
983: and finally prints the number.
984: .PP
985: A more involved problem is that of
986: renaming a particular variable, say
987: .CW n ,
988: to
989: .CW num
990: in a C program.
991: The obvious first attempt,
992: .P1 2n
993: , x/n/ c/num/
994: .P2
995: is badly flawed: it changes not only the variable
996: .CW n
997: but any letter
998: .CW n
999: that appears.
1000: We need to extract all the variables, and select those that match
1001: .CW n
1002: and only
1003: .CW n :
1004: .P1 2n
1005: , x/[A-Za-z_][A-Za-z_0-9]*/ g/n/ v/../ c/num/
1006: .P2
1007: The pattern
1008: .CW [A-Za-z_][A-Za-z_0-9]*
1009: matches C identifiers.
1010: Next
1011: .CW g/n/
1012: selects those containing an
1013: .CW n .
1014: Then
1015: .CW v/../
1016: rejects those containing two (or more) characters, and finally
1017: .CW c/num/
1018: changes the remainder (identifiers
1019: .CW n )
1020: to
1021: .CW num .
1022: This version clearly works much better, but there may still be problems.
1023: For example, in C character and string constants, the sequence
1024: .CW \en
1025: is interpreted as a newline character, and we don't want to change it to
1026: .CW \enum.
1027: This problem can be forestalled with a
1028: .CW y
1029: command
1030: (the following two examples are split onto multiple lines for readability):
1031: .P1 2n
1032: , y/\e\en/ x/[A-Za-z_][A-Za-z_0-9]*/
1033: g/n/ v/../ c/num/
1034: .P2
1035: (the second
1036: .CW \e
1037: is necessary because of lexical conventions in regular expressions),
1038: or we could even reject character constants and strings outright:
1039: .P1 2n
1040: , y/'[^']*'/ y/"[^"]*"/
1041: x/[A-Za-z_][A-Za-z_0-9]*/
1042: g/n/ v/../ c/num/
1043: .P2
1044: The
1045: .CW y
1046: commands in this version exclude from consideration all character constants
1047: and strings.
1048: The only remaining problem is to deal with the possible occurrence of
1049: .CW \e'
1050: or
1051: .CW \e"
1052: within these sequences, but it's easy to see how to resolve this difficulty.
1053: .PP
1054: The point of these composed commands is successive refinement.
1055: A simple version of the command is tried, and if it's not good enough,
1056: it can be honed by adding a clause or two.
1057: (Mistakes can be undone; see below.
1058: Also, the mouse language makes it unnecessary to retype the command each time.)
1059: The resulting chains of commands are somewhat reminiscent of
1060: shell pipelines.|reference(kernighan pike)
1061: Unlike pipelines, though, which pass along modified
1062: .I data ,
1063: .I sam
1064: commands pass a
1065: .I view
1066: of the data.
1067: The text at each step of the command is the same, but which pieces
1068: are selected is refined step by step until the correct piece is
1069: available to the final step of the command line, which ultimately makes the change.
1070: .PP
1071: In other
1072: .UX
1073: programs, regular expressions are used only for selection,
1074: as in the
1075: .I sam
1076: .CW g
1077: command, never for extraction as in the
1078: .CW x
1079: or
1080: .CW y
1081: command.
1082: For example, patterns in
1083: .I awk
1084: |reference(kernighan pike)
1085: are used to select lines to be operated on, but cannot be used
1086: to describe the format of the input text, or to handle newline-free text.
1087: The use of regular expressions to describe the structure of a piece
1088: of text rather than its contents, as in the
1089: .CW x
1090: command,
1091: has been given a name:
1092: .I
1093: structural regular expressions.
1094: .R
1095: When they are composed, as in the above example,
1096: they are pleasantly expressive.
1097: Their use is discussed at greater length elsewhere.|reference(pike structural)
1098: .NH 2
1099: Multiple files
1100: .LP
1101: .I Sam
1102: has a few other commands, mostly relating to input and output.
1103: .P1 2n
1104: e discfilename
1105: .P2
1106: replaces the contents and name of the current file with those of the named
1107: disc file;
1108: .P1 2n
1109: w discfilename
1110: .P2
1111: writes the contents to the named disc file; and
1112: .P1 2n
1113: r discfilename
1114: .P2
1115: replaces dot with the contents of the named disc file.
1116: All these commands use the current file's name if none is specified.
1117: Finally,
1118: .P1 2n
1119: f discfilename
1120: .P2
1121: changes the name associated with the file and displays the result:
1122: .P1 2n
1123: \&'-. discfilename
1124: .P2
1125: This output is called the file's
1126: .I
1127: menu line,
1128: .R
1129: because it is the contents of the file's line in the button 3 menu (described
1130: in the
1131: next section).
1132: The first three characters are a concise notation for the state of the file.
1133: The apostrophe signifies that the file is modified.
1134: The minus sign indicates the number of windows
1135: open on the file (see the next section):
1136: .CW -
1137: means none,
1138: .CW +
1139: means one, and
1140: .CW *
1141: means more than one.
1142: Finally, the period indicates that this is the current file.
1143: These characters are useful for controlling the
1144: .CW X
1145: command, described shortly.
1146: .PP
1147: .I Sam
1148: may be started with a set of disc files (such as all the source for
1149: a program) by invoking it with a list of file names as arguments, and
1150: more may be added or deleted on demand.
1151: .P1 2n
1152: B discfile1 discfile2 ...
1153: .P2
1154: adds the named files to
1155: .I sam 's
1156: list, and
1157: .P1 2n
1158: D discfile1 discfile2 ...
1159: .P2
1160: removes them from
1161: .I sam 's
1162: memory (without effect on associated disc files).
1163: Both these commands have a syntax for using the shell|reference(kernighan pike)
1164: (the
1165: .UX
1166: command interpreter) to generate the lists:
1167: .P1 2n
1168: B <echo *.c
1169: .P2
1170: will add all C source files, and
1171: .P1 2n
1172: B <grep -l variable *.c
1173: .P2
1174: will add all C source files referencing a particular variable
1175: (the
1176: .UX
1177: command
1178: .CW grep\ -l
1179: lists all files in its arguments that contain matches of
1180: the specified regular expression).
1181: Finally,
1182: .CW D
1183: without arguments deletes the current file.
1184: .PP
1185: There are two ways to change which file is current:
1186: .P1 2n
1187: b filename
1188: .P2
1189: makes the named file current.
1190: The
1191: .CW B
1192: command
1193: does the same, but also adds any new files to
1194: .I sam 's
1195: list.
1196: (In practice, of course, the current file
1197: is usually chosen by mouse actions, not by textual commands.)
1198: The other way is to use a form of address that refers to files:
1199: .P1 2n
1200: "\f2expression\fP" \f2address\fP
1201: .P2
1202: refers to the address evaluated in the file whose menu line
1203: matches the expression (there must be exactly one match).
1204: For example,
1205: .P1 2n
1206: "peter.c" 3
1207: .P2
1208: refers to the third line of the file whose name matches
1209: .CW peter.c .
1210: This is most useful in the move
1211: .CW m ) (
1212: and copy
1213: .CW t ) (
1214: commands:
1215: .P1 2n
1216: 0,$ t "peter.c" 0
1217: .P2
1218: makes a copy of the current file at the beginning of
1219: .CW peter.c .
1220: .PP
1221: The
1222: .CW X
1223: command
1224: is a looping construct, like
1225: .CW x ,
1226: that refers to files instead of strings:
1227: .P1 2n
1228: X/\f2expression\fP/ \f2command\fP
1229: .P2
1230: runs the command in all
1231: files whose menu lines match the expression. The best example is
1232: .P1 2n
1233: X/'/ w
1234: .P2
1235: which writes to disc all modified files.
1236: .CW Y
1237: is the complement of
1238: .CW X :
1239: it runs the command on all files whose menu lines don't match the expression:
1240: .P1 2n
1241: Y/\e.c/ D
1242: .P2
1243: deletes all files that don't have
1244: .CW \&.c
1245: in their names, that is, it keeps all C source files and deletes the rest.
1246: .PP
1247: Braces allow commands to be grouped, so
1248: .P1 2n
1249: {
1250: \f2command1\fP
1251: \f2command2\fP
1252: }
1253: .P2
1254: is syntactically a single command that runs two commands.
1255: Thus,
1256: .P1 2n
1257: X/\e.c/ ,g/variable/ {
1258: f
1259: , x/.*\en/ g/variable/ p
1260: }
1261: .P2
1262: finds all occurrences of
1263: .CW variable
1264: in C source files, and prints
1265: out the file names and lines of each match.
1266: The precise semantics of compound operations is discussed in the implementation
1267: sections below.
1268: .PP
1269: Finally,
1270: the undo command,
1271: .CW u ,
1272: undoes the last command,
1273: no matter how many files were affected.
1274: Multiple undo operations move further back in time, so
1275: .P1 2n
1276: u
1277: u
1278: .P2
1279: (which may be abbreviated
1280: .CW u2 )
1281: undoes the last two commands. An undo may not be undone, however, nor
1282: may any command that adds or deletes files.
1283: Everything else is undoable, though, including for example
1284: .CW e
1285: commands:
1286: .P1 2n
1287: e filename
1288: u
1289: .P2
1290: restores the state of the file completely, including its name, dot,
1291: and modified bit. Because of the undo, potentially dangerous commands
1292: are not guarded by confirmations. Only
1293: .CW D ,
1294: which destroys the information necessary to restore itself, is protected.
1295: It will not delete a modified file, but a second
1296: .CW D
1297: of the same file will succeed regardless.
1298: The
1299: .CW q
1300: command, which exits
1301: .I sam ,
1302: is similarly guarded.
1303: .NH 2
1304: Mouse Interface
1305: .LP
1306: .I Sam
1307: is most commonly run
1308: connected to a bitmap display and mouse for interactive editing.
1309: The only difference in the command language
1310: between regular, mouse-driven
1311: .I sam
1312: and
1313: .CW sam\ -d
1314: is that if an address
1315: is provided without a command,
1316: .CW sam\ -d
1317: will print the text referenced by the address, but
1318: regular
1319: .I sam
1320: will highlight it on the screen \(em in fact,
1321: dot is always highlighted (see Figure 2).
1322: .WS 1
1323: .1C
1324: .KF
1325: .IB fig2.ps 1.47i 4.61i c
1326: .Cs
1327: Figure 2. A
1328: .I sam
1329: window. The scroll bar down the left
1330: represents the file, with the bubble showing the fraction
1331: visible in the window.
1332: The scroll bar may be manipulated by the mouse for convenient browsing.
1333: The current text,
1334: which is highlighted, need not fit on a line. Here it consists of one partial
1335: line, one complete line, and a final partial line.
1336: .Ce
1337: .KE
1338: .2C
1339: .PP
1340: Each file may have zero or more windows open on the display.
1341: At any time, only one window in all of
1342: .I sam
1343: is the
1344: .I
1345: current window,
1346: .R
1347: that is, the window to which typing and mouse actions refer;
1348: this may be the
1349: .I sam
1350: window (that in which commands may be typed)
1351: or one of the file windows.
1352: When a file has multiple windows, the image of the file in each window
1353: is always kept up to date.
1354: The current file is the last file affected by a command,
1355: so if the
1356: .I sam
1357: window is current,
1358: the current window is not a window on the current file.
1359: However, each window on a file has its own value of dot,
1360: and when switching between windows on a single file,
1361: the file's value of dot is changed to that of the window.
1362: Thus, flipping between windows behaves in the obvious, convenient way.
1363: .PP
1364: The mouse on the Blit has three buttons, numbered left to right.
1365: Button 3 has a list of commands to manipulate windows,
1366: followed by a list of `menu lines' exactly as printed by the
1367: .CW f
1368: command, one per file (not one per window).
1369: These menu lines are sorted by file name.
1370: If the list is long, the Blit menu software will make it more manageable
1371: by generating a scrolling menu instead of an unwieldy long list.
1372: Using the menu to select a file from the list makes that file the current
1373: file, and the most recently current window in that file the current window.
1374: But if that file is already current, selecting it in the menu cycles through
1375: the windows on the file; this simple trick avoids a special menu to
1376: choose windows on a file.
1377: If there is no window open on the file,
1378: .I sam
1379: changes the mouse cursor to prompt the user to create one.
1380: .PP
1381: The commands on the button 3 menu are straightforward (see Figure 3), and
1382: are like the commands to manipulate windows in
1383: .I mux ,
1384: |reference(latest volume1)
1385: the Blit's window system.
1386: .CW new
1387: makes a new file, and gives it one empty window, whose size is determined
1388: by a rectangle swept by the mouse.
1389: .CW xerox
1390: prompts for a window to be selected, and
1391: makes a clone of that window; this is how multiple windows are created on one file.
1392: .CW reshape
1393: changes the size of the indicated window, and
1394: .CW close
1395: deletes it. If that is the last window open on the file,
1396: .CW close
1397: first does a
1398: .CW D
1399: command on the file.
1400: .CW write
1401: is identical to a
1402: .CW w
1403: command on the file; it is in the menu purely for convenience.
1404: Finally,
1405: .CW ~~sam~~
1406: is a menu item that appears between the commands and the file names.
1407: Selecting it makes the
1408: .I sam
1409: window the current window,
1410: causing subsequent typing to be interpreted as commands.
1411: .KF
1412: .IB fig3.ps 1.98i 1.39i c
1413: .Cs
1414: Figure 3. The menu on button 3.
1415: The black rectangle on the left is a scroll bar; the menu is limited to
1416: the length shown to prevent its becoming unwieldy.
1417: Above the
1418: .CW ~~sam~~
1419: line is a list of commands;
1420: beneath it is a list of files, presented exactly as with the
1421: .CW f
1422: command.
1423: .Ce
1424: .KE
1425: .PP
1426: When
1427: .I sam
1428: requests that a window be swept, in response to
1429: .CW new ,
1430: .CW xerox
1431: or
1432: .CW reshape ,
1433: it changes the mouse cursor from the usual arrow to a box with
1434: a small arrow.
1435: In this state, the mouse may be used to indicate an arbitrary rectangle by
1436: pressing button 3 at one corner and releasing it at the opposite corner.
1437: More conveniently,
1438: button 3 may simply be clicked,
1439: whereupon
1440: .I sam
1441: creates the maximal rectangle that contains the cursor
1442: and abuts the
1443: .I sam
1444: window.
1445: By placing the
1446: .I sam
1447: window in the middle of the screen, the user can define two regions (one above,
1448: one below) in which stacked fully-overlapping
1449: windows can be created with minimal fuss (see Figure 1).
1450: This simple user interface trick makes window creation noticeably easier.
1451: .PP
1452: The cut-and-paste editor is essentially the same as that in Smalltalk-80.
1453: |reference(smalltalk80 goldberg)
1454: The text in dot is always highlighted on the screen.
1455: When a character is typed it replaces dot, and sets dot to the null
1456: string after the character. Thus, ordinary typing inserts text.
1457: Button 1 is used for selection:
1458: pressing the button, moving the mouse, and lifting the button
1459: selects (sets dot to) the text between the points where the
1460: button was pressed and released.
1461: Pressing and releasing at the same point selects a null string; this
1462: is called clicking. Clicking twice quickly, or
1463: .I
1464: double clicking,
1465: .R
1466: selects larger objects;
1467: for example, double clicking in a word selects the word,
1468: double clicking just inside an opening bracket selects the text
1469: contained in the brackets (handling nested brackets correctly),
1470: and similarly for
1471: parentheses, quotes, and so on.
1472: The double-clicking rules reflect a bias toward
1473: programmers.
1474: If
1475: .I sam
1476: were intended more for word processing, double-clicks would probably
1477: select linguistic structures such as sentences.
1478: .PP
1479: If button 1 is pressed outside the current window, it makes the indicated
1480: window current.
1481: This is the easiest way to switch between windows and files.
1482: .PP
1483: Pressing button 2 brings up a menu of editing functions (see Figure 4).
1484: These mostly apply to the selected text:
1485: .CW cut
1486: deletes the selected text, and remembers it in a hidden buffer called the
1487: .I
1488: snarf buffer,
1489: .R
1490: .CW paste
1491: replaces the selected text by the contents of the snarf buffer,
1492: .CW snarf
1493: just copies the selected text to the snarf buffer,
1494: .CW look
1495: searches forward for the next literal occurrence of the selected text, and
1496: .CW <mux>
1497: exchanges snarf buffers with the window system in which
1498: .I sam
1499: is running.
1500: Finally, the last regular expression used appears as a menu entry
1501: to search
1502: forward for the next occurrence of a match for the expression.
1503: .WS 1
1504: .KF
1505: .IB fig4.ps 0.87i 0.81i c
1506: .Cs
1507: Figure 4. The menu on button 2.
1508: The bottom entry tracks the most recently used regular expression, which may
1509: be literal text.
1510: .Ce
1511: .KE
1512: .PP
1513: The relationship between the command language and the mouse language is
1514: entirely due to the equality of dot and the selected text chosen
1515: with button 1 on the mouse.
1516: For example, to make a set of changes in a C subroutine, dot can be
1517: set by double clicking on the left brace that begins the subroutine,
1518: which sets dot for the command language.
1519: An address-free command then typed in the
1520: .I sam
1521: window will apply only to the text between the opening and closing
1522: braces of the function.
1523: The idea is to select what you want, and then say what you want
1524: to do with it, whether invoked by a menu selection or by a typed command.
1525: And of course, the value of dot is highlighted on
1526: the display after the command completes.
1527: This relationship between mouse interface and command language
1528: is clumsy to explain, but comfortable, even natural, in practice.
1529: .NH
1530: The Implementation
1531: .LP
1532: The next few sections describe how
1533: .I sam
1534: is put together, first the host part,
1535: then the inter-component communication,
1536: then the terminal part.
1537: After explaining how the command language is implemented,
1538: the discussion follows (roughly) the path of a character
1539: from the temporary file on disc to the screen.
1540: The presentation centers on the data structures,
1541: because that is how the program was designed and because
1542: the algorithms are easy to provide, given the right data
1543: structures.
1544: ........
1545: .NH 2
1546: Parsing and execution
1547: .LP
1548: The command language is interpreted by parsing each command with a
1549: table-driven recursive
1550: descent parser, and when a complete command is assembled, invoking a top-down
1551: executor.
1552: Most editors instead employ a simple character-at-a-time
1553: lexical scanner.
1554: Use of a parser makes it
1555: easy and unambiguous to detect when a command is complete,
1556: which has two advantages.
1557: First, escape conventions such as backslashes to quote
1558: multiple-line commands are unnecessary; if the command isn't finished,
1559: the parser keeps reading. For example, a multiple-line append driven by an
1560: .CW x
1561: command is straightforward:
1562: .P1 2n
1563: x/.*\en/ g/Peter/ a
1564: one line about Peter
1565: another line about Peter
1566: \&.
1567: .P2
1568: Other
1569: .UX
1570: editors would require a backslash after all but the last line.
1571: .PP
1572: The other advantage is specific to the two-process structure of
1573: .I sam .
1574: The host process must decide when a command is completed so the
1575: command interpreter can be called. This problem is easily resolved
1576: by having the lexical analyzer read the single stream of events from the
1577: terminal, directly executing all typing and mouse commands,
1578: but passing to the parser characters typed to the
1579: .I sam
1580: command window.
1581: This scheme is slightly complicated by the availability of cut-and-paste
1582: editing in the
1583: .I sam
1584: window, but that difficulty is resolved by applying the rules
1585: used in
1586: .I mux :
1587: when a newline is typed to the
1588: .I sam
1589: window, all text between the newline and the previously typed newline
1590: is made available to the parser.
1591: This permits arbitrary editing to be done to a command before
1592: typing newline and thereby requesting execution.
1593: .PP
1594: The parser is driven by a table because the syntax of addresses
1595: and commands is regular enough
1596: to be encoded compactly. There are few special cases, such as the
1597: replacement text in a substitution, so the syntax of almost all commands
1598: can be encoded with a few flags.
1599: These include whether the command allows an address (for example,
1600: .CW e
1601: does not), whether it takes a regular expression (as in
1602: .CW x
1603: and
1604: .CW s ),
1605: whether it takes replacement text (as in
1606: .CW c
1607: or
1608: .CW i ),
1609: which may be multi-line, and so on.
1610: The internal syntax of regular expressions is handled by a separate
1611: parser; a regular expression is a leaf of the command parse tree.
1612: Regular expressions are discussed fully in the next section.
1613: .PP
1614: The parser table also has information about defaults, so the interpreter
1615: is always called with a complete tree. For example, the parser fills in
1616: the implicit
1617: .CW 0
1618: and
1619: .CW $
1620: in the abbreviated address
1621: .CW ,
1622: (comma),
1623: inserts a
1624: .CW +
1625: to the left of an unadorned regular expression in an address,
1626: and provides the usual default address
1627: .CW .
1628: (dot) for commands that expect an address but are not given one.
1629: .PP
1630: Once a complete command is parsed, the evaluation is easy.
1631: The address is evaluated left-to-right starting from the value of dot,
1632: with a mostly ordinary expression evaluator.
1633: Addresses, like many of the data structures in
1634: .I sam ,
1635: are held in a C structure and passed around by value:
1636: .P1 2n
1637:
1638: typedef long Posn; /* file position */
1639: typedef struct Range{
1640: Posn p1, p2;
1641: }Range;
1642: typedef struct Address{
1643: Range r;
1644: File *f;
1645: }Address;
1646: .P2
1647: .1C
1648: .KF top
1649: .SP 1
1650: .P1 10n
1651: Address
1652: address(ap, a, sign)
1653: Addrtree *ap;
1654: Address a;
1655: int sign;
1656: {
1657: Address a2;
1658: do
1659: switch(ap->type){
1660: case '.':
1661: a=a.f->dot;
1662: break;
1663: case '$':
1664: a.r.p1=a.r.p2=a.f->nbytes;
1665: break;
1666: case '"':
1667: a=matchfile(a, ap->aregexp)->dot;
1668: break;
1669: case ',':
1670: a2=address(ap->right, a, 0);
1671: a=address(ap->left, a, 0);
1672: if(a.f!=a2.f || a2.r.p2<a.r.p1)
1673: error(Eorder);
1674: a.r.p2=a2.r.p2;
1675: return a;
1676: /* and so on */
1677: }
1678: while((ap=ap->right)!=0);
1679: return a;
1680: }
1681: .P2
1682: .ce
1683: \fBDisplay 1.\fP The address interpreter
1684: .KE
1685: .2C
1686: An address is encoded as a substring (character positions
1687: .CW p1
1688: to
1689: .CW p2 )
1690: in a file
1691: .CW f .
1692: (The data type
1693: .CW File
1694: is described in detail below.)
1695: .PP
1696: The address interpreter (shown in Display 1) is an
1697: .CW Address -valued
1698: function that traverses the parse tree describing an address (the
1699: parse tree for the address has type
1700: .CW Addrtree ).
1701: .PP
1702: Throughout, errors are handled by a non-local
1703: .CW goto
1704: (a
1705: .CW setjmp/longjmp
1706: in C terminology)
1707: hidden in a routine called
1708: .CW error
1709: that immediately aborts the execution, retracts any
1710: partially made changes (see the section below on `undoing'), and
1711: returns to the top level of the parser.
1712: The argument to
1713: .CW error
1714: is an enumeration type that
1715: is translated to a terse but possibly helpful
1716: message such as
1717: .CW "?addresses out of order" .
1718: Very common messages are kept short; for example the message for
1719: a failed regular expression search is
1720: .CW ?search .
1721: .PP
1722: Character addresses such as
1723: .CW #3
1724: are trivial to implement, as the
1725: .CW File
1726: data structure is accessible by character number.
1727: However,
1728: .I sam
1729: keeps no information about the position of newlines \(em it is too
1730: expensive to track dynamically \(em so line addresses are computed by reading
1731: the file, counting newlines. Except in very large files, this has proven
1732: acceptable: file access is fast enough to make the technique practical,
1733: and lines are not central to the structure of the command language.
1734: .PP
1735: The command interpreter, called
1736: .CW cmdexec ,
1737: is also straightforward. The parse table includes a
1738: function to call to interpret a particular command. That function
1739: receives as arguments
1740: the calculated address
1741: for the command
1742: and the command tree (of type
1743: .CW Cmdtree ),
1744: which may contain information such as the subtree for compound commands.
1745: For example, the function for the
1746: .CW g
1747: and
1748: .CW v
1749: commands is shown in Display 2.
1750: .CW Compile "" (
1751: and
1752: .CW execute
1753: are part of the regular expression code, described in the next section.)
1754: Because the parser and the
1755: .CW File
1756: data structure do most of the work, most commands
1757: are similarly brief.
1758: .1C
1759: .KF bottom
1760: .P1 10n
1761: int
1762: g_cmd(a, cp)
1763: Address a;
1764: Cmdtree *cp;
1765: {
1766: compile(cp->regexp);
1767: if(execute(a.f, a.r.p1, a.r.p2) != (cp->cmdchar=='v')){
1768: a.f->dot=a;
1769: return cmdexec(a, cp->subcmd);
1770: }
1771: return TRUE; /* indicate that execution is to continue */
1772: }
1773: .P2
1774: .ce
1775: \fBDisplay 2. \f(CWg\fR and \f(CWv\fP commands
1776: .SP
1777: .KE
1778: .2C
1779: .NH 2
1780: Regular expressions
1781: .LP
1782: The regular expression code in
1783: .I sam
1784: is an interpreted, rather than compiled on-the-fly, implementation of Thompson's
1785: non-deterministic finite automaton algorithm.|reference(thompson regular expression)
1786: The syntax and semantics of the expressions are as in the
1787: .UX
1788: program
1789: .I egrep ,
1790: including alternation, closures, character classes, and so on.
1791: The only changes in the notation are two additions:
1792: .CW \en
1793: is translated to, and matches, a newline character, and
1794: .CW @
1795: matches any character. In
1796: .I egrep ,
1797: the character
1798: .CW \&.
1799: matches any character except newline, and in
1800: .I sam
1801: the same rule seemed safest, to prevent idioms like
1802: .CW \&.*
1803: from spanning newlines.
1804: .I Egrep
1805: expressions are arguably too complicated for an interactive editor \(em
1806: certainly it would make sense if all the special characters were two-character
1807: sequences, so that most of the punctuation characters wouldn't have
1808: peculiar meanings \(em but for an interesting command language, full
1809: regular expressions are necessary, and
1810: .I egrep
1811: defines the full regular expression syntax for
1812: .UX
1813: programs.
1814: Also, it seemed superfluous to define a new syntax, since various
1815: .UX
1816: programs
1817: .I ed , (
1818: .I egrep
1819: and
1820: .I vi )
1821: define too many already.
1822: .PP
1823: The expressions are compiled by a routine,
1824: .CW compile ,
1825: that generates the description of the non-deterministic finite state machine.
1826: A second routine,
1827: .CW execute ,
1828: interprets the machine to generate the leftmost-longest match of the
1829: expression in a substring of the file.
1830: The algorithm is described elsewhere.|reference(thompson regular expression)|reference(aho hopcroft ullman)
1831: .CW Execute
1832: reports
1833: whether a match was found, and sets a global variable,
1834: of type
1835: .CW Range ,
1836: to the substring matched.
1837: .PP
1838: A trick is required to evaluate the expression in reverse, such as when
1839: searching backwards for an expression.
1840: For example,
1841: .P1 2n
1842: -/P.*r/
1843: .P2
1844: looks backwards through the file for a match of the expression.
1845: The expression, however, is defined for a forward search.
1846: The solution is to construct a machine identical to the machine
1847: for a forward search except for a reversal of all the concatenation
1848: operators (the other operators are symmetric under direction reversal),
1849: to exchange the meaning of the operators
1850: .CW ^
1851: and
1852: .CW $ ,
1853: and then to read the file backwards, looking for the
1854: usual earliest longest match.
1855: .PP
1856: .CW Execute
1857: generates only one match each time it is called.
1858: To interpret looping constructs such as the
1859: .CW x
1860: command,
1861: .I sam
1862: must therefore synchronize between
1863: calls of
1864: .CW execute
1865: to avoid
1866: problems with null matches.
1867: For example, even given the leftmost-longest rule,
1868: the expression
1869: .CW a*
1870: matches three times in the string
1871: .CW ab
1872: (the character
1873: .CW a ,
1874: the null string between the
1875: .CW a
1876: and
1877: .CW b ,
1878: and the final null string).
1879: After returning a match for the
1880: .CW a ,
1881: .I sam
1882: must not match the null string before the
1883: .CW b .
1884: The algorithm starts
1885: .CW execute
1886: at the end of its previous match, and
1887: if the match it returns
1888: is null and abuts the previous match, rejects the match and advances
1889: the initial position one character.
1890: .NH 2
1891: Memory allocation
1892: .LP
1893: The C language has no memory allocation primitives, although a standard
1894: library routine,
1895: .I malloc ,
1896: provides adequate service for simple programs.
1897: For specific uses, however,
1898: it can be better to write a custom allocator.
1899: The allocator (or rather, pair of allocators) described here
1900: work in both the terminal and host parts of
1901: .I sam .
1902: They are designed for efficient manipulation of strings,
1903: which are allocated and freed frequently and vary in length from essentially
1904: zero to 32 Kbytes (very large strings are written to disc).
1905: More important, strings may be large and change size often,
1906: so to minimize memory usage it is helpful to reclaim and to coalesce the
1907: unused portions of strings when they are truncated.
1908: .PP
1909: Objects to be allocated in
1910: .I sam
1911: are of two flavors:
1912: the first is C
1913: .CW structs ,
1914: which are small and often addressed by pointer variables;
1915: the second is variable-sized arrays of characters
1916: or integers whose
1917: base pointer is always used to access them.
1918: The memory allocator in
1919: .I sam
1920: is therefore in two parts:
1921: first, a traditional first-fit allocator that provides fixed storage for
1922: .CW structs ;
1923: and second, a garbage-compacting allocator that reduces storage
1924: overhead for variable-sized objects, at the cost of some bookkeeping.
1925: The two types of objects are allocated from adjoining arenas, with
1926: the garbage-compacting allocator controlling the arena with higher addresses.
1927: Separating into two arenas simplifies compaction and prevents fragmentation due
1928: to immovable objects.
1929: The access rules for garbage-compactable objects
1930: (discussed in the next paragraph) allow them to be relocated, so when
1931: the first-fit arena needs space, it moves the garbage-compacted arena
1932: to higher addresses to make room. Storage is therefore created only
1933: at successively higher addresses, either when more garbage-compacted
1934: space is needed or when the first-fit arena pushes up the other arena.
1935: .PP
1936: Objects that may be compacted declare to the
1937: allocator a cell that is guaranteed to be the sole repository of the
1938: address of the object whenever a compaction can occur.
1939: The compactor can then update the address when the object is moved.
1940: For example, the implementation of type
1941: .CW List
1942: (really a variable-length array)
1943: is:
1944: .P1 2n
1945: typedef struct List{
1946: int nused;
1947: long *ptr;
1948: }List;
1949: .P2
1950: The
1951: .CW ptr
1952: cell must always be used directly, and never copied. When a
1953: .CW List
1954: is to be created the
1955: .CW List
1956: structure is allocated in the ordinary first-fit arena
1957: and its
1958: .CW ptr
1959: is allocated in the garbage-compacted arena.
1960: A similar data type for strings, called
1961: .CW String ,
1962: stores variable-length character arrays of up to 32767 elements.
1963: .PP
1964: A related matter of programming style:
1965: .I sam
1966: frequently passes structures by value, which
1967: simplifies the code.
1968: Traditionally, C programs have
1969: passed structures by reference, but implicit allocation on
1970: the stack is easier to use.
1971: Structure passing is a relatively new feature of C
1972: (it is not in the
1973: standard reference manual for C|reference(cbook)),
1974: and is poorly supported in most
1975: commercial C compilers.
1976: It's convenient and expressive, though,
1977: and simplifies memory management by
1978: avoiding the allocator altogether
1979: and eliminating pointer aliases.
1980: .....
1981: .NH 2
1982: Data structures for manipulating files
1983: .LP
1984: Experience with
1985: .I jim
1986: showed that the requirements
1987: of the file data structure were few, but strict.
1988: First, files need to be read and written quickly;
1989: adding a fresh file must be painless.
1990: Second, the implementation must place no arbitrary upper limit on
1991: the number or sizes of files. (It should be practical to edit many files,
1992: and files up to megabytes in length should be handled gracefully.)
1993: This implies that files be stored on disc, not in main memory.
1994: (Aficionados of virtual memory may argue otherwise, but the
1995: implementation of virtual
1996: memory in our system is not something to depend on
1997: for good performance.)
1998: Third, changes to files need be made by only two primitives:
1999: deletion and insertion.
2000: These are inverses of each other,
2001: which simplifies the implementation of the undo operation.
2002: Finally,
2003: it must be easy and efficient to access the file, either
2004: forwards or backwards, a byte at a time.
2005: .PP
2006: The
2007: .CW File
2008: data type is constructed from three simpler data structures that hold arrays
2009: of characters.
2010: Each of these types has an insertion and deletion operator, and the
2011: insertion and deletion operators of the
2012: .CW File
2013: type itself are constructed from them.
2014: .PP
2015: The simplest type is the
2016: .CW String ,
2017: which is used to hold strings in main memory.
2018: The code that manages
2019: .CW Strings
2020: guarantees that they will never be longer
2021: than some moderate size, and in practice they are rarely larger than 8 Kbytes.
2022: .CW Strings
2023: have two purposes: they hold short strings like file names with little overhead,
2024: and because they are deliberately small, they are efficient to modify.
2025: They are therefore used as the data structure for in-memory caches.
2026: .PP
2027: The disc copy of the file is managed by a data structure called a
2028: .CW Disc ,
2029: which corresponds to a temporary file. A
2030: .CW Disc
2031: has no storage in main memory other than bookkeeping information;
2032: the actual data being held is all on the disc.
2033: To reduce the number of open files needed,
2034: .I sam
2035: opens a dozen temporary
2036: .UX
2037: files and multiplexes the
2038: .CW Discs
2039: upon them.
2040: This permits many files to
2041: be edited; the entire
2042: .I sam
2043: source (48 files) may be edited comfortably with a single
2044: instance of
2045: .I sam .
2046: Allocating one temporary file per
2047: .CW Disc
2048: would strain the operating system's limit on the number of open files.
2049: Also, spreading the traffic among temporary files keeps the files shorter,
2050: and shorter files are more efficiently implemented by the
2051: .UX
2052: I/O subsystem.
2053: .PP
2054: A
2055: .CW Disc
2056: is an array of fixed-length blocks, each of which contains
2057: between 1 and 4096 characters of active data.
2058: (The block size of our
2059: .UX
2060: file system is 4096 bytes.)
2061: The block addresses within the temporary file and the length of each
2062: block are stored in a
2063: .CW List .
2064: When changes are made the live part of blocks may change size.
2065: Blocks are created and coalesced when necessary to try to keep the sizes
2066: between 2048 and 4096 bytes.
2067: An actively changing part of the
2068: .CW Disc
2069: therefore typically has about a kilobyte of slop that can be
2070: inserted or deleted
2071: without changing more than one block or affecting the block order.
2072: When an insertion would overflow a block, the block is split, a new one
2073: is allocated to receive the overflow, and the memory-resident list of blocks
2074: is rearranged to reflect the insertion of the new block.
2075: .PP
2076: Obviously, going to the disc for every modification to the file is
2077: prohibitively expensive.
2078: The data type
2079: .CW Buffer
2080: consists of a
2081: .CW Disc
2082: to hold the data and a
2083: .CW String
2084: that acts as a cache.
2085: This is the first of a series of caches throughout the data structures in
2086: .I sam.
2087: The caches not only improve performance, they provide a way to organize
2088: the flow of data, particularly in the communication between the host
2089: and terminal.
2090: This idea is developed below, in the section on communications.
2091: .PP
2092: To reduce disc traffic, changes to a
2093: .CW Buffer
2094: are mediated by a variable-length string, in memory, that acts as a cache.
2095: When an insertion or deletion is made to a
2096: .CW Buffer ,
2097: if the change can be accommodated by the cache, it is done there.
2098: If the cache becomes bigger than a block because of an insertion,
2099: some of it is written to the
2100: .CW Disc
2101: and deleted from the cache.
2102: If the change does not intersect the cache, the cache is flushed.
2103: The cache is only loaded at the new position if the change is smaller than a block;
2104: otherwise, it is sent directly to the
2105: .CW Disc .
2106: This is because
2107: large changes are typically sequential,
2108: whereupon the next change is unlikely to overlap the current one.
2109: .PP
2110: A
2111: .CW File
2112: comprises a
2113: .CW String
2114: to hold the file name and some ancillary data such as dot and the modified bit.
2115: The most important components, though, are a pair of
2116: .CW Buffers ,
2117: one called the transcript and the other the contents.
2118: Their use is described in the next section.
2119: .PP
2120: The overall structure is shown in Figure 5.
2121: Although it may seem that the data is touched many times on its
2122: way from the
2123: .CW Disc ,
2124: it is read (by one
2125: .UX
2126: system call) directly into the cache of the
2127: associated
2128: .CW Buffer ;
2129: no extra copy is done.
2130: Similarly, when flushing the cache, the text is written
2131: directly from the cache to disc.
2132: Most operations act directly on the text in the cache.
2133: A principle applied throughout
2134: .I sam
2135: is that the fewer times the data is copied, the faster the program will run
2136: (see also the paper by Waite|reference(waite lexical analysis)).
2137: .1C
2138: .KF
2139: .PS
2140: copy "fig5.pic"
2141: .PE
2142: .Cs
2143: Figure 5. File data structures.
2144: The temporary files are stored in the standard repository for such files
2145: on the host system.
2146: .Ce
2147: .KE
2148: .2C
2149: .PP
2150: The contents of a
2151: .CW File
2152: are accessed by a routine that
2153: copies to a buffer a substring of a file starting at a specified offset.
2154: To read a byte at a time, a
2155: .CW File "" per-
2156: array is loaded starting from a specified initial position,
2157: and bytes may then be read from the array.
2158: The implementation is done by a macro similar to the C standard I/O
2159: .I getc
2160: macro.|reference(cbook)
2161: Because the reading may be done at any address, a minor change to the
2162: macro allows the file to be read backwards.
2163: This array is read-only; there is no
2164: .I putc .
2165: .NH 2
2166: Doing and undoing
2167: .LP
2168: .I Sam
2169: has an unusual method for managing changes to files.
2170: The command language makes it easy to specify multiple variable-length changes
2171: to a file millions of bytes long, and such changes
2172: must be made efficiently if the editor is to be practical.
2173: The usual techniques for inserting and deleting strings
2174: are inadequate under these conditions.
2175: The
2176: .CW Buffer
2177: and
2178: .CW Disc
2179: data structures are designed for efficient random access to long strings,
2180: but care must be taken to avoid super-linear behavior when making
2181: many changes simultaneously.
2182: .PP
2183: .I Sam
2184: uses a two-pass algorithm for making changes, and treats each file as a database
2185: against which transactions are registered.
2186: Changes are not made directly to the contents.
2187: Instead, when a command is started, a `mark' containing
2188: a sequence number is placed in the transcript
2189: .CW Buffer ,
2190: and each change made to the file, either an insertion or deletion
2191: or a change to the file name,
2192: is appended to the end of the transcript.
2193: When the command is complete, the transcript is rewound to the
2194: mark and applied to the contents.
2195: .PP
2196: One reason for separating evaluation from
2197: application in this way is to simplify tracking the addresses of changes
2198: made in the middle of a long sequence.
2199: The two-pass algorithm also allows all changes to apply to the
2200: .I original
2201: data: no change can affect another change made in the same command.
2202: This is particularly important when evaluating an
2203: .CW x
2204: command because it prevents regular expression matches
2205: from stumbling over changes made earlier in the execution.
2206: Also, the two-pass
2207: algorithm is cleaner than the way other
2208: .UX
2209: editors allow changes to
2210: affect each other;
2211: for example,
2212: .I ed 's
2213: idioms to do things like delete every other line
2214: depend critically on the implementation.
2215: Instead,
2216: .I sam 's
2217: simple model, in which all changes in a command occur effectively
2218: simultaneously, is easy to explain and to understand.
2219: .PP
2220: The records in the transcript are of the form ``delete substring from
2221: locations
2222: 123 to 456'' and ``insert 11 characters `hello there' at location 789.''
2223: (It is an error if the changes are not at monotonically greater
2224: positions through the file.)
2225: While the update is occurring, these numbers must be
2226: offset by earlier changes, but that is straightforward and
2227: local to the update routine;
2228: moreover, all the numbers have been computed
2229: before the first is examined.
2230: .PP
2231: Treating the file as a transaction system has another advantage:
2232: undo is trivial.
2233: All it takes is to invert the transcript after it has been
2234: implemented, converting insertions
2235: into deletions and vice versa, and saving them in a holding
2236: .CW Buffer .
2237: The `do' transcript can then be deleted from
2238: the transcript
2239: .CW Buffer
2240: and replaced by the `undo' transcript.
2241: If an undo is requested, the transcript is rewound and the undo transcript
2242: executed.
2243: Because the transcript
2244: .CW Buffer
2245: is not truncated after each command, it accumulates
2246: successive changes.
2247: A sequence of undo commands
2248: can therefore back up the file arbitrarily,
2249: which is more helpful than the more commonly implemented self-inverse form of undo.
2250: .I Sam "" (
2251: provides no way to undo an undo, but if it were desired,
2252: it would be easy to provide by re-interpreting the `do' transcript.)
2253: Each mark in the transcript contains a sequence number and the offset into
2254: the transcript of the previous mark, to aid in unwinding the transcript.
2255: Marks also contain the value of dot and the modified bit so these can be
2256: restored easily.
2257: Undoing multiple files is easy; it merely demands undoing all files whose
2258: latest change has the same sequence number as the current file.
2259: .PP
2260: Another benefit of having a transcript is that errors encountered in the middle
2261: of a complicated command need not leave the files in an intermediate state.
2262: By rewinding the transcript to the mark beginning the command,
2263: the partial command can be trivially undone.
2264: .PP
2265: When the update algorithm was first implemented, it was unacceptably slow,
2266: so a cache was added to coalesce nearby changes,
2267: replacing multiple small changes by a single larger one.
2268: This reduced the number
2269: of insertions into the transaction
2270: .CW Buffer ,
2271: and made a dramatic improvement in performance,
2272: but made it impossible
2273: to handle changes in non-monotonic order in the file; the caching method
2274: only works if changes don't overlap.
2275: Before the cache was added, the transaction could in principle be sorted
2276: if the changes were out of order, although
2277: this was never done.
2278: The current status is therefore acceptable performance with a minor
2279: restriction on global changes, which is sometimes, but rarely, an annoyance.
2280: .PP
2281: The update algorithm obviously paws the data more than simpler
2282: algorithms, but it is not prohibitively expensive;
2283: the caches help.
2284: (The principle of avoiding copying the data is still honored here,
2285: although not as piously:
2286: the data is moved from contents' cache to
2287: the transcript's all at once and through only one internal buffer.)
2288: Performance figures confirm the efficiency.
2289: To read from a dead start a hundred kilobyte file on a VAX-11/750
2290: takes 1.4 seconds of user time, 2.5 seconds of system time,
2291: and 5 seconds of real time.
2292: Reading the same file in
2293: .I ed
2294: takes 6.0 seconds of user time, 1.7 seconds of system time,
2295: and 8 seconds of real time.
2296: .I Sam
2297: uses about half the CPU time.
2298: A more interesting example is the one stated above:
2299: inserting a character between every pair of characters in the file.
2300: The
2301: .I sam
2302: command is
2303: .P1 2n
2304: ,y/@/ a/x/
2305: .P2
2306: and takes 3 CPU seconds per kilobyte of input file, of which
2307: about a third is spent in the regular expression code.
2308: This translates to about 500 changes per second.
2309: .I Ed
2310: takes 1.5 seconds per kilobyte to make a similar change (ignoring newlines),
2311: but cannot undo it.
2312: The same example in
2313: .I ex ,|reference(bsdmanual)
2314: a variant of
2315: .I ed
2316: done at the University of California at Berkeley,
2317: which allows one level of undoing, again takes 3 seconds.
2318: In summary,
2319: .I sam 's
2320: performance is comparable to that of other
2321: .UX
2322: editors, although it solves
2323: a harder problem.
2324: .NH 2
2325: Communications
2326: .LP
2327: The discussion so far has described the implementation of the host part of
2328: .I sam ;
2329: the next few sections explain how a machine with mouse and bitmap display
2330: can be engaged to improve interaction.
2331: .I Sam
2332: is not the first editor to be written as two processes,|reference(fraser text editor)
2333: but its implementation
2334: has some unusual aspects.
2335: .PP
2336: There are several ways
2337: .I sam 's
2338: host and terminal parts may be connected.
2339: The first and simplest is to forgo the terminal part and use the host
2340: part's command language to edit text on an ordinary terminal.
2341: This mode is invoked by starting
2342: .I sam
2343: with the
2344: .CW -d
2345: option.
2346: With no options,
2347: .I sam
2348: runs separate host and terminal programs,
2349: communicating with a message protocol over the physical
2350: connection that joins them.
2351: Typically, the connection is an RS-232 link between a Blit
2352: (the prototypical display for
2353: .I sam )
2354: and a host running
2355: the Research Edition of the
2356: .UX
2357: operating system.|reference(latest volume1)
2358: (This is the version of the system used in the Computing Sciences Research
2359: Center at AT&T Bell Laboratories, where I work. Its relevant
2360: aspects are discussed in the Blit paper.|reference(bstj blit))
2361: The implementation of
2362: .I sam
2363: for the Sun computer runs both processes on the same machine and
2364: connects them by a pipe.
2365: .PP
2366: The low bandwidth of an RS-232 link
2367: necessitated the split between
2368: the two programs.
2369: The division is a mixed blessing:
2370: a program in two parts is much harder to write and to debug
2371: than a self-contained one,
2372: but the split makes several unusual configurations possible.
2373: The terminal may be physically separated from the host, allowing the conveniences
2374: of a mouse and bitmap display to be taken home while leaving the files at work.
2375: It is also possible to run the host part on a remote machine:
2376: .P1 2n
2377: sam -r host
2378: .P2
2379: connects to the terminal in the usual way, and then makes a call
2380: across the network to establish the host part of
2381: .I sam
2382: on the named machine.
2383: Finally, it cross-connects the I/O to join the two parts.
2384: This allows
2385: .I sam
2386: to be run on machines that do not support bitmap displays;
2387: for example,
2388: .I sam
2389: is the editor of choice on our Cray X-MP/24.
2390: .I Sam
2391: .CW -r
2392: involves
2393: .I three
2394: machines: the remote host, the terminal, and the local host.
2395: The local host's job is simple but vital: it passes the data
2396: between the remote host and terminal.
2397: .PP
2398: The host and terminal exchange messages asynchronously
2399: (rather than, say, as remote procedure calls) but there is no
2400: error detection or correction
2401: because, whatever the configuration, the connection is reliable.
2402: Because the terminal handles mundane interaction tasks such as
2403: popping up menus and interpreting the responses, the messages are about
2404: data, not actions.
2405: For example, the host knows nothing about what is displayed on the screen,
2406: and when the user types a character, the message sent to the host says
2407: ``insert a one-byte string at location 123 in file 7,'' not ``a character
2408: was typed at the current position in the current file.''
2409: In other words, the messages look very much like the transaction records
2410: in the transcripts.
2411: .PP
2412: Either the host or terminal part of
2413: .I sam
2414: may initiate a change to a file.
2415: The command language operates on the host, while typing and some
2416: mouse operations are executed directly in the terminal to optimize response.
2417: Changes initiated by the host program must be transmitted to the terminal,
2418: and
2419: vice versa.
2420: (A token is exchanged to determine which end is in control,
2421: which means that characters typed while a time-consuming command runs
2422: must be buffered and do not appear until the command is complete.)
2423: To maintain consistent information,
2424: the host and terminal track changes through a per-file
2425: data structure that records what portions of the file
2426: the terminal has received.
2427: The data structure, called a
2428: .CW Rasp
2429: (a weak pun: it's a file with holes)
2430: is held and updated by both the host and terminal.
2431: A
2432: .CW Rasp
2433: is a list of
2434: .CW Strings
2435: holding those parts of the file known to the terminal,
2436: separated by counts of the number of bytes in the interstices.
2437: Of course, the host doesn't keep a separate copy of the data (it only needs
2438: the lengths of the various pieces),
2439: but the structure is the same on both ends.
2440: .PP
2441: The
2442: .CW Rasp
2443: in the terminal doubles as a cache.
2444: Since the terminal keeps the text for portions of the file it has displayed,
2445: it need not request data from the host when revisiting old parts of the file
2446: or redrawing obscured windows, which speeds things up considerably
2447: over low-speed links.
2448: .PP
2449: It's trivial for the terminal to maintain its
2450: .CW Rasp ,
2451: because all changes made on the terminal apply to parts of the file
2452: already loaded there.
2453: Changes made by the host are compared against the
2454: .CW Rasp
2455: during the update sequence after each command.
2456: Small changes to pieces of the file loaded in the terminal
2457: are sent in their entirety.
2458: Larger changes, and changes that fall entirely in the holes,
2459: are transmitted as messages without literal data:
2460: only the lengths of the deleted and inserted strings are transmitted.
2461: When a command is completed, the terminal examines its visible
2462: windows to see if any holes in their
2463: .CW Rasps
2464: intersect the visible portion of the file.
2465: It then requests the missing data from the host,
2466: along with up to 512 bytes of surrounding data, to minimize
2467: the number of messages when visiting a new portion of the file.
2468: This technique provides a kind of two-level lazy evaluation for the terminal.
2469: The first level sends a minimum of information about
2470: parts of the file not being edited interactively;
2471: the second level waits until a change is displayed before
2472: transmitting the new data.
2473: Of course,
2474: performance is also helped by having the terminal respond immediately to typing
2475: and simple mouse requests.
2476: Except for small changes to active pieces of the file, which are
2477: transmitted to the terminal without negotiation,
2478: the terminal is wholly responsible for deciding what is displayed;
2479: the host uses the
2480: .CW Rasp
2481: only to tell the terminal what might be relevant.
2482: .PP
2483: When a change is initiated by the host,
2484: the messages to the terminal describing the change
2485: are generated by the routine that applies the transcript of the changes
2486: to the contents of the
2487: .CW File .
2488: Since changes are undone by the same update routine,
2489: undoing requires
2490: no extra code in the communications;
2491: the usual messages describing changes to the file are sufficient
2492: to back up the screen image.
2493: .PP
2494: The
2495: .CW Rasp
2496: is a particularly good example of the way caches are used in
2497: .I sam .
2498: First, it facilitates access to the active portion of the text by placing
2499: the busy text in main memory.
2500: In so doing, it provides efficient access
2501: to a large data structure that does not fit in memory.
2502: Since the form of data is to be imposed by the user, not by the program,
2503: and because characters will frequently be scanned sequentially,
2504: files are stored as flat objects.
2505: Caches help keep performance good and linear when working with such
2506: data.
2507: .PP
2508: Second, the
2509: .CW Rasp
2510: and several of the other caches have some
2511: .I read-ahead;
2512: that is, the cache is loaded with more information than is needed for
2513: the job immediately at hand.
2514: When manipulating linear structures, the accesses are usually sequential,
2515: and read-ahead can significantly reduce the average time to access the
2516: next element of the object.
2517: Sequential access is a common mode for people as well as programs;
2518: consider scrolling through a document while looking for something.
2519: .PP
2520: Finally, like any good data structure,
2521: the cache guides the algorithm, or at least the implementation.
2522: The
2523: .CW Rasp
2524: was actually invented to control the communications between the host and
2525: terminal parts, but I realized very early that it was also a form of
2526: cache. Other caches were more explicitly intended to serve a double
2527: purpose: for example, the caches in
2528: .CW Files
2529: that coalesce updates not only reduce traffic to the
2530: transcript and contents
2531: .CW Buffers ,
2532: they also clump screen updates so that complicated changes to the
2533: screen are achieved in
2534: just a few messages to the terminal.
2535: This saved me considerable work: I did not need to write special
2536: code to optimize the message traffic to the
2537: terminal.
2538: Caches pay off in surprising ways.
2539: Also, they tend to be independent, so their performance improvements
2540: are multiplicative.
2541: .NH 2
2542: Data structures in the terminal
2543: .LP
2544: The terminal's job is to display and to maintain a consistent image of
2545: pieces of the files being edited.
2546: Because the text is always in memory, the data structures are
2547: considerably simpler than those in the host part.
2548: .PP
2549: .I Sam
2550: typically has far more windows than does
2551: .I mux ,
2552: the window system within which its Blit implementation runs.
2553: .I Mux
2554: has a fairly small number of asynchronously updated windows;
2555: .I sam
2556: needs a large number of synchronously updated windows that are
2557: usually static and often fully obscured.
2558: The different tradeoffs guided
2559: .I sam
2560: away from the memory-intensive implementation of windows, called
2561: .CW Layers ,|reference(pike overlapping)
2562: used in
2563: .I mux.
2564: Rather than depending on a complete bitmap image of the display for each window,
2565: .I sam
2566: regenerates the image from its in-memory text
2567: (stored in the
2568: .CW Rasp )
2569: when necessary, although it will use such an image if it is available.
2570: Like
2571: .CW Layers ,
2572: though,
2573: .I sam
2574: uses the screen bitmap as active storage in which to update the image using
2575: .I bitblt .|reference(guibas stolfi)|reference(pike locanthi blit bitmap)
2576: The resulting organization, pictured in Figure 6,
2577: has a global array of windows, called
2578: .CW Flayers ,
2579: each of which holds an image of a piece of text held in a data structure
2580: called a
2581: .CW Frame ,
2582: which in turn represents
2583: a rectangular window full of text displayed in some
2584: .CW Bitmap .
2585: Each
2586: .CW Flayer
2587: appears in a global list that orders them all front-to-back
2588: on the display, and simultaneously as an element of a per-file array
2589: that holds all the open windows for that file.
2590: The complement in the terminal of the
2591: .CW File
2592: on the host is called a
2593: .CW Text ;
2594: each connects its
2595: .CW Flayers
2596: to the associated
2597: .CW Rasp .
2598: .1C
2599: .KF
2600: .PS
2601: copy "fig6.pic"
2602: .PE
2603: .Cs
2604: Figure 6. Data structures in the terminal.
2605: .CW Flayers
2606: are also linked together into a front-to-back list.
2607: .CW Boxes
2608: are discussed in the next section.
2609: .Ce
2610: .KE
2611: .2C
2612: .PP
2613: The
2614: .CW Bitmap
2615: for a
2616: .CW Frame
2617: contains the image of the text.
2618: For a fully visible window, the
2619: .CW Bitmap
2620: will be the screen (or at least the
2621: .CW Layer
2622: in which
2623: .I sam
2624: is being run),
2625: while for partially obscured windows the
2626: .CW Bitmap
2627: will be off-screen.
2628: If the window is fully obscured, the
2629: .CW Bitmap
2630: will be null.
2631: .PP
2632: The
2633: .CW Bitmap
2634: is a kind of cache.
2635: When making changes to the display, most of the original image will
2636: look the same in the final image, and the update algorithms exploit this.
2637: The
2638: .CW Frame
2639: software updates the image in the
2640: .CW Bitmap
2641: incrementally; the
2642: .CW Bitmap
2643: is not just an image, it is a data structure.|reference(guibas stolfi)|reference(pike locanthi blit bitmap)
2644: The job of the software that updates the display is therefore
2645: to use as much as possible of the existing image (converting the
2646: text from ASCII characters to pixels is expensive) in a sort of two-dimensional
2647: string insertion algorithm.
2648: The details of this process are described in the next section.
2649: .PP
2650: The
2651: .CW Frame
2652: software has no code to support overlapping windows;
2653: its job is to keep a single
2654: .CW Bitmap
2655: up to date.
2656: It falls to the
2657: .CW Flayer
2658: software to multiplex the various
2659: .CW Bitmaps
2660: onto the screen.
2661: The problem of maintaining overlapping
2662: .CW Flayers
2663: is easier than for
2664: .CW Layers |reference(pike overlapping)
2665: because changes are made synchronously and because the contents of the window
2666: can be reconstructed from the data stored in the
2667: .CW Frame ;
2668: the
2669: .CW Layers
2670: software
2671: makes no such assumptions.
2672: In
2673: .I sam ,
2674: the window being changed is almost always fully visible, because the current
2675: window is always fully visible, by construction.
2676: However, when multi-file changes are being made, or when
2677: more than one window is open on a file,
2678: it may be necessary to update partially obscured windows.
2679: .PP
2680: There are three cases: the window is
2681: fully visible, invisible (fully obscured), or partially visible.
2682: If fully visible, the
2683: .CW Bitmap
2684: is part of the screen, so when the
2685: .CW Flayer
2686: update routine calls the
2687: .CW Frame
2688: update routine, the screen will be updated directly.
2689: If the window is invisible,
2690: there is no associated
2691: .CW Bitmap ,
2692: and all that is necessary is to update the
2693: .CW Frame
2694: data structure, not the image.
2695: If the window is partially visible, the
2696: .CW Frame
2697: routine is called to update the image in the off-screen
2698: .CW Bitmap ,
2699: which may require regenerating it from the text of the window.
2700: The
2701: .CW Flayer
2702: code then clips this
2703: .CW Bitmap
2704: against the
2705: .CW Bitmaps
2706: of all
2707: .CW Frames
2708: in front of the
2709: .CW Frame
2710: being modified, and the remainder is copied to the display.
2711: .PP
2712: This is much faster than recreating the image off-screen
2713: for every change, or clipping all the changes made to the image
2714: during its update.
2715: Unfortunately, these caches can also consume prohibitive amounts of
2716: memory, so they are freed fairly liberally \(em after every change to the
2717: front-to-back order of the
2718: .CW Flayers .
2719: The result is that
2720: the off-screen
2721: .CW Bitmaps
2722: exist only while multi-window changes are occurring,
2723: which is the only time the performance improvement they provide is needed.
2724: Also, the user interface causes fully-obscured windows to be the
2725: easiest to make \(em
2726: creating a canonically sized and placed window requires only a button click
2727: \(em which reduces the need for caching still further.
2728: .NH 2
2729: Screen update
2730: .LP
2731: Only two low-level primitives are needed for incremental update:
2732: .I bitblt ,
2733: which copies rectangles of pixels, and
2734: .I string
2735: (which in turn calls
2736: .I bitblt ),
2737: which draws a null-terminated character string in a
2738: .CW Bitmap .
2739: A
2740: .CW Frame
2741: contains a list of
2742: .CW Boxes ,
2743: each of which defines a horizontal strip of text in the window
2744: (see Figure 7).
2745: A
2746: .CW Box
2747: has a character string
2748: .CW str ,
2749: and a
2750: .CW Rectangle
2751: .CW rect
2752: that defines the location of the strip in the window.
2753: (The text in
2754: .CW str
2755: is stored in the
2756: .CW Box
2757: separately from the
2758: .CW Rasp
2759: associated with the window's file, so
2760: .CW Boxes
2761: are self-contained.)
2762: The invariant is that
2763: the image of the
2764: .CW Box
2765: can be reproduced by calling
2766: .CW string
2767: with argument
2768: .CW str
2769: to draw the string in
2770: .CW rect ,
2771: and the resulting picture fits perfectly within
2772: .CW rect .
2773: In other words, the
2774: .CW Boxes
2775: define the tiling of the window.
2776: The tiling may be complicated by long lines of text, which
2777: are folded onto the next line.
2778: Some editors use horizontal scrolling to avoid this complication,
2779: but to be comfortable this technique requires that lines not be
2780: .I too
2781: long;
2782: .I sam
2783: has no such restriction.
2784: Also, and perhaps more importantly,
2785: .UX
2786: programs and terminals traditionally fold
2787: long lines to make their contents fully visible.
2788: .PP
2789: Two special kinds of
2790: .CW Boxes
2791: contain a single
2792: character: either a newline or a tab.
2793: Newlines and tabs are white space.
2794: A newline
2795: .CW Box
2796: always extends to the right edge of the window,
2797: forcing the following
2798: .CW Box
2799: to the next line.
2800: The width of a tab depends on where it is located:
2801: it forces the next
2802: .CW Box
2803: to begin at a tab location.
2804: Tabs also
2805: have a minimum width equivalent to a blank (blanks are
2806: drawn by
2807: .CW string
2808: and are not treated specially); newlines have a minimum width of zero.
2809: .1C
2810: .KF
2811: .PS
2812: copy "fig7.pic"
2813: .PE
2814: .sp .5
2815: .Cs
2816: Figure 7. A line of text showing its
2817: .CW Boxes .
2818: The first two blank
2819: .CW Boxes
2820: contain tabs; the last contains a newline.
2821: Spaces are handled as ordinary characters.
2822: .Ce
2823: .KE
2824: .2C
2825: .PP
2826: The update algorithms always use the
2827: .CW Bitmap
2828: image of the text (either the display or cache
2829: .CW Bitmap );
2830: they never examine the characters within a
2831: .CW Box
2832: except when the
2833: .CW Box
2834: needs to be split in two.
2835: Before a change, the window consists of a tiling of
2836: .CW Boxes ;
2837: after the change the window is tiled differently.
2838: The update algorithms rearrange the tiles in place, without
2839: backup storage.
2840: The algorithms are not strictly optimal \(em for example, they can
2841: clear a pixel that is later going to be written upon \(em
2842: but they never move a tile that doesn't need to be moved,
2843: and they move each tile at most once.
2844: .CW Frinsert
2845: on a Blit can absorb over a thousand characters a second if the strings
2846: being inserted are a few tens of characters long.
2847: .PP
2848: Consider
2849: .CW frdelete .
2850: Its job is to delete a substring from a
2851: .CW Frame
2852: and restore the image of the
2853: .CW Frame .
2854: The image of a substring has a peculiar shape (see Figure 2) comprising
2855: possibly a partial line,
2856: zero or more full lines,
2857: and possibly a final partial line.
2858: For reference, call this the
2859: .I
2860: Z-shape.
2861: .R
2862: .CW Frdelete
2863: begins by splitting, if necessary, the
2864: .CW Boxes
2865: containing the ends of
2866: the substring so the substring begins and ends on
2867: .CW Box
2868: boundaries.
2869: Because the substring is being deleted, its image is not needed,
2870: so the Z-shape is then cleared.
2871: Then, tiles (that is, the images of
2872: .CW Boxes )
2873: are copied, using
2874: .CW bitblt ,
2875: from immediately after the Z-shape to
2876: the beginning of the Z-shape,
2877: resulting in a new Z-shape.
2878: .CW Boxes "" (
2879: whose contents would span two lines in the new position must first be split.)
2880: .PP
2881: Copying the remainder of the
2882: .CW Frame
2883: tile by tile
2884: this way will clearly accomplish the deletion but eventually,
2885: typically when the copying algorithm encounters a tab or newline,
2886: the old and new
2887: .CW x
2888: coordinates of the tile
2889: to be copied are the same.
2890: This correspondence implies
2891: that the Z-shape has its beginning and ending edges aligned
2892: vertically, and a sequence of at most two
2893: .I bitblt s
2894: can be used to copy the remaining tiles.
2895: The last step is to clear out the resulting empty space at the bottom
2896: of the window;
2897: the number of lines to be cleared is the number of complete lines in the
2898: Z-shape closed by the final
2899: .I bitblt s.
2900: The final step is to merge horizontally adjacent
2901: .CW Boxes
2902: of plain text.
2903: The complete source to
2904: .CW frdelete
2905: is less than 100 lines of C.
2906: .PP
2907: .CW frinsert
2908: is more complicated because it must do four passes:
2909: one to construct the
2910: .CW Box
2911: list for the inserted string,
2912: one to reconnoitre,
2913: one to copy (in opposite order to
2914: .CW frdelete )
2915: the
2916: .CW Boxes
2917: to make the hole for the new text,
2918: and finally one to copy the new text into place.
2919: Overall, though,
2920: .CW frinsert
2921: has a similar flavor to
2922: .CW frdelete ,
2923: and needn't be described further.
2924: .CW Frinsert
2925: and its subsidiary routines comprise 211 lines of C.
2926: .PP
2927: The terminal source code is 3024 lines of C,
2928: and the host source is 5797 lines.
2929: .NH
2930: Discussion
2931: .NH 2
2932: History
2933: .LP
2934: The immediate ancestor of
2935: .I sam
2936: was the original text editor for the Blit, called
2937: .I jim .
2938: .I Sam
2939: inherited
2940: .I jim 's
2941: two-process structure and mouse language almost unchanged, but
2942: .I jim
2943: suffered from several drawbacks that were addressed in the design of
2944: .I sam .
2945: The most important of these was the lack of a command language.
2946: Although
2947: .I jim
2948: was easy to use for simple editing, it provided no direct help with
2949: large or repetitive editing tasks. Instead, it provided a command to pass
2950: selected text through a shell pipeline,
2951: but this was no more satisfactory than could be expected of a stopgap measure.
2952: .PP
2953: .I Jim
2954: was written primarily as a vehicle for experimenting with a mouse-based
2955: interface to text, and the experiment was successful.
2956: .I Jim
2957: had some spin-offs:
2958: .I mux ,
2959: the second window system for the Blit, is essentially a multiplexed
2960: version of the terminal part of
2961: .I jim ;
2962: and the debugger
2963: .I pi 's
2964: user interface|reference(cargill feel pi) was closely modeled on
2965: .I jim 's.
2966: But after a couple of years,
2967: .I jim
2968: had become difficult to maintain and limiting to use,
2969: and its replacement was overdue.
2970: .PP
2971: I began the design of
2972: .I sam
2973: by asking
2974: .I jim
2975: customers what they wanted.
2976: This was probably a mistake; the answers were essentially a list of features
2977: to be found in other editors, which did not provide any of the
2978: guiding principles I was seeking.
2979: For instance, one common request was for a ``global substitute,''
2980: but no one suggested how to provide it within a cut-and-paste editor.
2981: I was looking for a scheme that would
2982: support such specialized features comfortably in the context of some
2983: general command language.
2984: Ideas were not forthcoming, though, particularly given my insistence
2985: on removing all limits on file sizes, line lengths and so on.
2986: Even worse, I recognized that, since the mouse could easily
2987: indicate a region of the screen that was not an integral number of lines,
2988: the command language would best forget about newlines altogether,
2989: and that meant the command language had to treat the file as a single
2990: string, not an array of lines.
2991: .PP
2992: Eventually, I decided that thinking was not getting me very far and it was
2993: time to try building.
2994: I knew that the terminal part could be built easily \(em
2995: that part of
2996: .I jim
2997: behaved acceptably well \(em and that most of the hard work was going
2998: to be in the host part: the file interface, command interpreter and so on.
2999: Moreover, I had some ideas about how the architecture of
3000: .I jim
3001: could be improved without destroying its basic structure, which I liked
3002: in principle but which hadn't worked out as well as I had hoped.
3003: So I began by designing the file data structure,
3004: starting with the way
3005: .I jim
3006: worked \(em comparable to a single structure merging
3007: .CW Disc
3008: and
3009: .CW Buffer ,
3010: which I split to make the cache more general
3011: \(em and thinking about how global substitute could be implemented.
3012: The answer was clearly that it had to be done in two passes,
3013: and the transcript-oriented implementation fell out naturally.
3014: .PP
3015: .I Sam
3016: was written bottom-up,
3017: starting from the data structures and algorithms for manipulating text,
3018: through the command language and up to the code for maintaining
3019: the display.
3020: In retrospect, it turned out well, but this implementation method is
3021: not recommended in general.
3022: There were several times when I had a large body of interesting code
3023: assembled and no clue how to proceed with it.
3024: The command language, in particular, took almost a year to figure out,
3025: but can be implemented (given what was there at the beginning of that year)
3026: in a day or two. Similarly, inventing the
3027: .CW Rasp
3028: data structure delayed the
3029: connection of the host and terminal pieces by another few months.
3030: .I Sam
3031: took about two years to write, although only about four months were
3032: spent actually working on it.
3033: .PP
3034: Part of the design process was unusual:
3035: the subset of the protocol that maintains the
3036: .CW Rasp
3037: was simulated, debugged
3038: and verified by an automatic protocol analyzer,|reference(holzmann atttj) and was bug-free
3039: from the start.
3040: The rest of the protocol, concerned mostly
3041: with keeping menus up to date,
3042: was unfortunately too unwieldy for such analysis,
3043: and was debugged by more traditional methods, primarily
3044: by logging in a file all messages in and out of the host.
3045: .NH 2
3046: Reflections
3047: .LP
3048: .I Sam
3049: is essentially the only interactive editor used by the sixty or so members of
3050: the computing science research center in which I work.
3051: The same could not be said of
3052: .I jim ;
3053: the lack of a command language kept some people from adopting it.
3054: The union of a user interface as comfortable as
3055: .I jim 's
3056: with a command language as powerful as
3057: .CW ed 's\(dg
3058: .FS
3059: .vs 9
3060: \(dg
3061: The people who criticize
3062: .I ed
3063: as an interactive program often forget that it and its close relative
3064: .I sed |reference(kernighan pike)
3065: still thrive as programmable editors. The strength of these programs is
3066: independent of their convenience for interactive editing.
3067: .br
3068: .vs
3069: .FE
3070: is essential to
3071: .I sam 's
3072: success.
3073: When
3074: .I sam
3075: was first made available to the
3076: .I jim
3077: community,
3078: almost everyone switched to it within two or three days.
3079: In the months that followed, even people who had never adopted
3080: .I jim
3081: started using
3082: .I sam
3083: exclusively.
3084: .PP
3085: To be honest,
3086: .I ed
3087: still gets occasional use, but usually when
3088: something quick needs to be done and the overhead of
3089: downloading the terminal part of
3090: .I sam
3091: isn't worth the trouble.
3092: Also, as a `line' editor,
3093: .CW "sam -d"
3094: is a bit odd;
3095: when using a good old ASCII terminal, it's comforting to have
3096: a true line editor.
3097: But it is fair to say that
3098: .I sam 's
3099: command language has displaced
3100: .I ed 's
3101: for most of the complicated editing that has kept line editors
3102: (that is, command-driven editors) with us.
3103: .PP
3104: .I Sam 's
3105: command language is even fancier than
3106: .I ed 's,
3107: and most
3108: .I sam
3109: customers don't come near to using all its capabilities.
3110: Does it need to be so sophisticated?
3111: I think the answer is yes, for two reasons.
3112: .PP
3113: First, the
3114: .I model
3115: for
3116: .I sam 's
3117: command language is really relatively simple, and certainly simpler than that of
3118: .I ed .
3119: For instance, there is only one kind of textual loop in
3120: .I sam
3121: \(em the
3122: .CW x
3123: command \(em
3124: while
3125: .I ed
3126: has three (the
3127: .CW g
3128: command, the global flag on substitutions, and the implicit loop over
3129: lines in multi-line substitutions).
3130: Also,
3131: .I ed 's
3132: substitute command is necessary to make changes within lines, but in
3133: .I sam
3134: the
3135: .CW s
3136: command is more of a familiar convenience than a necessity;
3137: .CW c
3138: and
3139: .CW t
3140: can do all the work.
3141: .PP
3142: Second,
3143: given a community that expects an editor to be about as powerful as
3144: .I ed ,
3145: it's hard to see how
3146: .I sam
3147: could really be much simpler and still satisfy that expectation.
3148: People want to do ``global substitutes,'' and most are content
3149: to have the recipe for that and a few other fancy changes.
3150: The sophistication of the command language is really just a veneer
3151: over a design that makes it possible to do global substitutes
3152: in a screen editor.
3153: Some people will always want something more, however, and it's gratifying to
3154: be able to provide it.
3155: The real power of
3156: .I sam 's
3157: command language comes from composability of the operators, which is by
3158: nature orthogonal to the underlying model.
3159: In other words,
3160: .I sam
3161: is not itself complex, but it makes complex things possible.
3162: If you don't want to do anything complex, you can ignore the
3163: complexity altogether, and many people do so.
3164: .PP
3165: Sometimes I am asked the opposite question: why didn't I just make
3166: .I sam
3167: a real programmable editor, with macros and variables and so on?
3168: The main reason is a matter of taste: I like the editor
3169: to be the same every time I use it.
3170: There is one technical reason, though:
3171: programmability in editors is largely a workaround for insufficient
3172: interactivity.
3173: Programmable editors are used to make particular, usually short-term,
3174: things easy to do, such as by providing shorthands for common actions.
3175: If things are generally easy to do in the first place,
3176: shorthands are not as helpful.
3177: .I Sam
3178: makes common editing operations very easy, and the solutions to
3179: complex editing problems seem commensurate with the problems themselves.
3180: Also, the ability to edit the
3181: .I sam
3182: window makes it easy to repeat commands \(em it only takes a mouse button click
3183: to execute a command again.
3184: .NH 2
3185: Pros and cons
3186: .LP
3187: .I Sam
3188: has several other good points,
3189: and its share of problems.
3190: Among the good things is the idea of
3191: structural regular expressions,
3192: whose usefulness has only begun to be explored.
3193: They were arrived at serendipitously when I attempted to distill the essence of
3194: .I ed 's
3195: way of doing global substitution and recognized that the looping command in
3196: .I ed
3197: was implicitly imposing a structure (an array of lines) on the file.
3198: .PP
3199: Another of
3200: .I sam 's
3201: good things is its undo capability.
3202: I had never before used an editor with a true undo,
3203: but I would never go back now.
3204: Undo
3205: .I must
3206: be done well, but if it is, it can be relied on.
3207: For example,
3208: it's safe to experiment if you're not sure how to write some intricate command,
3209: because if you make a mistake, it can be fixed simply and reliably.
3210: I learned two things about undo from writing
3211: .I sam :
3212: first, it's easy to provide if you design it in from the beginning, and
3213: second, it's necessary, particularly if the system has some subtle
3214: properties that may be unfamiliar or error-prone for users.
3215: .PP
3216: .I Sam 's
3217: lack of internal limits and sizes is a virtue.
3218: Because it avoids all fixed-size tables and data structures,
3219: .I sam
3220: is able to make global changes to files that some of our other
3221: tools cannot even read.
3222: Moreover, the design keeps the performance linear when doing such
3223: operations, although I must admit
3224: .I sam
3225: does get slow when editing a huge file.
3226: .PP
3227: Now, the problems.
3228: Externally, the most obvious is that it is poorly integrated into the
3229: surrounding window system.
3230: By design, the user interface in
3231: .I sam
3232: feels almost identical to that of
3233: .I mux ,
3234: but a thick wall separates text in
3235: .I sam
3236: from the programs running in
3237: .I mux .
3238: For instance, the `snarf buffer' in
3239: .I sam
3240: must be maintained separately from that in
3241: .I mux .
3242: This is regrettable, but probably necessary given the unusual configuration
3243: of the system, with a programmable terminal on the far end of an RS-232 link.
3244: .PP
3245: .I Sam
3246: is reliable; otherwise, people wouldn't use it.
3247: But it was written over such a long time, and has so many new (to me)
3248: ideas in it, that I would like to see it done over again to clean
3249: up the code and remove many of the lingering problems in the implementation.
3250: The worst part is in the interconnection of the host and terminal parts,
3251: which might even be able to go away in a redesign for a more
3252: conventional window system.
3253: The program must be split in two to use the terminal effectively,
3254: but the low bandwidth of the connection forces the separation to
3255: occur in an inconvenient part of the design if performance is to be acceptable.
3256: A simple remote procedure call
3257: protocol driven by the host, emitting only graphics
3258: commands, would be easy to write but wouldn't have nearly the
3259: necessary responsiveness. On the other hand, if the terminal were in control
3260: and requested much simpler file services from the host, regular expression
3261: searches would require that the terminal read the entire file over its RS-232
3262: link, which would be unreasonably slow.
3263: A compromise in which either end can take control is necessary.
3264: In retrospect, the communications protocol should have been
3265: designed and verified formally, although I do not know of any tool
3266: that can adequately relate the protocol to
3267: its implementation.
3268: .PP
3269: Not all of
3270: .I sam 's
3271: users are comfortable with its command language, and few are adept.
3272: Some (venerable) people use a sort of
3273: .I ed \& ``
3274: subset'' of
3275: .I sam 's
3276: command language,
3277: and even ask why
3278: .I sam 's
3279: command language is not exactly
3280: .I ed 's.
3281: (The reason, of course, is that
3282: .I sam 's
3283: model for text does not include newlines, which are central to
3284: .I ed .
3285: Making the text an array of newlines to the command language would
3286: be too much of a break from the seamless model provided by the mouse.
3287: Some editors, such as
3288: .I vi ,
3289: are willing to make this break, though.)
3290: The difficulty is that
3291: .I sam 's
3292: syntax is so close to
3293: .I ed 's
3294: that people believe it
3295: .I should
3296: be the same.
3297: I thought, with some justification in hindsight,
3298: that making
3299: .I sam
3300: similar to
3301: .I ed
3302: would make it easier to learn and to accept.
3303: But I may have overstepped and raised the users'
3304: expectations too much.
3305: It's hard to decide which way to resolve this problem.
3306: .PP
3307: Finally, there is a tradeoff in
3308: .I sam
3309: that was decided by the environment in which it runs:
3310: .I sam
3311: is a multi-file editor, although in a different system there might instead be
3312: multiple single-file editors.
3313: The decision was made primarily because starting a new program in a Blit is
3314: time-consuming.
3315: If the choice could be made freely, however, I would
3316: still choose the multi-file architecture, because it allows
3317: groups of files to be handled as a unit;
3318: the usefulness of the multi-file commands is incontrovertible.
3319: It is delightful to have the source to an entire program
3320: available at your fingertips.
3321: .NH
3322: Acknowledgements
3323: .LP
3324: Tom Cargill suggested the idea behind the
3325: .CW Rasp
3326: data structure.
3327: Norman Wilson and Ken Thompson influenced the command language.
3328: This paper was improved by comments from
3329: Al Aho,
3330: Jon Bentley,
3331: Chris Fraser,
3332: Gerard Holzmann,
3333: Brian Kernighan,
3334: Ted Kowalski,
3335: Doug McIlroy
3336: and
3337: Dennis Ritchie.
3338: .NH
3339: References
3340: .LP
3341: |reference_placement
This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.