|
|
1.1 root 1: .\" @(#)awk 6.1 (Berkeley) 5/22/86
2: .\"
3: .EH 'USD:19-%''Awk \(em A Pattern Scanning and Processing Language'
4: .OH 'Awk \(em A Pattern Scanning and Processing Language''USD:19-%'
5: .\" .fp 3 G no G on APS (use gb) or Dandelion Printer (use CW)
6: .\" the .T is only a ditroff feature...
7: .if '\*.T'dp' .fp 3 El
8: .if '\*.T'aps' .fp 3 gB
9: ....TM "78-1271-12, 78-1273-6" 39199 39199-11
10: .ND "September 1, 1978"
11: ....TR 68
12: .\".RP
13: . \" macros here
14: .tr _\(em
15: .if t .tr ~\(ap
16: .tr |\(or
17: .tr *\(**
18: .de UC
19: \&\\$3\s-1\\$1\\s0\&\\$2
20: ..
21: .de IT
22: .if n .ul
23: \&\\$3\f2\\$1\fP\|\\$2
24: ..
25: .de UL
26: .if n .ul
27: \&\\$3\f3\\$1\fP\&\\$2
28: ..
29: .de P1
30: .DS I 3n
31: .nf
32: .if n .ta 5 10 15 20 25 30 35 40 45 50 55 60
33: .if t .ta .3i .6i .9i 1.2i
34: .if t .tr -\-'\(fm*\(**
35: .if t .tr _\(ul
36: .ft 3
37: .lg 0
38: .ss 18
39: . \"use first argument as indent if present
40: ..
41: .de P2
42: .ps \\n(PS
43: .vs \\n(VSp
44: .ft R
45: .ss 12
46: .if n .ls 2
47: .tr --''``^^!!
48: .if t .tr _\(em
49: .fi
50: .lg
51: .DE
52: ..
53: .hw semi-colon
54: .hy 14
55: . \"2=not last lines; 4= no -xx; 8=no xx-
56: . \"special chars in programs
57: .de WS
58: .sp \\$1
59: ..
60: . \" end of macros
61: .TL
62: Awk \(em A Pattern Scanning and Processing Language
63: .br
64: (Second Edition)
65: .AU "MH 2C-522" 4862
66: Alfred V. Aho
67: .AU "MH 2C-518" 6021
68: Brian W. Kernighan
69: .AU "MH 2C-514" 7214
70: Peter J. Weinberger
71: .AI
72: .MH
73: .AB
74: .IT Awk
75: is a programming language whose
76: basic operation
77: is to search a set of files
78: for patterns, and to perform specified actions upon lines or fields of lines which
79: contain instances of those patterns.
80: .IT Awk
81: makes certain data selection and transformation operations easy to express;
82: for example, the
83: .IT awk
84: program
85: .sp
86: .ce
87: .ft 3
88: length > 72
89: .ft
90: .sp
91: prints all input lines whose length exceeds 72 characters;
92: the program
93: .ce
94: .sp
95: .ft 3
96: NF % 2 == 0
97: .ft R
98: .sp
99: prints all lines with an even number of fields;
100: and the program
101: .ce
102: .sp
103: .ft 3
104: { $1 = log($1); print }
105: .ft R
106: .sp
107: replaces the first field of each line by its logarithm.
108: .PP
109: .IT Awk
110: patterns may include arbitrary boolean combinations of regular expressions
111: and of relational operators on strings, numbers, fields, variables, and array elements.
112: Actions may include the same pattern-matching constructions as in patterns,
113: as well as
114: arithmetic and string expressions and assignments,
115: .UL if-else ,
116: .UL while ,
117: .UL for
118: statements,
119: and multiple output streams.
120: .PP
121: This report contains a user's guide, a discussion of the design and implementation of
122: .IT awk ,
123: and some timing statistics.
124: ....It supersedes TM-77-1271-5, dated September 8, 1977.
125: .AE
126: .CS 6 1 7 0 1 4
127: .if n .ls 2
128: .nr PS 9
129: .nr VS 11
130: .NH
131: Introduction
132: .if t .2C
133: .PP
134: .IT Awk
135: is a programming language designed to make
136: many common
137: information retrieval and text manipulation tasks
138: easy to state and to perform.
139: .PP
140: The basic operation of
141: .IT awk
142: is to scan a set of input lines in order,
143: searching for lines which match any of a set of patterns
144: which the user has specified.
145: For each pattern, an action can be specified;
146: this action will be performed on each line that matches the pattern.
147: .PP
148: Readers familiar with the
149: .UX
150: program
151: .IT grep\|
152: .[
153: unix program manual
154: .]
155: will recognize
156: the approach, although in
157: .IT awk
158: the patterns may be more
159: general than in
160: .IT grep ,
161: and the actions allowed are more involved than merely
162: printing the matching line.
163: For example, the
164: .IT awk
165: program
166: .P1
167: {print $3, $2}
168: .P2
169: prints the third and second columns of a table
170: in that order.
171: The program
172: .P1
173: $2 ~ /A\||B\||C/
174: .P2
175: prints all input lines with an A, B, or C in the second field.
176: The program
177: .P1
178: $1 != prev { print; prev = $1 }
179: .P2
180: prints all lines in which the first field is different
181: from the previous first field.
182: .NH 2
183: Usage
184: .PP
185: The command
186: .P1
187: awk program [files]
188: .P2
189: executes the
190: .IT awk
191: commands in
192: the string
193: .UL program
194: on the set of named files,
195: or on the standard input if there are no files.
196: The statements can also be placed in a file
197: .UL pfile ,
198: and executed by the command
199: .P1
200: awk -f pfile [files]
201: .P2
202: .NH 2
203: Program Structure
204: .PP
205: An
206: .IT awk
207: program is a sequence of statements of the form:
208: .P1
209: .ft I
210: pattern { action }
211: pattern { action }
212: ...
213: .ft 3
214: .P2
215: Each line of input
216: is matched against
217: each of the patterns in turn.
218: For each pattern that matches, the associated action
219: is executed.
220: When all the patterns have been tested, the next line
221: is fetched and the matching starts over.
222: .PP
223: Either the pattern or the action may be left out,
224: but not both.
225: If there is no action for a pattern,
226: the matching line is simply
227: copied to the output.
228: (Thus a line which matches several patterns can be printed several times.)
229: If there is no pattern for an action,
230: then the action is performed for every input line.
231: A line which matches no pattern is ignored.
232: .PP
233: Since patterns and actions are both optional,
234: actions must be enclosed in braces
235: to distinguish them from patterns.
236: .NH 2
237: Records and Fields
238: .PP
239: .IT Awk
240: input is divided into
241: ``records'' terminated by a record separator.
242: The default record separator is a newline,
243: so by default
244: .IT awk
245: processes its input a line at a time.
246: The number of the current record is available in a variable
247: named
248: .UL NR .
249: .PP
250: Each input record
251: is considered to be divided into ``fields.''
252: Fields are normally separated by
253: white space \(em blanks or tabs \(em
254: but the input field separator may be changed, as described below.
255: Fields are referred to as
256: .UL "$1, $2,"
257: and so forth,
258: where
259: .UL $1
260: is the first field,
261: and
262: .UL $0
263: is the whole input record itself.
264: Fields may be assigned to.
265: The number of fields in the current record
266: is available in a variable named
267: .UL NF .
268: .PP
269: The variables
270: .UL FS
271: and
272: .UL RS
273: refer to the input field and record separators;
274: they may be changed at any time to any single character.
275: The optional command-line argument
276: \f3\-F\fIc\fR
277: may also be used to set
278: .UL FS
279: to the character
280: .IT c .
281: .PP
282: If the record separator is empty,
283: an empty input line is taken as the record separator,
284: and blanks, tabs and newlines are treated as field separators.
285: .PP
286: The variable
287: .UL FILENAME
288: contains the name of the current input file.
289: .NH 2
290: Printing
291: .PP
292: An action may have no pattern,
293: in which case the action is executed for
294: all
295: lines.
296: The simplest action is to print some or all of a record;
297: this is accomplished by the
298: .IT awk
299: command
300: .UL print .
301: The
302: .IT awk
303: program
304: .P1
305: { print }
306: .P2
307: prints each record, thus copying the input to the output intact.
308: More useful is to print a field or fields from each record.
309: For instance,
310: .P1
311: print $2, $1
312: .P2
313: prints the first two fields in reverse order.
314: Items separated by a comma in the print statement will be separated by the current output field separator
315: when output.
316: Items not separated by commas will be concatenated,
317: so
318: .P1
319: print $1 $2
320: .P2
321: runs the first and second fields together.
322: .PP
323: The predefined variables
324: .UL NF
325: and
326: .UL NR
327: can be used;
328: for example
329: .P1
330: { print NR, NF, $0 }
331: .P2
332: prints each record preceded by the record number and the number of fields.
333: .PP
334: Output may be diverted to multiple files;
335: the program
336: .P1
337: { print $1 >"foo1"; print $2 >"foo2" }
338: .P2
339: writes the first field,
340: .UL $1 ,
341: on the file
342: .UL foo1 ,
343: and the second field on file
344: .UL foo2 .
345: The
346: .UL >>
347: notation can also be used:
348: .P1
349: print $1 >>"foo"
350: .P2
351: appends the output to the file
352: .UL foo .
353: (In each case,
354: the output files are
355: created if necessary.)
356: The file name can be a variable or a field as well as a constant;
357: for example,
358: .P1
359: print $1 >$2
360: .P2
361: uses the contents of field 2 as a file name.
362: .PP
363: Naturally there is a limit on the number of output files;
364: currently it is 10.
365: .PP
366: Similarly, output can be piped into another process
367: (on
368: .UC UNIX
369: only); for instance,
370: .P1
371: print | "mail bwk"
372: .P2
373: mails the output to
374: .UL bwk .
375: .PP
376: The variables
377: .UL OFS
378: and
379: .UL ORS
380: may be used to change the current
381: output field separator and output
382: record separator.
383: The output record separator is
384: appended to the output of the
385: .UL print
386: statement.
387: .PP
388: .IT Awk
389: also provides the
390: .UL printf
391: statement for output formatting:
392: .P1
393: printf format expr, expr, ...
394: .P2
395: formats the expressions in the list
396: according to the specification
397: in
398: .UL format
399: and prints them.
400: For example,
401: .P1
402: printf "%8.2f %10ld\en", $1, $2
403: .P2
404: prints
405: .UL $1
406: as a floating point number 8 digits wide,
407: with two after the decimal point,
408: and
409: .UL $2
410: as a 10-digit long decimal number,
411: followed by a newline.
412: No output separators are produced automatically;
413: you must add them yourself,
414: as in this example.
415: The version of
416: .UL printf
417: is identical to that used with C.
418: .[
419: C programm language prentice hall 1978
420: .]
421: .NH 1
422: Patterns
423: .PP
424: A pattern in front of an action acts as a selector
425: that determines whether the action is to be executed.
426: A variety of expressions may be used as patterns:
427: regular expressions,
428: arithmetic relational expressions,
429: string-valued expressions,
430: and arbitrary boolean
431: combinations of these.
432: .NH 2
433: BEGIN and END
434: .PP
435: The special pattern
436: .UL BEGIN
437: matches the beginning of the input,
438: before the first record is read.
439: The pattern
440: .UL END
441: matches the end of the input,
442: after the last record has been processed.
443: .UL BEGIN
444: and
445: .UL END
446: thus provide a way to gain control before and after processing,
447: for initialization and wrapup.
448: .PP
449: As an example, the field separator
450: can be set to a colon by
451: .P1
452: BEGIN { FS = ":" }
453: .ft I
454: \&... rest of program ...
455: .ft 3
456: .P2
457: Or the input lines may be counted by
458: .P1
459: END { print NR }
460: .P2
461: If
462: .UL BEGIN
463: is present, it must be the first pattern;
464: .UL END
465: must be the last if used.
466: .NH 2
467: Regular Expressions
468: .PP
469: The simplest regular expression is a literal string of characters
470: enclosed in slashes,
471: like
472: .P1
473: /smith/
474: .P2
475: This
476: is actually a complete
477: .IT awk
478: program which
479: will print all lines which contain any occurrence
480: of the name ``smith''.
481: If a line contains ``smith''
482: as part of a larger word,
483: it will also be printed, as in
484: .P1
485: blacksmithing
486: .P2
487: .PP
488: .IT Awk
489: regular expressions include the regular expression
490: forms found in
491: the
492: .UC UNIX
493: text editor
494: .IT ed\|
495: .[
496: unix program manual
497: .]
498: and
499: .IT grep
500: (without back-referencing).
501: In addition,
502: .IT awk
503: allows
504: parentheses for grouping, | for alternatives,
505: .UL +
506: for ``one or more'', and
507: .UL ?
508: for ``zero or one'',
509: all as in
510: .IT lex .
511: Character classes
512: may be abbreviated:
513: .UL [a\-zA\-Z0\-9]
514: is the set of all letters and digits.
515: As an example,
516: the
517: .IT awk
518: program
519: .P1
520: /[Aa]ho\||[Ww]einberger\||[Kk]ernighan/
521: .P2
522: will print all lines which contain any of the names
523: ``Aho,'' ``Weinberger'' or ``Kernighan,''
524: whether capitalized or not.
525: .PP
526: Regular expressions
527: (with the extensions listed above)
528: must be enclosed in slashes,
529: just as in
530: .IT ed
531: and
532: .IT sed .
533: Within a regular expression,
534: blanks and the regular expression
535: metacharacters are significant.
536: To turn of the magic meaning
537: of one of the regular expression characters,
538: precede it with a backslash.
539: An example is the pattern
540: .P1
541: /\|\e/\^.\^*\e//
542: .P2
543: which matches any string of characters
544: enclosed in slashes.
545: .PP
546: One can also specify that any field or variable
547: matches
548: a regular expression (or does not match it) with the operators
549: .UL ~
550: and
551: .UL !~ .
552: The program
553: .P1
554: $1 ~ /[jJ]ohn/
555: .P2
556: prints all lines where the first field matches ``john'' or ``John.''
557: Notice that this will also match ``Johnson'', ``St. Johnsbury'', and so on.
558: To restrict it to exactly
559: .UL [jJ]ohn ,
560: use
561: .P1
562: $1 ~ /^[jJ]ohn$/
563: .P2
564: The caret ^ refers to the beginning
565: of a line or field;
566: the dollar sign
567: .UL $
568: refers to the end.
569: .NH 2
570: Relational Expressions
571: .PP
572: An
573: .IT awk
574: pattern can be a relational expression
575: involving the usual relational operators
576: .UL < ,
577: .UL <= ,
578: .UL == ,
579: .UL != ,
580: .UL >= ,
581: and
582: .UL > .
583: An example is
584: .P1
585: $2 > $1 + 100
586: .P2
587: which selects lines where the second field
588: is at least 100 greater than the first field.
589: Similarly,
590: .P1
591: NF % 2 == 0
592: .P2
593: prints lines with an even number of fields.
594: .PP
595: In relational tests, if neither operand is numeric,
596: a string comparison is made;
597: otherwise it is numeric.
598: Thus,
599: .P1
600: $1 >= "s"
601: .P2
602: selects lines that begin with an
603: .UL s ,
604: .UL t ,
605: .UL u ,
606: etc.
607: In the absence of any other information,
608: fields are treated as strings, so
609: the program
610: .P1
611: $1 > $2
612: .P2
613: will perform a string comparison.
614: .NH 2
615: Combinations of Patterns
616: .PP
617: A pattern can be any boolean combination of patterns,
618: using the operators
619: .UL \||\||
620: (or),
621: .UL &&
622: (and), and
623: .UL !
624: (not).
625: For example,
626: .P1
627: $1 >= "s" && $1 < "t" && $1 != "smith"
628: .P2
629: selects lines where the first field begins with ``s'', but is not ``smith''.
630: .UL &&
631: and
632: .UL \||\||
633: guarantee that their operands
634: will be evaluated
635: from left to right;
636: evaluation stops as soon as the truth or falsehood
637: is determined.
638: .NH 2
639: Pattern Ranges
640: .PP
641: The ``pattern'' that selects an action may also
642: consist of two patterns separated by a comma, as in
643: .P1
644: pat1, pat2 { ... }
645: .P2
646: In this case, the action is performed for each line between
647: an occurrence of
648: .UL pat1
649: and the next occurrence of
650: .UL pat2
651: (inclusive).
652: For example,
653: .P1
654: /start/, /stop/
655: .P2
656: prints all lines between
657: .UL start
658: and
659: .UL stop ,
660: while
661: .P1
662: NR == 100, NR == 200 { ... }
663: .P2
664: does the action for lines 100 through 200
665: of the input.
666: .NH 1
667: Actions
668: .PP
669: An
670: .IT awk
671: action is a sequence of action statements
672: terminated by newlines or semicolons.
673: These action statements can be used to do a variety of
674: bookkeeping and string manipulating tasks.
675: .NH 2
676: Built-in Functions
677: .PP
678: .IT Awk
679: provides a ``length'' function
680: to compute the length of a string of characters.
681: This program prints each record,
682: preceded by its length:
683: .P1
684: {print length, $0}
685: .P2
686: .UL length
687: by itself is a ``pseudo-variable'' which
688: yields the length of the current record;
689: .UL length(argument)
690: is a function which yields the length of its argument,
691: as in
692: the equivalent
693: .P1
694: {print length($0), $0}
695: .P2
696: The argument may be any expression.
697: .PP
698: .IT Awk
699: also
700: provides the arithmetic functions
701: .UL sqrt ,
702: .UL log ,
703: .UL exp ,
704: and
705: .UL int ,
706: for
707: square root,
708: base
709: .IT e
710: logarithm,
711: exponential,
712: and integer part of their respective arguments.
713: .PP
714: The name of one of these built-in functions,
715: without argument or parentheses,
716: stands for the value of the function on the
717: whole record.
718: The program
719: .P1
720: length < 10 || length > 20
721: .P2
722: prints lines whose length
723: is less than 10 or greater
724: than 20.
725: .PP
726: The function
727: .UL substr(s,\ m,\ n)
728: produces the substring of
729: .UL s
730: that begins at position
731: .UL m
732: (origin 1)
733: and is at most
734: .UL n
735: characters long.
736: If
737: .UL n
738: is omitted, the substring goes to the end of
739: .UL s .
740: The function
741: .UL index(s1,\ s2)
742: returns the position where the string
743: .UL s2
744: occurs in
745: .UL s1 ,
746: or zero if it does not.
747: .PP
748: The function
749: .UL sprintf(f,\ e1,\ e2,\ ...)
750: produces the value of the expressions
751: .UL e1 ,
752: .UL e2 ,
753: etc.,
754: in the
755: .UL printf
756: format specified by
757: .UL f .
758: Thus, for example,
759: .P1
760: x = sprintf("%8.2f %10ld", $1, $2)
761: .P2
762: sets
763: .UL x
764: to the string produced by formatting
765: the values of
766: .UL $1
767: and
768: .UL $2 .
769: .NH 2
770: Variables, Expressions, and Assignments
771: .PP
772: .IT Awk
773: variables take on numeric (floating point)
774: or string values according to context.
775: For example, in
776: .P1
777: x = 1
778: .P2
779: .UL x
780: is clearly a number, while in
781: .P1
782: x = "smith"
783: .P2
784: it is clearly a string.
785: Strings are converted to numbers and
786: vice versa whenever context demands it.
787: For instance,
788: .P1
789: x = "3" + "4"
790: .P2
791: assigns 7 to
792: .UL x .
793: Strings which cannot be interpreted
794: as numbers in a numerical context
795: will generally have numeric value zero,
796: but it is unwise to count on this behavior.
797: .PP
798: By default, variables (other than built-ins) are initialized to the null string,
799: which has numerical value zero;
800: this eliminates the need for most
801: .UL BEGIN
802: sections.
803: For example, the sums of the first two fields can be computed by
804: .P1
805: { s1 += $1; s2 += $2 }
806: END { print s1, s2 }
807: .P2
808: .PP
809: Arithmetic is done internally in floating point.
810: The arithmetic operators are
811: .UL + ,
812: .UL \- ,
813: .UL \(** ,
814: .UL / ,
815: and
816: .UL %
817: (mod).
818: The C increment
819: .UL ++
820: and
821: decrement
822: .UL \-\-
823: operators are also available,
824: and so are the assignment operators
825: .UL += ,
826: .UL \-= ,
827: .UL *= ,
828: .UL /= ,
829: and
830: .UL %= .
831: These operators may all be used in expressions.
832: .NH 2
833: Field Variables
834: .PP
835: Fields in
836: .IT awk
837: share essentially all of the properties of variables _
838: they may be used in arithmetic or string operations,
839: and may be assigned to.
840: Thus one can
841: replace the first field with a sequence number like this:
842: .P1
843: { $1 = NR; print }
844: .P2
845: or
846: accumulate two fields into a third, like this:
847: .P1
848: { $1 = $2 + $3; print $0 }
849: .P2
850: or assign a string to a field:
851: .P1
852: { if ($3 > 1000)
853: $3 = "too big"
854: print
855: }
856: .P2
857: which replaces the third field by ``too big'' when it is,
858: and in any case prints the record.
859: .PP
860: Field references may be numerical expressions,
861: as in
862: .P1
863: { print $i, $(i+1), $(i+n) }
864: .P2
865: Whether a field is deemed numeric or string depends on context;
866: in ambiguous cases like
867: .P1
868: if ($1 == $2) ...
869: .P2
870: fields are treated as strings.
871: .PP
872: Each input line is split into fields automatically as necessary.
873: It is also possible to split any variable or string
874: into fields:
875: .P1
876: n = split(s, array, sep)
877: .P2
878: splits the
879: the string
880: .UL s
881: into
882: .UL array[1] ,
883: \&...,
884: .UL array[n] .
885: The number of elements found is returned.
886: If the
887: .UL sep
888: argument is provided, it is used as the field separator;
889: otherwise
890: .UL FS
891: is used as the separator.
892: .NH 2
893: String Concatenation
894: .PP
895: Strings may be concatenated.
896: For example
897: .P1
898: length($1 $2 $3)
899: .P2
900: returns the length of the first three fields.
901: Or in a
902: .UL print
903: statement,
904: .P1
905: print $1 " is " $2
906: .P2
907: prints
908: the two fields separated by `` is ''.
909: Variables and numeric expressions may also appear in concatenations.
910: .NH 2
911: Arrays
912: .PP
913: Array elements are not declared;
914: they spring into existence by being mentioned.
915: Subscripts may have
916: .ul
917: any
918: non-null
919: value, including non-numeric strings.
920: As an example of a conventional numeric subscript,
921: the statement
922: .P1
923: x[NR] = $0
924: .P2
925: assigns the current input record to
926: the
927: .UL NR -th
928: element of the array
929: .UL x .
930: In fact, it is possible in principle (though perhaps slow)
931: to process the entire input in a random order with the
932: .IT awk
933: program
934: .P1
935: { x[NR] = $0 }
936: END { \fI... program ...\fP }
937: .P2
938: The first action merely records each input line in
939: the array
940: .UL x .
941: .PP
942: Array elements may be named by non-numeric values,
943: which gives
944: .IT awk
945: a capability rather like the associative memory of
946: Snobol tables.
947: Suppose the input contains fields with values like
948: .UL apple ,
949: .UL orange ,
950: etc.
951: Then the program
952: .P1
953: /apple/ { x["apple"]++ }
954: /orange/ { x["orange"]++ }
955: END { print x["apple"], x["orange"] }
956: .P2
957: increments counts for the named array elements,
958: and prints them at the end of the input.
959: .NH 2
960: Flow-of-Control Statements
961: .PP
962: .IT Awk
963: provides the basic flow-of-control statements
964: .UL if-else ,
965: .UL while ,
966: .UL for ,
967: and statement grouping with braces, as in C.
968: We showed the
969: .UL if
970: statement in section 3.3 without describing it.
971: The condition in parentheses is evaluated;
972: if it is true, the statement following the
973: .UL if
974: is done.
975: The
976: .UL else
977: part is optional.
978: .PP
979: The
980: .UL while
981: statement is exactly like that of C.
982: For example, to print all input fields one per line,
983: .P1
984: i = 1
985: while (i <= NF) {
986: print $i
987: ++i
988: }
989: .P2
990: .PP
991: The
992: .UL for
993: statement is also exactly that of C:
994: .P1
995: for (i = 1; i <= NF; i++)
996: print $i
997: .P2
998: does the same job as the
999: .UL while
1000: statement above.
1001: .PP
1002: There is an alternate form of the
1003: .UL for
1004: statement which is suited for accessing the
1005: elements of an associative array:
1006: .P1
1007: for (i in array)
1008: \fIstatement\f3
1009: .P2
1010: does
1011: .ul
1012: statement
1013: with
1014: .UL i
1015: set in turn to each element of
1016: .UL array .
1017: The elements are accessed in an apparently random order.
1018: Chaos will ensue if
1019: .UL i
1020: is altered, or if any new elements are
1021: accessed during the loop.
1022: .PP
1023: The expression in the condition part of an
1024: .UL if ,
1025: .UL while
1026: or
1027: .UL for
1028: can include relational operators like
1029: .UL < ,
1030: .UL <= ,
1031: .UL > ,
1032: .UL >= ,
1033: .UL ==
1034: (``is equal to''),
1035: and
1036: .UL !=
1037: (``not equal to'');
1038: regular expression matches with the match operators
1039: .UL ~
1040: and
1041: .UL !~ ;
1042: the logical operators
1043: .UL \||\|| ,
1044: .UL && ,
1045: and
1046: .UL ! ;
1047: and of course parentheses for grouping.
1048: .PP
1049: The
1050: .UL break
1051: statement causes an immediate exit
1052: from an enclosing
1053: .UL while
1054: or
1055: .UL for ;
1056: the
1057: .UL continue
1058: statement
1059: causes the next iteration to begin.
1060: .PP
1061: The statement
1062: .UL next
1063: causes
1064: .IT awk
1065: to skip immediately to
1066: the next record and begin scanning the patterns from the top.
1067: The statement
1068: .UL exit
1069: causes the program to behave as if the end of the input
1070: had occurred.
1071: .PP
1072: Comments may be placed in
1073: .IT awk
1074: programs:
1075: they begin with the character
1076: .UL #
1077: and end with the end of the line,
1078: as in
1079: .P1
1080: print x, y # this is a comment
1081: .P2
1082: .NH
1083: Design
1084: .PP
1085: The
1086: .UX
1087: system
1088: already provides several programs that
1089: operate by passing input through a
1090: selection mechanism.
1091: .IT Grep ,
1092: the first and simplest, merely prints all lines which
1093: match a single specified pattern.
1094: .IT Egrep
1095: provides more general patterns, i.e., regular expressions
1096: in full generality;
1097: .IT fgrep
1098: searches for a set of keywords with a particularly fast algorithm.
1099: .IT Sed\|
1100: .[
1101: unix programm manual
1102: .]
1103: provides most of the editing facilities of
1104: the editor
1105: .IT ed ,
1106: applied to a stream of input.
1107: None of these programs provides
1108: numeric capabilities,
1109: logical relations,
1110: or variables.
1111: .PP
1112: .IT Lex\|
1113: .[
1114: lesk lexical analyzer cstr
1115: .]
1116: provides general regular expression recognition capabilities,
1117: and, by serving as a C program generator,
1118: is essentially open-ended in its capabilities.
1119: The use of
1120: .IT lex ,
1121: however, requires a knowledge of C programming,
1122: and a
1123: .IT lex
1124: program must be compiled and loaded before use,
1125: which discourages its use for one-shot applications.
1126: .PP
1127: .IT Awk
1128: is an attempt
1129: to fill in another part of the matrix of possibilities.
1130: It
1131: provides general regular expression capabilities
1132: and an implicit input/output loop.
1133: But it also provides convenient numeric processing,
1134: variables,
1135: more general selection,
1136: and control flow in the actions.
1137: It
1138: does not require compilation or a knowledge of C.
1139: Finally,
1140: .IT awk
1141: provides
1142: a convenient way to access fields within lines;
1143: it is unique in this respect.
1144: .PP
1145: .IT Awk
1146: also tries to integrate strings and numbers
1147: completely,
1148: by treating all quantities as both string and numeric,
1149: deciding which representation is appropriate
1150: as late as possible.
1151: In most cases the user can simply ignore the differences.
1152: .PP
1153: Most of the effort in developing
1154: .I awk
1155: went into deciding what
1156: .I awk
1157: should or should not do
1158: (for instance, it doesn't do string substitution)
1159: and what the syntax should be
1160: (no explicit operator for concatenation)
1161: rather
1162: than on writing or debugging the code.
1163: We have tried
1164: to make the syntax powerful
1165: but easy to use and well adapted
1166: to scanning files.
1167: For example,
1168: the absence of declarations and implicit initializations,
1169: while probably a bad idea for a general-purpose programming language,
1170: is desirable in a language
1171: that is meant to be used for tiny programs
1172: that may even be composed on the command line.
1173: .PP
1174: In practice,
1175: .IT awk
1176: usage seems to fall into two broad categories.
1177: One is what might be called ``report generation'' \(em
1178: processing an input to extract counts,
1179: sums, sub-totals, etc.
1180: This also includes the writing of trivial
1181: data validation programs,
1182: such as verifying that a field contains only numeric information
1183: or that certain delimiters are properly balanced.
1184: The combination of textual and numeric processing is invaluable here.
1185: .PP
1186: A second area of use is as a data transformer,
1187: converting data from the form produced by one program
1188: into that expected by another.
1189: The simplest examples merely select fields, perhaps with rearrangements.
1190: .NH
1191: Implementation
1192: .PP
1193: The actual implementation of
1194: .IT awk
1195: uses the language development tools available
1196: on the
1197: .UC UNIX
1198: operating system.
1199: The grammar is specified with
1200: .IT yacc ;
1201: .[
1202: yacc johnson cstr
1203: .]
1204: the lexical analysis is done by
1205: .IT lex ;
1206: the regular expression recognizers are
1207: deterministic finite automata
1208: constructed directly from the expressions.
1209: An
1210: .IT awk
1211: program is translated into a
1212: parse tree which is then directly executed
1213: by a simple interpreter.
1214: .PP
1215: .IT Awk
1216: was designed for ease of use rather than processing speed;
1217: the delayed evaluation of variable types
1218: and the necessity to break input
1219: into fields makes high speed difficult to achieve in any case.
1220: Nonetheless,
1221: the program has not proven to be unworkably slow.
1222: .PP
1223: Table I below shows the execution (user + system) time
1224: on a PDP-11/70 of
1225: the
1226: .UC UNIX
1227: programs
1228: .IT wc ,
1229: .IT grep ,
1230: .IT egrep ,
1231: .IT fgrep ,
1232: .IT sed ,
1233: .IT lex ,
1234: and
1235: .IT awk
1236: on the following simple tasks:
1237: .IP "\ \ 1."
1238: count the number of lines.
1239: .IP "\ \ 2."
1240: print all lines containing ``doug''.
1241: .IP "\ \ 3."
1242: print all lines containing ``doug'', ``ken'' or ``dmr''.
1243: .IP "\ \ 4."
1244: print the third field of each line.
1245: .IP "\ \ 5."
1246: print the third and second fields of each line, in that order.
1247: .IP "\ \ 6."
1248: append all lines containing ``doug'', ``ken'', and ``dmr''
1249: to files ``jdoug'', ``jken'', and ``jdmr'', respectively.
1250: .IP "\ \ 7."
1251: print each line prefixed by ``line-number\ :\ ''.
1252: .IP "\ \ 8."
1253: sum the fourth column of a table.
1254: .LP
1255: The program
1256: .IT wc
1257: merely counts words, lines and characters in its input;
1258: we have already mentioned the others.
1259: In all cases the input was a file containing
1260: 10,000 lines
1261: as created by the
1262: command
1263: .IT "ls \-l" ;
1264: each line has the form
1265: .P1
1266: -rw-rw-rw- 1 ava 123 Oct 15 17:05 xxx
1267: .P2
1268: The total length of this input is
1269: 452,960 characters.
1270: Times for
1271: .IT lex
1272: do not include compile or load.
1273: .PP
1274: As might be expected,
1275: .IT awk
1276: is not as fast as the specialized tools
1277: .IT wc ,
1278: .IT sed ,
1279: or the programs in the
1280: .IT grep
1281: family,
1282: but
1283: is faster than the more general tool
1284: .IT lex .
1285: In all cases, the tasks were
1286: about as easy to express as
1287: .IT awk
1288: programs
1289: as programs in these other languages;
1290: tasks involving fields were
1291: considerably easier to express as
1292: .IT awk
1293: programs.
1294: Some of the test programs are shown in
1295: .IT awk ,
1296: .IT sed
1297: and
1298: .IT lex .
1299: .[
1300: $LIST$
1301: .]
1302: .1C
1303: .TS
1304: center;
1305: c c c c c c c c c
1306: c c c c c c c c c
1307: c|n|n|n|n|n|n|n|n|.
1308: Task
1309: Program 1 2 3 4 5 6 7 8
1310: _
1311: \fIwc\fR 8.6
1312: \fIgrep\fR 11.7 13.1
1313: \fIegrep\fR 6.2 11.5 11.6
1314: \fIfgrep\fR 7.7 13.8 16.1
1315: \fIsed\fR 10.2 11.6 15.8 29.0 30.5 16.1
1316: \fIlex\fR 65.1 150.1 144.2 67.7 70.3 104.0 81.7 92.8
1317: \fIawk\fR 15.0 25.6 29.9 33.3 38.9 46.4 71.4 31.1
1318: _
1319: .TE
1320: .sp
1321: .ce
1322: \fBTable I.\fR Execution Times of Programs. (Times are in sec.)
1323: .sp 2
1324: .2C
1325: .PP
1326: The programs for some of these jobs are shown below.
1327: The
1328: .IT lex
1329: programs are generally too long to show.
1330: .LP
1331: AWK:
1332: .LP
1333: .P1
1334: 1. END {print NR}
1335: .P2
1336: .P1
1337: 2. /doug/
1338: .P2
1339: .P1
1340: 3. /ken|doug|dmr/
1341: .P2
1342: .P1
1343: 4. {print $3}
1344: .P2
1345: .P1
1346: 5. {print $3, $2}
1347: .P2
1348: .P1
1349: 6. /ken/ {print >"jken"}
1350: /doug/ {print >"jdoug"}
1351: /dmr/ {print >"jdmr"}
1352: .P2
1353: .P1
1354: 7. {print NR ": " $0}
1355: .P2
1356: .P1
1357: 8. {sum = sum + $4}
1358: END {print sum}
1359: .P2
1360: .LP
1361: SED:
1362: .LP
1363: .P1
1364: 1. $=
1365: .P2
1366: .P1
1367: 2. /doug/p
1368: .P2
1369: .P1
1370: 3. /doug/p
1371: /doug/d
1372: /ken/p
1373: /ken/d
1374: /dmr/p
1375: /dmr/d
1376: .P2
1377: .P1
1378: 4. /[^ ]* [ ]*[^ ]* [ ]*\e([^ ]*\e) .*/s//\e1/p
1379: .P2
1380: .P1
1381: 5. /[^ ]* [ ]*\e([^ ]*\e) [ ]*\e([^ ]*\e) .*/s//\e2 \e1/p
1382: .P2
1383: .P1
1384: 6. /ken/w jken
1385: /doug/w jdoug
1386: /dmr/w jdmr
1387: .P2
1388: .LP
1389: LEX:
1390: .LP
1391: .P1
1392: 1. %{
1393: int i;
1394: %}
1395: %%
1396: \en i++;
1397: . ;
1398: %%
1399: yywrap() {
1400: printf("%d\en", i);
1401: }
1402: .P2
1403: .P1
1404: 2. %%
1405: ^.*doug.*$ printf("%s\en", yytext);
1406: . ;
1407: \en ;
1408: .P2
This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.