|
|
1.1 root 1: .\" @(#)ss3 6.1 (Berkeley) 5/8/86
2: .\"
3: .SH
4: 3: Lexical Analysis
5: .PP
6: The user must supply a lexical analyzer to read the input stream and communicate tokens
7: (with values, if desired) to the parser.
8: The lexical analyzer is an integer-valued function called
9: .I yylex .
10: The function returns an integer, the
11: .I "token number" ,
12: representing the kind of token read.
13: If there is a value associated with that token, it should be assigned
14: to the external variable
15: .I yylval .
16: .PP
17: The parser and the lexical analyzer must agree on these token numbers in order for
18: communication between them to take place.
19: The numbers may be chosen by Yacc, or chosen by the user.
20: In either case, the ``# define'' mechanism of C is used to allow the lexical analyzer
21: to return these numbers symbolically.
22: For example, suppose that the token name DIGIT has been defined in the declarations section of the
23: Yacc specification file.
24: The relevant portion of the lexical analyzer might look like:
25: .DS
26: yylex(){
27: extern int yylval;
28: int c;
29: . . .
30: c = getchar();
31: . . .
32: switch( c ) {
33: . . .
34: case \'0\':
35: case \'1\':
36: . . .
37: case \'9\':
38: yylval = c\-\'0\';
39: return( DIGIT );
40: . . .
41: }
42: . . .
43: .DE
44: .PP
45: The intent is to return a token number of DIGIT, and a value equal to the numerical value of the
46: digit.
47: Provided that the lexical analyzer code is placed in the programs section of the specification file,
48: the identifier DIGIT will be defined as the token number associated
49: with the token DIGIT.
50: .PP
51: This mechanism leads to clear,
52: easily modified lexical analyzers; the only pitfall is the need
53: to avoid using any token names in the grammar that are reserved
54: or significant in C or the parser; for example, the use of
55: token names
56: .I if
57: or
58: .I while
59: will almost certainly cause severe
60: difficulties when the lexical analyzer is compiled.
61: The token name
62: .I error
63: is reserved for error handling, and should not be used naively
64: (see Section 7).
65: .PP
66: As mentioned above, the token numbers may be chosen by Yacc or by the user.
67: In the default situation, the numbers are chosen by Yacc.
68: The default token number for a literal
69: character is the numerical value of the character in the local character set.
70: Other names are assigned token numbers
71: starting at 257.
72: .PP
73: To assign a token number to a token (including literals),
74: the first appearance of the token name or literal
75: .I
76: in the declarations section
77: .R
78: can be immediately followed by
79: a nonnegative integer.
80: This integer is taken to be the token number of the name or literal.
81: Names and literals not defined by this mechanism retain their default definition.
82: It is important that all token numbers be distinct.
83: .PP
84: For historical reasons, the endmarker must have token
85: number 0 or negative.
86: This token number cannot be redefined by the user; thus, all
87: lexical analyzers should be prepared to return 0 or negative as a token number
88: upon reaching the end of their input.
89: .PP
90: A very useful tool for constructing lexical analyzers is
91: the
92: .I Lex
93: program developed by Mike Lesk.
94: .[
95: Lesk Lex
96: .]
97: These lexical analyzers are designed to work in close
98: harmony with Yacc parsers.
99: The specifications for these lexical analyzers
100: use regular expressions instead of grammar rules.
101: Lex can be easily used to produce quite complicated lexical analyzers,
102: but there remain some languages (such as FORTRAN) which do not
103: fit any theoretical framework, and whose lexical analyzers
104: must be crafted by hand.
This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.