Lex - A Lexical Analyzer Generator

Lex - A Lexical Analyzer Generator

ABSTRACT

Lex source is a table of regular expressions and corresponding program fragments. The table is translated to a program which reads an input stream, copying it to an output stream and partitioning the input into strings which match the given expressions. As each such string is recognized the corresponding program fragment is executed. The recognition of the expressions is performed by a deterministic finite automaton generated by Lex. The program fragments written by the user are executed in the order in which the corresponding regular expressions occur in the input stream.

The lexical analysis programs written with Lex accept ambiguous specifications and choose the longest match possible at each input point. If necessary, substantial lookahead is performed on the input, but the input stream will be backed up to the end of the current partition, so that the user has general freedom to manipulate it.

Lex can generate analyzers in either C or Ratfor, a language which can be translated automatically to portable Fortran. It is available on the PDP-11 UNIX, Honeywell GCOS, and IBM OS systems. This manual, however, will only discuss generating analyzers in C on the UNIX system, which is the only supported form of Lex under UNIX Version 7. Lex is designed to simplify interfacing with Yacc, for those with access to this compiler-compiler system.

1. Introduction.

The user supplies the additional code beyond expression matching needed to complete his tasks, possibly including code written by other generators. The program that recognizes the expressions is generated in the general purpose programming language employed for the user's program fragments. Thus, a high level expression language is provided to write the string expressions to be matched while the user's freedom to write actions is unimpaired. This avoids forcing the user who wishes to use a string manipulation language for input analysis to write processing programs in the same and often inappropriate string handling language.

Lex is not a complete language, but rather a generator representing a new language feature which can be added to different programming languages, called ``host languages.'' Just as general purpose languages can produce code to run on different computer hardware, Lex can write code in different host languages. The host language is used for the output code generated by Lex and also for the program fragments added by the user. Compatible run-time libraries for the different host languages are also provided. This makes Lex adaptable to different environments and different users. Each application may be directed to the combination of hardware and host language appropriate to the task, the user's background, and the properties of local implementations. At present, the only supported host language is C, although Fortran (in the form of Ratfor [2] has been available in the past. Lex itself exists on UNIX, GCOS, and OS/370; but the code generated by Lex may be taken anywhere the appropriate compilers exist.

Lex turns the user's expressions and actions (called source in this memo) into the host general-purpose language; the generated program is named yylex. The yylex program will recognize expressions in a stream (called input in this memo) and perform the specified actions for each expression as it is detected. See Figure 1.

+-------+ Source -> | Lex | -> yylex +-------+ +-------+ Input -> | yylex | -> Output +-------+ An overview of Lex Figure 1

%% [ \t]+$ ;

%% [ \t]+$ ; [ \t]+ printf(" ");

Lex can be used alone for simple transformations, or for analysis and statistics gathering on a lexical level. Lex can also be used with a parser generator to perform the lexical analysis phase; it is particularly easy to interface Lex and Yacc [3]. Lex programs recognize only regular expressions; Yacc writes parsers that accept a large class of context free grammars, but require a lower level analyzer to recognize input tokens. Thus, a combination of Lex and Yacc is often appropriate. When used as a preprocessor for a later parser generator, Lex is used to partition the input stream, and the parser generator assigns structure to the resulting pieces. The flow of control in such a case (which might be the first half of a compiler, for example) is shown in Figure 2. Additional programs, written by other generators or by hand, can be added easily to programs written by Lex.

lexical grammar rules rules | | v v +---------+ +---------+ | Lex | | Yacc | +---------+ +---------+ | | v v +---------+ +---------+ Input -> | yylex | -> | yyparse | -> Parsed input +---------+ +---------+ Lex with Yacc Figure 2

Lex generates a deterministic finite automaton from the regular expressions in the source [4]. The automaton is interpreted, rather than compiled, in order to save space. The result is still a fast analyzer. In particular, the time taken by a Lex program to recognize and partition an input stream is proportional to the length of the input. The number of Lex rules or the complexity of the rules is not important in determining speed, unless rules which include forward context require a significant amount of rescanning. What does increase with the number and complexity of rules is the size of the finite automaton, and therefore the size of the program generated by Lex.

In the program written by Lex, the user's fragments (representing the actions to be performed as each regular expression is found) are gathered as cases of a switch. The automaton interpreter directs the control flow. Opportunity is provided for the user to insert either declarations or additional statements in the routine containing the actions, or to add subroutines outside this action routine.

Lex is not limited to source which can be interpreted on the basis of one character lookahead. For example, if there are two rules, one looking for ab and another for abcdefg, and the input stream is abcdefh, Lex will recognize ab and leave the input pointer just before cd. . . Such backup is more costly than the processing of simpler languages.

2. Lex Source.

{definitions} %% {rules} %% {user subroutines}

%%

In the outline of Lex programs shown above, the rules represent the user's control decisions; they are a table, in which the left column contains regular expressions (see section 3) and the right column contains actions, program fragments to be executed when the expressions are recognized. Thus an individual rule might appear

integer printf("found keyword INT");

colour printf("color"); mechanise printf("mechanize"); petrol printf("gas");

3. Lex Regular Expressions.

integer

a57D

Operators. The operator characters are

" \ [ ] ^ - ? . * + | ( ) $ / { } % < >

xyz"++"

"xyz++"

An operator character may also be turned into a text character by preceding it with \ as in

xyz\+\+

Character classes. Classes of characters can be specified using the operator pair []. The construction [abc] matches a single character, which may be a, b, or c. Within square brackets, most operator meanings are ignored. Only three characters are special: these are \ - and ^. The - character indicates ranges. For example,

[a-z0-9<>_]

[-+0-9]

In character classes, the ^ operator must appear as the first character after the left bracket; it indicates that the resulting string is to be complemented with respect to the computer character set. Thus

[^abc]

[^a-zA-Z]

Arbitrary character. To match almost any character, the operator character . is the class of all characters except newline. Escaping into octal is possible although non-portable:

[\40-\176]

Optional expressions. The operator ? indicates an optional element of an expression. Thus

ab?c

Repeated expressions. Repetitions of classes are indicated by the operators * and +.

a*

a+

[a-z]+

[A-Za-z][A-Za-z0-9]*

Alternation and Grouping. The operator | indicates alternation:

(ab|cd)

ab|cd

(ab|cd+)?(ef)*

Context sensitivity. Lex will recognize a small amount of surrounding context. The two simplest operators for this are ^ and $. If the first character of an expression is ^, the expression will only be matched at the beginning of a line (after a newline character, or at the beginning of the input stream). This can never conflict with the other meaning of ^, complementation of character classes, since that only applies within the [] operators. If the very last character is $, the expression will only be matched at the end of a line (when immediately followed by newline). The latter operator is a special case of the / operator character, which indicates trailing context. The expression

ab/cd

ab$

ab/



<x>

<ONE>

Repetitions and Definitions. The operators {} specify either repetitions (if they enclose numbers) or definition expansion (if they enclose a name). For example

{digit}

a{1,5}

Finally, initial % is special, being the separator for Lex source segments.

4. Lex Actions.

One of the simplest things that can be done is to ignore the input. Specifying a C null statement, ; as an action causes this result. A frequent rule is

[ \t

] ;

Another easy way to avoid writing actions is the action character |, which indicates that the action for this rule is the action for the next rule. The previous example could also have been written

" " "\t" "

"

In more complex actions, the user will often want to know the actual text that matched some expression like [a-z]+. Lex leaves this text in an external character array named yytext. Thus, to print the name found, a rule like

[a-z]+ printf("%s", yytext);

[a-z]+ ECHO;

Sometimes it is more convenient to know the end of what has been found; hence Lex also provides a count yyleng of the number of characters matched. To count both the number of words and the number of characters in words in the input, the user might write [a-zA-Z]+ {words++; chars += yyleng;} which accumulates in chars the number of characters in the words recognized. The last character in the string matched can be accessed by

yytext[yyleng-1]

Occasionally, a Lex action may decide that a rule has not recognized the correct span of characters. Two routines are provided to aid with this situation. First, yymore() can be called to indicate that the next input expression recognized is to be tacked on to the end of this input. Normally, the next input string would overwrite the current entry in yytext. Second, yyless (n) may be called to indicate that not all the characters matched by the currently successful expression are wanted right now. The argument n indicates the number of characters in yytext to be retained. Further characters previously matched are returned to the input. This provides the same sort of lookahead offered by the / operator, but in a different form.

Example: Consider a language which defines a string as a set of characters between quotation (") marks, and provides that to include a " in a string it must be preceded by a \. The regular expression which matches that is somewhat confusing, so that it might be preferable to write

\"[^"]* { if (yytext[yyleng-1] == '\\') yymore(); else ... normal user processing }

The function yyless() might be used to reprocess text in various circumstances. Consider the C problem of distinguishing the ambiguity of ``=-a''. Suppose it is desired to treat this as ``=- a'' but print a message. A rule might be

=-[a-zA-Z] { printf("Op (=-) ambiguous

"); yyless(yyleng-1); ... action for =- ... }

=-[a-zA-Z] { printf("Op (=-) ambiguous

"); yyless(yyleng-2); ... action for = ... }

=-/[A-Za-z]

=/-[A-Za-z]

=-/[^ \t

]

In addition to these routines, Lex also permits access to the I/O routines it uses. They are:

1) input() which returns the next input character;

2) output(c) which writes the character c on the output; and

3) unput(c) pushes the character c back onto the input stream to be read later by input().

By default these routines are provided as macro definitions, but the user can override them and supply private versions. These routines define the relationship between external files and internal characters, and must all be retained or modified consistently. They may be redefined, to cause input or output to be transmitted to or from strange places, including other programs or internal memory; but the character set used must be consistent in all routines; a value of zero returned by input must mean end of file; and the relationship between unput and input must be retained or the Lex lookahead will not work. Lex does not look ahead at all if it does not have to, but every rule ending in + * ? or $ or containing / implies lookahead. Lookahead is also necessary to match an expression that is a prefix of another expression. See below for a discussion of the character set used by Lex. The standard Lex library imposes a 100 character limit on backup.

Another Lex library routine that the user will sometimes want to redefine is yywrap() which is called whenever Lex reaches an end-of-file. If yywrap returns a 1, Lex continues with the normal wrapup on end of input. Sometimes, however, it is convenient to arrange for more input to arrive from a new source. In this case, the user should provide a yywrap which arranges for new input and returns 0. This instructs Lex to continue processing. The default yywrap always returns 1.

This routine is also a convenient place to print tables, summaries, etc. at the end of a program. Note that it is not possible to write a normal rule which recognizes end-of-file; the only access to this condition is through yywrap. In fact, unless a private version of input() is supplied a file containing nulls cannot be handled, since a value of 0 returned by input is taken to be end-of-file.

5. Ambiguous Source Rules.

1) The longest match is preferred.

2) Among rules which matched the same number of characters, the rule given first is preferred.

Thus, suppose the rules

integer keyword action ...; [a-z]+ identifier action ...;

The principle of preferring the longest match makes rules containing expressions like .* dangerous. For example, '.*' might seem a good way of recognizing a string in single quotes. But it is an invitation for the program to read far ahead, looking for a distant single quote. Presented with the input

'first' quoted string here, 'second' here

'first' quoted string here, 'second'

'[^'

]*'

Note that Lex is normally partitioning the input stream, not searching for all possible matches of each expression. This means that each character is accounted for once and only once. For example, suppose it is desired to count occurrences of both she and he in an input text. Some Lex rules to do this might be

she s++; he h++;

| . ;

Sometimes the user would like to override this choice. The action REJECT means ``go do the next alternative.'' It causes whatever rule was second choice after the current rule to be executed. The position of the input pointer is adjusted accordingly. Suppose the user really wants to count the included instances of he:

she {s++; REJECT;} he {h++; REJECT;}

| . ;

Consider the two rules

a[bc]+ { ... ; REJECT;} a[cd]+ { ... ; REJECT;}

In general, REJECT is useful whenever the purpose of Lex is not to partition the input stream but to detect all examples of some items in the input, and the instances of these items may overlap or include each other. Suppose a digram table of the input is desired; normally the digrams overlap, that is the word the is considered to contain both th and he. Assuming a two-dimensional array named digram to be incremented, the appropriate source is

%% [a-z][a-z] { digram[yytext[0]][yytext[1]]++; REJECT; } . ;

;

6. Lex Source Definitions.

{definitions} %% {rules} %% {user routines}

Remember that Lex is turning the rules into a program. Any source not intercepted by Lex is copied into the generated program. There are three classes of such things.

1) Any line which is not part of a Lex rule or action which begins with a blank or tab is copied into the Lex generated program. Such source input prior to the first %% delimiter will be external to any function in the code; if it appears immediately after the first %%, it appears in an appropriate place for declarations in the function written by Lex which contains the actions. This material must look like program fragments, and should precede the first Lex rule. As a side effect of the above, lines which begin with a blank or tab, and which contain a comment, are passed through to the generated program. This can be used to include comments in either the Lex source or the generated code. The comments should follow the host language convention.

2) Anything included between lines containing only %{ and %} is copied out as above. The delimiters are discarded. This format permits entering text like preprocessor statements that must begin in column 1, or copying lines that do not look like programs.

3) Anything after the third %% delimiter, regardless of formats, etc., is copied out after the Lex output.

Definitions intended for Lex are given before the first %% delimiter. Any line in this section not contained between %{ and %}, and begining in column 1, is assumed to define Lex substitution strings. The format of such lines is name translation and it causes the string given as a translation to be associated with the name. The name and translation must be separated by at least one blank or tab, and the name must begin with a letter. The translation can then be called out by the {name} syntax in a rule. Using {D} for the digits and {E} for an exponent field, for example, might abbreviate rules to recognize numbers:

D [0-9] E [DEde][-+]?{D}+ %% {D}+ printf("integer"); {D}+"."{D}*({E})? | {D}*"."{D}+({E})? | {D}+{E}

[0-9]+/"."EQ printf("integer");

The definitions section may also contain other commands, including the selection of a host language, a character set table, a list of start conditions, or adjustments to the default size of arrays within Lex itself for larger source programs. These possibilities are discussed below under ``Summary of Source Format,'' section 12.

7. Usage.

The C programs generated by Lex are slightly different on OS/370, because the OS compiler is less powerful than the UNIX or GCOS compilers, and does less at compile time. C programs generated on GCOS and UNIX are the same.

UNIX. The library is accessed by the loader flag -ll. So an appropriate set of commands is lex source cc lex.yy.c -ll The resulting program is placed on the usual file a.out for later execution. To use Lex with Yacc see below. Although the default Lex I/O routines use the C standard library, the Lex automata themselves do not do so; if private versions of input, output and unput are given, the library can be avoided.

8. Lex and Yacc.

return(token);

yacc good lex better cc y.tab.c -ly -ll

9. Examples.

%% int k; [0-9]+ { k = atoi(yytext); if (k%7 == 0) printf("%d", k+3); else printf("%d",k); }

%% int k; -?[0-9]+ { k = atoi(yytext); printf("%d", k%7 == 0 ? k+3 : k); } -?[0-9.]+ ECHO; [A-Za-z][A-Za-z0-9]+ ECHO;

For an example of statistics gathering, here is a program which histograms the lengths of words, where a word is defined as a string of letters.

int lengs[100]; %% [a-z]+ lengs[yyleng]++; . |

; %% yywrap() { int i; printf("Length No. words

"); for(i=0; i<100; i++) if (lengs[i] > 0) printf("%5d%10d

",i,lengs[i]); return(1); }

As a larger example, here are some parts of a program written by N. L. Schryer to convert double precision Fortran to single precision Fortran. Because Fortran does not distinguish upper and lower case letters, this routine begins by defining a set of classes including both cases of each letter:

a [aA] b [bB] c [cC] ... z [zZ]

W [ \t]*

{d}{o}{u}{b}{l}{e}{W}{p}{r}{e}{c}{i}{s}{i}{o}{n} { printf(yytext[0]=='d'? "real" : "REAL"); }

^" "[^ 0] ECHO;

[0-9]+{W}{d}{W}[+-]?{W}[0-9]+ | [0-9]+{W}"."{W}{d}{W}[+-]?{W}[0-9]+ | "."{W}[0-9]+{W}{d}{W}[+-]?{W}[0-9]+ { /* convert constants */ for(p=yytext; *p != 0; p++) { if (*p == 'd' || *p == 'D') *p=+ 'e'- 'd'; ECHO; }

{d}{s}{i}{n} | {d}{c}{o}{s} | {d}{s}{q}{r}{t} | {d}{a}{t}{a}{n} | ... {d}{f}{l}{o}{a}{t} printf("%s",yytext+1);

{d}{l}{o}{g} | {d}{l}{o}{g}10 | {d}{m}{i}{n}1 | {d}{m}{a}{x}1 { yytext[0] =+ 'a' - 'd'; ECHO; }

{d}1{m}{a}{c}{h} {yytext[0] =+ 'r' - 'd';

[A-Za-z][A-Za-z0-9]* | [0-9]+ |

| . ECHO;

10. Left Context Sensitivity.

This section describes three means of dealing with different environments: a simple use of flags, when only a few rules change from one environment to another, the use of start conditions on rules, and the possibility of making multiple lexical analyzers all run together. In each case, there are rules which recognize the need to change the environment in which the following input text is analyzed, and set some parameter to reflect the change. This may be a flag explicitly tested by the user's action code; such a flag is the simplest way of dealing with the problem, since Lex is not involved at all. It may be more convenient, however, to have Lex remember the flags as initial conditions on the rules. Any rule may be associated with a start condition. It will only be recognized when Lex is in that start condition. The current start condition may be changed at any time. Finally, if the sets of rules for the different environments are very dissimilar, clarity may be best achieved by writing several distinct lexical analyzers, and switching from one to another as desired.

Consider the following problem: copy the input to the output, changing the word magic to first on every line which began with the letter a, changing magic to second on every line which began with the letter b, and changing magic to third on every line which began with the letter c. All other words and all other lines are left unchanged.

These rules are so simple that the easiest way to do this job is with a flag:

int flag; %% ^a {flag = 'a'; ECHO;} ^b {flag = 'b'; ECHO;} ^c {flag = 'c'; ECHO;}

{flag = 0 ; ECHO;} magic { switch (flag) { case 'a': printf("first"); break; case 'b': printf("second"); break; case 'c': printf("third"); break; default: ECHO; break; } }

To handle the same problem with start conditions, each start condition must be introduced to Lex in the definitions section with a line reading

%Start name1 name2 ...

<name1>expression

BEGIN name1;

BEGIN 0;

The same example as before can be written:

%START AA BB CC %% ^a {ECHO; BEGIN AA;} ^b {ECHO; BEGIN BB;} ^c {ECHO; BEGIN CC;}

{ECHO; BEGIN 0;} <AA>magic printf("first"); <BB>magic printf("second"); <CC>magic printf("third");

11. Character Set.

{integer} {character string}

%T 1 Aa 2 Bb ... 26 Zz 27

28 + 29 - 30 0 31 1 ... 39 9 %T Sample character table.

12. Summary of Source Format.

{definitions} %% {rules} %% {user subroutines}

1) Definitions, in the form ``name space translation''.

2) Included code, in the form ``space code''.

3) Included code, in the form

%{ code %}

%S name1 name2 ...

%T number space character-string ... %T

%x nnn

Letter Parameter p positions n states e tree nodes a transitions k packed character classes o output array size

Regular expressions in Lex use the following operators:

x the character "x" "x" an "x", even if x is an operator. \x an "x", even if x is an operator. [xy] the character x or y. [x-z] the characters x, y or z. [^x] any character but x. . any character but newline. ^x an x at the beginning of a line. <y>x an x when Lex is in start condition y. x$ an x at the end of a line. x? an optional x. x* 0,1,2, ... instances of x. x+ 1,2,3, ... instances of x. x|y an x or a y. (x) an x. x/y an x but only if followed by y. {xx} the translation of xx from the definitions section. x{m,n} m through n occurrences of x

13. Caveats and Bugs.

REJECT does not rescan the input; instead it remembers the results of the previous scan. This means that if a rule with trailing context is found, and REJECT executed, the user must not have used unput to change the characters forthcoming from the input stream. This is the only restriction on the user's ability to manipulate the not-yet-processed input.

14. Acknowledgments.

The code of the current version of Lex was designed, written, and debugged by Eric Schmidt.

15. References.

1. B. W. Kernighan and D. M. Ritchie, The C Programming Language, Prentice-Hall, N. J. (1978).

2. B. W. Kernighan, Ratfor: A Preprocessor for a Rational Fortran, Software Practice and Experience, 5, pp. 395-496 (1975).

3. S. C. Johnson, Yacc: Yet Another Compiler Compiler, Computing Science Technical Report No. 32, 1975, Bell Laboratories, Murray Hill, NJ 07974.

4. A. V. Aho and M. J. Corasick, Efficient String Matching: An Aid to Bibliographic Search, Comm. ACM 18, 333-340 (1975).

5. B. W. Kernighan, D. M. Ritchie and K. L. Thompson, QED Text Editor, Computing Science Technical Report No. 5, 1972, Bell Laboratories, Murray Hill, NJ 07974.

6. D. M. Ritchie, private communication. See also M. E. Lesk, The Portable C Library, Computing Science Technical Report No. 31, Bell Laboratories, Murray Hill, NJ 07974.