Dec 182017
 
Rob Duff's LEX; used to make AWK202.
File PCLEX.ZIP from The Programmer’s Corner in
Category Miscellaneous Language Source Code
Rob Duff’s LEX; used to make AWK202.
File Name File Size Zip Size Zip Type
ABC.L 674 295 deflated
BCPL.L 3708 1076 deflated
BIT.L 175 114 deflated
BTOB.L 1076 467 deflated
CAP.L 348 228 deflated
CLEX.L 4819 941 deflated
CMPT201.L 350 187 deflated
CSTOCK.L 5360 1651 deflated
CTOC.L 1269 666 deflated
HELLO.L 162 119 deflated
HWORD.L 2472 1040 deflated
LANDY.L 2364 745 deflated
LEX.DOC 48193 15966 deflated
LEX.EXE 28835 16382 deflated
LEX.HLP 1614 650 deflated
LEX.L 3044 934 deflated
LEX.MAN 1888 816 deflated
LIBL.ZIP 5184 4716 deflated
PAREN.L 102 92 deflated
SHE.L 131 77 deflated
WORD.L 1554 715 deflated

Download File PCLEX.ZIP Here

Contents of the LEX.DOC file


L---+---T1----+-T--2----T----3--T-+----4T---+---T5----+-T--6---R






LEX
A lexical Analyser Generator


by

Charles H. Forsyth

University of Waterloo
Waterloo, Ontario, N2L 3G1
Canada


Revised by

Robert B. Denny & Martin Minow


LEX transforms a regular-expression grammar and associated
action routines into a C function and set of tables, yielding a
table-driven lexical analyser which manages to be compact and
rapid.





Copyright (C) 1978, Charles H. Forsyth
Modifications Copyright (C) 1980, 1982, DECUS


General permission to copy or modify, but not for
profit, is hereby granted, provided that the above
copyright notice is included and reference made to the
fact that reproduction privileges were granted by DECUS.

The information in this document is subject to change
without notice and should not be construed as a
commitment by Digital Equipment Corporation or by DECUS.

Neither Digital Equipment Corporation, DECUS, nor the
authors assume any responsibility for the use or
reliability of this document or the described software.

This software is made available without any support
whatsoever. The person responsible for an
implementation of this system should expect to have to
understand and modify the source code if any problems
are encountered in implementing or maintaining the
compiler or its run-time library. The DECUS 'Structured
Languages Special Interest Group' is the primary focus
for communication among users of this software.



UNIX is a trademark of Bell Telephone Laboratories.




LEX A Lexical Analyser Generator


1.0 Introduction

A computer program often has an input stream which is composed
of small elements, such as a stream of characters, and which it
would like to convert to larger elements in order to process the
data conveniently. A compiler is a common example of such a
program: it reads a stream of characters forming a program, and
would like to turn this sequence of characters into a sequence
of larger items, namely identifiers, numbers, and operators, for
parsing. In a compiler, the procedures which do this are
collectively called the lexical analyser, or scanner; this
terminology will be used more generally here.

It may happen that the speed with which this transformation is
done will noticeably affect the speed at which the rest of the
program operates. It is certainly true that although such code
is rarely difficult to write, writing and debugging it is both
tedious, and time-consuming, and one typically would rather
spend the time on the hard parts of the program. It is also
true that while certain transformations are easily thought of,
they are often hard to express succinctly in the usual
general-purpose programming languages (eg, the description of a
floating-point number).

LEX is a program which tries to give a programmer a good deal of
help in this, by writing large parts of the lexical analyser
automatically, based on a description supplied by the programmer
of the items to be recognised (which are known as tokens), as
patterns of the more basic elements of the input stream. The
LEX description is very much a special-purpose language for
writing lexical analysers, and LEX is then simply a translator
for this language. The LEX language is easy to write, and the
resulting processor is compact and fast running.

The purpose of a LEX program is to read an input stream, and
recognise tokens. As the lexical analyser will usually exist as
a subroutine in a larger set of programs, it will return a
"token number", which indicates which token was found, and
possibly a "token value", which provides more detailed
information about the token (eg, a copy of the token itself, or
an index into a symbol table). This need not be the only
possibility; a LEX program is often a good description of the
structure of the whole computation, and in such a case, the
lexical analyser might choose to call other routines to perform
the necessary actions whenever a particular token is recognised,
without reference to its own caller.


2.0 The Lex Language

LEX transforms a regular-expression grammar into a deterministic
finite-state automaton that recognizes that grammar. Each rule
of the grammar is associated with an action which is to be
performed when that rule successfully matches some part of the
input data.

Because of the nature of regular expression grammars, certain
language constructions cannot be recognized by LEX programs.
Specifically, expressions with balanced parentheses cannot be
recognized. This means that LEX cannot be used to recognize all
Fortran keywords as some (DO, IF, and FORMAT, for example)
require elaborate recognition to distinguish between ambiguous
constructions.


2.1 Elementary Things

Strings, characters, sets of characters called character
classes, and operators to form these into patterns, are the
fundamental elements of the LEX language.

A string is a sequence of characters, not including newline.
Within a string, the following escape sequences (which are those
of the C language) allow any 8-bit character to be represented,
including the escape character, quotes, and newlines:

\n NL (012)
\r CR (015)
\b BS (010)
\t TAB (011)
\v VT (013)
\f FF (014)
\" "
\' '
\c c
\\ \
\NNN (NNN)

where NNN is a number in octal, and c is any printable
character. A string may be continued across a line by writing
the escape character "\" before the newline. The following
characters must be quoted with double quotes or apostrohies or
the escape character

() {} [] * | = % ^ $ . / ' " -

otherwise they are treated as special charcters.

A sequence of characters enclosed by brackets ('[' and ']')
forms a character class, which stands for all those characters
within the brackets. If a circumflex ('^') follows the opening
bracket, then the class will instead stand for all those
characters but those inside the brackets. There is one
predefined character class "." or [^\n] which matches any
character except newline. The escapes used in strings may be
used in character classes as well.

Within a character class, the construction "A-B" (where "A" and
"B" are arbitrary characters) stands for the range of characters
between "A" and "B" inclusive.

For example,

"abc" matches "abc"

"[ABC]" matches "A" or "B" or "C"

"[A-Za-z0-9]" matches all letters and digits


2.2 Putting Things Together

Several operators are provided to allow construction of a type
of pattern called a regular expression. Such expressions can be
implemented as finite-state automata (without memory or stacks).
A reference to an "occurrence" of a regular expression is
generally taken to mean an occurrence of any string matched by
that regular expression. The operators are presented in order
of decreasing priority. In all cases, operators work on either
strings or character classes, or on other regular expressions.

Any string or character class forms a regular expression which
matches whatever the string or character class stands for (as
described above).

The operator '*' applied following a regular expression forms a
new regular expression which matches an arbitrary number (ie,
zero or more) of adjacent occurrences of the first regular
expression. The operation is often referred to as (Kleene)
closure.

The operator '+' is similar to the '*' operator except that at
least one occurance is necessary. The '?' operator specifies an
optional expression.

The operation of concatenation of two regular expressions is
expressed simply by writing the regular expressions adjacent to
each other. The resulting regular expression matches any
occurrence of the first regular expression followed directly by
an occurrence of the second regular expression.

The operator '|', alternation, written between two regular
expressions forms a regular expression which matches an
occurrence of the first regular expression or an occurrence of
the second regular expression.

Any regular expression may be enclosed in parentheses to cause
the priority of operators to be overridden in the usual manner.

A few examples should help to make all of this clear:

"[0-9]*" matches a (possibly empty) sequence of
digits.

"[A-Za-z_$][A-Za-z0-9_$]*"
matches a C identifier.

"([A-Za-z_$]|[0-9])*"
matches a C identifier, or a sequence of
digits, or a sequence of letters and
digits intermixed, or nothing.

The special operators '^' and '$' match the beginning and end of
line respectivly. The must be the first or last character in
the expression.

"^hello$" matches the word "hello" on a line by
itself.


2.3 The General Form of Lex Programs

A LEX source input file consists of three sections: a section
containing auxiliary definitions, a section containing
translations, and a section containing programs.

Throughout a LEX program, spaces, tabs, and newlines may be used
freely, and PL/1-style comments:

/* ... anything but '*/' ... */

may be used, and are treated as a space.

The auxiliary definition section must be present, separated from
following sections by the two-character sequence '%%', but may
be empty. This section allows definition of named regular
expressions, which provide the useful ability to use names of
regular expressions in the translation section, in place of
common sub-expressions, or to make that section more readable.

The translation section follows the '%%' sequence, and contains
regular expressions paired with actions which describe what the
lexical analyser should do when it discovers an occurrence of a
given regular expression in its input stream.

The program section may be omitted; if it is present it must be
separated from the translation section by the '%%' sequence. If
present, it may contain anything in general, as it is simply
tacked on to the end of the LEX output file.

The style of this layout will be familiar to users of Yacc. As
LEX is often used with that processor, it seemed reasonable to
keep to a similar format.


2.4 Auxiliary Definitions

Given the set of regular expressions forming a complete syntax,
there are often common sub-expressions. LEX allows these to be
named, defined but once, and referred to by name in any
subsequent regular expression. Note that definition must
precede use. A definition has the form:

expression_name regular_expression

where a name is composed of a letter followed by a sequence
string of letters and digits, and where an underscore is
considered a letter. For example,

digit [0-9]
letter [a-zA-Z_]
name {letter}({letter}|{digit})*

The regular expression is terminated by a space, tab or newline.
A regular expression may be continued on the next line by
putting the escape character at the end of the line.

Three auxiliary definitions have special meaning to LEX: "break",
"illegal", and "ignore." They are all defined as character
classes ("break [,.?]", for example) and are used as follows:

break An input token will always terminate if a member
of the "break" class is scanned.

illegal The "illegal" class allows simplification of
error detection, as will be described in a later
section. If this class is defined, and the
lexical analyser stops at a character that
"cannot" occur in its present context, the
analyser will output a suitable error message
and ignore the offender.

ignore This class defines a set of characters that are
ignored by the analyser's input routine.

The special classes may not be used in expressions.

Auxiliary definitions are used in expressions by enclosing the
name in braces:

{name} printf("name: %s\n", yytext);


2.5 Translations

One would like to provide a description of the action to be taken
when a particular sequence of input characters has been matched
by a given regular expression. The kind of action taken might
vary considerably, depending upon the application. In a
compiler, typical actions are: enter an identifer into a symbol
table, read and store a string, or return a particular token to
the parser. In text processing, one might wish to reproduce
most of the input stream on an output stream unchanged, making
substitutions when a particular sequence of characters is found.
In general, it is hard to predict what form the action might
take, and so, in LEX the nature of the action is left to the
user, by allowing specification, for each regular expression of
interest, C-language code to be executed when a string matching
that expression is discovered by the driving program of the
lexical analyser. An action, together with its regular
expression, is called a translation, and has the format:

regular_expression { action }

All of this may be spread across several lines. The action may
be empty. The braces are optional if the action is a single C
statement. The action is separated from the regular expresssion
by spaces or tabs.

If more than one translation uses the same action the action
only has to be written once. To signify that the same action is
to take place a "|" character is used as the action. For example
in:

{digit}+ |
{digit}+"."{digit}* |
{digit}+E{digit}+ printf("Number: %s\n", yytext);

the same action is performed for all three kinds of numbers.

There are three predefined actions implemented as macros:

REJECT causes the lexical analyser to ignore the current
token and start looking for the next. For example:

' '*\n REJECT;

ignores ends of lines and trailing blanks.

ECHO causes LEX to call the library function yyecho();

BEGIN sets the start condition

"/*" BEGIN COMMENT;

where COMMENT has been declared in a %start line
in the auxiliary section.

Earlier, it was argued that most general-purpose languages are
inappropriate for writing lexical analysers, and it is important
to see that the subsequent use of such a language to form the
actions is not a contradiction. Most languages are fairly good
at expressing the actions described above (symbol table
manipulation, writing character strings, and such). Leaving
this part of the lexical analyser to those languages therefore
not only makes sense, but also ensures that decisions by the
writer of the lexical analyser generator will not unduly cramp
the user's style. However, general-purpose languages do not as
a rule provide inexpensive pattern matching facilities, or input
description formats, appropriate for describing or structuring a
lexical analyser.

Allowing a user to provide his own code is not really enough, as
he will need some help from LEX to obtain access to the current
token. LEX provides variables and functions to access it and to
control the parsing process.


2.5.1 Numbers and Values -

Typically, a lexical analyser will return a value to its caller
indicating which token has been found. Within an action, this
is done by writing a C return statement, which returns the
appropriate value:

BEGIN return(T_BEGIN);
{name} { lookup(yytext); return(T_NAME); }
"/" return('/');

Note that function lookup() is provided by the user program.

In many cases, other information must be supplied to its caller
by the scanner. When an identifier is recognised, for example,
both a pointer to a symbol-table entry, and the token number
T_NAME must be returned, yet the C return statement can return
but a single value. Yacc has a similar problem, and so its
lexical analyser sets an external word 'yylval' to the token
value, while the token number is returned by the scanner. LEX
uses the external 'yylval' (to be compatible).

{name} {
yylval = lookup(yytext);
return(T_NAME);
}

Certain token numbers are treated specially. The first is -1
and is the number returned by the REJECT action. The others are
defined as a manifest constant by LEX. There is one token number
defined at present:

ERROR This is returned by the lexical analyser
(function yylex()) when an unrecognizable input
sequence has been detected. By default, ERROR
is 256. This the same as the YACC error value.

To summarise, the token number is set by the action with a return
statement, and the token value (ie, auxiliary information) is
set by assigning this value to the external integer 'yylval'.


2.6 Start conditions

A lexical analyser may have some rules that should be used under
specific conditions. Start conditions are provided for this
purpose. Start conditions are predeclared in the auxiliary
definitions section with a %start line declaring the names of
the start conditions. A translation may have a start condition
flag on it. The program:

%start COMMENT
%%
"/*" { BEGIN COMMENT; }
. REJECT;
\n ECHO;
.*"*/" { BEGIN 0; REJECT; }

will skip comments keeping the same number of lines.

Several start conditions may precede any translation regular
expression. The "^" beginning-of-line operator is a special
start condition that is true at the beginning of a file and
whenever a newline was the last character read. This condition
must be the last on a line.

All of the start conditions will have their values #defined in
the table file at the end of the auxiliary definitions section.


2.7 Declaration Sections

Declarations in the language of the actions may be included in
both the auxiliary definition section and in the translation
section. In the former case, these declarations will be
external to the lexical analyser, and in the latter case, they
will be local to the lexical analyser (ie, static, or automatic
storage). Declaration sections consist of a sequence of
declarations surrounded by the special bracketing sequences '%{'
and '%}' (as in Yacc). The characters within these brackets are
copied unchanged into the appropriate spots in the lexical
analyser program that LEX writes. The examples in appendix A
suggest how these might be used.


3.0 Using Lex from C

The present version of LEX is intended for use with C; and it
is this usage which will be described here.


3.1 The Function yylex()

The structure of LEX programs is influenced by what YACC
requires of its lexical analyser.

To begin with, the lexical analyser must be named 'yylex', has
no parameters, and is expected to return a token number, where
that number is determined by YACC. The token number for an
ASCII character is its ASCII value (ie, its value as a C
character constant). Named tokens, defined in YACC '%token'
statements, have a number above 256, with the particular number
accessible through a YACC-produced #define of the given token
name as its number. YACC also allows 'yylex' to pass a value to
the YACC action routines, by assigning that value to the
variable 'yylval'.

LEX thus provides a lexical analyser function named 'yylex',
which interprets tables constructed by the LEX program returning
the token number returned by the actions it performs.

A value of zero is returned by 'yylex' at end-of-file, and in
the absence of a return statement in an action, a non-zero value
is returned. If computation is performed entirely by the
lexical analyser, then a suitable main program would be

main()
{
while (yylex()) ;
}


3.2 Serial Re-Use of yylex()

The yylex() function contains several variables which are
statically initialized at compile time. Once yylex() sees an
EOF (-1) input character, it will continue to return NULL. If
yylex() is to be used inside a loop which processes multiple
files, it must be re-initialized at the beginning of each new
file with a call to the LEX library routine yyinit(). For
example (slightly extending the previous example):

main()
{
getfilelist();
for(file = first; file != last; file = next) {
yyinit();
while (yylex()) ;
}
printf("All files done\n");
}

The call to yyinit() is unnecessary if yylex() is to process
only one file, or is kept from seeing an EOF input character.


3.3 The Lex Table File

In the absence of instructions to the contrary (see below), LEX
reads a given LEX language file and produces a C program file
'lexyy.c' which largely consists of tables which are then
interpreted by 'yylex()' (which is in the LEX library). The
actions supplied by the user in each translation are combined
with a switch statement into a single function, which is called
by the table interpreter when a particular token is found. The
contents of the program section of the LEX file are added at the
end of the output file. LEX also inserts the line

#include

at the top of the file; this causes declarations required by
the standard I/O library and by LEX to be included when the C
program is compiled.


3.4 Analyzers Which Don't Use "Standard I/O"

LEX supports the generation of analyzers which may be
incorporated into programs which do not use the "standard I/O"
library. The predefined macros "input()" and "output(ch)" may
be #undefed and redefined to call your I/O routines.


3.5 Operating LEX

LEX normally reads the grammar from the standard input, writing
the C program to the file 'lextab.c'. It may be further
controlled by using the following flags upon invocation:

-i filename The grammar is read from 'filename'.

-o filename The analyser is written to 'filename'.

-t tablename The default finite-state automaton is
named lextab. The -t switch causes
the internal tables to be named
'tablename'. This is necessary if the
processor-switching capabilities
described in a later section are to
be used. The resulting file will not
contain the input/output macros and
procedures nor the yylval variable.

-v [filename] Internal state information is written
to 'filename.' If not present, state
information is written to file 'lex.out.'

-d Enable various debugging printouts.

-a Ignore non ASCII characters in input.

-m Enable state minimization. This will
create a smaller table but will take
longer to process.


The command line for compilation of the table file should
contain no surprises:

cc lexyy.c;

but when one is producing the running program, one must be
careful to include the necessary libraries.

link userprog+lexyy,,,libl;

If Yacc is used as well, the library 'liby' should be included.


4.0 The Lex Library

All programs using grammars generated by LEX must be linked
together with the LEX library. It contains routines which are
either essential or merely useful to users of LEX. The
essential routines include a routine to obtain a copy of the
current token, and a routine to switch to a different set of
scanning tables. Routines of the second, useful, class perform
functions which might well be written by the user himself, but
are there to save him the bother; including a routine to process
various forms of comments and a routine to transform numbers
written in arbitrary bases. Both sets of routines are expected
to grow as LEX sees use.

Those functions which produce diagnostics do so by calling
yyerror(), which is called as

yyerror(string, arg)

and is expected to write its argument followed by a newline, on
some output stream. A yyerror() function is included in the LEX
library, but a user is free to include his own. The routine in
the LEX library is standard I/O specific, it writes to stderr.

4.0.1 yynext -- steal character -

yynext()

yynext() returns the next character from the LEX input stream.
(This means that LEX will no longer see it.) LEX uses a look-
ahead buffer to handle complex languages, and this function
takes this into account.


4.0.2 yyback -- put character back on input

yyback(ch)
char ch;

yyback restores a character to the LEX input stream so that it
will be the next character read.


4.0.3 yypeek -- examine character -

yypeek()

yypeek() performs a function similar to that of yychar(), but
does not have the side-effect of removing the character from
LEX's view.


4.0.4 yylook -- how many characters are on the lookahead queue

yylook()

yylook() returns the number of characters that have been read
from the input and stored in the look ahead queue.


4.0.5 yyless -- cut a token back in size

yyless(n)
int n

yyless() shortens a token from the current length to n
characters at most. If it is too large or negative no action is
taken.


4.0.6 yymore -- keep token and add next on to it

yymore()

yymore() sets an internal flag that stops LEX from clearing the
token buffer for the next token. The next token gets appended
to the old one.


4.0.7 yyplus -- add character to token

yyplus(c)
char c

yyplus() adds the character to the end of yytext and increments yyleng.


4.0.8 yyecho -- write token to a file -

yyecho();

yyecho() may be called by a LEX action routine to write the
current token to a specified file. The predefined action ECHO
calls this routine. This routine uses the output procedure so
an application that does not use standard I/O must redefine this
macro.


4.0.9 yymapc -- Handle C escapes within strings

int yymapc(delim, esc)
char delim;
char esc;

yymapc() is a function which handles C "escape" characters such
as "\n" and "\nnn". It will scan off the entire escape sequence
and return the equivalent ASCII code as an integer. It is meant
for use with YACC while scanning quoted strings and character
constants. Escaped characters

If it encounters EOF while scanning, it calls yyerror() to print
an error message warning of "Unterminated string". If a normal
character is read, it returns the ASCII value. If "delim"
(usually " or ') is read, it returns EOF. If a newline (ASCII
linefeed) is read, it increments the global "yyline" and calls
itself recursively for the next line of input.

It uses the yynext() and yyback() functions to read and unread
characters.


4.0.10 yyswitch -- switch scanning tables -

struct lextab *
yyswitch(newtb)
struct lextab *newtb;

yyswitch() is called to cause LEX to use a different scanning
table; it returns a pointer to the one previously in use. This
facility is useful if certain objects of the language (eg,
strings in C) have a fairly complicated structure of their own
which cannot be handled within the translation section of the
LEX description of the larger language. If newtb is 0 then the
default table lextab is used.


4.0.11 yyinit -- reset lexical scanner to start state

yyinit()

yyinit() resets LEX to be ready to start on a new file. It
clears the token, the lookahead buffer and switches to the
default table.


4.1 LEX Variables

Several variables are available to the programmer:

int yyline the current input line number
int yyleng the length of the current token
char yytext[] the token string text
YYSTYPE yylval the auxilary value of the token


5.0 Error Detection and Recovery

If a character is detected in the input stream which cannot be
added to the last-matched string, and which cannot start a
string, then that character is considered illegal by LEX. LEX
may be instructed to return a special 'error' token, or to write
a diagnostic with yyerror(). At present, the former is the
default action.

The token ERROR is a special value which is recognised by YACC,
and causes it to start its own error recovery.

Often, it makes more sense to simply type a suitable diagnostic,
and continue by ignoring the offending character. It is fairly
easy to cause LEX to do this, by including the auxiliary
definition:

illegal [\0-\377]

which defines a character class "illegal" which is handled
specially by LEX. If the character that is causing the trouble
is a member of that character class (and in the example, all
characters are), then LEX will write a diagnostic, and ignore
it; otherwise, it will return the special token ERROR.

More comprehensive techniques may be added as they become
apparent.


6.0 Ambiguity and Look-ahead

Many computer languages have ambiguous grammars in that an input
token may represent more than one logical entity. This section
discusses the way in which grammars built by LEX resolve
ambiguous input, as well as a way for the grammar to assign
unique meaning to a token by looking ahead in the input stream.


6.1 Resolving Ambiguities

A LEX program may be ambiguous, in the sense that a particular
input string or strings might be matched by the regular
expression of more than one translation. Consider,

[a-z] { putchar(yytext); }
aaa* { printf("abc"); }

in which the string 'aa' is matched by both regular expressions
(twice by the first, and once by the second). Also, the string
'aaaaaa' may be matched in many different ways. LEX has to
decide somehow which actions should be performed.
(Alternatively, it could produce a diagnostic, and give up. As
it happens, LEX never does this.)

Consider a second example,

letter [a-z]
%%
a{letter}* { return(1); }
ab{letter}* { return(2); }

which attempts to distinguish sequences of letters that begin
with 'a' from similar sequences that begin with 'ab'. These two
examples illustrate two different kinds of ambiguity, and the
following indicates how LEX resolves these.

In the first example, it seems likely that the intent was to
have both 'aa' and 'aaaaaa' perform the second action, while all
single letters 'a' cause the first action to be performed. LEX
does this by ensuring that the longest possible part of the
input stream will be used to determine the match. Thus,

< { return(LESS); }
<= { return(LESSEQ); }

or

{digit}+ { return(NUMBER); }
{letter}({letter}|{digit})* { return(NAME); }

would work as one might expect.

In the second example, the longest-string need not work. On the
string "abb9", either action could apply, and so another rule
must be followed. This states that if, after the longest-string
rule has been applied, there remains an ambiguity, then the
action which appears first in the LEX program file is to be
performed. As the second example is written, the second action
will never be performed. It would have been written as:

letter = [a-z];
%%
ab{letter}* { return(1); }
a{letter}* { return(2); }

The two rules together completely determine a string.

At present, LEX produces no diagnostic in either case; it
merely applies the rules and proceeds. In the case where
priority is given to the first-appearing rule, it might be a
good idea to produce a diagnostic.


6.2 Look-ahead

Some facility for looking ahead in the input stream is sometimes
required. (This facility might also be regarded as a way for
the programmer to more closely control LEX's ambiguity
resolution process.) For example, in C, a name followed by "("
is to be contextually declared as an external function if it is
otherwise undefined.

In Pascal, look-ahead is required to determine that

123..1234

is an integer 123, followed by the subrange symbol "..",
followed by the integer 1234, and not simply two real numbers
run together.

In both of these cases, the desire is to look ahead in the input
stream far enough to be able to make a decision, but without
losing tokens in the process.

A special form of regular expression is used to indicate
look-ahead:

re1 '/' re2 '{' action '}'

where 're1' and 're2' are regular expressions. The slash is
treated as concatenation for the purposes of matching incoming
characters; thus both 're1' and 're2' must match adjacently for
the action to be performed. 'Re1' indicates that part of the
input string which is the token to be returned, while 're2'
indicates the context. The characters matched by 're2' will be
re-read at the next call to yylex(), and broken into tokens.

Note that you cannot write:

name re1 '/' re2

The look-ahead operator must be part of the rule. It is not
valid in definitions.

The "$" end-of-line character is implemented as a look-ahead for
a newline.

^".PP"$ is equivalent to ^".PP"/"\n"

In the first example, the look-ahead operator would be used as:

{name}/\( {
if (name undefined)
declare name a global function;
}
{name} {
/* usual processing for identifiers */
}

In the second example, the range construction would be parsed as
follows:

digit [0-9]
int {digit}+
%%
{int}/".."{int} { /* Start of a range */
".."{int} { /* End of a range */


Note that right-context is not sufficient to handle certain
types of ambiguity, as is found in several places in the Fortran
language. For example,

do i = 1 Is an assignment statement
do i = 1, 4 Is a DO statement

It is not sufficient to use right-context scanning to look for
the comma, as it may occur within a parenthesized
sub-expression:

do i = j(k,l) Is an assignment statement

In Fortran, similar problems exist for IF and FORMAT statements,
as well as counted (Hollerith) string constants. All of these
require a more powerful grammar than is possible with LEX
regular-expressions.


7.0 Multiple Scanning Tables; Processor Switching

Even a fairly simple syntax may be difficult, or impossible, to
describe and process with a single set of translations. An
example of this may be found in C, where strings, which are part
of the language, have quite a different structure, and in order
to process them, either a function must be called which reads
and parses the input stream for itself, or some mechanism within
LEX must be invoked to cause a (usually massive) change of
state.

LEX does provide such a facility, which is known, after AED, as
'processor switching'. yylex() locates its tables through a
pointer; if one simply changes the pointer to point at a new
set of tables, one will have effected the required change of
state. The LEX library function yyswitch(), which is described
elsewhere in this guide, arranges to do this; it also returns
the old value of the pointer so that it may be restored by a
later call to yyswitch. Thus, scanning environments may be
stacked, or not, as the user requires.


7.1 Creation of a Processor

It should be clear that if all the tables produced by LEX from a
user's translation file have the same name, someone (the loader)
is bound to object. Some method must be provided to change the
name of the table.

This is done by an option flag to the LEX command:

-t name

will cause the scanning table to be declared as

struct lextab name;

so that it may be passed to yyswitch:

yyswitch(&name);

Files generated with the use of this switch will not have the
predefined variables and procedures in the default table. At
least one of the tables must have the default name lextab.


8.0 Conclusion

LEX seems to handle most lexical analysis tasks easily. Indeed,
LEX may be more generally used to write commands of a text
processing nature; an example of such usage may be found in an
appendix. LEX programs are far easier to write than the
equivalent C programs, and generally consume less space
(although there is an initial overhead for the more general
table-interpreter program). The encoding suggested in [4]
achieves a reasonable compromise between table size, and
scanning speed. Certainly lexical analysers are less tedious
and time-consuming to write.

It is expected that most change in the future will be through
additions to the LEX library. The LEX language may change
slightly to accomodate common kinds of processing (eg, break
characters), or to extend its range of application. Neither
kind of change should affect existing LEX programs.

LEX produces tables and programs for the C language. The tables
are in a very simple (and stylised) format, and when LEX copies
the action routines or the program section, the code might as
well be Fortran for all it cares. One could write Unix filters
to translate the very simple C format tables into other
languages, allowing LEX to be used with a larger number of
languages, with little extra development cost. This seems a
likely future addition.

Because of the look-ahead necessary to implement the "longest
string match" rule, LEX is unsuitable for interactive programs
whose overall structure is:

for (;;) {
prompt_user();
get_input();
process();
print_output();
}

If these are rewritten as LEX-generated grammars, the user will
be confused by the fact the second input datum must be entered
before the first is processed. It is possible to solve this
dilemma by rewriting the input function to return an
"end-of-line" character until processing is complete for that
line. An example is shown in the Appendix.


9.0 Acknowledgements

LEX is based on a processor of the same name at Bell
Laboratories, which also runs under Unix [3], and, more
distantly, on AED-0 [1]. This version of LEX was based on the
description and suggestions of [4].


10.0 References

1. Johnson, W.L., et. al., "Automatic generation of
efficient lexical analysers using finite state
techniques", CACM Vol. 11, No. 12, pp. 805-813, 1968.

2. Johnson, S.C., "Yacc -- Yet Another Compiler-Compiler",
CSTR-32, Bell Telephone Laboratories, Murray Hill, New
Jersey, 1974.

3. Lesk, M.E., "Lex - a lexical analyser generator",
CSTR-39, Bell Telephone Laboratories, Murray Hill, New
Jersey, 1975.

4. Aho, A.V., Ullman, J.D., Principles of Compiler Design,
Addison-Wesley, Don Mills, Ontario, 1977.




APPENDIX A

LEX SOURCE GRAMMAR



The following is a grammar of LEX programs which generally
follows Bacus-Naur conventions. In the rules, "||" stands for
alternation (choose one or the other). Other graphic text
stands for itself. Several grammar elements have special
meaning:

Any text not including the following
grammar element (either a literal or
end-of-file).

Nothing -- used for optional rule
elements.

A C statement terminated by newline.

A compound statement.

A variable name.

A character class specifier.

A string (text).

A number of tabs or spaces.

The end of the input file.


This grammar was abstracted from the Yacc grammar used to
describe LEX.

program ::= aux_section trans_section

aux_section ::= auxiliaries %%
|| %%

auxiliaries ::= auxiliaries aux_def
|| aux_def

aux_def ::= name_def reg_exp
|| %{ %}
|| %start start_names

start_names ::= start_names start_name
|| start_name

start_name ::=

name_def ::=

reg_exp ::=
||
|| { }
|| reg_exp *
|| reg_exp +
|| reg_exp ?
|| reg_exp | reg_exp
|| reg_exp reg_exp
|| ( reg_exp )

trans_section ::= translations
||

translations ::= translations translation
|| translation

translation ::= start begin pattern end action
|| %{ %}
|| %%

start ::= conditions
||

conditions ::= conditions condition
|| condition

condition ::= < start_name >

begin ::= ^
||

pattern ::= reg_exp / reg_exp
|| reg_exp

end ::= $
||

action ::= |
||
|| { }



APPENDIX B

SOME SMALL EXAMPLES


The following example illustrates the use of the look-ahead
operator, and various other of the nuances of using LEX.


B.1 A Complete Command

The C programming language has had two different ways of writing
its assignment operators. The original method was to write a
binary operator immediately following the ordinary assignment
operator, forming a compound operator. Thus 'a =+ b' caused the
value of 'a+b' to be assigned to 'a'. Similarly,

=- =/ =% =* =<< =>> =| =& =^

were written for the assignment operators corresponding to
subtraction, division, modulus, multiplication, left shift,
right shift, logical OR, logical AND, and exclusive OR. In the
current version of the language, the binary operator is written
to the left of the assignment operator, to remove potential
ambiguity.

The LEX program "ctoc" is a filter which converts programs
written in the older style into programs written in the newer
style. It uses the look-ahead operator, and the various
dis-ambiguating rules to ensure that sequences like

a==-1 a=++b

remain unchanged.

/*
* ctoc.l -- Convert old C operators to new C form
*
* Adapted from example in C. Forsythe's LEX manual.
*
* NOTE:
* Forsythe's program put an entire comment into the token
* buffer. Either define a huge token buffer for my typical
* monster comments, or filter text within comments as if
* it were real C code. This is what I did. So =+ inside
* a comment will get changed to +=, etc. Note tnat you
* may use the commen() function in LEXLIB if you want the
* comments eaten. I wanted 'em in the output.
* by
* Bob Denny
* 31-Feb-81
*/
%{
extern int yylex(void);

main()
{
while (yylex())
;
}
%}
any [\0-\177]
nesc [^\\]
nescquote [^\\"]
nescapost [^\\']
schar "\\"{any}|{nescquote}
cchar "\\"{any}|{nescapost}
string '"'{schar}*'"'
charcon "'"{cchar}*"'"
%%
=("<<"|">>"|"*"|"+"|"-"|"/"|"%"|"&"|"|"|"^") printf("%s=",yytext+1);

[<=>!]=|=[<>] ECHO;

=/("++"|"--") ECHO;

{charcon} ECHO;

{string} ECHO;

.|\n ECHO;

/*
* The following will overflow the token buffer on any but a
* small comment:
*/

"/*"([^*]|"*"[^/])*"*/" ECHO;


B.2 Interactive Lexical Analysis

The following program reads words from the terminal, counting
each as they are entered. The interaction with the operator is
"natural" in the sense that processing for one line is complete
before the next line is input. To implement this program, it
was necessary to include a special version of input() which
returns newline if the current line has been completely
transmitted to the parser. Because the parser must still have
some look-ahead context, it will return the "end-of-line" token
twice at the beginning of processing. This required some
additional tests in the main program.

/*
* Count words -- interactively
*/
white [\t ] /* End of a word */
eol [\n] /* End of input line */
any [!-~] /* All printing char's */
illegal [\0-\377] /* Skip over junk */
%{
#undef input
#define T_EOL 1

char line[133] = "";
char *linep = line;
int is_eof = 0;
int wordct = 0;

main()
{
register int i;
while ((i = yylex()) != 0) {
/*
* If the "end-of-line" token is
* returned AND we're really at
* the end of a line, read the
* next line. Note that T_EOL is
* returned twice when the program
* starts because of the nature of
* the look-ahead algorithms.
*/
if (i == T_EOL && !is_eof && *linep == 0) {
fprintf(stdout, "* ");
fflush(stdout);
getline();
}
}
printf("%d words\n", wordct);
}
%}
%%

{any}{any}* {
/*
* Write each word on a
* seperate output line.
*/
ECHO;
output('\n');
wordct++;
REJECT;
}

{eol} return(T_EOL);

{white}{white}* REJECT;

%%
getline()
/*
* Read a line for input()
*/
{
is_eof = (fgets(line, sizeof line, stdin) == NULL);
linep = line;
}
input()
/*
* Homemade input -- return newline while at the
* end of an input line or EOF at end of file. If
* more on this line, return the next byte.

*/
{
return(is_eof?EOF:*linep == 0?'\n':*linep++);
}



 December 18, 2017  Add comments

Leave a Reply