Category : Modula II Source Code
Archive   : PATTERN.ZIP
Filename : PATTERN.TXT

 
Output of file : PATTERN.TXT contained in archive : PATTERN.ZIP










The Pattern System





Introduction


The Pattern System is the result of my investigations into lexical
analysis. The major source of information was Chapter 3 ("Lexical
Analyses") of Compilers: Principles, Techniques and Tools by Aho,
Sethi, and Ullman (Addison-Wesley, 1986). In addition, early in the
implementation of the various programs, Chapters 20 ("Pattern Match-
ing") and 21 ("Parsing") of Robert Sedgewick's book: Algorithms
(Addison-Wesley, 1983), were of great value, as was the book Data
Structures and Algorithms by Aho, Hopcroft, and Ullman (Addison-Wes-
ley, 1983).

There are three major sections to this paper. The first will de-
scribe the use of the four translation programs (LEX2NFA, NFA2DFA,
NFA2TXT, and DFA2TXT) to produce the output required to build a lexi-
cal analyzer. The second section describes use of the separately com-
piled modules (FAInput, DFAPattern, and NFAPattern) in programs. Fi-
nally, the third section discusses implementation details of the sys-
tem, should one want to convert the system to another programming lan-
guage.

Some general knowledge of lexical analyses is assumed. Those who
find the present description inadequate are referred to the introduc-
tion and sections one, three and five in Chapter 3 of Compilers: Prin-
ciples, Techniques, and Tools. It should also be pointed out that the
program LEX2NFA is not as powerful as its UNIX counterpart Lex. I
feel, however, that by moving more of the analysis to the run-time
program, it becomes more flexible.


Translation Programs


There are four programs which may be used to create a pattern specifi-
cation. The primary program is LEX2NFA, which will translate a human-
readable lexical specification into a machine-readable file containing
the description of a non-deterministic finite automaton (NFA). The
other programs, which transform NFAs and DFAs (a DFA is a determinis-
tic finite automaton) into various forms, require little human atten-
tion and will be described after LEX2NFA.













LEX2NFA
The program LEX2NFA will translate a lexical specification into an
NFA. While the full grammar of a lexical specification is given at
the end of this paper, it is my intention to given here a (perhaps)
more readable description for those who are unfamiliar with reading
grammars.

A lexical specification consists of three sections, each termi-
nated by a percent sign (%). The sections, in order, are "declaration
of manifest constants," "regular definition of certain common expres-
sions," and "association of manifest constants with regular expres-
sions."


Manifest Constants
The first section, declaration of manifest constants, lists the names
of the patterns which may be recognized. The value returned by Read-
Pattern will correspond to these declarations (starting with zero and
proceeding sequentially). If the implementation language is Pascal or
Modula-2 an enumerated type with elements declared in the same order
will also correspond.


Regular Definitions
The second section consists of zero or more regular definitions. A
regular definition can be used to associate a name with a common ex-
pression, or to add clarity to other expressions by the use of
mnemonic names. A definition consists of an identifier, the equals
sign (=), and an expression (which will be described shortly). It is
terminated by a period (.). While one may use previously stated defi-
nitions, a definition may not refer to itself, or to a definition not
yet stated.


Associations
The final section associates the manifest constants from the first
section with regular expressions which they should match. The associ-
ations have the form of an identifier (which must come from the list
of manifest constants), followed by the greater than symbol (>, which
is supposed to symbolize association) and an expression terminated by
a period. The associations should be made in order of priority. Ex-
ample: in the programming language Modula-2 an identifier is a se-
quence of letters and digits, starting with a letter. A reserved
word, such as "MODULE", also matches this pattern. To make sure the
reserved word is recognized instead of the identifier, make sure the
reserved word occurs before the identifier in the list of the associa-
tions.


Regular Expressions
Regular expressions form the basis of describing patterns to the Pat-
tern System. While regular expressions look substantially different
from arithmetical expressions, they follow many of the same rules
(although with different operators). For those who find the following
discussion difficult to understand, I would recommend trying to follow
the formal grammar at the end of this paper at the same time as read-
ing the following (rather informal) description.


Pattern - 2






An expression is composed of one or more factors combined by var-
ious operators. Factors have the highest priority in regular expres-
sions, and are therefore evaluated first. There are three atomic fac-
tors (corresponding to, say, numbers in an arithmetical expression),
and two special factors (to be described shortly). The atomic factors
include: 1) the literal string, 2) the literal character, and 3) the
character class. A literal string is any sequence of printing ASCII
characters, surrounded by double quotes ("). The literal string will
match precisely that which is between the double quotes (e.g. the lit-
eral string "FOR" would match the letters FOR in the input). The lit-
eral character is designed to circumvent the difficulties of using
certain characters in the lexical specification. A literal character
may take one of two forms. The first is provided mostly to ease the
matching of the double quote, since it cannot be included in a literal
string, but may be used with any ASCII printing character. This form
is specified by preceding the character to be matched by a back-slash
(\). The other form of literal character is for matching non-printing
ASCII characters, although it also may be used for any single charac-
ter. This form is represented by an octal number giving the ASCII
code for the desired character, followed by the letter C.

The last atomic factor is the character class. A character class
is really a short-hand way of saying any of the characters specified
here. Although not really essential for regular expressions, since
their functions could be specified in other manners, they are included
because of their convenience. A character class is represented by one
or more elements (separated by commas) enclosed by brackets. Each el-
ement consists of either a single character (which may be specified as
a literal character, or a literal string of length one), or, to indi-
cate a range of characters, two single characters separated by two
dots (..). If specifying a range, the second character must be higher
(in ASCII order) than the first. Example: all of the upper-case let-
ters may be represented as ["A".."Z"], while all of the vowels may be
represented as ["A","E","I","O","U","a","e","i","o","u"] (remembering
that the case is also significant).

The two "special" factors are 1) definitions from the definition
section of the lexical specification, and 2) parenthetical expres-
sions. A definition may be used by placing its name where one would
place the expression it represents. Definitions are handy for improv-
ing the clarity of the specification and for reducing the amount of
typing necessary. The last type of factor, the parenthetical expres-
sion, merely means that parentheses may be used for grouping.

Factors are then grouped into terms. A term consists a one or
more factors, with optional post-fix operators, juxtaposed in the ex-
pression. The juxtaposition indicates concatenation in the pattern
being matched. Example: \" ["A".."Z","a".."z"] \" indicates all sets
of letters surrounded by double quotes. There are three possible
post-fix operators. Kleene closure, represented by an asterisk (*),
indicates that the previous factor may be repeated zero or more times.
Positive closure, represented by an plus sign (+), indicates that the
previous factor may be repeated one or more times. Lastly, a post-fix






Pattern - 3






question mark (?) indicates that the previous factor is optional (may
occur zero or one times.). Example: assume the definition

digit = ["0".."9"].

the expression for numbers in Pascal would then be

number > digit+ ("." digit+)? ("E" ["+","-"]? digit+)?.

The final thing to say about expressions is that terms may be
separated by the vertical bar (|) to indicate choice. Examples: In
Modula-2, identifiers are one or more letters and digits, starting
with a letter; and the reserved NOT may be represented by the lexemes
"NOT" or "~". If we assume the definitions

letter = ["A".."Z","a".."z"].

digit = ["0".."9"].

expressions for identifiers and NOT would be

Not > "NOT" | "~".

Identifier > letter (letter | digit)*.

(Notice that the pattern for NOT is given first, to indicate that it
should be matched before an identifier with the same lexeme.)


Error detection
Any error in the lexical specification will be reported by the pro-
gram, and will result in no NFA file being written (although the pro-
gram will check as much of the rest of the specification as it can).
The error messages should be quite self-explanatory, so I won't bother
to describe them here. After encountering an error the program can
generally resume checking for other errors quite quickly. However,
certain lexical errors (such as a string terminator not on the same
line as it begins) might cause it to skip a good deal of the specifi-
cation, which means it might miss other errors. (This is merely in-
convenient, as they will appear after one has corrected the lexical
mistake, and require yet more corrections.)


Other Programs
There are three other programs which can be used to convert the NFA
produced by LEX2NFA into a form suitable for use. If the lexical
specification is not yet stable, it should be left in the form of a
file based NFA. This provides the ability to change the specification
without having to re-compile the program which uses it. Although NFAs
are easy to produce, they tend to take quite a while to process the
input. Once the specification has become stable, it may be converted
into a DFA with the program NFA2DFA. This process takes quite a while
(on an 8088 based machine it took an hour to convert the NFA for Mod-
ula-2 into a DFA), so try to be relatively sure the specification
won't change again. If run-time speed is of any importance at all,
the long translation time will be worthwhile. (In a test case, an NFA
took about 220 seconds to process a file, while the DFA took about 5


Pattern - 4






seconds.) Once the program using the pattern matcher is also stable,
it may be advantageous to move the initialization of the finite au-
tomaton into the program itself (instead of having to look for the
initialization file). The programs NFA2TXT and DFA2TXT will mechani-
cally produce a procedure which correctly initializes one of the pat-
tern matchers. Some code will have to be added to their output, how-
ever, since I didn't want to make any assumptions about the way it
would be used. If the output is quite small (for instance, the DFA
for LEX2NFA is about 130 lines long), one might add it to the main
program, while if it is large, it will be necessary to give it a sepa-
rately compiled module of its own (the DFA for Modula-2 was about 1700
lines long, and compiled to about 32 kilobytes).

Any of these translation programs may be used by starting the
program, and giving it the name of the file one wishes it to process.
No further human intervention should be required. Both NFA2TXT and
DFA2TXT are reasonably speedy and produce no output on the screen.
Since NFA2DFA does take quite a while, it will display the transitions
it is writing to the disk. There is no real reason for this, other
than that experience has taught us that watching a computer doing
something unintelligible is less alarming than watching a computer ap-
parently doing nothing.


Pattern Matchers


The two separately compiled modules NFAPattern and DFAPattern are in-
tended to be rather low-level lexical analyzers. They can both be
used in precisely the same fashion, and are for the most part inter-
changeable. I call them low-level because they make no assumptions
about their run-time environment, and do no real error checking. To
initialize one of the pattern matchers, one may either call a mechani-
cally created procedure, or call the appropriate read procedure
(ReadNFAFromDisk or ReadDFAFromDisk) with an initialize file repre-
senting the finite automaton. Before any further processing can take
place, an input file must be specified in the module FAInput.

Once initialized, patterns may be matched by calling the proce-
dure ReadPattern. This procedure will return the number of the mani-
fest constant in the lexical specification, or the result of
MAX (CARDINAL) (currently 0FFFFH) to indicate that it could not match
any pattern. The lexeme actually matched will be found in the vari-
able lastLexeme.

Since it is sometimes necessary to directly manipulate the input
buffer, FAInput provides a few simple operators. The procedure
NextChar returns the character at the current input position, and ad-
vances the position. The current position may be determined and set
by the procedures ReturnPosition and SetPosition. The procedure Back-
TrackPosition provides an easy way to move the given position back one
character, but it will not affect the current position. These opera-
tions should be applied with great care in order to avoid confusing
the pattern matcher.

Because these are really pattern matchers, as opposed to lexical
analyzers, care should be taken in designing the lexical specifica-


Pattern - 5






tion. For example, in the programming language Modula-2, strings may
take the form of characters (except a double quote) enclosed by double
quotes, with the requirement that the string terminate on the same
line as it begins. The specification for this might be (assume that
"eol" has been previously defined as a line terminator):

string > \" [" ".."!","#".."~"]* (\" | eol).

Notice first that the character class avoids including the double
quote, to remove any ambiguity about the termination. Also notice
that it is the responsibility of the calling program to determine what
the terminating character was. When examining the variable lastLexeme
remember that it will contain precisely the lexeme matched, so in this
case, it will start with a double quote and end with another double
quote or the characters making up the end of line.

Another caveat, comments. To quote Niklaus Wirth (from page 16
of Programming in Modula-2, Springer-Verlag, 1985) "They [comments]
are arbitrary sequences of characters enclosed in the comment brackets
(* and *)." According to this definition, the following specification
would be sufficient to lexically recognize comments:

comment > "(*" [0C..177C] "*)".

In fact, this specification would become confused by the nesting of
comments.1 (An easy example is using a comment to temporarily hide
code which already contains comments.) Although I have no direct ex-
perience, I suspect that this problem can occur in any programming
language where comments may extend over many lines. In order to avoid
this problem, don't try to make the pattern matcher recognize the
whole comment at once. Instead, make the left comment bracket and the
right comment bracket each be distinct patterns. Then have the call-
ing program keep track of the level of nesting and not return a token
until it's not in a comment. (One of the reasons for making the pat-
tern matchers simple was to allow flexibility in dealing with situa-
tions like this.)


Implementation Notes


It is not my purpose here to explain the whole theory behind lexical
analysis, but, instead, to describe the choice of algorithms used in
the Pattern System, and to note any implementation anomalies. Those
interested in more detailed descriptions of theory are referred to
chapter 3 of Compilers: Principles, Techniques, and Tools.


LEX2NFA
The LEX2NFA program is a simple recursive descent compiler based on my
own grammar (derived from a description of the UNIX Lex system, and
some odd sense of theoretical purity on my part); the formal grammar
____________________

1 Lest anyone think otherwise, I must point out that in the
formal definition of Modula-2, Prof. Wirth does in fact say that
comments may be nested.


Pattern - 6






it accepts is given at the end of this paper. The lexical analyzer it
uses is a DFA based pattern matching system constructed by various
programs in this system. It does, however, do further processing on
the input to weed out certain lexical errors. The program uses Thomp-
son's construction to parse each regular expression into a NFA. Regu-
lar expressions from the definition section are stored in memory for
future use, while those from the association section are output to a
disk file without intermediate storage. (This assumes that there are
no errors. If any errors do occur, only error checking will con-
tinue.)

An NFA file will consist of zero or more NFA records, followed by
an "end marker" (as defined by NFAPattern). Each NFA record is of
variable but pre-determined length and consists of two bytes which
specify the number of the pattern which the expression recognizes.
(All numeric values will be unsigned. While there shouldn't be much
trouble about reading them into signed variables, it is conceivable
that they could become large enough to cause problems.) Two bytes
specify the offset of the start state; and two bytes indicate the num-
ber of state records which are to follow. A state record consists of
four bytes indicating the offsets of the two possible transitions, and
one byte indicating the transition character. (The format for the
state record takes advantage of the simple nature of NFAs produced by
Thompson's construction and are probably of little value for more gen-
eral NFAs.) The values for certain special conditions are given in
the definition of NFAPattern and need not be repeated here. Two
things must still be pointed out. 1) In the file, each expression is
considered autonomous (with its first state numbered zero, and its
start state given in the NFA record), so it may need to be transposed
when storing it in main memory. 2) There is an assumed start state
for the whole file, which has a transition on the empty string to each
of the individual start states. (This value needs to be computed by
the run-time pattern matcher, after it has all of the expressions.)
Details of run-time operations may be observed in NFAPattern and
NFA2DFA.


NFA2DFA
This program uses subset construction to convert an NFA into a DFA.
The program is divided into three local modules, one of which handles
operations on NFAs, another handles operations on DFAs, and the last
hides the details of an abstract data type for dealing with sets of
NFA states (which is used by both of the other two modules).

The local module NFAHandler is a straightforward NFA simulator,
with the exception that when computing the "e-closure," only important
NFA states are returned. This has the effect of reducing the number
of DFA states produced and speeding the computation of transitions on
a given character. The local module DFAHandler concerns itself with
the mapping of sets of NFA states to DFA states and with the output of
the DFA file. The mapping mechanism is based on compressed versions
of the set of NFA states which it represents. (The set of NFA states
is first changed to a bit-map, and then the significant elements of
this bit-map are stored.) Observation showed that there tended to be
runs of searches for the same set of states (corresponding to a single
DFA state), so a one DFA state cache was added to improve performance.



Pattern - 7






A DFA file consists of zero or more DFA records terminated by an
end marker (0FFFFH). Each DFA record represents the transition from
one state to another based upon an input symbol. The DFA record con-
sists of a DFA state number (represented by two bytes, unsigned), the
DFA state number which is to be transferred to (two bytes), the pat-
tern number, possibly 0FFFFH for no pattern (two bytes), and the char-
acter on which transition is to take place (one byte, ASCII). It
should be pointed out that the file will be output ordered first by
increasing state number, and, in each state number, by increasing
transition characters. This can be taken advantage of when reading
the DFA file (see the section on DFAPattern). While this file format
is not the most compact, it is simple and flexible for the run-time
DFA loader. DFA states start with one (unlike NFA states) and con-
tinue up; a transition to state zero indicates that no more transi-
tions occur. The start state of the DFA is always number one.

The local module StateSetHandler hides the details of the ab-
stract data type used for storing sets of NFA states. The set is ac-
tually represented as a stack, as recommended in Compilers: Princi-
ples, Techniques, and Tools. Other than that there is nothing partic-
ularly noteworthy about it.

Since the original running time of the program was quite long, I
used some optimization tricks which one should be careful of when mak-
ing modifications. First of all, the program is now compiled without
run-time checks for arithmetic overflow or correct array indexing.
This shouldn't really cause any problems since changes can be tested
with these checks, and then they may again be removed. The next point
is more subtle, and therefore more dangerous, but it also produced a
greater increase in run-time speed. The passing of large data struc-
tures by value typically takes longer than passing by reference. I
felt that the amount of time spent in passing information around on
the stack justified passing all of the larger data structures by ref-
erence, unless it was required to be passed by value. While this is
the opposite of the normal policy regarding the passing of parameters,
the speed increase (the program now runs in two-thirds of its original
time) seems to justify it. However, one should be very careful when
making changes to the program. If a StateSet or StateMap is modified,
make sure that the change is intended to go back to the original
value, or that it is a copy of the original value being modified.

The programs NFA2TXT and DFA2TXT are considered trivial, and are
therefore not discussed here.


FAInput
This separately compiled module acts as the input system for both NFA-
Pattern and DFAPattern. It uses a simple two buffer technique, as
recommended in Compilers: Principles, Techniques, and Tools. It
should be pointed out that there is only one input buffer. If both
pattern matchers are active at the same time, they will both draw from
the same input.


NFAPattern
The separately compiled module NFAPattern is again a straightforward
NFA simulator, with the exception that only important NFA states are


Pattern - 8






returned when computing "e-closure." This reduces running time by re-
ducing the number of states which need to be considered on an input
symbol.

The data structure used to hold the NFA is simply an array of
state records (see the section on LEX2NFA) and an array of records
which contain information about the patterns to be recognized. When
reading an NFA from a file, each regular expression is read individu-
ally, transposed to start at the first available space in the array of
state records and stored in that same array. Information about the
pattern being recognized is then stored in the other array. This in-
formation includes the pattern's number, transposed start state, and
accepting state. The values are defined by making calls on DefineCon-
stant and DefineState, therefore mechanically generated code should
produce a similar sequence of calls.


DFAPattern
The separately compiled module DFAPattern is a straightforward imple-
mentation of a DFA simulator. The data structure used to hold the DFA
transition is not the fastest, but it represents a relatively good
time/space trade off. There are again two arrays; one holds the tran-
sitions, and the other is an index into the transition array based
upon state number. To save both time and space, runs of transitions
are stored in a single transition element. A run of transitions is
considered to have the same state number, pattern number, and accept-
ing state, and sequential input symbols. Once again, mechanically
generated code is supposed to take care of all of the details of get-
ting the calls the correct order, etc.

During run-time, the current input symbol is compared to the high
character in the run. When the symbol is in the sequence of transi-
tion characters, a transition takes place.


Any comments or donations would be greatly appreciated, please send
them to:


Narti Kitiyakara
2917 Ursulines Ave.
New Orleans, LA 70119

















Pattern - 9







Formal Grammar


What follows is a description of the language accepted by the program
LEX2NFA, in my own variation of Extended Backus-Naur Form (EBNF). In the
grammar, identifiers starting with upper-case letters (e.g.
"LexSpecification") are syntactic units, which will be defined somewhere in
the grammar. Identifiers which begin with lower-case letters (e.g. char-
Const) and any literal strings, enclosed by double quotes (e.g. ".."), are
lexical elements. (Non-constant lexical elements will be described in
footnotes the first time they are used.) Other symbols used in the grammar
include "::=", which is used to indicate a production; "*", which indicates
Kleene closure (the previous element may be repeated zero or more times);
parentheses, which are used as grouping symbols; "+", which indicates Pos-
itive closure (the previous element may be repeated one or more times);
empty braces ("{}"), which indicates a translation on empty input; and any-
thing enclosed in brackets, which indicates that the enclosed construction
is optional (may be repeated zero or one time).
ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ



LexSpecification ::= ManifestConstants "%" Definition* "%" Equation* "%"

ManifestConstants ::= identifier1 ("," identifier)*

Definition ::= identifier "=" Expression "."

Equation ::= identifier ">" Expression "."

Expression ::= Term ("|" Term)*

Term ::= (Factor OptionalPostFixOp)+

Factor ::= literalString
| CharClass
| "\" character2
| charConst3
| identifier
| "(" Expression ")"

OptionalPostFixOp ::= "*"
| "+"
| "?"
| {}

CharClass ::= "[" Element ("," Element)* "]"
____________________

1 The lexical construction "identifier" consists of a letter followed
by zero or more letters or digits.

2 The lexical construction "character" consists of a single, un-quoted
character.

3 The lexical element "charConst" consists of a literal character (see
description in the text about expressions) or a literal string of length
one.






Element ::= charConst [".." charConst]


  3 Responses to “Category : Modula II Source Code
Archive   : PATTERN.ZIP
Filename : PATTERN.TXT

  1. Very nice! Thank you for this wonderful archive. I wonder why I found it only now. Long live the BBS file archives!

  2. This is so awesome! 😀 I’d be cool if you could download an entire archive of this at once, though.

  3. But one thing that puzzles me is the “mtswslnkmcjklsdlsbdmMICROSOFT” string. There is an article about it here. It is definitely worth a read: http://www.os2museum.com/wp/mtswslnk/