Contents of the RPL.DOC file
TurboPower Utilities Programmer's Manual
Copyright (C) 1985 TurboPower Software.
All Rights Reserved.
RPL is based on the pattern matching algorithms presented in the
book "Software Tools in Pascal". However, we have significantly
extended the capabilities of the tools presented there, and have
also used more of the powerful data structures available in
Pascal to improve the performance of the program.
We are very interested in hearing of any general purpose
match/replace expressions that you develop using RPL.
For a basic introduction to regular expressions or pattern
matching algorithms we direct you to the references given in
RPL uses a record and pointer-based approach to storing regular
expressions. The following type declarations are key to all of
the operations of RPL.
longstring = string;
tokens = (tnil, tlitchar, tccl, tnccl, tclosure, tmaybeone,
tany, tbol, teol, tgroup, tbtag, tetag, tditto);
patptr = ^patrecord;
sptr = ^longstring;
patrecord = RECORD
tok : tokens;
one : Char;
nextok : Boolean;
strptr : sptr;
nestptr, next : patptr;
Each regular expression used by RPL is stored internally as a
linked list of PATRECORDs. The function of each record is
described by the .TOK field of the record. The most
straightforward function for one of these records is the TLITCHAR
token type. This indicates that the record will match a single
literal character, and that character is stored in the .ONE field
of the record. The record fields perform the following functions:
TOK - a scalar type that classifies the function of this
regular expression token.
ONE - a single character, only used when the token matches a
single literal character.
NEXTOK - a boolean flag. If true, indicates that this token is
part of a chain of alternates. A match to any one of the
alternates is acceptable, and once a match is found the
succeeding alternates are skipped over.
STRPTR - a pointer to a text string kept on the heap. The text
string will hold a character class, which is a list of
characters. The record token will be matched if any one
of the characters is found.
NESTPTR - This pointer points to a sub-list of the main
regular expression. RPL uses these sub-lists to support
nesting of regular expressions. Because the sub-lists
themselves are composed of the PATRECORD record type, they
may also have sub-lists and therefore nesting may occur
to arbitrary depth.
NEXT - points to the next record in the linked list of match
The linked list representing a regular expression is terminated
by a NEXT field containing NIL.
The .TOK field can take on any of 13 values. A brief description
of each of these values follows:
TNIL - used as a place holder to start new linked lists, and
occasionally within a list. Performs no matching
function, and is simply skipped over during pattern
TLITCHAR - indicates that the token matches a single literal
character, contained in the .ONE field.
TCCL - denotes a character class. The match token matches any
single character from the string pointed to by the field
TNCCL - denotes a negative character class. Matches any
character NOT contained in the string pointed to by the
TCLOSURE - indicates that the token is a "0 or more" closure.
The preceding token in the linked list is matched as
many times as possible. A "1 or more" closure is created
simply by preceding the TCLOSURE token by two copies of
the token to be matched.
TMAYBEONE - indicates that the token is a "0 or 1" closure.
The preceding token in the linked list is matched either
0 or 1 times.
TANY - the token matches any single character.
TBOL - the token matches only at the beginning of a text line.
TEOL - the token matches only at the end of a text line.
TGROUP - indicates that a nested group of tokens follows. A
sublist started by the .NESTPTR field is processed
before returning to processing of the current list.
TBTAG - indicates that a "tagged match field" is starting.
Performs no matching or grouping function.
TETAG - the end of a "tagged match field."
TDITTO - used in replace expressions only. The .ONE field will
contain an integer from 0 to 9 indicating which of the
tagged match words is to be inserted into the output
In outline, RPL works as follows:
The text version of the match and replace regular expressions is
read from the command line, a specified file, or from a prompt.
These regular expressions are checked for syntax, and linked
lists of the tokens just described are built to represent the
The input and output text files are opened, and the input file is
read one line at a time. Depending on what regular expressions
and options are specified, each input line may be processed up to
three separate times using different regular expressions each
time. The control flow for the various options becomes apparent
in the routine PROCESSLINE.
A match expression processes the line by picking a start location
(the first character) within that line. It then compares the line
character by character against the match linked list. If it makes
it all the way through the linked list without a mismatch, it has
found a match. If not, it will increment its start position by
one character and try the linked list again until the start
position reaches the end of the line. From these operations
alone, the algorithm may need to do order of n*m high level
comparisons per line, where n is the number of characters in the
line and m the number of characters in the pattern.
If the match expression contains a closure (like a * wildcard),
the processing requirements may increase by another factor of n.
The reference by Allen Holub describes this effect quite well.
A replace expression needs to repeat these operations, with some
additional work. When the replace expression matches a character,
it sets a flag for that position of the input text line saying
whether the matched character is part of a tagged match word, and
if so, which one. After completing a line in this manner, it
passes through the line doing its replacements on the basis of
the flag array set on the first pass.
Finally, after the replace expression has built an output line,
that line is written to the standard output.
The operation of RPL can become deeply recursive when using
complex regular expressions, and its performance is quite
sensitive to the combination of expressions and input data. It is
not intended to be used for simple fixed string match and replace
due to the overhead of its generality.
3. Procedure Locator
19 rpl(Input, Output);
120 parsecommand(usepsp : Boolean; cline : longstring);
426 realdiskfile(VAR fname : filestring; VAR useconsole : Boolean) : Boolean;
457 checkpat(pat : patline; VAR patrec : patptr) : Boolean;
605 processline(lin : line);
26 initvid(VAR attribute : Byte; sethivid : Boolean);
46 resetvid(attribute : Byte);
68 defaultextension(extension : filestring; VAR infile : filestring);
85 openfile(fname : filestring; VAR handle : Integer);
101 forcedup(handle, newhandle : Integer);
110 getchunk(VAR l : bufline; VAR count : Integer) : Boolean;
124 breakpressed : Boolean;
158 createfile(fname : filestring; VAR handle : Integer);
175 closefile(handle : Integer);
188 iostat(bit : Integer) : Boolean;
203 time(VAR sec : Real);
211 appends(VAR l1; len1 : Integer; VAR l2; len2 : Integer; VAR lout : line);
239 checkmore(VAR screenline : Integer);
259 putl(l : line);
286 readyesno(default : Boolean) : Boolean;
302 getcom(usepsp : Boolean; inlin : longstring; VAR errstring : message) : Boolean;
317 comchar : Char;
8 foundfile(fname : filestring; VAR pathname : filestring;
33 getenvir(VAR ipath : Integer);
76 parsepath(ipath : Integer; VAR pathtop : Integer);
93 existremote(filename : longstring) : Boolean;
108 exist(VAR f : filetype) : Boolean;
117 openremote(pathname, fname : longstring; VAR f : filetype);
125 getcurrent(VAR cd : longstring; VAR drivenum : Integer);
146 changedrive(drivenum : Integer);
153 changedir(dirname : longstring);
8 match(VAR lin : line; pat : patptr) : Boolean;
14 amatch(VAR lin : line; offset : Integer; pat : patptr) : Integer;
23 omatch(VAR lin : line; VAR i : Integer; pat : patptr) : Boolean;
8 writepat(patrec : patptr);
42 getpat(VAR arg : patline; VAR patlist : patptr) : Boolean;
48 makepat(VAR arg : patline; start : Integer; delim : Char; VAR patlist : patptr) : Integer;
59 addpat(tok : tokens; lastj : patptr; VAR j : patptr; s : longstring);
63 cleanupcase(VAR s : longstring) : longstring;
109 getccl(VAR arg : patline; VAR i : Integer;
116 dodash(delim : Char; VAR arg : patline; VAR i : Integer; VAR s : longstring);
124 addstr(c : Char; VAR j : Integer; VAR s : longstring);
131 isalphanum(c : Char) : Boolean;
9 getrep(VAR arg : patline; VAR patlist : patptr) : Boolean;
13 makerep(VAR arg : patline; start : Integer; delim : Char; VAR patlist : patptr) : Integer;
22 addrep(tok : tokens; lastj : patptr; VAR j : patptr; s : longstring);
94 subline(VAR lin : line; patrec, reprec : patptr; VAR sub : line);
102 amatch(VAR lin : line; VAR flags : flag;
114 omatch(VAR lin : line; VAR flags : flag;
247 writesub(VAR lin : line; VAR flags : flag; reprec : patptr;
255 findtag(VAR lin : line; VAR flags : flag; i, iend, tagnum : Integer;
4. Hierarchy Diagram
resetvid, time, iostat, appends, initvid, setbreak
getenvir, parsepath, existremote, exist
changedir, getcurrent, changedrive
forcedup, defaultextension, foundfile
halt, hivid, lovid
breakhalt, breakpressed, appends
parsecommand, putl, getcom
5. Special Features
A. Long Text Lines
RPL incorporates the same basic techniques that DIFF uses for
reading and writing long text lines (easily up to 32767
characters). See DIFF.DOC for details.
RPL treats long strings as records, with the length of the string
stored in an integer field and the text stored in a character
array. We provide a procedure APPENDS that allows you to append
these long strings and normal strings in any combination (always
creating a long string).
B. Treatment of Newlines
The Unix text matching tools possess an advantage over those that
must run within MSDOS and other operating systems. The end of
each text line in Unix is marked by a single character called
Newline. In MSDOS, each text line is terminated by a pair of
characters, Carriage Return and Line Feed. Although it seems
conceptually simple, matching these two characters as sometimes
one entity and sometimes individually turns into a major
You will find some tricky little pieces of code in the SUBLINE
procedure and in the two OMATCH functions that take care of these
problems. You can search for #13 (ASCII carriage return) to find
them. These code patches work well as long as and are
always found together, but you may be able to find some cases
where in the middle of a text line won't work. To "solve"
this we ignore s during the text read-in procedure and stick
a pair onto the end of each line before pattern matching.
If you come up with more elegant solutions to these issues,
please let us know.
RPL's algorithms are modeled after, but upgraded from, those
presented in the following. The book also presents a somewhat
gentle introduction to regular expressions.
Software Tools in Pascal, by B. Kernighan and P. Plauger,
Addison-Wesley Publishing, 1981, pp. 141-168.
The basic idea for the record structure we use to represent match
tokens came from the following article, which provides a working
(although simplified) version of GREP written in C.
"GREP.C", by Allen Holub, Dr. Dobbs Journal, October 1984, pp.
A more theoretical view of regular expressions and some different
(more powerful, and also more complicated) algorithms for
matching them is found in the following:
Principles of Compiler Design, A. Aho and J. Ullman, Addison-
Wesley Publishing, 1979, pp. 74-118.
A user's introduction to regular expressions and GREP is given in
the following and in numerous other books about Unix:
Introducing the Unix System, by H. McGilton and R. Morgan,
McGraw-Hill Books, 1983, pp. 149-162.
And finally, if you have access to a Unix system, scan the MAN
entries for ED, SED, GREP, EGREP, and FGREP and also see if your
system has the EMACS text editor. Our utility RPL exceeds the
pattern matching capabilities of all of these tools with the
possible exception of EMACS.