Dec 112017
Boyer-Moore search algorithm in TP4. | |||
---|---|---|---|
File Name | File Size | Zip Size | Zip Type |
BOYER.DOC | 3175 | 1217 | deflated |
SEARCHES.DOC | 3559 | 1579 | deflated |
SEARCHES.PAS | 6739 | 2087 | deflated |
Download File SEARCH-4.ZIP Here
Contents of the BOYER.DOC file
W. Vann Hall
Pala Designs
P.O. Box 10804
Arlington, VA 22210
CIS 70346,1144
MCI WVHALL
ABOUT Boyer-MOORE:
Boyer-Moore is an attempt to speed up the usually serial text search.
Traditionally, text searches proceed from the first character onward.
(In these examples, "string" refers to string we're trying to locate,
"text" the array of characters we're trying to find the string in.
These all ignore case sensitivity, matching around line boundaries,
punctuation, etc.)
String to find: STRING
Text to find it in: HOW YOU SEARCH FOR A STRING
Our StringPointer and TextPointer both start at 1. We compare
String[1] and Text[1]. Since "S" and "H" don't match, we increment the
text pointer and compare again. At TextPointer = 9, the "S" of
"STRING" and the "S" of "SEARCH" match, so we increment TextPointer
*and* StringPointer and compare String[2] and Text[10]. "T" and "E"
don't match, so we reset StringPointer and increment TextPointer again.
And so on. You can see that it takes 22 comparisons before we locate
the correct "S" and a full 27 before we know that we have located
"STRING."
Boyer-Moore, on the other hand, works from the end of the string (but
still the beginning of the text). First, it creates a look-up table of
values corresponding to the distance of the first occurence of each
character from the end of the string. For instance, the table for
"STRING" would read:
A=6 B=6 C=6 D=6 E=6 F=6 G=6 H=6 I=2 J=6 K=6 L=6 M=6
N=1 O=6 P=6 Q=6 R=3 S=5 T=4 U=6 V=6 W=6 X=6 Y=6 Z=6
Note that characters not located in the string are set to
Length(String), as it the final character.
Since a 6-character string can't be located entirely within the first 5
characters, we start by comparing the String[6] -- "G" -- with Text[6]:
"O". They don't match, so we increment TextPointer by the value
associated with "O" in the table; that is, by 6. Our next compare is
"G" with the "R" in "SEARCH". We now increment TextPointer by "R"'s
value, 3, and compare "S" to " ". And so on. Here's a tracing of the
progression; the letter to be compared is highlighted by lower casing:
STRINg
HOW YoU SEARCH FOR A STRING
STRINg
HOW YOU SEArCH FOR A STRING
STRINg
HOW YOU SEARCH^FOR A STRING
STRINg
HOW YOU SEARCH FOR A STRING
STRINg
HOW YOU SEARCH FOR A STRINg
STRIng
HOW YOU SEARCH FOR A STRIng
STRing
HOW YOU SEARCH FOR A STRing
STring
HOW YOU SEARCH FOR A STring
String
HOW YOU SEARCH FOR A String
string
HOW YOU SEARCH FOR A string
There; 10 comparisons total. You can see how this would speed
searching a long text -- enough to make up for the additional overhead
of compiling the look-up table.
Pala Designs
P.O. Box 10804
Arlington, VA 22210
CIS 70346,1144
MCI WVHALL
ABOUT Boyer-MOORE:
Boyer-Moore is an attempt to speed up the usually serial text search.
Traditionally, text searches proceed from the first character onward.
(In these examples, "string" refers to string we're trying to locate,
"text" the array of characters we're trying to find the string in.
These all ignore case sensitivity, matching around line boundaries,
punctuation, etc.)
String to find: STRING
Text to find it in: HOW YOU SEARCH FOR A STRING
Our StringPointer and TextPointer both start at 1. We compare
String[1] and Text[1]. Since "S" and "H" don't match, we increment the
text pointer and compare again. At TextPointer = 9, the "S" of
"STRING" and the "S" of "SEARCH" match, so we increment TextPointer
*and* StringPointer and compare String[2] and Text[10]. "T" and "E"
don't match, so we reset StringPointer and increment TextPointer again.
And so on. You can see that it takes 22 comparisons before we locate
the correct "S" and a full 27 before we know that we have located
"STRING."
Boyer-Moore, on the other hand, works from the end of the string (but
still the beginning of the text). First, it creates a look-up table of
values corresponding to the distance of the first occurence of each
character from the end of the string. For instance, the table for
"STRING" would read:
A=6 B=6 C=6 D=6 E=6 F=6 G=6 H=6 I=2 J=6 K=6 L=6 M=6
N=1 O=6 P=6 Q=6 R=3 S=5 T=4 U=6 V=6 W=6 X=6 Y=6 Z=6
Note that characters not located in the string are set to
Length(String), as it the final character.
Since a 6-character string can't be located entirely within the first 5
characters, we start by comparing the String[6] -- "G" -- with Text[6]:
"O". They don't match, so we increment TextPointer by the value
associated with "O" in the table; that is, by 6. Our next compare is
"G" with the "R" in "SEARCH". We now increment TextPointer by "R"'s
value, 3, and compare "S" to " ". And so on. Here's a tracing of the
progression; the letter to be compared is highlighted by lower casing:
STRINg
HOW YoU SEARCH FOR A STRING
STRINg
HOW YOU SEArCH FOR A STRING
STRINg
HOW YOU SEARCH^FOR A STRING
STRINg
HOW YOU SEARCH FOR A STRING
STRINg
HOW YOU SEARCH FOR A STRINg
STRIng
HOW YOU SEARCH FOR A STRIng
STRing
HOW YOU SEARCH FOR A STRing
STring
HOW YOU SEARCH FOR A STring
String
HOW YOU SEARCH FOR A String
string
HOW YOU SEARCH FOR A string
There; 10 comparisons total. You can see how this would speed
searching a long text -- enough to make up for the additional overhead
of compiling the look-up table.
December 11, 2017
Add comments