* qsearch: fast substring matching in forward and backward direction
* 9-Oct-1992, Guido Gronek
Recently I needed a fast substring search algorithm, scanning the
text string from right to left. The result is an ANSI-C implementation
of a function called qsearch(), searching for the leftmost or rightmost
occurrance of a pattern string in a text string.
The algorithm used is 'quick search' by D. M. Sunday, which is a
simple but fast practical method. It's supposed to be even faster
than the well known Boyer-Moore algorithm, see Sunday's original
paper CACM V33 N8, page 132 (for several improvements of the basic
method as well). The reversed text scanning I have realized by a
rather simple variation of the original algorithm.
The fastness of 'quick search' bases on the generation of a shift table
(called 'table delta 1' by Sunday) from the pattern string first. This
permits to perform the actual searching process with a minimal number of
comparisons. There are two separate functions realising these two steps,
namely mktd1() and qsearch(). The shift table has to be declared by the
caller and is passed to mktd1() and to qsearch() as well. The allows a
simultaneous searching for several patterns.
Typically one should generate the shift table once for a given pattern
string and then use it to search for the pattern in a number of text
Another realisation of 'quick search' has been posted to comp.sources.misc
(v14i074) by [email protected]
(Doug Gwyn) to code the usual strstr()
library function. My implementation differs from Gwyn's approach as
1. Before initialising the shift table, the length of the pattern string
is calculated first, see mktd1(). This avoids stepping through the
whole shift table (having 256 elements) twice, as done by the Gwyn
code. Because initialising the shift table is always rather costly,
a quick search based version of the standard strstr() seems useless.
2. A pre-calculation of text string length is done by qsearch() before
the actual matching process begins. The method given by Gwyn avoids
this pre-calculation by recognizing that text characters being compared
with corresponding pattern characters need not be checked against
'\0' too. This results in a better best case and mid case behaviour.
But: pathological combinations of text and pattern strings may produce
extreme overhead. I tried
text : abcdxxxxxxxxxx
to find text characters being tested against '\0' 57 times with a
text length of only 14 !
That's why I arranged things this way: if for some reason the length
of the text string is known, you can pass it to qsearch() via
parameter 'tlen', else simply pass 0.