Professional Documents
Culture Documents
A Fast String Searching Algorithm
A Fast String Searching Algorithm
Introduction
~=o cost(m) * prob(m) (It should be noted that we will not put these formulas
into closed form, but will simply evaluate them to
m=0 prob(m) * ~ k~=l skip(m,k) * k )) verify the validity of our empirical evidence.)
We now perform a similar analysis for deltas; deltas
lets us slide down by k if (a) doing so sets up an
where cost(m) is the cost associated with discovering
alignment of the discovered occurrence of the last m
that only the last m characters of pat match; prob(m) is
characters of pat in string with a plausible reoccurrence
the probability that only the last m characters of pat
of those m characters elsewhere in pat, and (b) no
match; and skip (m, k) is the probability that, supposing
smaller move will set up such an alignment. The
only the last m characters of pat match, we will get to
probability probpr(m, k) that the terminal substring of
slide pat down by k.
pat of length m has a plausible reoccurrence k charac-
U n d e r our assumptions, the probability that only
ters to the left of its first character is:
the last m characters of pat match is:
p r o b p r ( m , k) = if m + k < patlen
prob(m) = pro(1 - p)/(1 - ppatten). t h e n (1 - p ) * p "
else ptaatlen-k
(The denominator is due to the assumption that a
mismatch exists.) Of course, k is just the distance delta2 lets us slide
The probability that we will get to slide pat down provided there is no earlier reoccurrence. We can
by k is determined by analyzing how i is incremented. therefore define the probability probdelta2(m, k) that,
However, note that even though we increment i by the when only the last m characters of pat match, delta2
maximum max of the two deltas, this will actually only will allow us to move down by k recursively as follows:
slide pat down by max - m, since the increment of i
probdelta2(m, k)
also includes the m necessary to shift our attention
back to the end of pat. Thus when we analyze the =probpr(m,k)(1-k~=11Probdelta2(m,n) ) •
contributions of the two deltas we speak of the amount
by which they allow us to slide pat down, rather than We slide down by the maximum allowed by the
the amount by which we increment i. Finally, recall two deltas (taking adequate account of the possibility
that if the mismatched character char occurs in the that delta1 is worthless). If the values of the deltas
already matched final m characters of pat, then deltaa were independent, the probability that' we would ac-
is worthless and we always slide by deltas. The proba- tually slide down by k would just be the sum of the
bility that deltal is worthless is just (1 - (1 - p)m). Let products of the probabilities that one of the deltas
us call this probdelta~worthless(m). allows a move of k while the other allows a move of
The conditions under which delta~ will naturally let less than or equal to k.
us slide forward by k can be broken down into four However, the two moves are not entirely indepen-
cases as follows: (a) delta~ will let us slide down by 1 if dent. In particular, consider the possibility that delta1
char is the (m + 2)-th character from the righthand is worthless. Then the char just fetched occurs in the
end of pat (or else there are no more characters in pat) last m characters of pat and does not match the (m +
and char does not occur to the right of that position 1)-th. But if delta2 gives a slide of 1 it means that
(which has probability (1 - p ) " * (if m + 1 = patlen sliding these m characters to the left by i produces a
then 1 else p)). (b) delta1 allows us to slide down k, match. This implies that all of the last m characters of
where 1 < k < patlen - m, provided the rightmost pat are equal to the character m + 1 from the right.
occurrence of char in pat is m + k characters from the But this character is known not to be char. Thus char
right end of pat (which has probability p * (1 - cannot occur in the last m characters of pat, violating
p)k+m-~). (c) When patlen - m > 1, deltai allows us to the hypothesis that delta~ was worthless. Therefore if
slide past patlen - m characters if char does not occur delta~ is worthless, the probability that delta2 specifies
in pat at all (which has probability (1 - p)paae,-1 given a skip of 1 is 0 and the probability that it specifies one
that we know char is not the (m + 1)-th character from of the larger skips is correspondingly increased.
768 Communications October 1977
of Volume 20
the ACM N u m b e r 10
This interaction between the two deltas is also felt Fig. 3.
(to a lesser extent) for the next m possible delta2's, but 1.0
I I I I I
we ignore these (and in so doing accept that our
analysis may predict slightly worse results than might
be expected since we allow some short delta2 moves
when longer ones would actually occur).
The probability that delta2 will allow us to slide
down by k when only the last m characters of pat
match, assuming that deltai is worthless, is:
probdelta~(m,k) = ifk = 1
then 0
else
probpr(m, k) 1- probdelta'2(m, n) .
k--1
+ ~. probdeltal(m, n) * probdelta2(m, k)
n=l
Indiana University Computing Network. Chm: Collins, Computer Sciences Dept,, University of
Professional Activities Tom Osgood, IU-EAST, 2325 Chester Boulevard, Wisconsin, 1210 W. Dayton Street, Madison WI
Richmond, IN 47374. 53706.
Calendar of Events 12-17 M a r c h 1978 12-16 June 1978
A C M ' s calendar policy is to list open com- Symposium on Computer Simulation of Bulk 7th Triennial I F A C World Congress. Spon-
puter science meetings that are held on a not-for- Matter from Molecular Perspective, Anaheim, sor: IFAC. Contact: I F A C 78 Secretariat, PUB
profit basis. Not included in the calendar are edu- Calif.; part of 1978 Annual Spring Meeting of 192, 00101 Helsinki 10, Finland.
cational seminars institutes, and courses. Sub- American Chemical Society. Sponsor: ACS Div. 19-22 June 1978
Computers in Chemistry. Contact: Peter Lykes, Annual Conference of the American Society
mittals should be substantiated with name of the for Engineering Education (Computers in Edu-
sponsoring organization, fee schedule, and chair- Illinois Institute of Technology, Chicago, IL
60616; 312 567-3430. cation Division Program), University of British
m a n ' s name and full address. Columbia, Vancouver, B.C., Canada. Sponsor:
One telephone number contact for those in- 28-30 March 1978
3rd Symposium on Programming, Paris, ASEE Computers in Education Division. Contact:
terested in attending a meeting will be given when ASEE, Suite 400, One DuPont Circle, Washing-
a number is specified for this purpose in the news France. Sponsor: Centre National de la Recherche
Sciantifique (CNRS) and Universit6 Pierre et ton, D C 20036.
release text or in a direct communication to this 22-23 June 1978
periodical. Marie Curie. Contact S6cretariat du Colloque,
Institut de Programmation, 4, Place Jussieu, • International Conference on the Perform-
All requests for A C M sponsorship or coop- 75230 Paris Cedex 05, France. ance of Computer Installations, Gardone Riviera,
eration should be addressed to Chairman, Con- Lake Garda, Italy. Sponsor: Sperry Univac, Italy,
ferences and Symposia Committee. Dr. W.S. 29-31 March 1978
Conference on Information Sciences and with cooperation of A C M SIGMETRICS,
Dorsey, Dept. 503/504 Rockwell International Systems, Johns Hopkins University, Baltimore, ECOMA, AICA, A C M Italian Chapter. Contact:
Corporation, Anaheim, CA 92803. F o r European Md. Contact: 1978 CISS, Department of Electri- Conference Secretariat, CILEA, Via Raffaello
events, a copy of the request should also be sent cal Engineering, Johns Hopkins University, Balti- Sanzio 4, 20090 Segrate, Milan, Italy.
to the European Regional Representative. Tech- more, MD 21218. 2-4 August 1978
nical Meeting Request Forms for this purpose 3-7 April 1978 • International Conference on Databases: Im-
can be obtained from A C M Headquarters or Fifth International Symposium on Comput- proving Usability and Responsiveness, Technion,
from the European Regional Representative. Lead ing in Literary and Linguistic Research, Univer- Haifa, Israel. Sponsor: Technion in cooperation
time should include 2 months (3 months if for sity of Aston, Birmingham, England. Sponsor: with ACM. Prog. chm: Ben Shneiderman, Dept.
Europe) for processing of the request, plus the Association for Literary and Linguistic Comput- of Information Systems Management, University
necessary months (minimum 2) for any publicity ing. Contact: The Secretary (CLLR), Modern of Maryland, College Park, MD 20742.
to appear in Communications. Languages Dept., University of Aston, Birming- 13-18 August 1978
Events for which ACM or a subunit of ACM ham B4 7ET, England. Symposium on Modeling and Simulation
is a sponsor or collaborator are indicated by • . 4-8 April 1978 Methodology, Weizmann Institute of Science, Re-
Dates precede titles. Second International Conference on Combi- hovot, Israel. Contact: H.J. Highland, State Uni-
In this issue the calendar is given to April natoriaI Mathematics, Barbizon-Plaza Hotel, New versity Technical College Farmingdale, N.Y., or
1978. N e w Listings are shown first; they will ap- York City. Sponsor: New York Academy of Sci- B.P. Zeigler, Dept. of Applied Mathematics,
pear next month as Previous Listings. ences. Contact: Conference Dept., New York Weizmann Institute of Science, Rehovot, Israel.
Academy of Sciences, 2 East 63 St., New York, 30 October-1 November 1978
N E W LISTINGS N Y 10021; 212 838-0230. 1978 S I A M Fall Meeting, Hyatt Regency
16-17 November 1977 15-19 M a y 1978 Hotel, Knoxville, Tenn. Sponsor: SIAM. Contact:
• Workshop on Future Directions in Computer 16th Annual Convention of the Association H.B. Hair, SIAM, 33 South 17th St., Philadelphia,
Architecture, Austin, Tex. Sponsors: A C M SIG- for Educational Data Systems, Atlanta, Ga. PA 19103; 215 564-2929.
A R C H , IEEE-CS TCCA, and University of Texas Sponsor: AEDS. Contact: James E. Eisele, Office
at Austin. Conf. chm: G. Jack Lipovski. Dept. of of Computing Activities, University of Georgia,
EE, University of Texas, Austin, T X 78712. Athens, G A 30602. PREVIOUS LISTINGS
5-9 December 1977 22-25 M a y 1978 17-19 October 1977
Third International Symposium on Com- Sixth International C O D A T A Conference, • A C M 77 Annual Conference, Olympic Ho-
puting Methods in Applied Sciences and Engi- Taormina, Italy. Sponsor: International Council tel. Seattle, Wash. Gen. chin: J a m e s S. Ketchel.
neering, Versailles, France. Organized by IRIA. of Scientific Unions Comm. on D a t a for Science Box 16156, Seattle, W A 98116; 206 935-6776.
Sponsors: AFCET, G A M N I , IFIP WG7.2 Con- and Technology. Contact: C O D A T A Secretariat, 17-21 October 1977
tact: Institut de Recherche D'Informatique et 51, Boulevard de Montmorency, 75016 Paris, Systems 77, Computer Systems and Their
D'Automatique, Domaine de Voincean, Rocquen- France. Application, Munich, Federal Republic of Ger-
court, 78150 Le Chesnay, France. 24-26 M a y 1978 many. Contact: Miinehener Messe- und Ausstel-
13-15 February 1978 • 1978 S I A M National Meeting, University of lungsgesellschaft mbh, Kongresszentrum, Kong-
• Symposium on Computer Network Proto- Wisconsin, Madison, Wis. Sponsor: SIAM in co- ressbiiro Systems 77, Postfach 12 10 09, D-8000
cols, Li6ge, Belgium. Sponsors: I F I P T.C.6 and operation with A C M SIGSAM. Contact: H.B. Mfinchen 12, Federal Republic of Germany.
A C M Belgian Chapter. Contact: A. Danthine, Hair, SIAM, 33 South 17 St., Philadelphia, PA 18-19 October 1977
Symposium on Computer Network Protocols, 19103; 215 564-2929. M S F C / V A H Data Management Symposium,
Avenue des Tilleuls, 49, B-4000, Li6ge, Belgium. 26 M a y 1978 Sheraton Motor Inn, Huntsville, Ala. Sponsors:
3 March 1978 • Computer Algebra Symposium, University of N A S A Marshall Space Flight Center, University
Indiana University Computer Network Con- Wisconsin, Madison, Wis.; part of the 1978 SIAM of A l a b a m a in Huntsville. Contact: General Chair-
ference on Instructional Computing Applications, National Meeting. Sponsor: SIAM in cooperation (Calendar continued on p. 781)
Indiana University-East, Richmond, Ind. Sponsor: with A C M SIGSAM. Syrup. chin: George E.