KMP

We use the notation A[x…y] to describe the subarray of A consisting of A[x] followed by A[x+1…y].

For convenience in this discussion we assume that arrays' first elements arenumbered 1, not 0. Let the T be the array of characters to be searched and P be the array of characters forming the pattern.

Here is the straightforward program that will return “yes” if a subsequence of T that matches P of and return “no” if there is none:

function Matches(P, T) {

let i = 1

let state = “maybe”

while (state == 'maybe') {

let j = 1

state = “maybe_at_i”

while (state == “maybe_at_i”){

if (j > P.length) return “yes”

else if(i+j-1 > T.length)return “no”

else if(P[j]==T[i+j-1]) j=j+1

else state = “maybe”

i = i+1

}

}

}


Theorem 1. Matches can take time proportional to (T.length)(P.length).

Proof: Consider the perverse case where P is all Ws up until the final character, a Z; and T is simply 2(P.length) Ws. So there will be no match. The inner loop will do P.length comparisons failing on the Z repeatedly, until P starts running off the end of T. At that point the program has performed the inner loop P.length times, so it has done

(P.length)(P.length)=(P.length)(T.length)/2

comparisons, which is proportional to (P.length)(T.length).

QED


To get an idea of how to improve the algorithm consider P = XXXAXXXB and T = XXXAXXXAXXXB. The matching process starts off like this, failing on the B at the end of P:

T:XXXAXXXAXXXB

P:XXXAXXXB

Since we know that the first three Xs of P will not match the first A in T and that the first three Xs will match the second three X's we've just matched in T, we know we can immediately slide P forward 4 letters rather than just the 1 which the program calls for. In terms of the program it means we can leave i = 8, poised on the second A in T and set j = 4, poised on the A in P. Furthermore, we could have predicted this ability by analyzing P without even looking at T! This reasoning can be distilled into table, L, that identifies each by segment of P that matches an initial segment of P by pointing from the last character of the later segment to the last character of the initial segment. In general,

L(j)= max{k| k<j && P[1…k]=P[j-k+1…j]}

and in this case

L is 012301234

More generally, define a function MISL(Q, j) to be the Maximum Initial Segment Length of P that matches a segment of Q ending at j.

MISL(Q,j) = max{k| k<j && P[1…k]=Q[i-k+1…j}

Now consider another function, L', which will show is equal to L. Define L' and MiniMatch, assuming 1≤j≤P.length,

L'(j) = (j=1) ? 0 : MiniMatch(P, j, L'(j-1)+1)


MiniMatch(Q, i, j) = (P[j]=Q[i]) ? j : (j=1) ? 0 : MiniMatch(Q, i, L'(j-1)+1)



Theorem 1. L'(j) < j and (j<i implies MiniMatch(Q,i,j)<i) (W)

Proof: by induction on j

For j= 1

L'(j)=0<1 and MiniMatch(Q,i,j) = j or 0, and both are less than i.


Assume W for 1≤n<j


L'(j-1)+1 < j-1+1 = j so L'(j) = MiniMatch(P, j, L'(j-1)+1) < j by W

and

j<i implies L'(j-1)+1<i implies MiniMatch(Q,i,j)= j or 0 or MiniMatch(Q,i,L'(j-1)+1),

all of which are less than i because of W

QED


Now we prove an important property of MiniMatch.


Theorem 2: For j=1,…,min(P.length,i-1), L(j)= MISL(P,j) (X)

implies

MiniMatch(Q,i,MISL(Q,i-1)+1)= MISL(Q,i) (Y)


Proof: Define j’= MISL(Q,i-1)+1 and prove by induction on j’


j’==1 implies L(j’)=0, by the definition of L

max{k| k<i-1 && P[1…k]==Q[j-k…i-1]}

= MISL(Q,i-1) = j’-1 = 0, by the definition of j’

So for k =1,…,i-2, P[1…k]≠Q[i-k…i-1] (Z)


let k’= k+1

So for k’=2,…,i-1,

P[1…k’-1] = P[1…k] ≠ Q[i-k…i-1] = Q[i-k’+1…i-1] by Z

So P[1…k’-1] ≠ Q[i-k’+1…i-1] (W)


Consider MiniMatch(Q,i,1)


If P[1]=Q[i]

MiniMatch(Q,i,1)=1, by the definition of MiniMatch

MISL(Q,i)=max{k’| k’<i && P[1…k’]=Q[i-k’+1…i]}=1

because P[1]=Q[i] and W


If P[1]≠Q[i] & j=1

MiniMatch(Q,i,j)=0

MISL(Q,i) = max{k’| k’<i && P[1…k’]=Q[i-k’+1…i]}==0

because P[1]!=Q[i] and W


P[1]≠Q[i] & j≠1 is always false


So X implies Y for j’=1


Now assume X implies Y is true for 1≤n<j’, and prove it for j’. (V)

Assume X, for j=1,…,min(P.length, i-1), L(j)=MISL(P, j)

If P[j’]=Q[i]

MiniMatch(Q, i, j’)=j’ by definition of MiniMatch j’=MISL(Q, i-1)+1

= max{k+1| | k<i-1 && P[1…k]=Q[i-k…i-1]}

= max {k | k<i && P[1…k-1]=Q[i-k+1…i-1]}

=max {k | k<i && P[1…k-1]=Q[i-k+1…i-1] && P[j’]=Q[i]}

because adding true statement can’t change max value

= max {k | k<i && P[1…k]=Q[i-k+1…i]}

..= MISL(Q, i)


If P[j’]≠Q[i] & j’=1, we’ve already proved X implies Y in the base case. If P[j’]≠P[i] & j’≠1

MiniMatch(Q, i, j’)=MiniMatch(Q, i, L(j’-1)+1)

by the definition of MiniMatch

==MISL(Q, i) since L(j’-1)+1 < j’-1+1=j’ and V

QED


Corollary 1: For j=1,…,P.length L(j)=MISL(P, j)

Proof by induction on j

Base case: j==1

L(j)==0

MISL(P, 1) == {k | ,k<1 && P[1…k]==P[1-k+1…1]} == 0

since P[1…0]==P[2…1]

Induction Hypothesis: For 1≤n<j, L(n)==MISL(P, n)

Choose i=j and Q=P, so Theorem 2 becomes:

For j’=1,…,j-1 L(j’)==MISL(P, j’)

implies

MiniMatch(P, MISL(P, j’-1)+1) = MISL(P, j’) (R)

Now the premise of R is true by the induction hypothesis, so:

MiniMatch(P, MISL(P, j-1)+1) == MISL(P, j) by R

L(j) == MiniMatch(P, L(j-1)+1)) because j!=1

== MiniMatch(P, MISL(P, j-1)+1)) by the induction hypothesis

== MISL(P, j) by Theorem 2

QED


Theorem 3. For j=1,….,P.length, L(j)=MISL(P, j) (U)

Proof by induction on j

For j=1

L(1)=0 by definition

MISL(P, 1)=max{k| k<1 & P[1…k]=P[1-k+1…1]}=0

since 0 is max{k| k<1} and the 0-length sequences, P[1…0] and P[2…1], are equal.

Now assume U is true for all 1≤n<j and we’ll prove it for j

L(j) = MiniMatch(P, j, L(j-1)+1) by definition

= MiniMatch(P, j, MISL(P, j-1)+1) by U

= MISL(P, j) by Theorem 2 because X holds by the induction hypothesis

QED


Now define Match(i) = (i=1) ? 0 : MiniMatch(T, i, Match(i-1)+1)


Theorem 4 Match(i)= MISL(T, i)

Proof by induction on i

If i=1 Match(1)=0 and

MISL(T,1)={k| k <1 && P[1…k]=T[1-k+1…1]}= 0

since P[1…0]=T[2…1]

Assume true for 1≤n<i

Match(i) = MiniMatch(T, i, MISL(T, i-1)+1)

by the induction hypothesis

=MISL(T,i) by Theorem 2

QED


Corollary 2. maxi{Match(i)}= P.length if and only if P is a subsequence of T.

Proof: The maximum cannot be more than P.length.

If it is P.length then P =T[i-P.length…i].

Otherwise it is less than P.length so there is no subsequence of T equal to P

QED

MiniMatch is an example of a recursive program that uses tail recursion, which means that any recursive calls of itself produce the final result without any further computation. Any tail recursive function can be implemented by an iterative program, in this case:

function MM(Q, i, j) {

while (P[j]!==Q[i] && j!=1) {

j = L(j-1)+1

}

return P[j]==Q[i] ? j+1 else 0

}


L can also be implemented by an iterative function that tabulates the values of L in an array, LTable

function MakeL(){

LTable[1] = 0

Let j = 1

Let i = 2

while (i < P.length){

if (P[j]==P[i]) {

LTable[i] = j

i = i+1

j = j+1

} else if (j==1) {

LTable[i] = 0

i = i+1

} else {

j = LTable[j-1]+1

}

}

}

Finally, the computation of maxi{Match(i)} can be done iteratively by:


function MaxMatch()

MaxMatch = 0

i = 1

j = 1

while (j<P.length){

if (i>T.length) {

return MaxMatch

} if (P[j]==T[i]) {

i = i+1

j= j+1

MaxMatch = max(j, MaxMatch)

} if j !=1 {

j = LTable[j-1]+1

} else {

i = i+ 1

}

}

return P.length

}

The section on Hoare Logic shows that some of these programs are equivalent to their recursive versions.

To see that MaxMatch is bounded by a linear function of i consider the difference 2i-j at the beginning of the while loop. If P[j]=T[i] or j=1 then 2i-j increases by 1 or 2. Otherwise, L[j-1]<j-1<j, so j decreases by at least 1 making 2i-j increase by at least 1. But 2i-j cannot exceed 2T.length since i cannot exceed T.length and j cannot be less than 1. Therefore, the number of comparisons is bounded by 2T.length.

The number of comparisons that the link storing algorithm makes is bounded by 2P.length, by a similar analysis, so the entire number of comparisons to do the match is bounded by 2(P.length+T.length).