# KMP

We use the notation A[x…y] to describe the subarray of A consisting of A[x] followed by A[x+1…y].

For convenience in this discussion we assume that arrays' first elements arenumbered 1, not 0. Let the T be the array of characters to be searched and P be the array of characters forming the pattern.

Here is the straightforward program that will return “yes” if a subsequence of T that matches P of and return “no” if there is none:

function Matches(P, T) {

let i = 1

let state = “maybe”

while (state == 'maybe') {

let j = 1

state = “maybe_at_i”

while (state == “maybe_at_i”){

if (j > P.length) return “yes”

else if(i+j-1 > T.length)return “no”

else if(P[j]==T[i+j-1]) j=j+1

else state = “maybe”

i = i+1

}

}

}

**Theorem 1**. Matches can take time proportional to (T.length)(P.length).

Proof: Consider the perverse case where P is all Ws up until the final character, a Z; and T is simply 2(P.length) Ws. So there will be no match. The inner loop will do P.length comparisons failing on the Z repeatedly, until P starts running off the end of T. At that point the program has performed the inner loop P.length times, so it has done

(P.length)(P.length)=(P.length)(T.length)/2

comparisons, which is proportional to (P.length)(T.length).

**QED**

To get an idea of how to improve the algorithm consider P = XXXAXXXB and T = XXXAXXXAXXXB. The matching process starts off like this, failing on the B at the end of P:

T:XXXAXXXAXXXB

P:XXXAXXXB

Since we know that the first three Xs of P will not match the first A in T and that the first three Xs *will* match the second three X's we've just matched in T, we know we can immediately slide P forward 4 letters rather than just the 1 which the program calls for. In terms of the program it means we can leave i = 8, poised on the second A in T and set j = 4, poised on the A in P. Furthermore, we could have predicted this ability by analyzing P without even looking at T! This reasoning can be distilled into table, L, that identifies each by segment of P that matches an initial segment of P by pointing from the last character of the later segment to the last character of the initial segment. In general,

L(j)= max{k| k<j && P[1…k]=P[j-k+1…j]}

and in this case

L is 012301234

More generally, define a function MISL(Q, j) to be the Maximum Initial Segment Length of P that matches a segment of Q ending at j.

MISL(Q,j) = max{k| k<j && P[1…k]=Q[i-k+1…j}

Now consider another function, L', which will show is equal to L. Define L' and MiniMatch, assuming 1≤j≤P.length,

L'(j) = (j=1) ? 0 : MiniMatch(P, j, L'(j-1)+1)

MiniMatch(Q, i, j) = (P[j]=Q[i]) ? j : (j=1) ? 0 : MiniMatch(Q, i, L'(j-1)+1)

**Theorem 1**. L'(j) < j and (j<i implies MiniMatch(Q,i,j)<i) (W)

Proof: by induction on j

For j= 1

L'(j)=0<1 and MiniMatch(Q,i,j) = j or 0, and both are less than i.

Assume W for 1≤n<j

L'(j-1)+1 < j-1+1 = j so L'(j) = MiniMatch(P, j, L'(j-1)+1) < j by W

and

j<i implies L'(j-1)+1<i implies MiniMatch(Q,i,j)= j or 0 or MiniMatch(Q,i,L'(j-1)+1),

all of which are less than i because of W

**QED**

Now we prove an important property of MiniMatch.

**Theorem 2**: For j=1,…,min(P.length,i-1), L(j)= MISL(P,j) (X)

implies

MiniMatch(Q,i,MISL(Q,i-1)+1)= MISL(Q,i) (Y)

Proof: Define j’= MISL(Q,i-1)+1 and prove by induction on j’

j’==1 implies L(j’)=0, by the definition of L

max{k| k<i-1 && P[1…k]==Q[j-k…i-1]}

= MISL(Q,i-1) = j’-1 = 0, by the definition of j’

So for k =1,…,i-2, P[1…k]≠Q[i-k…i-1] (Z)

let k’= k+1

So for k’=2,…,i-1,

P[1…k’-1] = P[1…k] ≠ Q[i-k…i-1] = Q[i-k’+1…i-1] by Z

So P[1…k’-1] ≠ Q[i-k’+1…i-1] (W)

Consider MiniMatch(Q,i,1)

If P[1]=Q[i]

MiniMatch(Q,i,1)=1, by the definition of MiniMatch

MISL(Q,i)=max{k’| k’<i && P[1…k’]=Q[i-k’+1…i]}=1

because P[1]=Q[i] and W

If P[1]≠Q[i] & j=1

MiniMatch(Q,i,j)=0

MISL(Q,i) = max{k’| k’<i && P[1…k’]=Q[i-k’+1…i]}==0

because P[1]!=Q[i] and W

P[1]≠Q[i] & j≠1 is always false

So X implies Y for j’=1

Now assume X implies Y is true for 1≤n<j’, and prove it for j’. (V)

Assume X, for j=1,…,min(P.length, i-1), L(j)=MISL(P, j)

If P[j’]=Q[i]

MiniMatch(Q, i, j’)=j’ by definition of MiniMatch j’=MISL(Q, i-1)+1

= max{k+1| | k<i-1 && P[1…k]=Q[i-k…i-1]}

= max {k | k<i && P[1…k-1]=Q[i-k+1…i-1]}

=max {k | k<i && P[1…k-1]=Q[i-k+1…i-1] && P[j’]=Q[i]}

because adding true statement can’t change max value

= max {k | k<i && P[1…k]=Q[i-k+1…i]}

..= MISL(Q, i)

If P[j’]≠Q[i] & j’=1, we’ve already proved X implies Y in the base case. If P[j’]≠P[i] & j’≠1

MiniMatch(Q, i, j’)=MiniMatch(Q, i, L(j’-1)+1)

by the definition of MiniMatch

==MISL(Q, i) since L(j’-1)+1 < j’-1+1=j’ and V

**QED**

**Corollary 1**: For j=1,…,P.length L(j)=MISL(P, j)

Proof by induction on j

Base case: j==1

L(j)==0

MISL(P, 1) == {k | ,k<1 && P[1…k]==P[1-k+1…1]} == 0

since P[1…0]==P[2…1]

Induction Hypothesis: For 1≤n<j, L(n)==MISL(P, n)

Choose i=j and Q=P, so Theorem 2 becomes:

For j’=1,…,j-1 L(j’)==MISL(P, j’)

implies

MiniMatch(P, MISL(P, j’-1)+1) = MISL(P, j’) (R)

Now the premise of R is true by the induction hypothesis, so:

MiniMatch(P, MISL(P, j-1)+1) == MISL(P, j) by R

L(j) == MiniMatch(P, L(j-1)+1)) because j!=1

== MiniMatch(P, MISL(P, j-1)+1)) by the induction hypothesis

== MISL(P, j) by Theorem 2

**QED**

**Theorem 3**. For j=1,….,P.length, L(j)=MISL(P, j) (U)

Proof by induction on j

For j=1

L(1)=0 by definition

MISL(P, 1)=max{k| k<1 & P[1…k]=P[1-k+1…1]}=0

since 0 is max{k| k<1} and the 0-length sequences, P[1…0] and P[2…1], are equal.

Now assume U is true for all 1≤n<j and we’ll prove it for j

L(j) = MiniMatch(P, j, L(j-1)+1) by definition

= MiniMatch(P, j, MISL(P, j-1)+1) by U

= MISL(P, j) by Theorem 2 because X holds by the induction hypothesis

**QED**

Now define Match(i) = (i=1) ? 0 : MiniMatch(T, i, Match(i-1)+1)

**Theorem 4** Match(i)= MISL(T, i)

Proof by induction on i

If i=1 Match(1)=0 and

MISL(T,1)={k| k <1 && P[1…k]=T[1-k+1…1]}= 0

since P[1…0]=T[2…1]

Assume true for 1≤n<i

Match(i) = MiniMatch(T, i, MISL(T, i-1)+1)

by the induction hypothesis

=MISL(T,i) by Theorem 2

**QED**

**Corollary 2**. max_{i}{Match(i)}= P.length if and only if P is a subsequence of T.

Proof: The maximum cannot be more than P.length.

If it is P.length then P =T[i-P.length…i].

Otherwise it is less than P.length so there is no subsequence of T equal to P

**QED**

MiniMatch is an example of a recursive program that uses *tail recursion*, which means that any recursive calls of itself produce the final result without any further computation. Any tail recursive function can be implemented by an iterative program, in this case:

function MM(Q, i, j) {

while (P[j]!==Q[i] && j!=1) {

j = L(j-1)+1

}

return P[j]==Q[i] ? j+1 else 0

}

L can also be implemented by an iterative function that tabulates the values of L in an array, LTable

function MakeL(){

LTable[1] = 0

Let j = 1

Let i = 2

while (i < P.length){

if (P[j]==P[i]) {

LTable[i] = j

i = i+1

j = j+1

} else if (j==1) {

LTable[i] = 0

i = i+1

} else {

j = LTable[j-1]+1

}

}

}

Finally, the computation of maxi{Match(i)} can be done iteratively by:

function MaxMatch()

MaxMatch = 0

i = 1

j = 1

while (j<P.length){

if (i>T.length) {

return MaxMatch

} if (P[j]==T[i]) {

i = i+1

j= j+1

MaxMatch = max(j, MaxMatch)

} if j !=1 {

j = LTable[j-1]+1

} else {

i = i+ 1

}

}

return P.length

}

The section on Hoare Logic shows that some of these programs are equivalent to their recursive versions.

To see that MaxMatch is bounded by a linear function of i consider the difference 2i-j at the beginning of the while loop. If P[j]=T[i] or j=1 then 2i-j increases by 1 or 2. Otherwise, L[j-1]<j-1<j, so j decreases by at least 1 making 2i-j increase by at least 1. But 2i-j cannot exceed 2T.length since i cannot exceed T.length and j cannot be less than 1. Therefore, the number of comparisons is bounded by 2T.length.

The number of comparisons that the link storing algorithm makes is bounded by 2P.length, by a similar analysis, so the entire number of comparisons to do the match is bounded by 2(P.length+T.length).