String Matching Algo: KMP
Introduction
Nowadays, there is so much information around us. It’s impossible to read all of them. String matching algorithm becomes indispensable. It helps us grab the pieces we want.
There are so many algorithms to do that. I’ve found KMP is the most efficient one and “easiest” to understand. I would like to share it with you.
Background
Before talking about KMP, let’s understand how to get a string match. There are 2 strings, the Main string, and the Pattern string. We need to find the pattern in the main.
We check both of them, character by character. If a mismatch is found, we slide the pattern one character.
a a b c a a b c
 ==>
a b > a b
It’s useful for most cases. However, when we need to find aaab
in aaaaaaaaaaaaaaaaaaaaaaab
. Onecharacterslide becomes inefficient.
The question we should ask is how to slide the pattern can make the string match efficient.
Partial Match Table (PTM)
Let’s look at the pattern when a mismatch occurs.
a b a b a b c

a b a b c
There are 2 pieces of information.
 The suffix
ab
mathes theab
in the main string  The suffix
ab
mathes theab
in the prefix of pattern string
This feature tells us, if we slide 2 characters, the prefix ab
of the pattern MUST match the ab
in the main string. It’s amazing! It improves the efficiency. Now we can slide 2 characters instead of one.
a b a b a b c
> > a b a b c
Back to the question, how to slide the pattern efficiently, we should study the pattern. To be more specifically, we need to know the longest prefix matches the suffix, of each prefix string of the pattern. The results are stored in an array, called PMT(Partial Match Table).
Take ababc
as an exemple,
pattern : a b a b a b c aPTM : 0 0 1 2 3 4 0 1
Explanation:
 PTM[0] = 0 : No prefix to match
 PTM[1] = 0: In the string `ab`, the last `b` doesn’t matche the first `a`. So the matched string length is 0.
 PTM[2] = 1: In the string `aba`, the last `a` matches the first `a`. So the matched string length is 1.
 PTM[3] = 2: In the string `abab`, the last `ab` matches the last `ab`. the matched string length is 2.
 … (same as the case PTM[3] = 2)
 PTM[7] = 0: In the string `abababc`, the last `c` doesn’t matche the first `a`. So the matched string length is 0.
 PTM[8] = 1: In the string `abababca`, the last `a` matches the first `a`, and `ca` doesn’t matches the first `ab`. So the matched string length is 1.
With PTM, we can explain why should slide 2 characters. Since c == pattern[7]
, we should check PTM[0:7]
. PTM[6]
tells us the length of the matched string is `4`, so we need to slide 7–1–4 = 2
characters. 7–1
means the length of the pattern before `c` (`ababab`).
Implement
PTM is the key point of KMP. Since we’ve understood it, we can write codes to implement this algorithm.
There are 2 tricks.
As metioned before, we need to check PTM[6]
when the mismatch happens at pattern[7]
. We push one space back, and create a new array next_arr
. Now we check next_arr[7]
instead of PTM[6]
.
pattern : a b a b a b c aPTM : 0 0 1 2 3 4 0 1next : 1 0 0 1 2 3 4 0 1
The second tricks is fastslowpointers. It use 2 points to traverse instead of one.
Talk is cheap. Let’s code.
The beauty of this algorithm is that the main function and the generation function are almost identical.
Thanks for reading, and see you next time :)