Nowadays, there is so much information around us. It’s impossible to read all of them. String matching algorithm becomes indispensable. It helps us grab the pieces we want.
There are so many algorithms to do that. I’ve found KMP is the most efficient one and “easiest” to understand. I would like to share it with you.
Before talking about KMP, let’s understand how to get a string match. There are 2 strings, the Main string, and the Pattern string. We need to find the pattern in the main.
We check both of them, character by character. If a mismatch is found, we slide the pattern one character.
a a b c a a b c
a b > a b
It’s useful for most cases. However, when we need to find
aaaaaaaaaaaaaaaaaaaaaaab . One-character-slide becomes inefficient.
The question we should ask is how to slide the pattern can make the string match efficient.
Partial Match Table (PTM)
Let’s look at the pattern when a mismatch occurs.
a b a b a b c
a b a b c
There are 2 pieces of information.
- The suffix
abin the main string
- The suffix
abin the prefix of pattern string
This feature tells us, if we slide 2 characters, the prefix
ab of the pattern MUST match the
ab in the main string. It’s amazing! It improves the efficiency. Now we can slide 2 characters instead of one.
a b a b a b c
> > a b a b c
Back to the question, how to slide the pattern efficiently, we should study the pattern. To be more specifically, we need to know the longest prefix matches the suffix, of each prefix string of the pattern. The results are stored in an array, called PMT(Partial Match Table).
ababc as an exemple,
pattern : a b a b a b c aPTM : 0 0 1 2 3 4 0 1
- PTM = 0 : No prefix to match
- PTM = 0: In the string `ab`, the last `b` doesn’t matche the first `a`. So the matched string length is 0.
- PTM = 1: In the string `aba`, the last `a` matches the first `a`. So the matched string length is 1.
- PTM = 2: In the string `abab`, the last `ab` matches the last `ab`. the matched string length is 2.
- … (same as the case PTM = 2)
- PTM = 0: In the string `abababc`, the last `c` doesn’t matche the first `a`. So the matched string length is 0.
- PTM = 1: In the string `abababca`, the last `a` matches the first `a`, and `ca` doesn’t matches the first `ab`. So the matched string length is 1.
With PTM, we can explain why should slide 2 characters. Since
c == pattern, we should check
PTM tells us the length of the matched string is `4`, so we need to slide
7–1–4 = 2characters.
7–1means the length of the pattern before `c` (`ababab`).
PTM is the key point of KMP. Since we’ve understood it, we can write codes to implement this algorithm.
There are 2 tricks.
As metioned before, we need to check
PTM when the mismatch happens at
pattern . We push one space back, and create a new array
next_arr . Now we check
next_arr instead of
pattern : a b a b a b c aPTM : 0 0 1 2 3 4 0 1next : -1 0 0 1 2 3 4 0 1
The second tricks is fast-slow-pointers. It use 2 points to traverse instead of one.
Talk is cheap. Let’s code.
The beauty of this algorithm is that the main function and the generation function are almost identical.
Thanks for reading, and see you next time :)