String Matching Algo: KMP

Introduction

Nowadays, there is so much information around us. It’s impossible to read all of them. String matching algorithm becomes indispensable. It helps us grab the pieces we want.

There are so many algorithms to do that. I’ve found KMP is the most efficient one and “easiest” to understand. I would like to share it with you.

Background

Before talking about KMP, let’s understand how to get a string match. There are 2 strings, the Main string, and the Pattern string. We need to find the pattern in the main.

We check both of them, character by character. If a mismatch is found, we slide the pattern one character.

a a b c      a a b c
| ==>
a b > a b

It’s useful for most cases. However, when we need to find aaab in aaaaaaaaaaaaaaaaaaaaaaab . One-character-slide becomes inefficient.

The question we should ask is how to slide the pattern can make the string match efficient.

Partial Match Table (PTM)

Let’s look at the pattern when a mismatch occurs.

a b a b a b c
|
a b a b c

There are 2 pieces of information.

  • The suffix ab mathes the ab in the main string
  • The suffix ab mathes the ab in the prefix of pattern string

This feature tells us, if we slide 2 characters, the prefix ab of the pattern MUST match the ab in the main string. It’s amazing! It improves the efficiency. Now we can slide 2 characters instead of one.

a b a b a b c
> > a b a b c

Back to the question, how to slide the pattern efficiently, we should study the pattern. To be more specifically, we need to know the longest prefix matches the suffix, of each prefix string of the pattern. The results are stored in an array, called PMT(Partial Match Table).

Take ababc as an exemple,

pattern : a b a b a b c aPTM     : 0 0 1 2 3 4 0 1

Explanation:

  • PTM[0] = 0 : No prefix to match
  • PTM[1] = 0: In the string `ab`, the last `b` doesn’t matche the first `a`. So the matched string length is 0.
  • PTM[2] = 1: In the string `aba`, the last `a` matches the first `a`. So the matched string length is 1.
  • PTM[3] = 2: In the string `abab`, the last `ab` matches the last `ab`. the matched string length is 2.
  • … (same as the case PTM[3] = 2)
  • PTM[7] = 0: In the string `abababc`, the last `c` doesn’t matche the first `a`. So the matched string length is 0.
  • PTM[8] = 1: In the string `abababca`, the last `a` matches the first `a`, and `ca` doesn’t matches the first `ab`. So the matched string length is 1.

With PTM, we can explain why should slide 2 characters. Since c == pattern[7], we should check PTM[0:7]. PTM[6] tells us the length of the matched string is `4`, so we need to slide 7–1–4 = 2characters. 7–1means the length of the pattern before `c` (`ababab`).

Implement

PTM is the key point of KMP. Since we’ve understood it, we can write codes to implement this algorithm.

There are 2 tricks.

As metioned before, we need to check PTM[6] when the mismatch happens at pattern[7] . We push one space back, and create a new array next_arr . Now we check next_arr[7] instead of PTM[6] .

pattern :  a b a b a b c aPTM     :  0 0 1 2 3 4 0 1next    : -1 0 0 1 2 3 4 0 1

The second tricks is fast-slow-pointers. It use 2 points to traverse instead of one.

Talk is cheap. Let’s code.

The beauty of this algorithm is that the main function and the generation function are almost identical.

Thanks for reading, and see you next time :)

Developper in Paris, interested in Big Front-end, Full Stack, Explainable AI. Try to introduce AI into Full Stack workflow

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store