# Introduction

Nowadays, there is so much information around us. It’s impossible to read all of them. String matching algorithm becomes indispensable. It helps us grab the pieces we want.

There are so many algorithms to do that. I’ve found KMP is the most efficient one and “easiest” to understand. I would like to share it with you.

# Background

Before talking about KMP, let’s understand how to get a string match. There are 2 strings, the Main string, and the Pattern string. We need to find the pattern in the main.

We check both of them, character by character. If a mismatch is found, we slide the pattern one character.

`a a b c      a a b c  |      ==>   a b          > a b`

It’s useful for most cases. However, when we need to find `aaab` in `aaaaaaaaaaaaaaaaaaaaaaab` . One-character-slide becomes inefficient.

The question we should ask is how to slide the pattern can make the string match efficient.

# Partial Match Table (PTM)

Let’s look at the pattern when a mismatch occurs.

`a b a b a b c        |a b a b c`

There are 2 pieces of information.

• The suffix `ab` mathes the `ab` in the main string
• The suffix `ab` mathes the `ab` in the prefix of pattern string

This feature tells us, if we slide 2 characters, the prefix `ab` of the pattern MUST match the `ab` in the main string. It’s amazing! It improves the efficiency. Now we can slide 2 characters instead of one.

`a b a b a b c> > a b a b c`

Back to the question, how to slide the pattern efficiently, we should study the pattern. To be more specifically, we need to know the longest prefix matches the suffix, of each prefix string of the pattern. The results are stored in an array, called PMT(Partial Match Table).

Take `ababc` as an exemple,

`pattern : a b a b a b c aPTM     : 0 0 1 2 3 4 0 1`

Explanation:

• PTM = 0 : No prefix to match
• PTM = 0: In the string `ab`, the last `b` doesn’t matche the first `a`. So the matched string length is 0.
• PTM = 1: In the string `aba`, the last `a` matches the first `a`. So the matched string length is 1.
• PTM = 2: In the string `abab`, the last `ab` matches the last `ab`. the matched string length is 2.
• … (same as the case PTM = 2)
• PTM = 0: In the string `abababc`, the last `c` doesn’t matche the first `a`. So the matched string length is 0.
• PTM = 1: In the string `abababca`, the last `a` matches the first `a`, and `ca` doesn’t matches the first `ab`. So the matched string length is 1.

With PTM, we can explain why should slide 2 characters. Since `c == pattern`, we should check `PTM[0:7]`. `PTM` tells us the length of the matched string is `4`, so we need to slide `7–1–4 = 2`characters. `7–1`means the length of the pattern before `c` (`ababab`).

# Implement

PTM is the key point of KMP. Since we’ve understood it, we can write codes to implement this algorithm.

There are 2 tricks.

As metioned before, we need to check `PTM` when the mismatch happens at `pattern` . We push one space back, and create a new array `next_arr` . Now we check `next_arr` instead of `PTM` .

`pattern :  a b a b a b c aPTM     :  0 0 1 2 3 4 0 1next    : -1 0 0 1 2 3 4 0 1`

The second tricks is fast-slow-pointers. It use 2 points to traverse instead of one.

Talk is cheap. Let’s code.

The beauty of this algorithm is that the main function and the generation function are almost identical.

Thanks for reading, and see you next time :)

Developper in Paris, interested in Big Front-end, Full Stack, Explainable AI. Try to introduce AI into Full Stack workflow

## More from Jijie Liu

Developper in Paris, interested in Big Front-end, Full Stack, Explainable AI. Try to introduce AI into Full Stack workflow