Ctrl+F

Working with computers, we get used to shortcut keys/combos to make our day to day operations easier. One such shortcut combination is "Ctrl+F", used to find some text in a document, doesn't matter if you are a software developer, a writer, a business person, you've used that combination before.

Ever wondered how exactly does your application finds a text pattern in a huge document?

One way is to place the provided text pattern over the document one by one and wait, sure you will get the result but what about the time required for such pattern matching, depending on the size of your document it can take hours and no one has that kind of time in this fast-moving world.

To make the matching faster applications use the Knuth–Morris–Pratt string-searching algorithm (or KMP algorithm).

The basic idea behind this is:

Whenever we detect a mismatch (after some matches), we already know some of the characters in the text of the next window. We take advantage of this information to avoid matching the characters that we know will anyway match.

Let's understand by an example:

Matching Overview

Main text = "AAAAABAAABA"

pattern to match = "AAAA"

We compare the first window of Main text with pattern to match

Main text = "AAAAABAAABA"

pattern to match = "AAAA" [Initial position]

In the next step, we compare the next window of the Main text with a pattern to match.

Main text = "AAAAABAAABA"

pattern to match = "AAAA" [Pattern shifted one position]

This is where KMP does optimization over Naive. In this second window, we only compare the fourth A of the pattern with the fourth character of the current window of text to decide whether the current window matches or not. Since we know the first three characters will anyway match, we skipped matching the first three characters.

Now the question arises how do we know how many characters to skip? To know this we preprocess the pattern and prepare an array of integers.

Preprocessing Overview:

  • KMP algorithm preprocesses the pattern and constructs an auxiliary array( denoted by lps[] ) of size m (same as the size of pattern) which is used to skip characters while matching.
  • name lps indicates the longest proper prefix which is also a suffix... A proper prefix is a prefix with a whole string not allowed. For example, prefixes of “ABC” are "“, “A”, “AB” and “ABC”. Proper prefixes are “", “A” and “AB”. Suffixes of the string are “", “C”, “BC” and “ABC”.
  • We search for lps in sub-patterns. More clearly we focus on sub-strings of patterns that are either prefix and suffix.
  • For each sub-pattern pat[0..i] where i = 0 to m-1, lps[i] stores length of the maximum matching proper prefix which is also a suffix of the sub-pattern pat[0..i].

Examples of lps[] construction:

For the pattern “AAAA”,

lps[] is [0, 1, 2, 3]

For the pattern “ABCDE”,

lps[] is [0, 0, 0, 0, 0]

For the pattern “AABAACAABAA”,

lps[] is [0, 1, 0, 1, 2, 0, 1, 2, 3, 4, 5]

After we're done with the preprocessing, next comes searching for the pattern in the main text

Unlike the normal searching where we place the pattern over the text one by one, here we use a value from lps[] to decide the next characters to be matched. The idea is to not match a character that we know will anyway match.

How to use lps[] to decide the next positions (or to know the number of characters to be skipped)?

  • We start the comparison of pat[j] with j = 0 with characters of the current window of text.
  • We keep matching characters txt[i] and pat[j] and keep incrementing i and j while pat[j] and txt[i] keep matching.
  • When we see a mismatch

    • We know that characters pat[0..j-1] match with txt[i-j…i-1] (Note that j starts with 0 and increment it only when there is a match).
      • We also know (from the above definition) that lps[j-1] is the count of characters of pat[0…j-1] that are both proper prefix and suffix.
      • From the above two points, we can conclude that we do not need to match these lps[j-1] characters with txt[i-j…i-1] because we know that these characters will anyway match.

Let us consider the below example to understand this.

txt[] = "AAAAABAAABA"

pat[] = "AAAA"

lps[] = {0, 1, 2, 3}

i = 0, j = 0 txt[] = "AAAAABAAABA"

pat[] = "AAAA"

txt[i] and pat[j] match, do i++, j++

i = 1, j = 1

txt[] = "AAAAABAAABA"

pat[] = "AAAA"

txt[i] and pat[j] match, do i++, j++

i = 2, j = 2

txt[] = "AAAAABAAABA"

pat[] = "AAAA"

pat[i] and pat[j] match, do i++, j++

i = 3, j = 3

txt[] = "AAAAABAAABA"

pat[] = "AAAA"

txt[i] and pat[j] match, do i++, j++

i = 4, j = 4

Since j == M, print pattern found and reset j,

j = lps[j-1] = lps[3] = 3

Here unlike the normal searching, we do not match the first three characters of this window. Value of lps[j-1] (in above step) gave us the index of the next character to match.

i = 4, j = 3

txt[] = "AAAAABAAABA"

pat[] = "AAAA"

txt[i] and pat[j] match, do i++, j++

i = 5, j = 4

Since j == M, print pattern found and reset j,

j = lps[j-1] = lps[3] = 3

Again unlike the normal searching, we do not match the first three characters of this window. Value of lps[j-1] (in above step) gave us the index of the next character to match.

i = 5, j = 3

txt[] = "AAAAABAAABA"

pat[] = "AAAA"

txt[i] and pat[j] do NOT match and j > 0, change only j

j = lps[j-1] = lps[2] = 2

i = 5, j = 2

txt[] = "AAAAABAAABA"

pat[] = "AAAA"

txt[i] and pat[j] do NOT match and j > 0, change only j

j = lps[j-1] = lps[1] = 1

And so on...

Python code for the above algorithm

def KMPSearch(pat, txt):
    M = len(pat)
    N = len(txt)

    lps = [0]*M
    j = 0 # index for pat[]

    computeLPSArray(pat, M, lps)

    i = 0 
    while i < N:
        if pat[j] == txt[i]:
            i += 1
            j += 1

        if j == M:
            print ("Found pattern at index " + str(i-j))
            j = lps[j-1]

        elif i < N and pat[j] != txt[i]:
            # Do not match lps[0..lps[j-1]] characters,
            # they will match anyway
            if j != 0:
                j = lps[j-1]
            else:
                i += 1

def computeLPSArray(pat, M, lps):
    len = 0 

    lps[0] 
    i = 1

    while i < M:
        if pat[i]== pat[len]:
            len += 1
            lps[i] = len
            i += 1
        else:
            if len != 0:
                len = lps[len-1]
            else:
                lps[i] = 0
                i += 1