2025-03-31

computer_science

Pattern Matching

Basic concepts
Brute Force
- Brute Force Implementation
- Complexity analysis
Boyer-Moore
Knuth-Morris-Pratt (KMP)
Robin-Karp
- Robin-Karp implementation
- Rabin-Karp: Time Complexity

This is a review of the Gatech CS1332 pattern-matching algorithm.

Basic concepts#

Text: the collection of alphabet
String: a sequence of characters
Alphabet: the set of possible characters in a family of strings
n: is the length of the text
m: is the length of the pattern
Pattern Matching is the same as the String Search

Brute Force#

Check for every possible text. The most natural way of thinking about pattern matching

Brute Force Implementation#

// The following method find the first occurance of the pattern
public class BruteForcePatternSearch {
    public static int bruteForceSearch(String text, String pattern) {
        int n = text.length();
        int m = pattern.length();

        // Iterate through the text
        for (int i = 0; i <= n - m; i++) {
            int j;

            // Check for pattern match
            for (j = 0; j < m; j++) {
                if (text.charAt(i + j) != pattern.charAt(j)) {
                    break;
                }
            }
            // If the pattern is found
            if (j == m) {
                return i + m;
            }
        }
 
}

Complexity analysis#

Scenario	Best Case Efficiency	Example Best Case	Worst Case Efficiency	Example Worst Case
No Occurrences	`O(n)`	Text: `aaaaaaaaaaa` Pattern: `baa`	`O(mn)`	Text: `aaaaaaaaaaa` Pattern: `aab`
Single Occurrence	`O(m)`	Text: `aaaaaaaaaaaab` Pattern: `aab`	`O(mn)`	Text: `aaaaaaaaaaaab` Pattern: `aab`
All Occurrence	`O(n)`	Text: `aaaaaaaaaaa`	`O(mn)`	Text: `aaaaaaaaaaaab` Pattern: `aab`

Boyer-Moore#

Key idea: produce a Last Occurrence Table that records the index of the last occurrence of the letter (usually a hash map), and the letters not in the alphabet of the pattern are marked as *.

Last Occurrence Table implementation without the alphabet considered.#

public static Map<Character, Integer> buildLastTable(CharSequence pattern) {
    Map<Character, Integer> lastTable = new HashMap<>();
    // Fill the map with the last occurrence of each character
    for (int i = 0; i < pattern.length(); i++) {
        lastTable.put(pattern.charAt(i), i);
    }
    return lastTable;
}

Three main principles in the Boyer-Moore Algorithm:

Pattern found and returned.
The last occurrence has already been scanned, thus just move forward by one.
If the last occurrence has not been reached, then jump directly to the last occurrence.
Boyer-Moore implementation#

public static List<Integer> boyerMoore(CharSequence pattern,
        CharSequence text,
        CharacterComparator comparator) {
    if(pattern == null || pattern.length() == 0) {
        throw new IllegalArgumentException("Pattern cannot be null or empty.");
    }
    if (text == null || comparator == null) {
        throw new IllegalArgumentException("Text or comparator cannot be null.");
    }
    // first generate the last occurrence table
    Map<Character, Integer> lastTable = buildLastTable(pattern);
    List<Integer> matches = new java.util.ArrayList<>();
    // the constants
    int textLength = text.length();
    int patternLength = pattern.length();
    int i = 0;
    while(i<= textLength - patternLength){
        int j = patternLength - 1; // this minus one is turning the size into an index
        while(j >= 0 && comparator.compare(text.charAt(i + j), pattern.charAt(j)) == 0){
            j--;
        }
        if(j < 0) {
            matches.add(i);
            i ++;;
        } else {
            int lastIndex = lastTable.getOrDefault(text.charAt(i + j), -1);
            if(lastIndex < j) {
                i += j - lastIndex;
            } else {
                i ++;
            }
        }
    }
    return matches;
}

Efficiency of Boyer-Moore#

Scenario	Best Case Efficiency	Example Best Case	Worst Case Efficiency	Example Worst Case
No Occurrences	O(m + (n/m))	Pattern: bbb Text: aaaaaaaaaaa	O(mn)	Pattern: baa Text: aaaaaaaaaaa
Single Occurrence	O(m)	Pattern: aaaa Text: aaaaaaaaaaa	O(mn)	Pattern: baa Text: aaaaaaaaaaa
All Occurrences	O(m + (n/m))	Pattern: aab Text: aacaaaaaab	O(mn)	Pattern: baa Text: aaaaaaaaaaa

Knuth-Morris-Pratt (KMP)#

Similar to Boyer-Moore, but, instead of the last occurrence table, it uses a more complicated failure table.

Build the KMP table#

The code follows the following rules:

if i and j do not match, update j++
If i and j matches AND i == 0, update both i and j
If i and j now don’t match, then move i to i -

public static int[] buildFailureTable(String pattern) {
        int m = pattern.length();
        int[] f = new int[m];
        
        // Initialize as per pseudocode
        f[0] = 0;
        int i = 0;
        int j = 1;
        
        while (j < m) {
            if (pattern.charAt(i) == pattern.charAt(j)) {
                // Characters match - update table and advance both indices
                f[j] = i + 1;
                i++;
                j++;
            } else if (pattern.charAt(i) != pattern.charAt(j) && i == 0) {
                // Mismatch at beginning - no prefix to fall back on
                f[j] = 0;
                j++;
            } else {
                // Mismatch after some matches - fall back using the table
                i = f[i - 1];
            }
        }
        
        return f;}

KMP algorithm#

The code also follows three situations and utilizes the Failure table to avoid checking the pattern already checked. The number pointer of the pattern jump to is the number of the equal suffix and prefix since they need to move to the next place of the match -> the best part of this algorithm, which is the last sentence

public static int kmp(String t, String p, int[] f) {
    char[] tChars = t.toCharArray();
    char[] pChars = p.toCharArray();
    int n = tChars.length;
    int m = pChars.length;

    // Step 1: j ← 0; k ← 0
    int j = 0;
    int k = 0;

    // Step 2: while k < n
    while (k < n) {
        // Step 3: if p[j] = t[k]
        if (pChars[j] == tChars[k]) {
            // Step 4: if j = m - 1
            if (j == m - 1) {
                // Step 5: return k    //match has been found
                return k;
            } 
            // Step 6: else
            else {
                // Step 7: increment both j and k
                j++;
                k++;
            }
        } 
        // Step 8: else if p[j] != t[k] and j = 0
        else if (pChars[j] != tChars[k] && j == 0) {
            // Step 9: increment k only
            k++;
        } 
        // Step 10: else
        else {
            // Step 11: j ← f[j - 1]
            j = f[j - 1];
        }
    }

    // Pattern not found
    return -1;
}
}

The time complexity of the KMP#

Scenario	Best Case Efficiency	Worst Case Efficiency
No Occurrences	O(m + n)	O(m + n)
Single Occurrence	O(m)	O(m + n)
All Occurrences	O(m + n)	O(m + n)

Robin-Karp#

This algorithm mainly uses hash comparisons to simplify the comparing process.

If the hash is the same use brute force, if not align more and update more!

Special Hash function: choose a big prime number, and make the ASCII for each character times the BASE^(index) for this polynomial function.

Robin-Karp implementation#

Rabin-Karp: Time Complexity#

Scenario	Best Case Efficiency	Worst Case Efficiency
No Occurrences	O(n + m)	O(nm)
Single Occurrence	O(n + m)	O(nm)
All Occurrences	O(n + m)	O(nm)