Pattern Matching

This is a review of the Gatech CS1332 pattern-matching algorithm.

Basic concepts#

  • Text: the collection of alphabet
  • String: a sequence of characters
  • Alphabet: the set of possible characters in a family of strings
  • n: is the length of the text
  • m: is the length of the pattern
  • Pattern Matching is the same as the String Search

Brute Force#

Check for every possible text. The most natural way of thinking about pattern matching

Brute Force Implementation#

// The following method find the first occurance of the pattern
public class BruteForcePatternSearch {
public static int bruteForceSearch(String text, String pattern) {
int n = text.length();
int m = pattern.length();

// Iterate through the text
for (int i = 0; i <= n - m; i++) {
int j;

// Check for pattern match
for (j = 0; j < m; j++) {
if (text.charAt(i + j) != pattern.charAt(j)) {
break;
}
}
// If the pattern is found
if (j == m) {
return i + m;
}
}

}

Complexity analysis#

Scenario Best Case Efficiency Example Best Case Worst Case Efficiency Example Worst Case
No Occurrences O(n) Text: aaaaaaaaaaa
Pattern: baa
O(mn) Text: aaaaaaaaaaa
Pattern: aab
Single Occurrence O(m) Text: aaaaaaaaaaaab
Pattern: aab
O(mn) Text: aaaaaaaaaaaab
Pattern: aab
All Occurrence O(n) Text: aaaaaaaaaaa O(mn) Text: aaaaaaaaaaaab
Pattern: aab

Boyer-Moore#

Key idea: produce a Last Occurrence Table that records the index of the last occurrence of the letter (usually a hash map), and the letters not in the alphabet of the pattern are marked as *.

Last Occurrence Table implementation without the alphabet considered.#

public static Map<Character, Integer> buildLastTable(CharSequence pattern) {
Map<Character, Integer> lastTable = new HashMap<>();
// Fill the map with the last occurrence of each character
for (int i = 0; i < pattern.length(); i++) {
lastTable.put(pattern.charAt(i), i);
}
return lastTable;
}

Three main principles in the Boyer-Moore Algorithm:

  • Pattern found and returned.
  • The last occurrence has already been scanned, thus just move forward by one.
  • If the last occurrence has not been reached, then jump directly to the last occurrence.

    Boyer-Moore implementation#

public static List<Integer> boyerMoore(CharSequence pattern,
CharSequence text,
CharacterComparator comparator) {
if(pattern == null || pattern.length() == 0) {
throw new IllegalArgumentException("Pattern cannot be null or empty.");
}
if (text == null || comparator == null) {
throw new IllegalArgumentException("Text or comparator cannot be null.");
}
// first generate the last occurrence table
Map<Character, Integer> lastTable = buildLastTable(pattern);
List<Integer> matches = new java.util.ArrayList<>();
// the constants
int textLength = text.length();
int patternLength = pattern.length();
int i = 0;
while(i<= textLength - patternLength){
int j = patternLength - 1; // this minus one is turning the size into an index
while(j >= 0 && comparator.compare(text.charAt(i + j), pattern.charAt(j)) == 0){
j--;
}
if(j < 0) {
matches.add(i);
i ++;;
} else {
int lastIndex = lastTable.getOrDefault(text.charAt(i + j), -1);
if(lastIndex < j) {
i += j - lastIndex;
} else {
i ++;
}
}
}
return matches;
}

Efficiency of Boyer-Moore#

Scenario Best Case Efficiency Example Best Case Worst Case Efficiency Example Worst Case
No Occurrences O(m + (n/m)) Pattern: bbb
Text: aaaaaaaaaaa
O(mn) Pattern: baa
Text: aaaaaaaaaaa
Single Occurrence O(m) Pattern: aaaa
Text: aaaaaaaaaaa
O(mn) Pattern: baa
Text: aaaaaaaaaaa
All Occurrences O(m + (n/m)) Pattern: aab
Text: aacaaaaaab
O(mn) Pattern: baa
Text: aaaaaaaaaaa

Knuth-Morris-Pratt (KMP)#

Similar to Boyer-Moore, but, instead of the last occurrence table, it uses a more complicated failure table.

Build the KMP table#

The code follows the following rules:

  • if i and j do not match, update j++
  • If i and j matches AND i == 0, update both i and j
  • If i and j now don’t match, then move i to i -
public static int[] buildFailureTable(String pattern) {
int m = pattern.length();
int[] f = new int[m];

// Initialize as per pseudocode
f[0] = 0;
int i = 0;
int j = 1;

while (j < m) {
if (pattern.charAt(i) == pattern.charAt(j)) {
// Characters match - update table and advance both indices
f[j] = i + 1;
i++;
j++;
} else if (pattern.charAt(i) != pattern.charAt(j) && i == 0) {
// Mismatch at beginning - no prefix to fall back on
f[j] = 0;
j++;
} else {
// Mismatch after some matches - fall back using the table
i = f[i - 1];
}
}

return f;}

KMP algorithm#

The code also follows three situations and utilizes the Failure table to avoid checking the pattern already checked. The number pointer of the pattern jump to is the number of the equal suffix and prefix since they need to move to the next place of the match -> the best part of this algorithm, which is the last sentence

public static int kmp(String t, String p, int[] f) {
char[] tChars = t.toCharArray();
char[] pChars = p.toCharArray();
int n = tChars.length;
int m = pChars.length;

// Step 1: j ← 0; k ← 0
int j = 0;
int k = 0;

// Step 2: while k < n
while (k < n) {
// Step 3: if p[j] = t[k]
if (pChars[j] == tChars[k]) {
// Step 4: if j = m - 1
if (j == m - 1) {
// Step 5: return k //match has been found
return k;
}
// Step 6: else
else {
// Step 7: increment both j and k
j++;
k++;
}
}
// Step 8: else if p[j] != t[k] and j = 0
else if (pChars[j] != tChars[k] && j == 0) {
// Step 9: increment k only
k++;
}
// Step 10: else
else {
// Step 11: j ← f[j - 1]
j = f[j - 1];
}
}

// Pattern not found
return -1;
}
}

The time complexity of the KMP#

Scenario Best Case Efficiency Worst Case Efficiency
No Occurrences O(m + n) O(m + n)
Single Occurrence O(m) O(m + n)
All Occurrences O(m + n) O(m + n)

Robin-Karp#

This algorithm mainly uses hash comparisons to simplify the comparing process.

If the hash is the same use brute force, if not align more and update more!

Special Hash function: choose a big prime number, and make the ASCII for each character times the BASE^(index) for this polynomial function.

Robin-Karp implementation#


Rabin-Karp: Time Complexity#

Scenario Best Case Efficiency Worst Case Efficiency
No Occurrences O(n + m) O(nm)
Single Occurrence O(n + m) O(nm)
All Occurrences O(n + m) O(nm)