This is a review of the Gatech CS1332 pattern-matching algorithm.
Basic concepts#
- Text: the collection of alphabet
- String: a sequence of characters
- Alphabet: the set of possible characters in a family of strings
- n: is the length of the text
- m: is the length of the pattern
- Pattern Matching is the same as the String Search
Brute Force#
Check for every possible text. The most natural way of thinking about pattern matching
Brute Force Implementation#
|
Complexity analysis#
Scenario | Best Case Efficiency | Example Best Case | Worst Case Efficiency | Example Worst Case |
---|---|---|---|---|
No Occurrences | O(n) |
Text: aaaaaaaaaaa Pattern: baa |
O(mn) |
Text: aaaaaaaaaaa Pattern: aab |
Single Occurrence | O(m) |
Text: aaaaaaaaaaaab Pattern: aab |
O(mn) |
Text: aaaaaaaaaaaab Pattern: aab |
All Occurrence | O(n) |
Text: aaaaaaaaaaa |
O(mn) |
Text: aaaaaaaaaaaab Pattern: aab |
Boyer-Moore#
Key idea: produce a Last Occurrence Table that records the index of the last occurrence of the letter (usually a hash map), and the letters not in the alphabet of the pattern are marked as *.
Last Occurrence Table implementation without the alphabet considered.#
|
Three main principles in the Boyer-Moore Algorithm:
- Pattern found and returned.
- The last occurrence has already been scanned, thus just move forward by one.
- If the last occurrence has not been reached, then jump directly to the last occurrence.
Boyer-Moore implementation#
|
Efficiency of Boyer-Moore#
Scenario | Best Case Efficiency | Example Best Case | Worst Case Efficiency | Example Worst Case |
---|---|---|---|---|
No Occurrences | O(m + (n/m)) | Pattern: bbb Text: aaaaaaaaaaa |
O(mn) | Pattern: baa Text: aaaaaaaaaaa |
Single Occurrence | O(m) | Pattern: aaaa Text: aaaaaaaaaaa |
O(mn) | Pattern: baa Text: aaaaaaaaaaa |
All Occurrences | O(m + (n/m)) | Pattern: aab Text: aacaaaaaab |
O(mn) | Pattern: baa Text: aaaaaaaaaaa |
Knuth-Morris-Pratt (KMP)#
Similar to Boyer-Moore, but, instead of the last occurrence table, it uses a more complicated failure table.
Build the KMP table#
The code follows the following rules:
- if i and j do not match, update j++
- If i and j matches AND i == 0, update both i and j
- If i and j now don’t match, then move i to i -
|
KMP algorithm#
The code also follows three situations and utilizes the Failure table to avoid checking the pattern already checked. The number pointer of the pattern jump to is the number of the equal suffix and prefix since they need to move to the next place of the match -> the best part of this algorithm, which is the last sentence
|
The time complexity of the KMP#
Scenario | Best Case Efficiency | Worst Case Efficiency |
---|---|---|
No Occurrences | O(m + n) | O(m + n) |
Single Occurrence | O(m) | O(m + n) |
All Occurrences | O(m + n) | O(m + n) |
Robin-Karp#
This algorithm mainly uses hash comparisons to simplify the comparing process.
If the hash is the same use brute force, if not align more and update more!
Special Hash function: choose a big prime number, and make the ASCII for each character times the BASE^(index) for this polynomial function.
Robin-Karp implementation#
|
Rabin-Karp: Time Complexity#
Scenario | Best Case Efficiency | Worst Case Efficiency |
---|---|---|
No Occurrences | O(n + m) | O(nm) |
Single Occurrence | O(n + m) | O(nm) |
All Occurrences | O(n + m) | O(nm) |