Recently, I came across a not very difficult task on LeetCode. Within the task, I had to implement an algorithm to search for a substring in a string. While I was trying to solve the task, I realized that I knew very little about how to search for a substring and decided to study this topic in more detail and share the results with you.
As Wikipedia says, "Substring search is one of the simplest information search tasks", but this is not entirely true. Below, I will talk about different algorithms for solving this problem and show examples of their implementation. Let's begin!
"Brute force algorithm" or "Simple search algorithm"
The brute force algorithm is a simple algorithm for searching for a substring, which consists of sequentially iterating over all the characters of a string and checking for a match with the substring being searched for. If no match is found, the algorithm shifts the substring being searched by one character to the right and starts the search again.
const strStr = (haystack, needle) => {
const m = needle.length
const n = haystack.length
for (let windowStart = 0; windowStart <= n - m; windowStart++) {
for (let i = 0; i < m; i++) {
if (needle[i] !== haystack[windowStart + i]) {
break
}
if (i === m - 1) {
return windowStart
}
}
}
return -1
}
The search function consists of a loop that is limited to n - m (length of the search string - length of the string in which we are searching), which allows us to avoid the problem if the string has fewer characters than the search string and avoid unnecessary checks. It also includes a nested loop that has the logic of comparing characters.
If the characters are equal, then we continue to compare characters from the string with the search string until we find non-matching characters. Once we find them, we exit the nested loop and repeat the comparisons again.
The second check is triggered if we reach the end, and all values match, then we return the index of the first character of the search word in the string. Otherwise, we return -1, which means that the string is not found.
Advantages of using:
- Simple implementation and understanding
- Works for any type of data, including text, numeric, etc.
Disadvantages of using:
- Low performance, especially on large strings and when searching for long substrings
- Uses a lot of time and memory.
"Robin-Karp Algorithm"
The Robin-Karp algorithm is one of the efficient search algorithms based on the concept of hashing.
The algorithm works as follows:
- First, we compute the hashes for the first window of the test string and substring.
- Then, we sequentially compare the hashes. If the hashes match, an additional character-by-character comparison is made. If a match is found, the algorithm ends.
- If no match is found, the hash value of the next substring is computed by shifting the window one character to the right and computing the hash value for the new substring.
const strStr = (haystack, needle) => {
const haystackLen = haystack.length
const needleLen = needle.length
const prime = 101
const d = 256
let haystackHash = 0
let needleHash = 0
let fastHash = 1
// loop 1
for (let i = 0; i < needleLen - 1; i++) {
fastHash = (fastHash * d) % prime
}
// loop 2
for (let i = 0; i < needleLen; i++) {
haystackHash = (d * haystackHash + haystack.charCodeAt(i)) % prime
needleHash = (d * needleHash + needle.charCodeAt(i)) % prime
}
// loop 3
for (let i = 0; i <= haystackLen - needleLen; i++) {
if (needleHash === haystackHash) {
let j
for (j = 0; j < needleLen; j++) {
if (haystack[i + j] !== needle[j]) {
break
}
}
if (j === needleLen) {
return i
}
}
if (i < haystackLen - needleLen) {
haystackHash = (d * (haystackHash - haystack.charCodeAt(i) * fastHash) + haystack.charCodeAt(i + needleLen)) % prime
if (haystackHash < 0) {
haystackHash = (haystackHash + prime)
}
}
}
return -1
}
Now let me tell you more about the search function.
The 1st loop is used to calculate a value that is used to accelerate the calculation of hashes, it works according to the algorithm fastHash = d^(needleLen-1) % prime.
The 2nd loop simply calculates the hash for the string and the substring.
In the 3rd loop, the search for matches takes place. If the hash values match, then we check them individually. If a mismatch occurs, then we exit the loop and repeat the search for matches. If all the characters match, then we simply return the index of the first element.
If the hash values do not match, then we calculate the hash value of the new window for the next comparison and repeat the search for matches.
Advantages of using this method:
- Fast search for a substring in the average case, especially for long strings and substrings;
- The ability to efficiently search for multiple substrings at once using the same hash code for all the desired substrings;
- Hashing allows you to skip many unsuitable substrings, reducing the number of necessary comparisons.
Disadvantages of using this method:
- In some cases, for example, when using simple hash functions, false matches are possible, which must be additionally checked character by character;
- The need to choose an efficient hash function that not only quickly calculates hash values, but also evenly distributes them among all possible values;
- If too small or too large hash values are encountered, there may be an overflow problem that needs to be handled.
You can read more about it here: wikipedia.org
Knuth-Morris-Pratt (KMP) Algorithm
At the moment, this algorithm is considered the fastest, as it runs in linear time. It is useful when searching through large amounts of data. Also, I found out that this algorithm is used not only for substring search, but also in:
- Gene searching in DNA sequences;
- Autocompletion in search engines;
- Signal recognition in digital signal processing.
Now let's move on to the algorithm itself. The KMP algorithm builds a prefix table (which is the main part of the algorithm) for the substring and uses it to determine the best position to continue the search when characters do not match.
The prefix table contains information about the largest prefix that is also a suffix (i.e. a substring that starts from the beginning of the string) of that element.
Here's an example implementation of the prefix function:
const prefix = (str) => {
const n = str.length
const p = Array(n).fill(0)
let i = 1, j = 0
while (i < str.length) {
if (str[i] === str[j]) {
p[i] = j + 1
i ++
j ++
} else {
if (j === 0) {
p[i] = 0
i ++
} else {
j = p[j - 1]
}
}
}
return p
}
Example: for the string "abcabcd", the prefix function will look like this:
[0, 0, 0, 1, 2, 3, 0]
, where the first three elements are equal to 0 because no prefix is a suffix of the first three characters; 1, 2, and 3 because "a", "ab", and "abc" are prefixes that are also suffixes; and finally, the last element is equal to 0 because "abcabc" is a prefix of the string "abcabcd" but is not a suffix of the last element "d".
After implementing the prefix function, we can move on to implementing the search algorithm. The essence of the algorithm is to compare the characters of the string and the substring, and in case the characters are not equal, we take the value from the prefix table and use it as an index for the transition, thus allowing us to skip some characters and find the desired substring faster.
const findStr = (str, searchString) => {
const searchStringPrefix = prefix(searchString)
let i = 0, j = 0
while (i < str.length) {
if (str[i] === searchString[j]) {
j ++
i ++
if (j === searchString.length) {
return i - searchString.length
}
} else {
if (j > 0) {
j = searchStringPrefix[j - 1]
} else {
i ++
}
}
}
if (i === str.length && j !== searchString.length) {
return -1
}
}
Pros of using the KMP algorithm:
- Linear time complexity:
O(n+m)
, where n is the length of the string and m is the length of the substring. - Efficiently processes large texts with multiple occurrences of the substring.
Cons of using the KMP algorithm:
- Requires a pre-built prefix table, which can take
O(m)
time andO(m)
memory. - Can be difficult to understand and implement, especially for beginners in programming.
I also want to note that KMP only provides a performance gain if there are positive values in the prefix table. Only in this case will the pointer be shifted by more than one. But even in the case of "all zeros" in the table, the search works fairly well.
No comments yet