Trying to invent a better substring search algorithm
I failed to create something new, but I'll share my thought process, and how it almost led me to reinventing a well-known solution.
Table of contents
- Prelude.
- The naive implementation.
- The thought process.
- The algorithm.
- The Karp, and the Rabin.
- Closing thoughts.
- Homework.
- Extras:
Prelude
needle = 'needle'
haystack = 'do I have a needle? Yes I do'
assert(needle in haystack)
I was once told that Python uses the Boyer–Moore algorithm (created in 1977, optimized in 1979) to search for substrings.
I briefly read about it, here's a summary: You preprocess the needle creating a lookup matrix, then, you use it to skip a lot of checks.
Uh... ok, cool.
-- 3 years passed by --
Can't we speed this up in a simpler way?
The naive implementation
Instead of explaining the substring search algorithm, I'll show you a solution:
// Time complexity: O(H * N)
// Space complexity: O(1)
fn naive_search(haystack: &str, needle: &str) -> bool {
let [haystack, needle] =
[haystack, needle].map(str::as_bytes);
haystack
.windows(needle.len())
.any(|haystack_window| haystack_window == needle)
}
For each window (with the needle size) in the haystack, check if it matches the needle.
I want this, but vroom vroom (faster).
What does haystack.windows(...)
do? (→ click me ←)
The
slice::windows
method returns an iterator that slides a window with the provided size through the whole slice.
0 1 2 3 4 5 6 7 8 9 1st 0 1 2 3 4 5 2nd 1 2 3 4 5 6 3rd 2 3 4 5 6 7 4th 3 4 5 6 7 8 5th 4 5 6 7 8 9 Above, each line represent a window in
(0..=9)
with size6
.
The thought process
What I'm looking for:
- For each window, tell if it can be skipped.
- Maximize the amount of skips, so my algorithm runs faster.
- Linear time complexity, and constant space complexity.
I'll start my thought process based on these.
The main question to answer is: "How can I tell if a window can be skipped?"
To develop an idea, I started asking myself simpler questions:
- What insight can I gather about a single window?
- IDK, start somewhere cheap.
- What are the cheapest operations I can do?
- Arithmetic operations.
- What about multiplication?
- Like a hash?
Yeah, by comparing the needle hash with the window hash, I can (hopefully) skip most comparisons.
But I'm constrained by linear time complexity, with this in mind, how do I calculate the hash for each window?
I know how to do this in O(N²)
, but not in O(N)
.
Ok, gotta try something else (but I'll come back to hashing later).
- Brain, give me another cheap arithmetic operation.
- XOR.
- Mmmmmmmmmmmm... nope.
- Addition?
Yes! Similar to the hash idea, we can compute the sum (of all elements) for both slices, if sum(needle) != sum(haystack)
, equality is impossible and the check can be skipped.
And this time, I know how to do this in O(N)
, take these windows as an example:
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
---|---|---|---|---|---|---|---|---|---|---|
1st | 0 | 1 | 2 | 3 | 4 | 5 | ||||
2nd | 1 | 2 | 3 | 4 | 5 | 6 | ||||
3rd | 2 | 3 | 4 | 5 | 6 | 7 | ||||
4th | 3 | 4 | 5 | 6 | 7 | 8 | ||||
5th | 4 | 5 | 6 | 7 | 8 | 9 |
To start, calculate the sum of the first window naively:
- First sum is
15
.
For the rest, reuse the previous sum, subtract the removed element and add the new one:
Second = First + 6 - 0 = 21
.Third = Second + 7 - 1 = 27
.Fourth = Third + 8 - 2 = 33
.Fifth = Fourth + 9 - 3 = 39
.- And so on...
That's O(N)
for all windows, and addition is, of course, fast as hell.
The algorithm
Let's preprocess the sum of the needle, and calculate the sum for each window, comparing both to perform skips.
// Time complexity: O(H * N) // same as naive
// Space complexity: O(1) // same as naive
fn sum_search(haystack: &str, needle: &str) -> bool {
// Treat corner cases
if needle.is_empty() {
return true;
} else if needle.len() >= haystack.len() {
return haystack == needle;
}
let [haystack, needle] = [haystack, needle].map(str::as_bytes);
let mut windows = haystack.windows(needle.len());
// Unwrap Safety:
// We know that `0 < needle.len() < haystack.len()`, there is at
// least one window.
let first_window = windows.next().unwrap();
let sum_slice = |slice: &[u8]| -> u64 {
slice.iter().copied().map(u64::from).sum()
};
let needle_sum = sum_slice(needle);
let mut window_sum = sum_slice(first_window);
// Short-circuit the expensive check to skip it
if needle_sum == window_sum && first_window == needle {
return true;
}
// Now, for the rest of the windows.
for (removed_element_index, window) in windows.enumerate() {
// Unwrap Safety:
// We know that `needle.len() > 0`, every window is non-empty.
window_sum += *window.last().unwrap() as u64;
window_sum -= haystack[removed_element_index] as u64;
// If the sum doesn't match, skip the check
if needle_sum != window_sum {
continue;
}
// Check equality (expensive check)
if window == needle {
return true;
}
}
false
}
Time to compare it to the naive solution and (one of) the fastest Boyer–Moore crate, needle
:
(Benchmarked searching for "Lorem is a ipsum and more"
in a Shakespeare book.)
boyer_moore = 1.52 ms
sum_search = 2.43 ms # mine
naive_search = 13.87 ms
Mistake: When writing I didn't consider
memchr::memmem
, which may contain a faster Boyer–Moore implementation thanneedle
. Sorry Burnt Sushi!
Cool! It's slower than Boyer–Moore but faster than the naive algorithm, and it's very simple.
This is curious, there is no "Lorem"
substring in the text, but there's a lot of "L"
, "Lo"
and "Lor"
, that's probably why naive_search
is much slower.
Time to see if we're faster than the std
's solution:
fn std_search(haystack: &str, needle: &str) -> bool {
haystack.contains(needle)
}
naive_search = 13.87 ms
sum_search = 2.43 ms
boyer_moore = 1.52 ms
std_search = 127.84 µs # ...
Oh, the needle is so small that the std
optimizes it to use SIMD instructions instead.
#[cfg(all(target_arch = "x86_64", target_feature = "sse2"))]
if self.len() <= 32 {
if let Some(result) = simd_contains(self, haystack) {
return result;
}
}
// PR - https://github.com/rust-lang/rust/pull/103779
I'll avoid that by expanding the needle to:
"Lorem is a ipsum and more, consectetur"
(38 bytes)
naive_search = 14.22 ms
sum_search = 2.39 ms
std_search = 1.27 ms # still very good
boyer_moore = 1.01 ms
The Karp, and the Rabin
After writing my algorithm down, I started researching existing solutions. Do you remember the hash idea? And how I couldn't compute all hashes in O(N)
?
Well, it actually has a name, that's the Rabin-Karp
algorithm, it skips checks by using a hash, actually, a rolling hash.
A rolling hash is exactly what I was looking for, with it, you're able to calculate the next window hash based on the previous one, by just peeking at the new (and, sometimes the deleted) elements.
But wait, that's extremely similar to what I did, isn't my sum thing a rolling hash?
Here's the main property of a hash:
a == b -> hash(a) == hash(b)
And to be clear, we were relying on the contraposition of it to perform the skips:
hash(a) != hash(b) -> a != b # This has the same meaning as the above
Yes, the sequence sum meets this property, so it does look like a rolling hash, but it doesn't comply to the other properties that we usually expect from a hash function (like uniformity, necessary for good bucket distribution).
To settle my doubt, here is a quote from the Rabin-Karp
Wikipedia article:
“A trivial (but not very good) rolling hash function just adds the values of each character in the substring.”
Closing thoughts
To me, it has become clear that, from the problem statement, you can arrive very close to the Rabin-Karp
algorithm, well, at the trivial part of it.
I found two crates that use Rabin-Karp
:
(Props to @BurntSushi
for lovely writing the linked comprehensive comments.)
If you're using axum
, futures
or the regex
crate, then your code is probably using Rabin-Karp
, because they depend on 1
or 2
.
The hard part of Rabin-Karp
is coming up with a decent rolling hash function, the two links above point to an implementation reference.
It's also clear that I ran away from the deep problem when I gave up on the hashing problem, the ideal thing would be to explore on it until I've come up with a good hashing solution.
But what can I expect? That algorithm was all just a shower thought.
Homework
If I managed to spark your interest on this subject, you should read more about it!
So, here are some suggested points for you to think about:
- Check the classification of (substring) search algorithms.
- Why is
std
's<&str>::contains
implementation so fast? - What makes for a good rolling hash function?
- Check the one used in
memchr
andaho-corasick
crates.
- Check the one used in
- For what inputs each substring search algorithm shines the most?
- Is it possible to create an adaptative algorithm that collects insights from the input on-the-fly and changes strategy based on it? Is there any gain on falling back to another strategy mid-way, or can you always take the correct decision upfront?
Extras
Can the algorithm overflow?
It would require a needle of 72_057 TB
, in that case, the search algorithm would produce false-negatives.
XOR variation
At first, I thought XOR
wasn't a solution, so I focused on addition.
However, to quickly recalculate the sum for the new window, we were relying on the invertibility property of addition, that is:
A + B - B == A
Guess what:
A ^ B ^ B == A
It also applies to XOR, let's see what happens:
fn xor_search(haystack: &str, needle: &str) -> bool {
...
let xor_slice = |slice: &[u8]| -> u64 {
slice.iter().copied().map(u64::from).fold(0, |a, b| a ^ b)
};
let needle_xor = xor_slice(needle);
let mut window_xor = xor_slice(first_window);
...
for (removed_element_index, window) in windows.enumerate() {
// Unwrap Safety:
// We know that `needle.len() > 0`, every window is non-empty.
window_xor ^= *window.last().unwrap() as u64;
window_xor ^= haystack[removed_element_index] as u64;
...
naive_search = 14.22 ms
xor_search = 2.69 ms # here
sum_search = 2.39 ms
std_search = 1.27 ms
boyer_moore = 1.01 ms
It works, it's still faster than the naive, but its slower than sum_search
because XOR of u8
s is limited to u8::MAX
and the chance of collision is higher, in contrast to the u64
used in sum_search
.
Inputs that trigger the worst-case performance
Here's a bad case the sum
solution, but great for xor
:
needle = "22";
haystack = "1313131313131313...";
In contrast, this one is good for sum
and bad for xor
:
needle = "223";
haystack = "113113113...";
In "223"
, the 2
s cancel each other out, the same happens to the 1
s in "113"
, the resulting XOR for both sequences is b'3'
.
Fun results for tiny inputs
For tiny needles (up to 24 bytes of length), this one runs faster than the naive algorithm:
fn weird_search(haystack: &str, needle: &str) -> bool {
let [haystack, needle] =
[haystack, needle].map(str::as_bytes);
let sum_slice = |slice: &[u8]| -> u64 {
slice.iter().copied().map(u64::from).sum()
};
let needle_sum = sum_slice(needle);
haystack.windows(needle.len()).any(|haystack_window| {
let window_sum = sum_slice(haystack_window);
window_sum == needle_sum && haystack_window == needle
})
}
It's the naive, plus a sum check, here are the results:
naive_search = 13.82 ms
weird_search = 8.37 ms # this one
boyer_moore = 3.82 ms
xor_search = 2.68 ms
sum_search = 2.42 ms
std_search = 137.29 µs
Calculating the sum for a whole window can be faster than comparing two slices (in C
's strcmp
fashion), it's probable that SIMD
instructions are contributing to this.
Tiny puzzle
What assumptions can we make about haystack
after these checks?
fn sum_search(haystack: &str, needle: &str) -> bool {
// Treat corner cases
if needle.is_empty() {
return true;
} else if needle.len() >= haystack.len() {
return haystack == needle;
}
...
}
Answer. (→ click me ←)
haystack.len() >= 2
Code for the needle
crate
fn boyer_moore(haystack: &str, needle: &str) -> bool {
::needle::BoyerMoore::new(needle.as_bytes())
.find_first_in(haystack.as_bytes())
.is_some()
}
Comment section
I'd like the comment section to live inside of GitHub, it's an experiment:
Click here for the comment section of this post.