# Trying to invent a better substring search algorithm

I failed to create something new, but I'll share my thought process, and how it almost led me to reinventing a well-known solution.

# Table of contents

- Prelude.
- The naive implementation.
- The thought process.
- The algorithm.
- The Karp, and the Rabin.
- Closing thoughts.
- Homework.
- Extras:

# Prelude

```
needle = 'needle'
haystack = 'do I have a needle? Yes I do'
assert(needle in haystack)
```

I was once told that Python uses the Boyer–Moore algorithm (created in 1977, optimized in 1979) to search for substrings.

I briefly read about it, here's a summary: You preprocess the needle creating a lookup matrix, then, you use it to skip a lot of checks.

Uh... ok, cool.

-- *3 years passed by* --

Can't we speed this up in a simpler way?

# The naive implementation

Instead of explaining the substring search algorithm, I'll show you a solution:

```
// Time complexity: O(H * N)
// Space complexity: O(1)
fn naive_search(haystack: &str, needle: &str) -> bool {
let [haystack, needle] =
[haystack, needle].map(str::as_bytes);
haystack
.windows(needle.len())
.any(|haystack_window| haystack_window == needle)
}
```

For each window (with the needle size) in the haystack, check if it matches the needle.

I want this, but vroom vroom (faster).

##
What does `haystack.windows(...)`

do? *(→ click me ←)*

The

`slice::windows`

method returns an iterator that slides a window with the provided size through the whole slice.

0 1 2 3 4 5 6 7 8 9 1st 0 1 2 3 4 5 2nd 1 2 3 4 5 6 3rd 2 3 4 5 6 7 4th 3 4 5 6 7 8 5th 4 5 6 7 8 9 Above, each line represent a window in

`(0..=9)`

with size`6`

.

# The thought process

What I'm looking for:

- For each window, tell if it can be skipped.
- Maximize the amount of skips, so my algorithm runs faster.
- Linear time complexity, and constant space complexity.

I'll start my thought process based on these.

The main question to answer is: *"How can I tell if a window can be skipped?"*

To develop an idea, I started asking myself simpler questions:

- What insight can I gather about a
**single window**?- IDK, start somewhere
*cheap*.

- IDK, start somewhere
- What are the cheapest operations I can do?
- Arithmetic operations.

- What about multiplication?
- Like a hash?

Yeah, by comparing the needle hash with the window hash, I can (hopefully) skip most comparisons.

*But* I'm constrained by **linear time complexity**, with this in mind, how do I calculate the hash for each window?

I know how to do this in `O(N²)`

, but not in `O(N)`

.

Ok, gotta try something else (but I'll come back to hashing later).

- Brain, give me another cheap arithmetic operation.
- XOR.

- Mmmmmmmmmmmm... nope.
- Addition?

Yes! Similar to the hash idea, we can compute the sum (of all elements) for both slices, if `sum(needle) != sum(haystack)`

, equality is impossible and the check can be skipped.

And this time, I know how to do this in `O(N)`

, take these windows as an example:

0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
---|---|---|---|---|---|---|---|---|---|---|

1st | 0 | 1 | 2 | 3 | 4 | 5 | ||||

2nd | 1 | 2 | 3 | 4 | 5 | 6 | ||||

3rd | 2 | 3 | 4 | 5 | 6 | 7 | ||||

4th | 3 | 4 | 5 | 6 | 7 | 8 | ||||

5th | 4 | 5 | 6 | 7 | 8 | 9 |

To start, calculate the sum of the first window naively:

- First sum is
`15`

.

For the rest, reuse the previous sum, subtract the removed element and add the new one:

`Second = First + 6 - 0 = 21`

.`Third = Second + 7 - 1 = 27`

.`Fourth = Third + 8 - 2 = 33`

.`Fifth = Fourth + 9 - 3 = 39`

.- And so on...

That's `O(N)`

for all windows, and addition is, of course, fast as hell.

# The algorithm

Let's preprocess the sum of the needle, and calculate the sum for each window, comparing both to perform skips.

```
// Time complexity: O(H * N) // same as naive
// Space complexity: O(1) // same as naive
fn sum_search(haystack: &str, needle: &str) -> bool {
// Treat corner cases
if needle.is_empty() {
return true;
} else if needle.len() >= haystack.len() {
return haystack == needle;
}
let [haystack, needle] = [haystack, needle].map(str::as_bytes);
let mut windows = haystack.windows(needle.len());
// Unwrap Safety:
// We know that `0 < needle.len() < haystack.len()`, there is at
// least one window.
let first_window = windows.next().unwrap();
let sum_slice = |slice: &[u8]| -> u64 {
slice.iter().copied().map(u64::from).sum()
};
let needle_sum = sum_slice(needle);
let mut window_sum = sum_slice(first_window);
// Short-circuit the expensive check to skip it
if needle_sum == window_sum && first_window == needle {
return true;
}
// Now, for the rest of the windows.
for (removed_element_index, window) in windows.enumerate() {
// Unwrap Safety:
// We know that `needle.len() > 0`, every window is non-empty.
window_sum += *window.last().unwrap() as u64;
window_sum -= haystack[removed_element_index] as u64;
// If the sum doesn't match, skip the check
if needle_sum != window_sum {
continue;
}
// Check equality (expensive check)
if window == needle {
return true;
}
}
false
}
```

Time to compare it to the naive solution and the fastest Boyer–Moore crate, `needle`

:

(Benchmarked searching for `"Lorem is a ipsum and more"`

in a Shakespeare book.)

```
boyer_moore = 1.52 ms
sum_search = 2.43 ms # mine
naive_search = 13.87 ms
```

Cool! It's slower than Boyer–Moore but faster than the naive algorithm, and it's very simple.

This is curious, there is no `"Lorem"`

substring in the text, but there's a lot of `"L"`

, `"Lo"`

and `"Lor"`

, that's probably why `naive_search`

is much slower.

Time to see if we're faster than the `std`

's solution:

```
fn std_search(haystack: &str, needle: &str) -> bool {
haystack.contains(needle)
}
```

```
naive_search = 13.87 ms
sum_search = 2.43 ms
boyer_moore = 1.52 ms
std_search = 127.84 µs # ...
```

Oh, the needle is so small that the `std`

optimizes it to use SIMD instructions instead.

```
#[cfg(all(target_arch = "x86_64", target_feature = "sse2"))]
if self.len() <= 32 {
if let Some(result) = simd_contains(self, haystack) {
return result;
}
}
// PR - https://github.com/rust-lang/rust/pull/103779
```

I'll avoid that by expanding the needle to:

`"Lorem is a ipsum and more, consectetur"`

(38 bytes)

```
naive_search = 14.22 ms
sum_search = 2.39 ms
std_search = 1.27 ms # still very good
boyer_moore = 1.01 ms
```

# The Karp, and the Rabin

After writing my algorithm down, I started researching existing solutions. Do you remember the hash idea? And how I couldn't compute all hashes in `O(N)`

?

Well, it actually has a name, that's the `Rabin-Karp`

algorithm, it skips checks by using a hash, actually, a *rolling hash*.

A *rolling hash* is exactly what I was looking for, with it, you're able to calculate the next window hash based on the previous one, by just peeking at the new (and, sometimes the deleted) elements.

But wait, that's extremely similar to what I did, isn't my sum thing a *rolling hash*?

Here's the main property of a hash:

```
a == b -> hash(a) == hash(b)
```

And to be clear, we were relying on the contraposition of it to perform the skips:

```
hash(a) != hash(b) -> a != b # This has the same meaning as the above
```

Yes, the sequence sum meets this property, so it does look like a rolling hash, but it doesn't comply to the other properties that we usually expect from a hash function (like uniformity, necessary for good bucket distribution).

To settle my doubt, here is a quote from the `Rabin-Karp`

Wikipedia article:

“A trivial (but not very good) rolling hash function just adds the values of each character in the substring.”

# Closing thoughts

To me, it has become clear that, from the problem statement, you can arrive very close to the `Rabin-Karp`

algorithm, well, at the trivial part of it.

I found two crates that use `Rabin-Karp`

:

(Props to `@BurntSushi`

for lovely writing the linked comprehensive comments.)

If you're using `axum`

, `futures`

or the `regex`

crate, then your code is probably using `Rabin-Karp`

, because they depend on `1`

or `2`

.

The hard part of `Rabin-Karp`

is coming up with a decent *rolling hash function*, the two links above point to an implementation reference.

It's also clear that I ran away from the deep problem when I gave up on the hashing problem, the ideal thing would be to explore on it until I've come up with a good hashing solution.

But what can I expect? That algorithm was all just a shower thought.

# Homework

If I managed to spark your interest on this subject, you should read more about it!

So, here are some suggested points for you to think about:

- Check the classification of (substring) search algorithms.
- Why is
`std`

's`<&str>::contains`

implementation so fast? - What makes for a good
*rolling hash function*?- Check the one used in
`memchr`

and`aho-corasick`

crates.

- Check the one used in
- For what inputs each substring search algorithm shines the most?
- Is it possible to create an adaptative algorithm that collects insights from the input on-the-fly and changes strategy based on it? Is there any gain on falling back to another strategy mid-way, or can you always take the correct decision upfront?

# Extras

## Can the algorithm overflow?

It would require a needle of `72_057 TB`

, in that case, the search algorithm would produce false-negatives.

## XOR variation

At first, I thought `XOR`

wasn't a solution, so I focused on addition.

However, to quickly recalculate the sum for the new window, we were relying on the invertibility property of addition, that is:

```
A + B - B == A
```

Guess what:

```
A ^ B ^ B == A
```

It also applies to XOR, let's see what happens:

```
fn xor_search(haystack: &str, needle: &str) -> bool {
...
let xor_slice = |slice: &[u8]| -> u64 {
slice.iter().copied().map(u64::from).fold(0, |a, b| a ^ b)
};
let needle_xor = xor_slice(needle);
let mut window_xor = xor_slice(first_window);
...
for (removed_element_index, window) in windows.enumerate() {
// Unwrap Safety:
// We know that `needle.len() > 0`, every window is non-empty.
window_xor ^= *window.last().unwrap() as u64;
window_xor ^= haystack[removed_element_index] as u64;
...
```

```
naive_search = 14.22 ms
xor_search = 2.69 ms # here
sum_search = 2.39 ms
std_search = 1.27 ms
boyer_moore = 1.01 ms
```

It works, it's still faster than the naive, but its slower than `sum_search`

because XOR of `u8`

s is limited to `u8::MAX`

and the chance of collision is higher, in contrast to the `u64`

used in `sum_search`

.

## Inputs that trigger the worst-case performance

Here's a bad case the `sum`

solution, but great for `xor`

:

```
needle = "22";
haystack = "1313131313131313...";
```

In contrast, this one is good for `sum`

and bad for `xor`

:

```
needle = "223";
haystack = "113113113...";
```

In `"223"`

, the `2`

s cancel each other out, the same happens to the `1`

s in `"113"`

, the resulting XOR for both sequences is `b'3'`

.

## Fun results for tiny inputs

For tiny needles (up to 24 bytes of length), this one runs faster than the naive algorithm:

```
fn weird_search(haystack: &str, needle: &str) -> bool {
let [haystack, needle] =
[haystack, needle].map(str::as_bytes);
let sum_slice = |slice: &[u8]| -> u64 {
slice.iter().copied().map(u64::from).sum()
};
let needle_sum = sum_slice(needle);
haystack.windows(needle.len()).any(|haystack_window| {
let window_sum = sum_slice(haystack_window);
window_sum == needle_sum && haystack_window == needle
})
}
```

It's the naive, plus a sum check, here are the results:

```
naive_search = 13.82 ms
weird_search = 8.37 ms # this one
boyer_moore = 3.82 ms
xor_search = 2.68 ms
sum_search = 2.42 ms
std_search = 137.29 µs
```

Calculating the sum for a whole window can be faster than comparing two slices (in `C`

's `strcmp`

fashion), it's probable that `SIMD`

instructions are contributing to this.

## Tiny puzzle

What assumptions can we make about `haystack`

after these checks?

```
fn sum_search(haystack: &str, needle: &str) -> bool {
// Treat corner cases
if needle.is_empty() {
return true;
} else if needle.len() >= haystack.len() {
return haystack == needle;
}
...
}
```

##
Answer. *(→ click me ←)*

`haystack.len() >= 2`

## Code for the `needle`

crate

```
fn boyer_moore(haystack: &str, needle: &str) -> bool {
::needle::BoyerMoore::new(needle.as_bytes())
.find_first_in(haystack.as_bytes())
.is_some()
}
```

## Comment section

I'd like the comment section to live inside of GitHub, it's an experiment:

Click here for the comment section of this post.