I failed to create something new, but I'll share my thought process, and how it almost led me to reinventing a well-known solution.

Table of contents


needle = 'needle'
haystack = 'do I have a needle? Yes I do'

assert(needle in haystack)

I was once told that Python uses the Boyer–Moore algorithm (created in 1977, optimized in 1979) to search for substrings.

I briefly read about it, here's a summary: You preprocess the needle creating a lookup matrix, then, you use it to skip a lot of checks.

Uh... ok, cool.

-- 3 years passed by --

Can't we speed this up in a simpler way?

The naive implementation

Instead of explaining the substring search algorithm, I'll show you a solution:

// Time complexity: O(H * N)
// Space complexity: O(1)
fn naive_search(haystack: &str, needle: &str) -> bool {
    let [haystack, needle] =
        [haystack, needle].map(str::as_bytes);

        .any(|haystack_window| haystack_window == needle)

For each window (with the needle size) in the haystack, check if it matches the needle.

I want this, but vroom vroom (faster).

What does haystack.windows(...) do? (→ click me ←)

The slice::windows method returns an iterator that slides a window with the provided size through the whole slice.


Above, each line represent a window in (0..=9) with size 6.

The thought process

What I'm looking for:

  1. For each window, tell if it can be skipped.
  2. Maximize the amount of skips, so my algorithm runs faster.
  3. Linear time complexity, and constant space complexity.

I'll start my thought process based on these.

The main question to answer is: "How can I tell if a window can be skipped?"

To develop an idea, I started asking myself simpler questions:

  • What insight can I gather about a single window?
    • IDK, start somewhere cheap.
  • What are the cheapest operations I can do?
    • Arithmetic operations.
  • What about multiplication?
    • Like a hash?

Yeah, by comparing the needle hash with the window hash, I can (hopefully) skip most comparisons.

But I'm constrained by linear time complexity, with this in mind, how do I calculate the hash for each window?

I know how to do this in O(N²), but not in O(N).

Ok, gotta try something else (but I'll come back to hashing later).

  • Brain, give me another cheap arithmetic operation.
    • XOR.
  • Mmmmmmmmmmmm... nope.
    • Addition?

Yes! Similar to the hash idea, we can compute the sum (of all elements) for both slices, if sum(needle) != sum(haystack), equality is impossible and the check can be skipped.

And this time, I know how to do this in O(N), take these windows as an example:


To start, calculate the sum of the first window naively:

  1. First sum is 15.

For the rest, reuse the previous sum, subtract the removed element and add the new one:

  1. Second = First + 6 - 0 = 21.
  2. Third = Second + 7 - 1 = 27.
  3. Fourth = Third + 8 - 2 = 33.
  4. Fifth = Fourth + 9 - 3 = 39.
  5. And so on...

That's O(N) for all windows, and addition is, of course, fast as hell.

The algorithm

Let's preprocess the sum of the needle, and calculate the sum for each window, comparing both to perform skips.

// Time complexity: O(H * N) // same as naive
// Space complexity: O(1)    // same as naive
fn sum_search(haystack: &str, needle: &str) -> bool {
    // Treat corner cases
    if needle.is_empty() {
        return true;
    } else if needle.len() >= haystack.len() {
        return haystack == needle;

    let [haystack, needle] = [haystack, needle].map(str::as_bytes);

    let mut windows = haystack.windows(needle.len());

    // Unwrap Safety:
    //   We know that `0 < needle.len() < haystack.len()`, there is at
    //   least one window.
    let first_window = windows.next().unwrap();

    let sum_slice = |slice: &[u8]| -> u64 {

    let needle_sum = sum_slice(needle);
    let mut window_sum = sum_slice(first_window);

    // Short-circuit the expensive check to skip it
    if needle_sum == window_sum && first_window == needle {
        return true;

    // Now, for the rest of the windows.
    for (removed_element_index, window) in windows.enumerate() {
        // Unwrap Safety:
        //   We know that `needle.len() > 0`, every window is non-empty.
        window_sum += *window.last().unwrap() as u64;
        window_sum -= haystack[removed_element_index] as u64;

        // If the sum doesn't match, skip the check
        if needle_sum != window_sum {
        // Check equality (expensive check)
        if window == needle {
            return true;


Time to compare it to the naive solution and the fastest Boyer–Moore crate, needle:

(Benchmarked searching for "Lorem is a ipsum and more" in a Shakespeare book.)

boyer_moore  =  1.52 ms
sum_search   =  2.43 ms # mine
naive_search = 13.87 ms

Cool! It's slower than Boyer–Moore but faster than the naive algorithm, and it's very simple.

This is curious, there is no "Lorem" substring in the text, but there's a lot of "L", "Lo" and "Lor", that's probably why naive_search is much slower.

Time to see if we're faster than the std's solution:

fn std_search(haystack: &str, needle: &str) -> bool {
naive_search =  13.87 ms
sum_search   =   2.43 ms
boyer_moore  =   1.52 ms
std_search   = 127.84 µs # ...
Image of Tom (the cat from Tom & Jerry) with a judgemental look in his eyes.

Oh, the needle is so small that the std optimizes it to use SIMD instructions instead.

#[cfg(all(target_arch = "x86_64", target_feature = "sse2"))]
if self.len() <= 32 {
    if let Some(result) = simd_contains(self, haystack) {
        return result;
// PR - https://github.com/rust-lang/rust/pull/103779

I'll avoid that by expanding the needle to:

"Lorem is a ipsum and more, consectetur" (38 bytes)

naive_search = 14.22 ms
sum_search   =  2.39 ms
std_search   =  1.27 ms # still very good
boyer_moore  =  1.01 ms

The Karp, and the Rabin

After writing my algorithm down, I started researching existing solutions. Do you remember the hash idea? And how I couldn't compute all hashes in O(N)?

Well, it actually has a name, that's the Rabin-Karp algorithm, it skips checks by using a hash, actually, a rolling hash.

A rolling hash is exactly what I was looking for, with it, you're able to calculate the next window hash based on the previous one, by just peeking at the new (and, sometimes the deleted) elements.

But wait, that's extremely similar to what I did, isn't my sum thing a rolling hash?

Here's the main property of a hash:

a == b -> hash(a) == hash(b)

And to be clear, we were relying on the contraposition of it to perform the skips:

hash(a) != hash(b) -> a != b   # This has the same meaning as the above

Yes, the sequence sum meets this property, so it does look like a rolling hash, but it doesn't comply to the other properties that we usually expect from a hash function (like uniformity, necessary for good bucket distribution).

To settle my doubt, here is a quote from the Rabin-Karp Wikipedia article:

“A trivial (but not very good) rolling hash function just adds the values of each character in the substring.”

Closing thoughts

To me, it has become clear that, from the problem statement, you can arrive very close to the Rabin-Karp algorithm, well, at the trivial part of it.

I found two crates that use Rabin-Karp:

  1. memchr.
  2. aho-corasick.

(Props to @BurntSushi for lovely writing the linked comprehensive comments.)

If you're using axum, futures or the regex crate, then your code is probably using Rabin-Karp, because they depend on 1 or 2.

The hard part of Rabin-Karp is coming up with a decent rolling hash function, the two links above point to an implementation reference.

It's also clear that I ran away from the deep problem when I gave up on the hashing problem, the ideal thing would be to explore on it until I've come up with a good hashing solution.

But what can I expect? That algorithm was all just a shower thought.


If I managed to spark your interest on this subject, you should read more about it!

So, here are some suggested points for you to think about:

  1. Check the classification of (substring) search algorithms.
  2. Why is std's <&str>::contains implementation so fast?
  3. What makes for a good rolling hash function?
    • Check the one used in memchr and aho-corasick crates.
  4. For what inputs each substring search algorithm shines the most?
    • Is it possible to create an adaptative algorithm that collects insights from the input on-the-fly and changes strategy based on it? Is there any gain on falling back to another strategy mid-way, or can you always take the correct decision upfront?


Can the algorithm overflow?

It would require a needle of 72_057 TB, in that case, the search algorithm would produce false-negatives.

XOR variation

At first, I thought XOR wasn't a solution, so I focused on addition.

However, to quickly recalculate the sum for the new window, we were relying on the invertibility property of addition, that is:

A + B - B == A

Guess what:

A ^ B ^ B == A

It also applies to XOR, let's see what happens:

fn xor_search(haystack: &str, needle: &str) -> bool {

    let xor_slice = |slice: &[u8]| -> u64 {
        slice.iter().copied().map(u64::from).fold(0, |a, b| a ^ b)
    let needle_xor = xor_slice(needle);
    let mut window_xor = xor_slice(first_window);


    for (removed_element_index, window) in windows.enumerate() {
        // Unwrap Safety:
        //   We know that `needle.len() > 0`, every window is non-empty.
        window_xor ^= *window.last().unwrap() as u64;
        window_xor ^= haystack[removed_element_index] as u64;
naive_search = 14.22 ms
xor_search   =  2.69 ms # here
sum_search   =  2.39 ms
std_search   =  1.27 ms
boyer_moore  =  1.01 ms

It works, it's still faster than the naive, but its slower than sum_search because XOR of u8s is limited to u8::MAX and the chance of collision is higher, in contrast to the u64 used in sum_search.

Inputs that trigger the worst-case performance

Here's a bad case the sum solution, but great for xor:

needle = "22";
haystack = "1313131313131313...";

In contrast, this one is good for sum and bad for xor:

needle = "223";
haystack = "113113113...";

In "223", the 2s cancel each other out, the same happens to the 1s in "113", the resulting XOR for both sequences is b'3'.

Fun results for tiny inputs

For tiny needles (up to 24 bytes of length), this one runs faster than the naive algorithm:

fn weird_search(haystack: &str, needle: &str) -> bool {
    let [haystack, needle] =
        [haystack, needle].map(str::as_bytes);

    let sum_slice = |slice: &[u8]| -> u64 {
    let needle_sum = sum_slice(needle);

    haystack.windows(needle.len()).any(|haystack_window| {
        let window_sum = sum_slice(haystack_window);
        window_sum == needle_sum && haystack_window == needle

It's the naive, plus a sum check, here are the results:

naive_search =  13.82 ms
weird_search =   8.37 ms # this one
boyer_moore  =   3.82 ms
xor_search   =   2.68 ms
sum_search   =   2.42 ms
std_search   = 137.29 µs

Calculating the sum for a whole window can be faster than comparing two slices (in C's strcmp fashion), it's probable that SIMD instructions are contributing to this.

Tiny puzzle

What assumptions can we make about haystack after these checks?

fn sum_search(haystack: &str, needle: &str) -> bool {
    // Treat corner cases
    if needle.is_empty() {
        return true;
    } else if needle.len() >= haystack.len() {
        return haystack == needle;
Answer. (→ click me ←)

haystack.len() >= 2

Code for the needle crate

fn boyer_moore(haystack: &str, needle: &str) -> bool {

Comment section

I'd like the comment section to live inside of GitHub, it's an experiment:

Click here for the comment section of this post.