Finding substrings within strings using the Boyer-Moore-Horspool Algorithm in C++

Ever wondered what computer algorithm gets employed when using Ctrl-F to search for specific words in a page of text?

A number of potential string searching techniques exist. One possible candidate implementation is Nigel Horspool’s Boyer-Moore-Horspol algorithm – an exceptionally fast technique for finding a ‘needle’ substring within a larger ‘haystack’ string.

Here is an example of it’s usage: you simply feed it the needle/haystack strings/lengths so that it returns a pointer to the first occurence of the ‘needle’ string, or null if it is not found. Use the difference between the result and haystack pointers to determine the numerical position of the needle substring within the haystack string.

Code listing below:

#include <stdio.h>
#include <string.h>
#include <limits.h>


/* Returns a pointer to the first occurrence of "needle"
 * within "haystack", or NULL if not found. Works like
 * memmem().
 */
 
/* Note: In this example needle is a Cstring. The ending
 * 0x00 will be cut off, so you could call this example with
 * boyermoore_horspool_memmem(haystack, hlen, "abc", sizeof("abc"))
 */
const unsigned char* boyermoore_horspool_memmem(const unsigned char* haystack, 
											    size_t hlen,
											    const unsigned char* needle,  
											    size_t nlen)
{
    size_t scan = 0;
    size_t bad_char_skip[UCHAR_MAX + 1]; /* Officially called:
                                          * bad character shift */
 
    /* Sanity checks on the parameters */
    if (nlen <= 0 || !haystack || !needle)
        return NULL;
 
    /* ---- Preprocess ---- */
    /* Initialize the table to default value */
    /* When a character is encountered that does not occur
     * in the needle, we can safely skip ahead for the whole
     * length of the needle.
     */
    for (scan = 0; scan <= UCHAR_MAX; scan = scan + 1) bad_char_skip[scan] = nlen;
 
    /* C arrays have the first byte at [0], therefore:
     * [nlen - 1] is the last byte of the array. */
    size_t last = nlen - 1;
 
    /* Then populate it with the analysis of the needle */
    for (scan = 0; scan < last; scan = scan + 1) bad_char_skip[needle[scan]] = last - scan;
 
    /* ---- Do the matching ---- */
 
    /* Search the haystack, while the needle can still be within it. */
    while (hlen >= nlen)
    {
        /* scan from the end of the needle */
        for (scan = last; haystack[scan] == needle[scan]; scan = scan - 1)
		{
			/* If the first byte matches, we've found it. */
            if (scan == 0) 
			{
					return haystack;
			}
		}
 
        /* otherwise, we need to skip some bytes and start again.
           Note that here we are getting the skip value based on the last byte
           of needle, no matter where we didn't match. So if needle is: "abcd"
           then we are skipping based on 'd' and that value will be 4, and
           for "abcdd" we again skip on 'd' but the value will be only 1.
           The alternative of pretending that the mismatched character was
           the last character is slower in the normal case (Eg. finding
           "abcd" in "...azcd..." gives 4 by using 'd' but only
           4-2==2 using 'z'. */
        hlen     -= bad_char_skip[haystack[last]];
        haystack += bad_char_skip[haystack[last]];
    }
 
    return NULL;
}

int main()
{
	unsigned char needle[]   = "needle";
	unsigned char haystack[] = "try to find needle in a haystack";

	int  needle_length = 6;
	int  haystack_length = 34;	

	const unsigned char* result = boyermoore_horspool_memmem( haystack, 50, needle, 6 );

	int pos = result - haystack ;
}

Leave a Reply