Using Boost regular expressions as word finders

A sample demonstration of using the Boost libraries as a means of finding matching words in a large array table, that match the given lookup criteria.

Suppose you are wrestling with a cryptic crossword and want to find all seven-letter words whose third letter is ‘Y’ and fifth letter is ‘N’, or better still, run a program that will find these words for you.

The Boost regex_match algorithm can be easily used to determine whether a given regular expression matches all of a given character sequence, in the steps described as follows:

1. Define the regular expression

In my simple example, I would like to find out all (case insensitive) words with seven characters and with third letter ‘Y’ / fifth letter ‘N’, so I define a boost::regex to do this:

static const boost::regex ex("..y.n..");

2. Read the text file into a std::string

There seem to be a plethora of dictionaries and word lists at various online places. You can use this text file, about 1MB in size, if you’re having trouble finding one. There seem to be plenty of others.

Reading the text file into a std::string is straightforward enough:

std::ifstream t( "word_list.txt" );  
std::string text( ( std::istreambuf_iterator<char>( t ) ),  
                    std::istreambuf_iterator<char>() ); 

3. Split the complete string into separate words

Use boost::tokenize to split the string. This string we have read in will contain the complete list of words, which we will split into separate tokens, delimited by newline ('\n') characters:

boost::char_separator< char > sep( "\n" );  
boost::tokenizer< boost::char_separator< char > > tokens( text, sep );

4. Determine the suitable matches.

We now find which of the tokenized words satisfy the regular expression criteria. For this I iterate through the tokenized words and use boost::regex_match to test for any matches, by passing it the regular expression and word string respectively. If a match is found then we can store and/or display it.

Here is the complete code listing:

#include <iostream>
#include <string>  
#include <fstream>  

#include <boost/regex.hpp>
#include <boost/foreach.hpp>  
#include <boost/tokenizer.hpp>  

bool testMatch( const boost::regex &ex, const std::string st )
{	
	if ( boost::regex_match( st, ex ) ) 
	{				
		return true;
	}
	else 
	{		
		return false;
	}
}

int main(int argc, char *argv[])
{	
	// 1. Define what matches we are looking for
	static const boost::regex ex("..y.n..");

	// 2. Read the word list text file into a string
	std::ifstream t( "word_list.txt" );  
	std::string text( ( std::istreambuf_iterator<char>( t ) ),  
					    std::istreambuf_iterator<char>() ); 

	// 3. Tokenize the words via their newlines
	boost::char_separator< char > sep( "\n" );  
	boost::tokenizer< boost::char_separator< char >> tokens( text, sep );  
  
	// 4. Find and display any matches found using the boost foreach loop  	
	std::vector<std::string> matches;
  
	BOOST_FOREACH( std::string val, tokens )  
	{ 
		if ( testMatch( ex, val ) )
		{
			std::cout << val << std::endl;
			matches.push_back( val );
		}				
	}  

	return 0;
}

And here is the list of word matches that it finds for us:

bayonet
beyonds
cryonic

Leave a Reply