Counting the Number of Words in a Text File in STL / C++

This post aims to illustrate the power of using STL’s associative arrays as a word counter. It reads the entire contents of the text file, word-by-word, and keeps a running total of the number of occurences of each word. All using just a few lines of code, discounting the bits that output the results.

There is a deceptively simple line of code

counter[ tok ]++;

that looks up the counter value for a given word and increments its value. It basically navigates the red-black tree used by the standard map to find the appropriate tree node, selects the T portion of the pair <Key,T>, and then increments it.

The WordCounter also needs to have the default constructor in order to set the counter value to zero. Without this, when performing the counter[ tok ]++; code for the first time for a particular word, this initial integer value will be set to some arbitrary value contained in the memory.

#include <iostream>
#include <fstream>
#include <sstream>
#include <map>
#include <string>

using namespace std;

class WordCounter
{
public:
	int value;
	WordCounter() : value( 0 ) {}
	
	void operator++ (int) { value++; }
};


ostream& operator<<(ostream& st, WordCounter& wc )
{
	return st << wc.value;
}

const string path = "C:\\Dump\\Hamlet.txt";
//const string path = "/home/andy/NetBeansProjects/WordCount/Hamlet.txt";  //Linux

int main()
{
	map<string, WordCounter> counter;

	ifstream input;
	input.open( path.c_str() );

	if ( !input )
	{
		cout << "Error in opening file Hamlet.txt\n";
		return 0;
	}

	string tok;		
	while ( true )
	{
		input >> tok;
		
		if ( input )
		{
			counter[ tok ]++;
		}
		else break;		
	}

	map< string, WordCounter,less<string> >::iterator it;

	for ( it  = counter.begin();
		  it != counter.end();
		  it++ )
	{
		cout << (*it).first
			 << ", "
			 << (*it).second
			 << endl;
	}

	return 0;
}

As an example, let’s apply this to a sample piece of text, an excerpt from Shakespeare’s Hamlet. (Click here for Richard E. Grant’s superb rendition by Richard E. Grant.)

I have of late – but wherefore I know not – lost all my mirth,
forgone all custom of exercises;
and indeed it goes so heavily with my disposition that this goodly frame,
the earth, seems to me a sterile promontory;
this most excellent canopy, the air, look you, this brave o’erhanging firmament,
this majestical roof fretted with golden fire,
why, it appears no other thing to me than a foul and pestilential congregation of vapours.
What a piece of work is a man! how noble in reason! how infinite in faculty!
in form and moving how express and admirable!
in action how like an angel! in apprehension how like a god!
the beauty of the world! the paragon of animals!
And yet to me, what is this quintessence of dust?
man delights not me: no, nor woman neither.

To get a word count something like this:

And so on…

Further improvement: removing unwanted characters

The output may be further refined by including a kind of filter to strip out any unwanted characters like ‘?’, ‘;’ etc and improve the output formatting by means of the setw routine.

While the strings are being read in and tokenized, I use this

tok.resize( remove_if( tok.begin(), tok.end(), filter) - tok.begin() );

specifically to remove the unwanted characters and resize the string, filter returning true or false depending upon whether the character is ‘normal’ or not. Updated code listing as follows:

#include <iostream>
#include <iomanip>
#include <fstream>
#include <sstream>
#include <algorithm>
#include <map>
#include <string>

using namespace std;

class WordCounter
{
public:
	int value;
	WordCounter() : value( 0 ) {}
	
	void operator++ (int) { value++; }
};

ostream& operator<<(ostream& st, WordCounter& wc )
{
	return st << wc.value;
}

// Remove unwanted characters from a string
bool filter(char c)
{
	return isalpha( c ) == 0;
}

const string path = "C:\\Dump\\Hamlet.txt";
//const string path = "/home/andy/NetBeansProjects/WordCount/Hamlet.txt";  //Linux

int main()
{
	map<string, WordCounter> counter;

	ifstream input;
	input.open( path.c_str() );

	if ( !input )
	{
		cout << "Error in opening file Hamlet.txt\n";
		return 0;
	}

	string tok;		

	while ( true )
	{
		input >> tok;

		// Remove ?, !, , characters etc from string
		tok.resize( remove_if( tok.begin(), tok.end(), filter) - tok.begin() );
		
		if ( input )
		{
			counter[ tok ]++;
		}
		else break;		
	}
	
	map< string, WordCounter,less<string> >::iterator it;

	for ( it  = counter.begin();
		  it != counter.end();
		  it++ )
	{
		cout << std::setw (15) 
			 << (*it).first			 
			 << "\t"
			 << (*it).second
			 << endl;
	}

	return 0;
}

Giving the following output, with extra padding and minus the extraneous characters:

Further improvement: sorting the words and their counts

Now that we have the full set of words and their frequencies, we may wish to make the output more presentable by sorting them. As described on this other post, you cannot just sort a std::map like you can with a std::vector, since a std::map sorts its elements by key. You have to first insert the std::map key-value pairs into a std::vector, and then sort that std::vector with some kind of comparison function or functor, as shown in this next example:


#include <iostream>
#include <iomanip>
#include <fstream>
#include <sstream>
#include <algorithm>
#include <map>
#include <string>
#include <vector>

using namespace std;  

template <class T>
struct less_second : std::binary_function<T,T,bool>  
{  
    inline bool operator()( const T& lhs, const T& rhs )  
    {  
        return lhs.second.value < rhs.second.value;  
    }  
};  

class WordCounter  
{  
public:  
    int value;  
    WordCounter() : value( 0 ) {}  

    void operator++ (int) { value++; }  
};  

ostream& operator<<(ostream& st, WordCounter& wc )  
{  
    return st << wc.value;  
}  

// Remove unwanted characters from a string  
bool filter(char c)  
{  
    return isalpha( c ) == 0;  
}  

const string path = "C:\\Dump\\Hamlet.txt";  
//const string path = "/home/andy/NetBeansProjects/WordCount/Hamlet.txt";  //Linux

int main()  
{  
    typedef std::pair< string, WordCounter > word_mapping;  

    map<string, WordCounter> counter;  

    ifstream input;  
    input.open( path.c_str() );  

    if ( !input )  
    {  
        cout << "Error in opening file\n";  
        return 0;  
    }  

    string tok;       

    while ( true )  
    {  
        input >> tok;  

        // Remove ?, !, , characters etc from string  
        tok.resize( remove_if( tok.begin(), tok.end(), filter) - tok.begin() );  

        if ( input )  
        {  
            counter[ tok ]++;  
        }  
        else break;       
    }  

    map< string, WordCounter,less<string> >::iterator it = counter.begin();  

    for ( ; it != counter.end(); it++ )  
    {  
        cout << std::setw (15)   
             << (*it).first              
             << "\t"  
             << (*it).second  
             << endl;  
    }  

    // And then sort and display the result:
    std::vector< word_mapping > result( counter.begin(), counter.end() );         
    std::sort( result.begin(), result.end(), less_second<word_mapping>() );    

    for ( std::vector< word_mapping >::const_iterator mit  = result.begin();  
          mit != result.end();  
          mit++ )  
    {  
        cout << std::setw (15)   
             << (*mit).first              
             << "\t"  
             << (*mit).second.value 
             << endl;  
    }  


    return 0;  
}  

Giving the following output, which has the improvements made previously plus the sorting:

WordCountSort

Related post: Mapping Words to Line Numbers in Text Files

`