Word Counting in C++: Computing the Span of a Word

Here is a new episode in the series of word counting! Today we will focus on computing the span words in code.

As a reminder, word counting consists in counting the occurrences of every term in a piece of code (for example, in a function), and sorting the results by most frequent words. This can reveal at a glance useful information about that piece of code.

Over the past few posts, we’ve been building a word counter in C++. We’re investing time in this project for several reasons:

it is an opportunity to practice with the STL,

it is an opportunity to practice with interface design,

we have a more an more complete word counter to use on our code.

The span of words

Today we add a new feature to our word counter: computing the span of words! The span of a term in a piece of code is the number of lines over which it spreads. For example, consider the following piece of code:

int i = 42; f(i); f(i+1) std::cout << "hello"; ++i; 1 2 3 4 5 int i = 42 ; f ( i ) ; f ( i + 1 ) std :: cout << "hello" ; ++ i ;

The span of f is 2, the span of i is 5 and the span of cout is 1.

The span of a word is an interesting measure because it indicates how spread out the word is in a piece of code: are all its usage located in the same area? Is it used throughout the function? Such are the questions that can be answered by measuring the span of that word.

Combined with the count of occurrences of a word (a feature that our word counter already has), the span can measure the density of a term. If a word has a high number of occurrences and a low span, it means that its usages are all crammed in a part of a function:

Knowing such a piece of information brings at least two things:

quickly knowing what a part of the code is about,

suggesting a refactoring task (taking away that part of the code in a separate function).

Computing the span of a word

Let’s pick up the word counter where we left it off.

The basic design of our word counter was to extract the successive words in the piece of code, and then to count the number of occurrences of each of those words:

As you can see, in that first implementation we used standard types, such as string for the extracted words and size_t for their number of occurrences.

To implement the span, we will need to extract and process more information (about line numbers in particular), so this implementation won’t hold. We need to make it more robust, by replacing the raw standard types by dedicated classes:

The data extracted from the code is now called WordData , and the aggregates computed from this data for each word is now WordStats . At this stage, WordData and WordStats are simple encapsulations of their standard types equivalents:

class WordData { public: explicit WordData(std::string word); std::string const& word() const; private: std::string word_; }; class WordStats { public: WordStats(); size_t nbOccurrences() const; void addOneOccurrence(); private: size_t nbOccurrences_; }; 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 class WordData { public : explicit WordData ( std :: string word ) ; std :: string const & word ( ) const ; private : std :: string word_ ; } ; class WordStats { public : WordStats ( ) ; size_t nbOccurrences ( ) const ; void addOneOccurrence ( ) ; private : size_t nbOccurrences_ ; } ;

If we didn’t want to go further than this, we could have considered using strong types instead of defining our own classes. But the point here is to add new features to the class, so we’ll stick with regular classes.

Extracting line numbers

Our current code for extracting words from code is this:

template<typename EndOfWordPredicate> std::vector<WordData> getWordDataFromCode(std::string const& code, EndOfWordPredicate isEndOfWord) { auto words = std::vector<WordData>{}; auto beginWord = std::find_if_not(begin(code), end(code), isDelimiter); while (beginWord != end(code)) { auto const endWord = std::find_if(std::next(beginWord), end(code), isEndOfWord); words.emplace_back(std::string(beginWord, endWord)); beginWord = std::find_if_not(endWord, end(code), isDelimiter); } return words; } 1 2 3 4 5 6 7 8 9 10 11 12 13 template < typename EndOfWordPredicate > std :: vector < WordData > getWordDataFromCode ( std :: string const & code , EndOfWordPredicate isEndOfWord ) { auto words = std :: vector < WordData > { } ; auto beginWord = std :: find_if_not ( begin ( code ) , end ( code ) , isDelimiter ) ; while ( beginWord != end ( code ) ) { auto const endWord = std :: find_if ( std :: next ( beginWord ) , end ( code ) , isEndOfWord ) ; words . emplace_back ( std :: string ( beginWord , endWord ) ) ; beginWord = std :: find_if_not ( endWord , end ( code ) , isDelimiter ) ; } return words ; }

The isEndOfWord predicate checks for the end of word that can be either a capital letter for words inside of camel case symbols, or a delimiter in all cases.

And isDelimiter indicates if a character is not part of a word:

bool isDelimiter(char c) { auto const isAllowedInName = isalnum(c) || c == '_'; return !isAllowedInName; } 1 2 3 4 5 bool isDelimiter ( char c ) { auto const isAllowedInName = isalnum ( c ) || c == '_' ; return ! isAllowedInName ; }

This code extracts the words of the piece of code. We would now like to also make it extract the line numbers of those words. We will then be able to compute the span, as being the distance between the first line and the last one.

A simple way to work out the line number of a given word is to compute the number of line returns from the beginning of the piece of code an until that word. But doing this for each word makes for a quadratic number of reads of the characters of the piece of code. Can we do better than quadratic?

We can if we count the number of line returns since the end of the previous word, and add this to the line number of the previous word. This has a linear complexity, which is much better than quadratic complexity.

We could consider going further by checking every character only once, and find the beginning of the next word AND the number of line returns until then, all in one single pass. But that would lead to more complex code. So we will suffice with the above linear algorithm, even it is makes several reads of the same characters. We keep the code simple until there is a compelling reason not to do so (for example, a poor performance which profiling indicates that we should go for a more elaborate algorithm).

Here is the code updated in that sense:

template<typename EndOfWordPredicate> std::vector<WordData> getWordDataFromCode(std::string const& code, EndOfWordPredicate isEndOfWord) { auto words = std::vector<WordData>{}; auto endWord = begin(code); auto beginWord = std::find_if_not(begin(code), end(code), isDelimiter); size_t line = 0; while (beginWord != end(code)) { auto const linesBetweenWords = std::count(endWord, beginWord, '

'); line += linesBetweenWords; endWord = std::find_if(std::next(beginWord), end(code), isEndOfWord); words.emplace_back(std::string(beginWord, endWord), line); beginWord = std::find_if_not(endWord, end(code), isDelimiter); } return words; } 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 template < typename EndOfWordPredicate > std :: vector < WordData > getWordDataFromCode ( std :: string const & code , EndOfWordPredicate isEndOfWord ) { auto words = std :: vector < WordData > { } ; auto endWord = begin ( code ) ; auto beginWord = std :: find_if_not ( begin ( code ) , end ( code ) , isDelimiter ) ; size_t line = 0 ; while ( beginWord != end ( code ) ) { auto const linesBetweenWords = std :: count ( endWord , beginWord , '

' ) ; line += linesBetweenWords ; endWord = std :: find_if ( std :: next ( beginWord ) , end ( code ) , isEndOfWord ) ; words . emplace_back ( std :: string ( beginWord , endWord ) , line ) ; beginWord = std :: find_if_not ( endWord , end ( code ) , isDelimiter ) ; } return words ; }

Computing the span

We now have a collection of WordData , that each contains a word an a line number. We now feed this collection to a std::map<std::string, WordStats> . The code before taking the span into account looked like this:

std::map<std::string, WordStats> wordStats(std::vector<WordData> const& wordData) { auto wordStats = std::map<std::string, WordStats>{}; for (auto const& oneWordData : wordData) { wordStats[oneWordData.word()].addOneOccurrence(); } return wordStats; } 1 2 3 4 5 6 7 8 9 std :: map < std :: string , WordStats > wordStats ( std :: vector < WordData > const & wordData ) { auto wordStats = std :: map < std :: string , WordStats > { } ; for ( auto const & oneWordData : wordData ) { wordStats [ oneWordData . word ( ) ] . addOneOccurrence ( ) ; } return wordStats ; }

One way to pass line numbers of the words so that WordStats can process them is to pass it as an argument to the method addOneOccurrence :

std::map<std::string, WordStats> wordStats(std::vector<WordData> const& wordData) { auto wordStats = std::map<std::string, WordStats>{}; for (auto const& oneWordData : wordData) { wordStats[oneWordData.word()].addOneOccurrence(oneWordData.lineNumber()); } return wordStats; } 1 2 3 4 5 6 7 8 9 std :: map < std :: string , WordStats > wordStats ( std :: vector < WordData > const & wordData ) { auto wordStats = std :: map < std :: string , WordStats > { } ; for ( auto const & oneWordData : wordData ) { wordStats [ oneWordData . word ( ) ] . addOneOccurrence ( oneWordData . lineNumber ( ) ) ; } return wordStats ; }

WordStats should be able to provide a span in the end, so it needs to remember the smallest and highest line numbers where the word appears. To achieve that, we can keep the smallest (resp. highest) line number encountered so far in the WordStats and replace it with the incoming line number in addOneOccurrence if it is smaller (resp. higher).

But what initial value should we give to the smallest and highest line numbers encountered so far? Before giving any line number, those two bounds are “not set”. To implement this in C++, we can use optional ( std::optional in C++17, boost::optional before):

class WordStats : public Comparable<WordStats> { public: WordStats(); size_t nbOccurrences() const; void addOneOccurrence(size_t lineNumber); size_t span() const; private: size_t nbOccurrences_; std::optional<size_t> lowestOccurringLine_; std::optional<size_t> highestOccurringLine_; }; 1 2 3 4 5 6 7 8 9 10 11 12 class WordStats : public Comparable < WordStats > { public : WordStats ( ) ; size_t nbOccurrences ( ) const ; void addOneOccurrence ( size_t lineNumber ) ; size_t span ( ) const ; private : size_t nbOccurrences_ ; std :: optional < size_t > lowestOccurringLine_ ; std :: optional < size_t > highestOccurringLine_ ; } ;

With this, the implementation of addOneOccurrence can be:

void WordStats::addOneOccurrence(size_t lineNumber) { ++nbOccurrences_; if (!lowestOccurringLine_) // means that it is the first line number coming in { lowestOccurringLine_ = lineNumber; } else { lowestOccurringLine_ = std::min(*lowestOccurringLine_, lineNumber); // the "min" that we were talking about } // then same thing for the highest line if (!highestOccurringLine_) { highestOccurringLine_ = lineNumber; } else { highestOccurringLine_ = std::max(*highestOccurringLine_, lineNumber); } } 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 void WordStats :: addOneOccurrence ( size_t lineNumber ) { ++ nbOccurrences_ ; if ( ! lowestOccurringLine_ ) // means that it is the first line number coming in { lowestOccurringLine_ = lineNumber ; } else { lowestOccurringLine_ = std :: min ( * lowestOccurringLine_ , lineNumber ) ; // the "min" that we were talking about } // then same thing for the highest line if ( ! highestOccurringLine_ ) { highestOccurringLine_ = lineNumber ; } else { highestOccurringLine_ = std :: max ( * highestOccurringLine_ , lineNumber ) ; } }

Then span comes naturally:

size_t WordStats::span() const { if (!lowestOccurringLine_ || !lowestOccurringLine_) { return 0; } else { return *highestOccurringLine_ - *lowestOccurringLine_ + 1; } } 1 2 3 4 5 6 7 8 9 10 11 size_t WordStats :: span ( ) const { if ( ! lowestOccurringLine_ || ! lowestOccurringLine_ ) { return 0 ; } else { return * highestOccurringLine_ - * lowestOccurringLine_ + 1 ; } }

The feature of span

We have highlighted the main part of the design. If you’d like to have a look at the code in its entirety, and play around with the word counter, you will find all the above in this coliru.

The code produces the span of the words, but I certainly don’t claim that it’s the optimal implementation. Did you see things that you would like to correct in the design, or the implementation?

More generally, do you think that measuring the span of words, as well as their density, is a relevant measure for your code?

You will also like

3 Things That Counting Words Can Reveal on Your Code

Word Counting in C++: Implementing a Simple Word Counter

Word Counting in C++: Extracting words from camelCase

Word Counting in C++: Parametrizing the Type of Case

Share this post! Don't want to miss out ?