Jumping into C++

January 18th, 2013

I was a professional developer for years, working initially with C in labs and an engineering firm before moving on to other languages and stints as an enterprise developer and consultant. Other languages and tool were more in demand for the projects I worked on so I gradually moved away from native development and lost touch with the C++ story. It has been years since I developed anything in C or tried using C++.

How hard would it be to write a familiar application in modern C++?

This blog post shares my experience returning to native programming using modern C++. No printf statements were harmed in this effort. A few of my colleagues might have been annoyed by my many questions, so shout out to James McNellis for tutoring and reviewing the code and blog post drafts and to Sumit Kumar and Niklas Gustafsson for deciphering warning messages.

Any errors in the code or this post are mine.

Ground Rules

I set out to write a simple application using modern C++. I imposed the following rules and requirements:

Use only modern C++ . I wanted to use streams, iterators, templates and mapping how I did things in C, C# and other languages onto modern C++.

. I wanted to use streams, iterators, templates and mapping how I did things in C, C# and other languages onto modern C++. Use best practices . For example, “when in doubt, use vector .” If a value was not meant to change, make it const . Don’t use macros. Go with what makes sense and seems to have the most support.

. For example, “when in doubt, use .” If a value was not meant to change, make it . Don’t use macros. Go with what makes sense and seems to have the most support. Share all gotchas ! I expected to run into simple problems reflecting my lack of C++ experience. Fortunately, I could wander the halls and ask my colleagues on the Visual C++ team for help/answers/advice and commiseration. Spoiler: one of my gotchas involved precompiled headers. *facepalm*

! I expected to run into simple problems reflecting my lack of C++ experience. Fortunately, I could wander the halls and ask my colleagues on the Visual C++ team for help/answers/advice and commiseration. Spoiler: one of my gotchas involved precompiled headers. *facepalm* Keep the project simple. The application was not going to be a Windows Store application. I wanted to read and write some data, do a few loops, mess around with a collection or two, and use std::string , not build a “real” application or explore more advanced concepts. I do hope to build a more sophisticated “real” application in the future.

The application was not going to be a Windows Store application. I wanted to read and write some data, do a few loops, mess around with a collection or two, and use , not build a “real” application or explore more advanced concepts. I do hope to build a more sophisticated “real” application in the future. Use Visual Studio and Visual C++. Probably a no-brainer, but I did have some concerns about finding my way around VS2012. I had read a few complaints about the default editor color palette, but I had no complaints. I also found all the shortcuts I had learned years ago continue to work.

Given these ground rules, I chose a simple project, counting words across a number of files. It requires input and output, a way to map a count to a word, some string manipulation (if desired), and loops. It also works well as a command-line application so I could avoid UI issues. If written well, the code could always be ported to a fancier Windows application in the future.

Requirements

The application needs to count the words in a set of text files provided by the user. It must accept a list of files on the command line, process each file while accumulating word counts and skipping bad/missing files, and then print out the number of files processed and the list of words and counts. For this exercise, words are chunks of non-whitespace characters delimited by one or more whitespace characters—nothing too sophisticated.

For all requirements, I chose the simplest possible solution. For example, the program targets ASCII files and not deal with wide characters (though I did mess around with wstring and wifstream without running into problems). The program avoids handling certain binary files (test file are from the Gutenberg Project). Words include unwanted characters even after simple filtering, and so on.

Would you realistically implement these requirements with C++? You might if you could squeeze more performance out of file processing, might need to reuse the code on different platforms, or wanted more control. You might not if you had access to PowerShell or other scripting language and wanted a quick solution. As with all interesting choices, the answer is “it depends.” If you wanted to use AppleSoft Basic on an old ][+, go for it (and share your code in the comments section below)!

The solution itself is short, but it did take a couple of iterations to whip into proper C++ shape. If you want the code, it is attached but should not be used in production code blah blah blah.

Interesting Bits

There were a few interesting bits — interesting mostly because I was new to modern C++ and a bit rusty on using the C/C++ compiler. Some of the issues encountered were fixed using information online (the C++ community and online ecosystem is awesome); others I inflicted on my colleagues.

Processing Command-Line Arguments

Rather than looping through the command line arguments array argv, I went ahead and converted it to a vector<string>.

int main(int argc, char** argv)

{

// bail if no files are specified

if(0 == argc)

return 0;

const vector<string> files(argv, argv + argc);

}

Iteration is straight-forward, especially if (unlike me) you remember to use const auto&.

for (const auto& file : files) …

We have no intention of modifying the file string. Defaulting to const auto& when writing a loop seems like a safe best practice.

If run from the Windows command line, the first argument is usually the path/name of the executable. Rather than weed out this case when copying from argv, I did it during the file processing loop:

if(file.rfind(“.exe”) == (file.length() – 4))

continue;

If the file name ends with “.exe”, skip it. It could be more robust, but the basic mechanism is in there.

Reading a File

Reading a file is as easy as the different C++ tutorials claim: grab an appropriate stream, point it to the file and if it is not bad, read a word at a time until there are no more words to read and then close the file. For this project, the appropriate stream is ifstream though I did have it working with wifstream and wide-characters. If you decide to go that route, use wstring.

My first attempt explicitly checked for a bad file like so:

ifstream infile(file);

if(!infile.bad())

{

string word;

while(infile >> word) …

Turns out ifstream::bad() is set by a previous i/o operation. If, as in this example, there is no previous operation, the file is not “bad” even if it is non-existent. I needed to use a different strategy, one that avoids the explicit “badness” check:

ifstream infile(file);

string word;

// if we can pull a word (file is good)

if(infile >> word)

{

// process all of the contents

do

{

} while(infile >> word);

}

This code “primes the pump” by test-reading a word from the file before processing the rest of the file. Per-file code (like counting the number of files actually processed) can be shoved after the test-word is successfully read.

If something goes horribly wrong while processing a file, we don’t try to pick up the pieces. When the infile falls out of scope, the file is closed.

Tracking Word Counts

Words are tracked in a map using the word as key and a long integer to keep track of the number of occurrences. If a word is not in the array, it is automatically added, avoiding the need for extra code.

map<string, unsigned int> words;

string word;

…

words[word]++;

This code is on the “simple” end of the map complexity continuum; implementations can get ugly quick. I used unsigned int because word counts will not be negative – there are no “anti-words” in this exercise.

I made one tweak to the program once it was working. The original version counted contiguous chunks of characters delimited by one or more spaces without worrying about whether a character was punctuation, so dog and ‘dog’ each counted as a unique word. It bugged me so I looked for a way to remove a set of characters from a string. What I found was:

word.erase(remove_if(word.begin(), word.end(), &::isremovable), word.end());

The inner remove_if removes characters the custom function isremovable says should be removed (ie, returns true), shifting all non-removable characters to the left. When done, remove_if returns an iterator pointing to the new end of the word. The outer word.erase removes characters from the new end to the actual end of the word.

This looked like mumbo jumbo until James explained it. It also helped stepping through example code that split the operations into separate lines. Once I got it, it seemed obvious, an “aha” moment that would help me dissect similar statements in the future. Hopefully!

If I wanted to go fully modern C++, I’d replace the isremovable with a lambda, but then there would be too much going on in that one statement for this first attempt J.

Printing Results

My first try at printing to console netted the following:

for (pair<string, long> c : words)

cout << c.first << “:” << c.second << endl;

It worked but because the pair declaration is wrong (map keys are const), a temporary variable was created for each pair, making the program less efficient. I updated it using what I had learned working on the file processing code:

for (const auto& c : words)

cout << c.first << “:” << c.second << endl;

If I had defined a new type for the word map, I would have been able to use another mechanism but like lambdas, it can wait.

Gotchas!

My “gotchas”:

Selecting the wrong project template for the job . In my first go around, I chose a project type that included precompiled headers and handled Unicode. I had forgotten that includes need to go after the precompiled header include in the source file resulting in some funky errors whose cause was not immediate obvious to me (I figured my C++ was wrong!). With Unicode came TCHAR, adding complexities around printing and manipulating strings. For small projects, start with an empty C++ project and write everything from scratch. It is easy to extend later.

. In my first go around, I chose a project type that included precompiled headers and handled Unicode. I had forgotten that includes need to go after the precompiled header include in the source file resulting in some funky errors whose cause was not immediate obvious to me (I figured my C++ was wrong!). With Unicode came TCHAR, adding complexities around printing and manipulating strings. It is easy to extend later. Forgetting to include the right library. When this happened, I was certain I had the correct includes and so assumed the errors being thrown at me were from bad code. This is part of the learning curve. Double-check includes! Online docs and frequent compiles helped.

When this happened, I was certain I had the correct includes and so assumed the errors being thrown at me were from bad code. This is part of the learning curve. Online docs and frequent compiles helped. Getting buried in complexity. Part of my time was spent reviewing C++ information related to my task, in particular the STL and templates. It did not take too long to go from beginner content to the dragon’s den in an article, discussion thread or a few “related article” clicks. Some STL code cannot be unseen. Understand there is complexity, file it away, and refocus on the immediate goal.

Part of my time was spent reviewing C++ information related to my task, in particular the STL and templates. It did not take too long to go from beginner content to the dragon’s den in an article, discussion thread or a few “related article” clicks. Some STL code cannot be unseen. Editor squigglies. On the first version of the project, I used “for each” when looping through files and word counts. The editor “squiggled” the container in each case yet the compiler had no complaints. Turns out “for each” is a Visual Studio extension; when I used “for”, the squiggles went away. The code was technically correct, but there was a better way. Verify the veracity of the squiggle – could the statement be tweaked to get rid of it?

On the first version of the project, I used “for each” when looping through files and word counts. The editor “squiggled” the container in each case yet the compiler had no complaints. Turns out “for each” is a Visual Studio extension; when I used “for”, the squiggles went away. The code was technically correct, but there was a better way. Assuming the requirements and implementation were “simple”. Nothing beats a friendly code review from an expert. James reviewed the code and this post and identified quite a few fundamental “oops,” “d’ohs,” and “ughs”. Getting a program to work is one thing; making sure it is correct (and I could explain why) was much harder. Don’t avoid peer code reviews!

Most of these are pretty basic. Your gotchas may vary (YGMV).

Is C++ Hard?

Not if you start with a small, familiar project, tackle a few new concepts at a time, and use all the wonderful resources in the community (including the community itself).

I hope to tackle a more complex project next time. Stay tuned!

WordCountInFiles.cpp