Translating a Rosetta Code entry from Perl to C++ A. Sinan Unur June 27, 2015

Several weeks ago, I pondered if C++ can be used as one would a scripting language. A few days ago, Paddy3118 on Reddit commented:

If the latest C++ could improve on this Rosetta Code task entry for C++ then you might want to compare your improved C++ against the other scripting language solutions such as Perl, Python, Ruby, and Tcl. The scripting language solutions seem easier to read than the C++ and some show off their standard libraries by showing how easy it is to download the input data too. I chose the task as it involves word counting like the post but adds filtering the words to count which is a common scripting task. Note that the RC site is about showing idiomatic code for comparing programming languages and golfing is discouraged. (emphasis mine)

The task is:

Using the word list from http://www.puzzlers.org/pub/wordlists/unixdict.txt, check if the two sub-clauses of the phrase are plausible individually: "I before E when not preceded by C" "E before I when preceded by C" If both sub-phrases are plausible then the original phrase can be said to be plausible. Something is plausible if the number of words having the feature is more than two times the number of words having the opposite feature (where feature is 'ie' or 'ei' preceded or not by 'c' as appropriate).

This morning, I went ahead, and looked at the C++ entry. It is not horrible, but I wouldn't want to have to read this person's code for more complicated tasks. As an aside, while the difference between struct and class doesn't amount to much, it is still needlessly cute to define a class in C++ using struct .

Suffice it to say, I wouldn't have written the solution to this task in this manner even in 1995:

#include <iostream> #include <fstream> #include <string> #include <tuple> #include <vector> #include <stdexcept> #include <boost/regex.hpp> struct Claim { Claim(const std::string& name) : name_(name), pro_(0), against_(0), propats_(), againstpats_() { } void add_pro(const std::string& pat) { propats_.push_back(std::make_tuple(boost::regex(pat), pat[0] == '^')); } void add_against(const std::string& pat) { againstpats_.push_back(std::make_tuple(boost::regex(pat), pat[0] == '^')); } bool plausible() const { return pro_ > against_*2; } void check(const char * buf, uint32_t len) { for (auto i = propats_.begin(), ii = propats_.end(); i != ii; ++i) { uint32_t pos = 0; boost::cmatch m; if (std::get<1>(*i) && pos > 0) continue; while (pos < len && boost::regex_search(buf+pos, buf+len, m, std::get<0>(*i))) { ++pro_; if (pos > 0) std::cerr << name_ << " [pro] multiple matches in: " << buf << "

"; pos += m.position() + m.length(); } } for (auto i = againstpats_.begin(), ii = againstpats_.end(); i != ii; ++i) { uint32_t pos = 0; boost::cmatch m; if (std::get<1>(*i) && pos > 0) continue; while (pos < len && boost::regex_search(buf+pos, buf+len, m, std::get<0>(*i))) { ++against_; if (pos > 0) std::cerr << name_ << " [against] multiple matches in: " << buf << "

"; pos += m.position() + m.length(); } } } friend std::ostream& operator<<(std::ostream& os, const Claim& c); private: std::string name_; uint32_t pro_; uint32_t against_; // tuple std::vector> propats_; std::vector> againstpats_; }; std::ostream& operator<<(std::ostream& os, const Claim& c) { os << c.name_ << ": matches: " << c.pro_ << " vs. counter matches: " << c.against_ << ". "; os << "Plausibility: " << (c.plausible() ? "yes" : "no") << "."; return os; } int main(int argc, char ** argv) { try { if (argc < 2) throw std::runtime_error("No input file."); std::ifstream is(argv[1]); if (! is) throw std::runtime_error("Input file not valid."); Claim ieclaim("[^c]ie"); ieclaim.add_pro("[^c]ie"); ieclaim.add_pro("^ie"); ieclaim.add_against("[^c]ei"); ieclaim.add_against("^ei"); Claim ceiclaim("cei"); ceiclaim.add_pro("cei"); ceiclaim.add_against("cie"); { const uint32_t MAXLEN = 32; char buf[MAXLEN]; uint32_t longest = 0; while (is) { is.getline(buf, sizeof(buf)); if (is.gcount() <= 0) break; else if (is.gcount() > longest) longest = is.gcount(); ieclaim.check(buf, is.gcount()); ceiclaim.check(buf, is.gcount()); } if (longest >= MAXLEN) throw std::runtime_error("Buffer too small."); } std::cout << ieclaim << "

"; std::cout << ceiclaim << "

"; std::cout << "Overall plausibility: " << (ieclaim.plausible() && ceiclaim.plausible() ? "yes" : "no") << "

"; } catch (const std::exception& ex) { std::cerr << "*** Error: " << ex.what() << "

"; return -1; } return 0; }

In the middle of all the highfalutin jiggery-pokery, the author uses a fixed size char buffer to read the words. What's up with that?

I then looked at the Perl version:

#!/usr/bin/perl use warnings; use strict; sub result { my ($support, $against) = @_; my $ratio = sprintf '%.2f', $support / $against; my $result = $ratio >= 2; print "$support / $against = $ratio. ", 'NOT ' x !$result, "PLAUSIBLE

"; return $result; } my @keys = qw(ei cei ie cie); my %count; while (<>) { for my $k (@keys) { $count{$k}++ if -1 != index $_, $k; } } my ($support, $against, $result); print 'I before E when not preceded by C: '; $support = $count{ie} - $count{cie}; $against = $count{ei} - $count{cei}; $result += result($support, $against); print 'E before I when preceded by C: '; $support = $count{cei}; $against = $count{cie}; $result += result($support, $against); print 'Overall: ', 'NOT ' x ($result < 2), "PLAUSIBLE.

";

There is one difference between the C++ and Perl versions. The C++ code above does account for the possibility that some words may contain a pattern multiple times. That is definitely not mentioned in the requirements, and I am not sure if one is allowed to use the same word as evidence more than once, but that's just a detail.

I wondered what would happen if I just rewrote the Perl entry as a C++ program with only the bare minimum of changes. Of course, C++ does not have Perl's magic diamond operator, but just reading from stdin ought to be fine.

It didn't take much time for me to come up with something that seems to work:

#include <cerrno> #include <iostream> #include <string> #include <unordered_map> #include <vector> static bool result(int support, int against) { auto rat (static_cast(support) / against); auto is_plausible (rat >= 2.0); std::cout << support << " / " << against << " = " << rat; std::cout << (!is_plausible ? " NOT" : "") << " PLAUSIBLE

"; return is_plausible; } int main(int argc, char *argv[]) { const std::vector keys {"ei", "cei", "ie", "cie"}; std::unordered_map count {}; std::string word; while (std::cin >> word) { for (auto k: keys) { if (word.find(k) != std::string::npos) { count[k] += 1; } } } int res(0); std::cout << "I before E when not preceded by C: " << "

"; res += result(count["ie"] - count["cie"], count["ei"] - count["cei"]); std::cout << "E before I when preceded by C: " << "

"; res += result(count["cei"], count["cie"]); std::cout << "Overall:" << ((res < 2) ? " NOT" : "") << " PLAUSIBLE

"; return 0; }

I don't know if one would call this idiomatic C++, but I think the nice thing about modern C++ is the ability to ignore some of the uglier idioms, and just being able to write straightforward code quickly to solve a problem. Should solving such a simple problem actually need a lot of work?

Paddy3118 also mentioned that the Python entry uses urllib, and the Ruby entry uses open-uri to download the word list (but see also this caution). Of course, one could adapt my C example using libcurl to do the same in C++, but it is true that there really isn't a "standard" way of doing that from within a C++ program. However, the ability to use lambdas and the flexibility of std::string should make it much easier to use it.

A couple of more observations.

I find the Python entry a little odd in that it makes four separate passes over the word list:

def simple_stats(url='http://www.puzzlers.org/pub/wordlists/unixdict.txt'): words = urllib.request.urlopen(url).read().decode().lower().split() cie = len({word for word in words if 'cie' in word}) cei = len({word for word in words if 'cei' in word}) not_c_ie = len({word for word in words if re.search(r'(^ie|[^c]ie)', word)}) not_c_ei = len({word for word in words if re.search(r'(^ei|[^c]ei)', word)}) return cei, cie, not_c_ie, not_c_ei

Its memory footprint is proportional to the number of words in the word list. This kind of thing can work in toy examples, but sooner or later one will feel the effects of slurping over reading in small chunks.

Adding another pattern requires scanning word list once more, as well as returning an additional variable.

The Ruby and C++ entries use regular expressions when a simple search would have sufficed.

I really didn't look at any of the other entries. I just wanted to see what would happen if I basically took a Perl script and rewrote it in standard C++. I am pleased with the ease with which I was able to do that.

PS: You can discuss this post on /r/cpp.