Generate k random samples of n draws from a set of distributions in C++ A. Sinan Unur December 21, 2016

In my quest to improve the tests in Boost.Accumulators, I decided to generate some random samples (i.e. sets of draws) from known distributions and compare the results produced by the P2 algorithm as implemented by a well-respected data analysis software package to the medians calculated using the Boost.Accumulators implementation. It turns out, the well-respected software does not implement the P2 algorithm in the way described in the original paper. It does something proprietary (which, of course, means that the algorithm they implement is no longer the P2 algorithm). The person who responded to my inquiry clung to the observation that their method produced an approximation closer to the actual sample median in the example I provided, but that is beside the point: This creates a situation where two reseachers thinking they are both using the same algorithm may report different results.

I am going to write in more detail about this later, but it's depressing. In the mean time I decided the best way to cheer myself up is to write a quick C++ program to generate the pseudo-random samples I am going to use to compare the two implementations. While I have always liked C++, it has recently become a lot more fun (assuming you have a large scrollback buffer in case you get any template related errors or warnings when compiling your code ;-)

I am merely a human being, so I belong to the camp who thinks the standard <random> facilities are a more than a little convoluted. Luckily for the lazy programmer in me, Melissa O'Neill has put together a really useful header only library which makes it possible for the rest of us to use the standard facilities with much less effort.

So, let's say I have M distributions. For each distribution, I want to generate K pseudo-random samples of various sizes N 1 , N 2 , N 3 … In the example below, I am going to generate samples from the normal, exponential, and uniform distributions. My purpose is to save these samples and compare the original and proprietary implementations of the P2 algorithm. The question I am asking is, for each set of samples of each size from the distributions under consideration, how close are the medians calculated using these implementations to the true sample medians and to each other. In particular, I am not testing any statistical hypotheses.

I hard coded a number things to focus on the essentials of this simple demonstration for a blog post. Slapping some command line arguments is not that hard. The nicest part of randutils.hpp is that it gives one the ability to write:

std::vector<rng_t> rngs(nsamples);

to get a vector of properly seeded pseudo-random number generators. This is helpful in many ways, but it also helps to minimize the memory footprint of the program. Having this facility means the memory footprint of the program depends on the number of samples rather than sample sizes. I can draw the i th observation for each sample, print the vector of observations to a file and move on. For generating millions of draws for each sample, it would probably be better to write 100 or even 1,000 lines at a time to cut down on filesystem accesses, but this will do for this demo.

For each distribution and each sample size, I get a single file containing all K samples in columns. This is how most statistical software expects to see data: Variables in columns, observations in rows.

Here is the main function:

int main() { std::unordered_map<std::string, std::function<double(rng_t&)> > pickers { { "normal_25_10", [] (rng_t& g) { return pick_normal(g, 25.0, 10.0); } }, { "uniform_20_30", [] (rng_t& g) { return pick_uniform(g, 20.0, 30.0); } }, { "exponential_25", [] (rng_t& g) { return pick_exponential(g, std::log(2.0)/25.0); } }, }; for (auto p: pickers) { gen_samples(5, 50, p.first, p.second); } return 0; }

The pick_* functions are small utility functions I wrote to reduce line width and to be able to put together that neat table which associates a label with each specific distribution. For each of those distributions, I chose parameters such that the population median is about 25 (it doesn't matter much, but you've gotta choose something). Thanks to the lambda facility, it is easy to map those labels to simple functions which take a PRNG and return a draw from a given distribution.

Here are the helper functions:

typedef randutils::mt19937_rng rng_t; static inline double pick_normal(rng_t& g, const double mu = 0.0, const double sigma = 1.0) { return g.variate<double, std::normal_distribution>(mu, sigma); } static inline double pick_uniform(rng_t& g, const double lo = 0.0, const double hi = 1.0) { return g.variate<double, std::uniform_real_distribution>(lo, hi); } static inline double pick_exponential(rng_t& g, const double lambda = 1.0) { return g.variate<double, std::exponential_distribution>(lambda); }

where the concise and expressive variate calls exist thanks to randutils.hpp .

The only thing left is the gen_samples function:

static void gen_samples( const std::size_t nsamples, const std::size_t nobs, const std::string& label, const std::function<double(rng_t&)>& f ) { const std::string fname( std::to_string( nobs ) + "-" + std::to_string( nsamples) + "-" + label + ".txt" ); std::ofstream fout( fname ); std::vector<rng_t> rngs(nsamples); for(std::size_t i = 0; i < nobs; ++i) { std::vector<double> obs(nsamples); for(std::size_t j = 0; j < nsamples; ++j) { obs[j] = f( rngs[j] ); } std::copy( obs.begin(), obs.end(), std::ostream_iterator<double>(fout, "\t") ); fout << '

'; } fout.close(); }

It takes the number of samples, the number of draws in each sample, the label for the generator (which is used to derive the name of the file in which to save the samples), and the picker function as arguments, and generates a file containing nsamples (K above) samples in columns and nobs rows ( N i above) from a given distribution. For example, in the original paper, the authors used 50 samples for sample sizes less than 500. This little routine generates 150 samples of size 1,000 in under half a second. Nice.

I wonder if there is a way to replace:

for(std::size_t j = 0; j < nsamples; ++j) { obs[j] = f( rngs[j] ); }

with something more elegant.

PS: You can discuss this post on r/programming.

PPS: Here is the complete program for your reference: