sample

Produce a sample of lines from files. The sample size is either fixed or proportional to the size of the file. Additionally, the header and footer can be included in the sample.

Red tape

no dependencies other than a POSIX system and a C99 compiler.

licensed under BSD3c

Resources

Features

proportional sampling of streams and files

header and footer can be included in the sample

reservoir sampling (fixed sample size) of streams and files

stable reservoir sampling (i.e. the order is preserved)

Motivation

Practically ubiquitous, there’s shuf -n of GNU coreutils, a tool that, in principle, solves the problem at hand. However, shuf buffers all input and is therefore useless for files that don’t fit in memory.

So, looking for alternatives one may come across paulgb’s subsample or earino’s fast_sample. They usually do the trick and everyone seems to agree (judged by github stars). However, both tools have short-comings: they try to make sense of the line data semantically, and secondly, they are slow!

The first issue is such a major problem that their bug trackers are full of reports. subsample needs lines to be UTF-8 strings and fast_sample wants CSV files whose correctness is checked along the way. This project’s tool, sample , on the other hand does not care about the line’s content, all it needs are those line breaks at the end.

The speed issue is addressed by

using the most appropriate programming language for the problem

using radix sort

using the PCG family to obtain randomness

oversampling

Examples

To get 10 random words from the words file:

$ sample -n 10 -H 0 /usr/share/dict/words ... benzopyrene calamondins cephalothorax copulate garbology's Kewadin Peter's reassembly Vienna's Wagnerism's ...

The -H 0 produces 0 lines of header output which defaults to 5.

For proportional sampling use -r|--rate :

$ wc -l /usr/share/dict/words 305089 $ sample -r 1% /usr/share/dict/words | wc -l 3080

which is close to the true result bearing in mind that by default the header and footer of the file is printed as well.

Sampling with a rate of 0 replaces awkward scripts that use multios and head and tail to produce the same result.

$ sample -r 0 /usr/share/dict/words A AA AAA Aachen aah ... Zyuganov Zyuganov's zyzzyva zyzzyvas ZZZ

Similar projects

In no particular order and without any claim to completeness: