Recently I wrote about C++’s standard binary search algorithms (yes, four of them!) which do such a fine job of:

specifying exactly what kind of range a binary search requires

separating the core algorithm from the details of the range it’s working on

delivering precise results

To support these claims I included an implementation of a file iterator, suitable for use with std::binary_search() etc. to efficiently locate values in very large files.

Now, there are a couple of issues with this approach:

we had to write a lot of code to make a file iterator suitable for use with standard algorithms this file iterator only works on highly structured files, where each value occupies a fixed number of bytes

In this follow up article we’ll consider each of these issues in a little more depth by working through two very different solutions to a related problem.

Suppose, once again, that we have a large file, a few gigabytes, say. The file contains numbers, in order, and we’re interested in testing if this file contains a given number. This time, though, the file is a text file, where the numbers are represented in the usual way as sequences of digits separated by whitespace.

$ less lots-of-numbers ... 10346 11467 11469 11472 11501 11662 12204 12290 ...

Note that a number in this file does not occupy a fixed number of bytes. If we jump to a new position in the file using a seek operation, we cannot expect to land exactly where a number starts. Thus the random access file iterator we developed last time won’t work.

In C++ an input file is an example of an input stream, and the standard library gives us istream_iterators which perform formatted input. In our case, an istream_iterator<int> effectively converts the file into a stream of numbers.

Istream iterators are input iterators. They progress through the input stream, item by item, with no repeating or rewinding allowed. Despite their limitations, the C++ standard library provides some algorithms which require nothing more than basic input iterators. For example, to count up even numbers in the file whose name is supplied on the command line we might use std::count_if .

#include <algorithm> #include <fstream> #include <iostream> #include <iterator> bool is_even(int x) { return x % 2 == 0; } int main(int argc, char * argv[]) { typedef std::istream_iterator<int> i_iter; typedef std::ostream_iterator<int> o_iter; std::ifstream in(argv[1]); std::cout << std::count_if(i_iter(in), i_iter(), is_even) << '

'; return 0; }

The next version of C++ supports lambda functions, so you’ll be able to put is_even right where it’s used, in the count_if() function call. Or, with the current version of C++, you could write:

Ouch! .... std::count_if(i_iter(in), i_iter(), std::not1(std::bind2nd(std::modulus<int>(), 2)))

Maybe not!

The very simplest search algorithm, std::find , needs nothing more than an input iterator. To determine if a number is in a file, we could just invoke std::find .

typedef std::istream_iterator<int> i_iter; bool is_number_in_file(char const * filename, int n) { std::ifstream in(filename); i_iter begin(in); i_iter end; return std::find(begin, end, n) != end; }

Here, the find algorithm advances through the numbers in the file, from start to finish, stopping as soon as it hits one equal to the supplied value, n . We can expect this function to be light on memory use — there will be some buffering at the lower levels of the file access, but nothing more — and the function is evidently correct.

It would be correct even if our file was unsorted, however. Is there any way we can do better?

In the previous article we developed a random access iterator for accessing binary files, and usable for efficient binary searches of sorted binary files. Now would be a good time to question the problem specification. Is this a one off? Or are we going to be testing the presence of more numbers in the file in future? And if so, can we convert the file to binary to save time in the long run?

Although I’m not going to pursue this option here, it may well be the best approach. For now, though, let’s assume we have a one-off problem to solve, and that we aren’t allowed to tinker with the input.

If we want to use std::binary_search we need, as a minimum, forward iterators. Like input iterators, forward iterators advance, one step at a time. Unlike input iterators, you can copy a forward iterator and dereference or advance that copy in future, independently of the original.

Forward iterators are suitable for multipass algorithms, such as std::search , which looks for the first occurrence of a sequence within a sequence (a generalised strstr , if you like), or std::adjacent_find and std::search_n which look for repeated elements; and of course std::binary_search , which is our immediate interest.

Wouldn’t it be nice if we could convert our istream iterators into forwards iterators? Then we could plug them directly into all these algorithms.

Other languages allow this. You can replicate streams in the Unix shell with tee . And you can do something similar in Python, thanks to one of the standard iterator tools. Independent iterators over the same sequence needed? Itertools.tee is your friend. The example below codes up adjacent find in Python.

from itertools import tee import sys def adjacent_find(xs): '''Does the supplied iterable contain any adjacent repeats? Returns True if xs contains two consecutive, equal items, False otherwise. ''' try: curr, next_ = tee(xs) next(next_) return any(c == n for c, n in zip(curr, next_)) except StopIteration: return False

Why, even Java has an iterator adaptors, courtesy of Jez Higgins’ Mango library.

What about C++? I couldn’t find any such iterator adaptors in the standard library, but I turned up something in the standard library research and development unit, also known as Boost.

Boost.Spirit is a remarkable C++ parser framework, which uses operator overloading to represent parsers directly as EBNF grammars in C++. Somewhere in its depths it tracks back, and hence must adapt input iterators into forward iterators — or multipass iterators, to use its own term.

The multi_pass iterator will convert any input iterator into a forward iterator suitable for use with Spirit.Qi. multi_pass will buffer data when needed and will discard the buffer when its contents is not needed anymore. This happens either if only one copy of the iterator exists or if no backtracking can occur.

What’s good enough for parsing is more than good enough for searching. Here’s a function which detects whether a number is in a file. Most of the code here just includes the right headers and defines some typedefs. By leaning on high quality support libraries we’ve overcome our first issue: we no longer have to write loads of code just to call binary search!

Boost spirit multipass iterators #include <fstream> #include <iterator> #include <algorithm> #include <boost/spirit/include/support_multi_pass.hpp> namespace spirit = boost::spirit; typedef long long number; typedef std::istream_iterator<number> in_it; typedef spirit::multi_pass<in_it> fwd_it; /* Returns true if the input number can be found in the named file, false otherwise. The file must contain ordered, whitespace separated numbers. */ bool is_number_in_file(number n, char const * filename) { std::ifstream in(filename); fwd_it begin = spirit::make_default_multi_pass(in_it(in)); fwd_it end = spirit::make_default_multi_pass(in_it()); return std::binary_search(begin, end, n); }

If this library-based solution looks too good to be true, that’s because it is! As we noted before, the standard binary search algorithm may indeed work with forward iterators, but it works far better with random access iterators. There’s no point reducing the number of integer comparisons to O(log(N)) if we’re going to advance our iterators O(N) times.

What’s worse, these multipass iterators aren’t magic. Did you read the smallprint concerning Python’s tee iterator?

This itertool may require significant auxiliary storage (depending on how much temporary data needs to be stored).

If teed iterators diverge, intervening values have to be stored somewhere, and the same appears to be true of our inscrutable multipass iterators. Huge chunks of our large input file are buffered into memory. When I ran this function to confirm the presence of a single number somewhere near the middle of a 4.4GB input file, it took over 19 minutes.

real 19m13.675s user 5m19.219s sys 1m26.278s

Much of this time was spent paging.

As a comparison, testing for the same value using find took just under 3 minutes.

real 2m48.139s user 2m21.336s sys 0m7.252s

You’ll have noticed that we used default multipass iterators. These iterators permit multi-dimensional customisation. I wasn’t feeling brave enough to attempt a template storage policy class, and I very much doubt I could have beaten a simple linear find anyway; anything built on a generic input iterator is unlikely to solve our problem efficiently.

We can beat std::find with a bit of ingenuity. Standard istream iterators are useful but, in this case, not a good starting point. A better idea is to create a novel iterator which uses file seek operations to advance through the file, then fine-tunes the file position to point at a number.

Consider an imagine an iterator which can be positioned at any seekable position in the file, and which we dereference to be the first number in the file which ends at or after that position. The graphic below shows a file with 11 seekable positions, 0 through 10 inclusive.

positions 0 and 1 dereference to the number 42



positions 2, 3, 4 and 5 dereference to the number 57

positions 6, 7, 8 and 9 dereference to the number 133

it is an error to try and dereference position 10, at the end of the file

Now, this is a rather unusual iterator. It iterates over the numbers in the file, but each number gets repeated for every byte in the file it occupies. Despite this duality it’s perfectly usable — so long as we keep a clear head. Binary searches are fine.

How does this version perform?

Recall, a linear search for a single value in the middle of a 4.4GB took nearly 3 minutes. Running 10 binary searches through the same file took just 40 milliseconds — that’s a rate of 25 searches a second!

Here’s our weird new iterator. It should be usable on files containing whitespace separated items of any type for which the stream read operator>>() has been defined.

There’s quite a lot of code here, but much of it is random access iterator scaffolding. The interesting functions are the private implementation details towards the end of the class.

#include <fstream> #include <ios> #include <iterator> #include <stdexcept> #include <ctype.h> // File read error, thrown when low level file access fails. class file_read_error : public std::runtime_error { public: file_read_error(std::string const & what) : std::runtime_error(what) { } }; /* Here's an unusual iterator which can be used to binary search for whitespace-separated items in a text file. It masquerades as a random access iterator but a file is not usually a random access device. Nonetheless, file seek operations are quicker than stepping through the file item by item. The unusual thing is that the iterators correspond to file offsets rather than items within the file. Here's a short example where the items are numbers. +---+---+---+---+---+---+---+---+---+---+ | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | +---+---+---+---+---+---+---+---+---+---+ |'4'|'2'| | |'5'|'7'| |'1'|'3'|'3'| +---+---+---+---+---+---+---+---+---+---+ The graphic shows a text file which contains 3 numbers, 42, 57, 133, separated by whitespace. The file itself is 10 bytes long, and hence there are 11 iterators over the file, corresponding to actual file positions (including the one-past-the end position). To dereference an iterator, we step back through the file until we reach either whitespace or the start of the file. Then we look forwards again and read in the next item. In the graphic above: - Iterators 0 and 1 point to number 42 - Iterators 2, 3, 4 and 5 point to number 57 - Iterators 6, 7, 8, 9 point to number 133 - Iterator 10 is the end, and must not be dereferenced Dereferencing an iterator always returns an item which is in the file, and all items in the file have iterators pointing to them, so std::binary_search based on these iterators is valid. The iterators also expose their underlying file positions directory (via the getpos() member function), and with a little thought we can make use of std::lower_bound() and std::upper_bound(). */ template <typename item> class text_file_item_iter { typedef text_file_item_iter<item> iter; private: // Sanity // Check things are OK, throwing an error on failure. void check(bool ok, std::string const & what) { if (!ok) { throw file_read_error(what); } } public: // Traits typedefs, which make this class usable with // algorithms which need a random access iterator. typedef std::random_access_iterator_tag iterator_category; typedef item value_type; typedef std::streamoff difference_type; typedef item * pointer; typedef item & reference; enum start_pos { begin, end }; public: // Lifecycle text_file_item_iter(iter const & other) : fname(other.fname) { open(); setpos(other.pos); } text_file_item_iter() : pos(-1) { } text_file_item_iter(std::string const & fname, start_pos where = begin) : fname(fname) , pos(-1) { open(); if (where == end) { seek_end(); } } ~text_file_item_iter() { close(); } iter & operator=(iter const & other) { close(); fname = other.fname; open(); setpos(other.pos); return *this; } public: // Comparison // Note: it's an error to compare iterators over different files. bool operator<(iter const & other) const { return pos < other.pos; } bool operator>(iter const & other) const { return pos > other.pos; } bool operator==(iter const & other) const { return pos == other.pos; } bool operator!=(iter const & other) const { return pos != other.pos; } public: // Iteration iter & operator++() { return *this += 1; } iter & operator--() { return *this -= 1; } iter operator++(int) { iter tmp(*this); ++(*this); return tmp; } iter operator--(int) { iter tmp(*this); --(*this); return tmp; } public: // Step iter & operator+=(difference_type n) { advance(n); return *this; } iter & operator-=(difference_type n) { advance(-n); return *this; } iter operator+(difference_type n) { iter result(*this); return result += n; } iter operator-(difference_type n) { iter result(*this); return result -= n; } public: // Distance difference_type operator-(iter & other) { return pos - other.pos; } public: // Access value_type operator*() { return (*this)[0]; } value_type operator[](difference_type n) { std::streampos restore = getpos(); advance(n); value_type const result = read(); setpos(restore); return result; } // Allow direct access to the underlying stream position std::streampos getpos() { std::streampos pos_ = in.tellg(); check(in, "getpos failed"); return pos_; } private: // Implementation details void open() { in.open(fname.c_str(), std::ios::binary); check(in, "open failed"); pos = getpos(); } void close() { if (in.is_open()) { in.close(); check(in, "close failed"); } } void advance(difference_type n) { check(in.seekg(n, std::ios_base::cur), "advance failed"); pos = getpos(); } void seek_end() { check(in.seekg(0, std::ios_base::end), "seek_end failed"); chop_whitespace(); pos = getpos(); } void chop_whitespace() { do { in.unget(); } while (isspace(in.peek())); in.get(); in.clear(); } void setpos(std::streampos newpos) { check(in.seekg(newpos), "setpos failed"); pos = newpos; } // Return the item at the current position value_type read() { item n = 0; // Reverse till we hit whitespace or the start of the file while (in && !isspace(in.peek())) { in.unget(); } in.clear(); check(in >> n, "read failed"); return n; } private: // State std::string fname; std::ifstream in; std::streampos pos; };

Model Name: MacBook Model Identifier: MacBook1,1 Processor Name: Intel Core Duo Processor Speed: 2 GHz Number Of Processors: 1 Total Number Of Cores: 2 L2 Cache (per processor): 2 MB Memory: 2 GB Bus Speed: 667 MHz

Initially the Boost.Spirit solution looked promising but we pushed it too hard. Suitable abstractions can remove complexity; but they can also hide it. When efficiency matters, we need a handle on what’s going on.

After this false start we did find a way to create a file iterator suitable for use with the standard binary search algorithms. Use it with care, though!