Description

The C++ String Toolkit Library (StrTk) consists of robust, optimized and portable string processing algorithms for the C++ language. StrTk is designed to be easy to use and integrate within existing code bases. Furthermore the library has a rich set of features that makes light work of any kind of string processing task.

Capabilities

The StrTk library has the following capabilities:

Generic string tokenizer and token iterators

Fast Split routines

User specified delimiter and splitter policies (simple and regex based etc.)

Conversions between data and hex and base-64

In-place removal and replace routines

Wild-card matching and globbing

Search and Replace

Fast token grid and Comma Separated Values (CSV and DSV) processing

Extremely fast String to Integer, Double and other POD conversions

Extensible string processing templates and algorithms (eg: combinatorics and randomizations)

Prefix tree and Bloom filter structures for efficient string matching and testing

Fast and simple-to-use binary serialization

Seamless integration with STL and Boost

Easy to use wrappers of common string processing usage patterns

Single header file solution requires no installation or building

C++ String Toolkit Library License

Free use of the String Toolkit Library is permitted under the guidelines and in accordance with the MIT License.

Compatibility

The C++ String Toolkit Library implementation is compatible with the following C++ compilers:

GNU Compiler Collection (4.1+)

Intel® C++ Compiler (9.x+)

Clang/LLVM (1.1+)

PGI C++ (10.x+)

Microsoft Visual Studio C++ Compiler (7.1+)

IBM XL C/C++ (10.x+)

C++ Builder (XE4+)

Download

The C++ String Toolkit Library (StrTk) is a general purpose string processing library. It's design principles centers around efficiency and ease of use. This tutorial will primarily cover string parsing and tokenization functionalities of the library.



StrTk provides two common tokenization concepts, a split function and a token iterator. Both these concepts require the user to provide a delimiter predicate and an iterator range over which the process will be carried out.

The tokenization process can be further parametrized by switching between "compressed delimiters" or "no compressed delimiters" mode. This essentially means that consecutive delimiters will be compressed down to one and treated as such. Below are two tables depicting the expected tokens from various types of input. The table below represent 'no compressed' and compressed tokenization processes respectively. The delimiter in this instance is a pipe symbol | and <> denotes an empty token.

No Compressed Delimiters Compressed Delimiters Input Token List Input Token List a a a a a|b a,b a||b a,b a||b a,<>,b |a||b <>,a,b |a <>,a ||a||b|| <>,a,b,<> a| a,<> | <>,<> |a||b <>,a,<>,b || <>,<> ||a||b|| <>,<>,a,<>,b,<>,<> ||| <>,<> | <>,<> || <>,<>,<> ||| <>,<>,<>,<>

Two forms of delimiters are supported and they are single delimiter predicate and multiple delimiters predicate otherwise known as SDP and MDP respectively. Essentially an SDP is where there is only one type that can break or fragment the sequence, where as with MDPs there is more than one unique type that can break the sequence. It is possible to represent a SDP using the MDP, however from a performance POV having separate predicates is far more efficient. Additionally for strings based on char or unsigned char (8-bit versions) there is a MDP that has a look-up complexity of O(1) making it greatly more efficient than the basic MDP.

Single Delimiter Predicate

Instantiation requires specialization of type and construction requires an instance of the type.

strtk::single_delimiter_predicate<typename T>(const T& t) strtk::single_delimiter_predicate<std::string::value_type> predicate('|');

Multiple Delimiter Predicate

Instantiation requires specialization of type and construction requires a sequence of potential delimiters through a range described by iterators.

strtk::multiple_delimiter_predicate<typename T>(Iterator begin, Iterator end); std::string str_delimiters = " ,.;:<>'[]{}()_?/'`~!@#$%^&*|-_\"=+"; strtk::multiple_delimiter_predicate mdp1(str_delimiters.begin(),str_delimiters.end()); unsigned int uint_delimiters[5] = {1, 10, 101, 1010, 10101}; strtk::multiple_delimiter_predicate<unsigned int> mdp2(uint_delimiters,uint_delimiters + 5);

Multiple Char Delimiter Predicate

Instantiation requires an iterator range based on either unsigned char or char. This delimiter is more efficient than the simple MDP as it has a predicate evaluation of O(1) due to the use of a lookup-table as opposed to O(n) where n is the number of delimiters. Performance increase is seen as more delimiters are used.

strtk::multiple_char_delimiter_predicate(const std::string&) strtk::multiple_char_delimiter_predicate(const std::string::const_iterator begin, const std::string::const_iterator end) //usage: strtk::multiple_char_delimiter_predicate predicate(" .;?");

This is a function that performs tokenization over an entire sequence in one go. strtk::split takes a sequence through a range of iterators or in the case of a string through a reference to its instance, a delimiter predicate and an output iterator or otherwise known as a type sink. It populates the output iterator with the tokens it extracts. The tokens in this case are std::pairs of iterators for the sequence.

Split can be used in a "simple - no frills" manner by simply passing the necessary parameters:

std::string str = "abc|123|xyz|789"; strtk::std_string::token_list_type token_list; strtk::split(" |.;?", str, std::back_inserter(token_list));

strtk::split can also be used in a more explicit manner whereby the exact type of delimiter predicate can be specified by the user:

// split using strtk predicates { std::string str = "abc|123|xyz|789"; strtk::std_string::token_list_type token_list; strtk::single_delimiter_predicate predicate('|'); strtk::split(predicate, str, std::back_inserter(token_list)); } // split using a lambda as a predicate { std::string data = "abc|123|xyz|789"; std::deque<std::string> token_list; strtk::split([](const char c) { return '|' == c; }, data, strtk::range_to_type_back_inserter(token_list)); }

strtk::split provides an additional usage option that allows the user to specify if they would like to either compress the delimiters and whether or not they would like to include the delimiter as part of the token range. This enum parameter is called strtk::split_options and has the following values:

Split Option Definition strtk::split_options::default_mode Default options strtk::split_options::compress_delimiters Consecutive delimiters are treated as one strtk::split_options::include_1st_delimiter The first delimiter is included in the resulting token range strtk::split_options::include_delimiters All delimiters are included in the resulting token range

The simple example below demonstrates a split that will occur over a string given a predicate where the provided split options indicate that consecutive delimiters will be treated as one and also all delimiters encountered after each token will also be included in the token up until the next token begins.

strtk::split(predicate, str, std::back_inserter(token_list), strtk::split_options::compress_delimiters | strtk::split_options::include_delimiters);

Another way of using split may be to use the strtk::multiple_char_delimiter_predicate as follows:

std::string str = "abc?123;xyz.789"; strtk::std_string::token_list_type token_list; strtk::multiple_char_delimiter_predicate predicate(" .;?"); strtk::split(predicate, str, std::back_inserter(token_list));

The contents of the token_list can be printed out as follows:

strtk::std_string::token_list_type::iterator itr = token_list.begin(); while (token_list.end() != itr) { std::cout << (*itr) << '\t'; ++itr; }

A natural extension of strtk::split is strtk::split_n. This function provides the ability to tokenize a sequence up until a specific number of tokens have been encountered or until there are no more tokens left. The return value of the strtk::split_n would be the number of tokens encountered.

std::string data = "token1?token2,token3;token4,token5"; strtk::std_string::token_list_type token_list; const std::size_t token_count = 4; const std::string delimiters (" ,.;?"); strtk::split_n(delimiters, data, token_count, std::back_inserter(token_list)); strtk::std_string::token_list_type::iterator itr = token_list.begin(); while (token_list.end() != itr) { std::cout << "[" << (*itr) << "]\t"; ++itr; } std::cout << std::endl;

Another simple variant is the strtk::offset_splitter. This form of split takes a series of offsets and an iterator range or string and determines the tokens based on the offsets. This function can be set to perform a single pass of the offsets or to rotate them until the range has been completely traversed. The example below demonstrates how a string representing date and time can be tokenized into its constituent parts (year, month, day, hour, minutes, seconds, milliseconds)

std::string time_data = "20000101091011345"; // 2000-01-01 9:10:11sec 345ms const std::size_t offset_list_size = 7; const int offset_list[offset_list_size] = { 4, // year 2, // month 2, // day 2, // hour 2, // minute 2, // second 3 // ms }; const strtk::offset_predicate predicate(offset_list); strtk::std_string::token_list_type token_list; strtk::offset_splitter(time_data,predicate,std::back_inserter(token_list)); strtk::std_string::token_list_type::iterator itr = token_list.begin(); while (token_list.end() != itr) { std::cout << "[" << (*itr) << "] "; ++itr; } std::cout << std::endl;

Another form of splitter is based on the concept of using regular expressions as the delimiter predicate. Below is a simple example of splitting tokens wrapped in round-brackets.

std::string str = "(12)(345)(6789)(0ijkx)(yz)"; std::list<std::string> token_list; strtk::split_regex("\\(.*?\\)", s, std::back_inserter(token_list), strtk::regex_match_mode::match_1); std::list<std::string>::iterator itr = token_list.begin(); while (token_list.end() != itr) { std::cout << "[" << (*itr) << "]\t"; ++itr; } std::cout << std::endl;

Note: The parameter regex_match_mode represents the capture of the marked sub-pattern in the current match. By default it is strtk::regex_match_mode::match_all which in this case would provide the entire match including the bounding pattern, eg: Token3 would be (0ijkx).



However in the above example we are only interested in the sub-pattern between the round-brackets, hence strtk::regex_match_mode::match_1 is used resulting in Token3 being 0ijkx.

The following examples demonstrate the use of strtk::split_regex and strtk::split_regex_n routines in extracting specific types of data - in this case the POD types int and double.

int main() { { // Extract ints from data string std::string data = "a 1^bc,0023| def?gh(4567ijk)-89 10l,m$n-op+123r@st+3u v*w2y56yz+"; std::deque<int> int_list; strtk::split_regex("([+-]?([\\d]+))", data, strtk::range_to_type_back_inserter(int_list), strtk::regex_match_mode::match_1); } { // Extract doubles from data string std::string data = "ab$c1.1?d-2.2ef#ghi+3.3%(123.456)!&*-7.89E+12@^="; std::deque<double> double_list; strtk::split_regex(strtk::ieee754_expression, data, strtk::range_to_type_back_inserter(double_list), strtk::regex_match_mode::match_1); } { // Extract the first 3 ints from data string std::string data = "a 1^bc,0023| def?gh(4567ijk)-89 10l,m$n-op+123r@st+3u v*w2y56yz+"; std::deque<int> int_list; strtk::split_regex_n("([+-]?([\\d]+))", data, 3, strtk::range_to_type_back_inserter(int_list), strtk::regex_match_mode::match_1); } { // Extract the first 4 doubles from data string std::string data = "ab$c1.1?d-2.2ef#ghi+3.3%(123.456)!&*-7.89E+12@^="; std::deque<double> double_list; strtk::split_regex_n(strtk::ieee754_expression, data, 4, strtk::range_to_type_back_inserter(double_list), strtk::regex_match_mode::match_1); } return 0; }

The following table describes the various regex patterns built into StrTk which can be used seamlessly with the strtk::split_regex and strtk::split_regex_n routines.

Regex Definition strtk::uri_expression URI, URL address

e.g.: http://www.example.com, domain.example.net/index.html strtk::email_expression E-mail address

e.g.: some.one@example.com strtk::ip_expression IPv4 address

e.g.: 192.168.0.1, 127.0.0.1 strtk::ieee754_expression Floating point value

e.g.: 1.1, 1.234e-123, -1.0001E+10, 0.1234

The tokenizer is specialized on a sequence iterator and predicate. It is constructed with a range of iterators for the sequence and a reference to the desired predicate. Definitions exist for std::string tokenizers. The tokenizer provides an iteration pattern as a means for accessing the tokens in the sequence.

const unsigned int data_size = 12; unsigned int data[data_size] = {1, 2, 3, 0, 4, 5, 6, 0, 7, 8, 0, 9}; strtk::single_delimiter_predicate predicate(0); typedef strtk::tokenizer > tokenizer_type; tokenizer_type tokenizer(data, data + data_size, predicate);

Similar to that of strtk::split, strtk::tokenizer provides tokenizing options that are passed in during construction. Below is a table depicting said options:

Tokenize Option Definition strtk::tokenize_options::default_mode Default options strtk::tokenize_options::compress_delimiters Consecutive delimiters are treated as one strtk::tokenize_options::include_1st_delimiter The first delimiter is included in the resulting token range strtk::tokenize_options::include_delimiters All delimiters are included in the resulting token range

typedef strtk::single_delimiter_predicate<unsigned int> sdp_t; typedef strtk::tokenizer<unsigned int*,sdp_t> tokenizer_type; tokenizer_type tokenizer(data, data + data_size, predicate, strtk::tokenize_options::compress_delimiters | strtk::tokenize_options::include_1st_delimiter);

Furthermore, iteration over tokens using strtk::tokenizer is performed as follows:

tokenizer_type::iterator itr = tokenizer.begin(); while (tokenizer.end() != itr) { std::copy((*itr).first,(*itr).second,std::ostream_iterator (std::cout," ")); std::cout << std::endl; ++itr; }

A typical std::string can be tokenized in the following manner:

std::string str = "abc|123|xyz|789"; strtk::std_string::tokenizer<>::type tokenizer(str,"|"); strtk::std_string::tokenizer<>::type::iterator itr = tokenizer.begin(); while (tokenizer.end() != itr) { std::cout << "[" << (*itr) << "]\t"; ++itr; } std::cout << std::endl;

Another common situation may be tokenizing a sequence of strings, such as the following:

const std::string str_list[] = { "abc" , "delimiter" , "ijk" , "delimiter" , "mno" , "delimiter" , "rst" , "uvw" , "delimiter" , "xyz" }; const std::size_t str_list_size = sizeof(str_list) / sizeof(std::string); strtk::range_adapter<std::string> range(str_list,str_list_size); strtk::single_delimiter_predicate<std::string> predicate("delimiter"); typedef strtk::tokenizer <std::string*,strtk::single_delimiter_predicate<std::string> > tokenizer_type; tokenizer_type tokenizer(range.begin(), range.end(), predicate); tokenizer_type::iterator itr = tokenizer.begin(); while (tokenizer.end() != itr) { std::copy((*itr).first, (*itr).second, std::ostream_iterator<std::string>(std::cout," ")); std::cout << std::endl; ++itr; }

Note: For performance and efficient resource management purposes the strtk::tokenizer does not take ownership or make an internal copy of the sequence being tokenized, as such the strtk::tokenizer expects the range to be valid during the entirety of the tokenization process, this is also the case for the specified predicate.

Till now the mentioned routines worked specifically with tokens, or in other words ranges of characters. The responsibility of managing the tokens and converting the tokens to user specified types was done manually via "range to type" oriented back inserters and converters. This can be a bit cumbersome and as such StrTk provides a series of helper routines called strtk::parse. Parse takes an std::string representing a tuple of delimited values as input data, a delimiter set, and a series of references to variables that are to be populated with the values from the parsed tokens. The following diagram demonstrates the flow of data, tokens and the corresponding relationships and conversions between each of the parameters.

Note: strtk::parse returns a boolean value of true upon successful parsing and false for all other results. Situations that cause strtk::parse to fail include:

Insufficient number of tokens for the given number of variables

Conversion failure from token range to variable type

Empty or null token(s)

strtk::parse can take an arbitrary number of variable references. The code below demonstrates the basic usage of strtk::parse taking various number of parameters.

std::string data = "abcde,-1234|567.890#1.1f"; std::string delimiters = ",|#"; std::string var0; int var1; double var2; float var3; strtk::parse(data, delimiters, var0); strtk::parse(data, delimiters, var0, var1); strtk::parse(data, delimiters, var0, var1, var2); strtk::parse(data, delimiters, var0, var1, var2, var3);

The following examples demonstrate parsing of PODs such as int and double into STL compatible sequences (std::vector, std::deque, std::list, std::set, std::queue, std::stack and std::priority_queue).

// Insert into std::vector std::string int_string = "1 +2 -3 4 +5 -6 7 +8 -9 10 +11 -12 13 +14 -15"; std::vector<int> int_vector; strtk::parse(int_string, " ", int_vector); // Insert into std::deque std::string double_string = "-123.456,789.012,-345.678,901.234,+567.890"; std::deque<double> double_deque; strtk::parse(double_string, ",", double_deque); // Insert into std::list std::string data_string = "a,bc,def,ghij,klmno,pqrstu,vwxyz"; std::list<std::string> string_list; strtk::parse(data_string, ",", string_list); // Insert into std::set std::string set_string = "a|bc#def|ghij#klmno|pqrstu#vwxyz"; std::set<std::string> string_set; strtk::parse(set_string, "#|", string_set); // Insert into std::queue std::string queue_string = "value1,value2,value3,value4,value5"; std::queue<std::string> string_queue; strtk::parse(queue_string, ",|", string_queue); // Insert into std::stack std::string stack_string = "value1|value2,value3|value4,value5"; std::stack<std::string> string_stack; strtk::parse(stack_string, ",|", string_stack); // Insert into std::priority_queue std::string priority_queue_string = "value1|value2,value3#value4,value5"; std::priority_queue<std::string> string_priority_queue; strtk::parse(priority_queue_string, ",|#", string_priority_queue);

Similar to what is described above, the following demonstrates parsing of up to "N" elements into an STL compatible sequence.

// Insert 5 elements into std::vector std::string int_string = "100,-200,+300,400,-500,+600,700,-800,+900"; std::vector<int> int_vector; strtk::parse_n(int_string, ",", 5, int_vector); // Insert 3 elements into std::deque std::string double_string = "123.456,+789.012,345.678,-901.234,567.890"; std::deque<double> double_deque; strtk::parse_n(double_string, ",", 3, double_deque); // Insert 6 elements into std::list std::string data_string = "a,bc,def,ghij,klmno,pqrstu,vwxyz"; std::list<std::string> string_list; strtk::parse_n(data_string, ",", 6, string_list); // Insert 6 elements into std::set std::string set_string = "a|bc#def|ghij#klmno|pqrstu#vwxyz"; std::set<std::string> string_set; strtk::parse_n(set_string, "#|", 6, string_set); // Insert 4 elements into std::queue std::string queue_string = "value0,value1,value2,value3,value4,value5"; std::queue<std::string> string_queue; strtk::parse_n(queue_string, ",|", 4, string_queue); // Insert 4 elements into std::stack std::string stack_string = "value0|value1|value2,value3|value4,value5"; std::stack<std::string> string_stack; strtk::parse_n(stack_string, ",|", 4, string_stack); // Insert 4 elements into std::priority_queue std::string priority_queue_string = "value0#value1|value2,value3#value4,value5"; std::priority_queue<std::string> string_priority_queue; strtk::parse_n(priority_queue_string, ",|#", 4, string_priority_queue);

Note: The return value of the routine strtk::parse_n indicates how many elements were parsed and placed into the specified sequence.

As a first example, we'll tackle the simple problem of reversing words in a sentence. That is given a sentence, to have the first word be the last and the last to be the first, the second word to be the second last so on and so forth. Using StrTk we come up with the following solution:

int main() { std::string sentence = "The quick brown fox jumps over the lazy dog"; std::cout << "Before: " << sentence << std::endl; strtk::split(" ", sentence, strtk::functional_inserter( [](const strtk::range::string& range) { strtk::reverse(range); }) ); strtk::reverse(sentence); std::cout << "After: " << sentence << std::endl; return 0; }

With an expected output of:

Before: The quick brown fox jumps over the lazy dog After: dog lazy the over jumps fox brown quick The

Another example, given a list of words to blank out and a sentence, transform the sentence such that the blank-out words are removed. Using StrTk we come up with the following solution:

int main() { std::string sentence = "The quick brown fox jumps over the lazy dog"; std::unordered_set<std::string> blankout_words; blankout_words.insert("quick"); blankout_words.insert("over"); blankout_words.insert("lazy"); std::cout << "Before: " << sentence << std::endl; strtk::split(" ", sentence, strtk::functional_inserter( [&blankout_words](const strtk::range::string& range) { auto itr = blankout_words.find(range); if (blankout_words.end() != itr) { strtk::fill(range,' '); } }) ); strtk::remove_consecutives_inplace(' ',sentence); std::cout << "After: " << sentence << std::endl; return 0; }

With an expected output of:

Before: The quick brown fox jumps over the lazy dog After: The brown fox jumps the dog

Another simple example would be given a text file to read each of the lines and populate a word list structure by tokenizing each line into words. The following is an example of how this can be achieved using StrTk:

int main() { std::deque<std::string> word_list; strtk::for_each_line("text.txt", [&word_list](const std::string& line) { static const std::string delimiters = "0123456789()[]{}<>" "\t\r

,,.;:'" "!@#$%^&*_-=+`~/"; strtk::parse(line,delimiters,word_list); }); return 0; }

The following simple example takes a user specified text file, proceeds to process it and returns information relating to the file, such as word, letter, uppercase character, lowercase character, vowel and consonant counts.

int main() { std::size_t word_count = 0; std::size_t letter_count = 0; std::size_t uppercase_count = 0; std::size_t lowercase_count = 0; std::size_t vowel_count = 0; std::size_t consonant_count = 0; using namespace strtk; typedef multiple_char_delimiter_predicate mcdp_t; static const mcdp_t is_vowel("AEIOUaeiou"); static const mcdp_t is_lowercase(ext_string::all_lowercase_letters()); static const std::string delimiters = ext_string::all_chars() - ext_string::all_lowercase_letters() - ext_string::all_uppercase_letters(); for_each_line("data.txt", [&](const std::string& line) { split(delimiters, line, functional_inserter( [&](const strtk::range::string& range) { if (0 == range.size()) return; ++word_count; letter_count += range.size(); std::size_t current_lowercase_count = 0; std::size_t current_vowel_count = 0; for (std::size_t i = 0; i < range.size(); ++i) { if (is_vowel(range[i])) ++current_vowel_count; if (is_lowercase(range[i])) ++current_lowercase_count; } uppercase_count += range.size() - current_lowercase_count; lowercase_count += current_lowercase_count; consonant_count += range.size() - current_vowel_count; vowel_count += current_vowel_count; }) ); }); std::cout << "Word count: " << word_count << std::endl; std::cout << "Letter count: " << letter_count << std::endl; std::cout << "Uppercase count: " << uppercase_count << std::endl; std::cout << "Lowercase count: " << lowercase_count << std::endl; std::cout << "Vowel count: " << vowel_count << std::endl; std::cout << "Consonant count: " << consonant_count << std::endl; return 0; }

For the next example, assume we have a text file with a list of names, one per line that represents the order of people that entered a building. Some of the people may enter and leave then reenter the building many times, hence their name will appear multiple times in the list. Our task is to reduce this list to a list of unique names but to also maintain the relative order of names found in the original list. The following is how this particular requirement can be accomplished by using StrTk:

int main() { strtk::for_each_line("file_name.txt", [](const std::string& line) { static std::unordered_set<std::string> line_set; if (line_set.end() != line_set.find(line)) return; line_set.insert(line); std::cout << line << std::endl; }); return 0; }

As a final simple example, we would like to calculate the word frequency model of a user specified text file. The process involves reading each line, splitting the line into words, then incrementing the relevant count for each word and maintaining a global word count. Once the file has been processed, the occurrence frequency of each word will be printed to stdout.

int main() { typedef std::unordered_map<std::string,unsigned int> map_t; map_t word_hit_list; unsigned int word_count = 0; strtk::for_each_line("data.txt", [&](const std::string& line) { static const std::string delimiters = strtk::ext_string::all_chars() - strtk::ext_string::all_lowercase_letters() - strtk::ext_string::all_uppercase_letters(); strtk::split(delimiters, line, strtk::functional_inserter( [&](const strtk::range::string& range) { if (range.begin() == range.end()) return; ++word_count; std::string word(range.begin(),range.end()); ++word_hit_list[word]; }) ); }); for (map_t::value_type v : word_hit_list) { printf("%s %10d %10.9f

", strtk::text::right_align(15,' ',v.first).c_str(), v.second, (1.0 * v.second) / word_count); } return 0; }

Lets assume we have been given an English text file to process, with the intention of extracting a lexicon from the file.

One solution would be to break the problem down to a line by line tokenization problem. In this case we would define a functional object such as the following which will take the container in which we plan on storing our tokens (words) and a predicate and insert the tokens as strings into our container.

template <typename Container, typename Predicate> struct parse_line { public: parse_line(Container& container, const Predicate& predicate) : container_(container), predicate_(predicate) {} void operator() (const std::string& str) { strtk::split(str, predicate_, strtk::range_to_type_back_inserter(container_), strtk::split_options::compress_delimiters); } private: Container& container_; const Predicate& predicate_; };

The whole thing together would include a process of opening the file and reading it line by line each time invoking the parse_line would be as follows:

template <typename Container> void parse_text(const std::string& file_name, Container& c) { static const std::string delimiters = " ,.;:<>'[]{}()_?/" "`~!@#$%^&*|-_\"=+\t\r

\0" "0123456789"; strtk::multiple_char_delimiter_predicate predicate(delimiters); strtk::for_each_line(file_name, parse_line (c,predicate)); } int main() { std::string text_file_name = "text.txt"; std::deque<std::string> word_list; parse_text(text_file_name,word_list); std::cout << "Token Count: " << word_list.size() << std::endl; return 0; }

Before we continue on with the example, a re-written version of the above code using C++11 lambdas is as follows:

int main() { std::string text_file_name = "text.txt"; std::deque<std::string> word_list; strtk::for_each_line(text_file_name, [&word_list](const std::string& line) { static const std::string delimiters = " ,.;:<>'[]{}()_?/" "`~!@#$%^&*|-_\"=+\t\r

\0" "0123456789"; strtk::parse(line,delimiters,word_list); }); std::cout << "Token Count: " << word_list.size() << std::endl; return 0; }

Now coming back to the original problem, that being the construction of a lexicon. In this case the set of "words" should only contain words of interest. For the sake of simplicity lets define words of interest as being anything other than the following prepositions: as, at, but, by, for, in, like, next, of, on, opposite, out, past, to, up and via. This type of list is commonly known as a Stop Word List. In this example the stop-word list definition will be as follows:

const std::string stop_word_list [] = { "as", "at", "but", "by", "for", "in", "like", "next", "of", "on", "opposite", "out", "past", "to", "up", "via", "" }; const std::size_t stop_word_list_size = sizeof(stop_word_list) / sizeof(std::string);

Some minor updates to the parse_line processor include using the filter_on_match predicate for determining if the currently processed token is a preposition and also the invocation of the range_to_type back_inserter to convert the tokens from their range iterator representation to a type representation compatible with the user defined container. For the new implementation to provide unique words of interest the simplest change that can be made is to replace the deque used as the container for the word_list to some kind of 1-1 associative container such as a set. The following is the improved version of the parse_line processor:

template <typename Container, typename Predicate> struct parse_line { public: parse_line(Container& container, const Predicate& predicate) : container_(container), predicate_(predicate), tmp_(" "), tokenizer_(tmp_,predicate_,true), filter_(stop_word_list,stop_word_list + stop_word_list_size, strtk::range_to_string_back_inserter_iterator<Container>(container_), true,false) {} void operator() (const std::string& s) { const filter_type& filter = filter_; strtk::for_each_token(s,tokenizer_,filter); } private: Container& container_; const Predicate& predicate_; std::string tmp_; typename strtk::std_string_tokenizer<Predicate>::type tokenizer_; strtk::filter_on_match<strtk::range_to_string_back_inserter_iterator<Container>> filter_; };

The above described predicate can be greatly simplified by using binders and various lambda expressions.

When performing serialization or deserialization of an instance object such as a class or struct, a simple approach one could take would be to take each of the members and convert them into string representations and from those strings construct a larger string delimiting each member with a special character guaranteed not to exist in any of the string representations.

In this example we will assume that there exists a struct which represents the properties of a person, a person struct if you will:

struct person { unsigned int id; std::string name; unsigned int age; double height float weight; };

The process of populating a person struct would entail having an instance of a person and the necessary data string. The following is an example of how this would be done using the strtk::parse function.

Person Tuple Format

Token0 Delimiter0 Token1 Delimiter1 Token2 Delimiter2 Token3 Delimiter3 Token4 Unique ID(hex) | Name | Age | Height(m) | Weight(kg)

std::string data = "0xFA37ED12|Rumpelstiltskin|397|1.31|58.7"; person p; strtk::hex_to_number_sink<unsigned int> hex_sink(p.id); // register id with the hex sink strtk::parse(data, "|", hex_sink, p.name, p.age, p.height, p.weight);

Batch processing of a text file comprised of one person tuple per-line is somewhat similar to the previous example. A predicate is defined that takes a container specialized on the person struct, and a delimiter predicate with which the strtk::parse function will be invoked. This predicate is then instantiated and coupled with the text file name, is fed to the strtk::for_each_line processor.

template <typename Container, typename Predicate> struct parse_person_tuple { public: parse_person_tuple(Container& container) : container_(container), hex_sink(p_.id) {} void operator() (const std::string& s) { if (strtk::parse(s, "|", hex_sink, p_.name, p_.age, p_.height, p_.weight)) container_.push_back(p_); else std::cerr << "Failed to parse: " << s << std::endl; } private: Container& container_; person p_; strtk::hex_to_number_sink<unsigned int> hex_sink; };

Bringing the above pieces together to process a file results in the following:

int main() { std::string text_file_name = "person_records.txt"; std::deque<person> person_list; strtk::for_each_line(file_name,predicate_type(person_list)); return 0; }

Before we continue on with the example, a re-written version of the above code using C++11 lambdas is as follows:

int main() { std::string text_file_name = "person_records.txt"; std::deque<person> person_list; person p; strtk::hex_to_number_sink<unsigned int> hex_sink; strtk::for_each_line( text_file_name, [&](const std::string& line) { if (strtk::parse(line, "|", hex_sink, p.name, p.age, p.height, p.weight)) container_.push_back(p); else std::cerr << "Failed to parse: " << line << std::endl; }); return 0; }

To make things easier one could adapt a struct (made up entirely of PODs) to a parser. This makes the usage syntax little easier to follow. An example of this adaption is as follows:

struct type { std::string s; double d; int i; char c; bool b; }; strtk_parse_begin(type) strtk_parse_type(s) strtk_parse_type(d) strtk_parse_type(i) strtk_parse_type(c) strtk_parse_type(b) strtk_parse_end() int main() { type t; std::string s = "abcdefghijklmnop|123.456|987654321|A|1"; strtk::parse(s, "|", t); return 0; }

Another similar example to the previous, would be parsing a text file of 3D coordinates into a sequence. This can be done easily and cleanly using lambdas and StrTk as follows:

struct point { double x,y,z; }; int main() { std::string point_data = "point_data.txt"; std::deque<point> points; point p; strtk::for_each_line(point_data, [&](const std::string& str) { if (strtk::parse(str, ",", p.x, p.y, p.z)) points.push_back(p); }); return 0; }

Assuming the datetime struct defined below, and a string representation of a combined date and time in the form of YYYY-MM-DD HH:MM:SS.MS eg: 2005-06-26 15:57:03.678

struct datetime { unsigned int year; unsigned int month; unsigned int day; unsigned int hour; unsigned int minute; unsigned int second; unsigned int msecond; };

The following assumes an input of date-time values separated by a pipe. To facilitate parsing of a date-time by the strtk::parse routine into an STL compatible sequence an implementation of string_to_type_converter_impl specific to the datetime type is required. The following demonstrates how such a routine can be generated and used within the strtk::parse context:

strtk_string_to_type_begin(datetime) static const std::string delimiters ("-:. "); return strtk::parse(begin, end, delimiters, t.year, t.month, t.day, t.hour, t.minute, t.second, t.msecond); strtk_string_to_type_end()

A simple example of using strtk::parse in conjunction with the newly integrated datetime variable mixed with variables of other types is as follows:

int main() { const std::string data = "abc 123 xyz,2000-01-10 03:01:16.123|+98765.43210"; std::string var0; datetime var1; double var2; strtk::parse(data, ",|", var0, var1, var2); return 0; }

Bringing the above pieces together, in the following we can then proceed to parse a sequence of date-time strings delimited by pipe "|" into a deque of type datetime.

int main() { static const std::string data = "2000-01-10 03:01:16.123|2001-02-22 05:12:24.234|" "2002-03-13 07:23:32.345|2003-04-24 09:34:47.456|" "2004-05-15 11:46:51.767|2005-06-26 15:57:03.678|" "2006-07-17 17:45:31.561|2007-08-26 19:06:02.809|" "2008-09-18 21:16:23.267|2009-10-26 23:12:03.798|" "2010-11-23 13:47:11.963|2011-12-26 15:35:08.168"; std::deque<datetime> datetime_list; strtk::parse(data,"|",datetime_list); for (std::size_t i = 0; i < datetime_list.size(); ++i) { datetime& dt = datetime_list[i]; std::cout << dt.year << "-" << strtk::text::right_align(2,'0', dt.month) << "-" << strtk::text::right_align(2,'0', dt.day) << " " << strtk::text::right_align(2,'0', dt.hour) << ":" << strtk::text::right_align(2,'0', dt.minute) << ":" << strtk::text::right_align(2,'0', dt.second) << "." << strtk::text::right_align(3,'0',dt.msecond) << std::endl; } return 0; }

As a side note, the more commonly used date, time and date-time formats can be easily parsed with a simple utilities library based on StrTk called Datetime_Utils. The library makes use of the technique described above in conjunction with the strtk::offset_splitter to provide efficient and high performance parsers for formats such as the ones denoted below:

Format Example YYYYMMDD

20060304

YYYYDDMM

20060403

YYYY/MM/DD

2006/03/04

YYYY/DD/MM

2006/04/03

DD/MM/YYYY

04/03/2006

HH:MM:SS.mss

13:27:54.123

HH:MM:SS

13:27:54

YYYYMMDD HH:MM:SS.mss

20060304 13:27:54.123

YYYY/MM/DD HH:MM:SS.mss 2006/03/04 13:27:54.123 DD/MM/YYYY HH:MM:SS.mss 04/03/2006 13:27:54.123 YYYYMMDD HH:MM:SS

20060304 13:27:54

YYYY/MM/DD HH:MM:SS

2006/03/04 13:27:54

DD/MM/YYYY HH:MM:SS

04/03/2006 13:27:54

YYYY-MM-DD HH:MM:SS.mss 2006-03-04 13:27:54.123 DD-MM-YYYY HH:MM:SS

04-03-2006 13:27:54

YYYY-MM-DDTHH:MM:SS

2006-03-04T13:27:54

YYYY-MM-DDTHH:MM:SS.mss 2006-03-04T13:27:54.123 YYYYMMDDTHH:MM:SS

20060304T13:27:54

YYYYMMDDTHH:MM:SS.mss

20060304T13:27:54.123

In the following simple example we have an array of data representing tuples of trade executions in CSV format. The objective is to populate the trade_list instance with the given data via the defined trade struct. In the example the dt_utils::datetime_format6 date-time parser is used, it populates a general date-time type instance called dt_utils::datetime. If the parse operation succeeds, then the date-time components the trade type requires are updated and the instance itself is subsequently added to the trade_list.

struct trade { std::string ticker; double price; unsigned int volume; unsigned short hr,min,sec,ms; }; int main() { std::string trade_data[] = { "2006-03-04 13:27:54.123,ABC,12.347,6676", "2006-03-04 13:27:54.231,XYZ,23.455,71547", "2006-03-04 13:27:54.312,IJK,34.562,514", "2006-03-04 13:27:55.263,PQR,67.893,5949", "2006-03-04 13:27:55.327,ABC,78.963,19", "2006-03-04 13:27:55.524,XYZ,45.677,1276", "2006-03-04 13:27:56.623,IJK,36.182,6676", "2006-03-04 13:27:56.877,PQR,62.339,1547" }; std::deque<trade> trade_list; trade t; dt_utils::datetime dt; dt_utils::datetime_format6 dt6(dt); for (std::size_t i = 0; i < sizeof(trade_data) / sizeof(std::string); ++i) { bool result = strtk::parse(trade_data[i], ",", dt6, t.ticker, t.price, t.volume); if (result) { t.hr = td.hour; t.min = td.minute; t.sec = td.second; t.ms = td.millisecond; trade_list.push_back(t); } } return 0; }

So far the demonstrated capabilities of the strtk::parse function has been based on passing a series of parameters that are populated in a linear fashion as the parser processes the tokens it encounters. That said, some formats have their own sub-structures, a simple example would be a list of values (such as integers) that need to be loaded into a deque or stack. StrTk provides a series of sink facilities that consume a range and an STL container which can be forwarded onto strtk::parse.

In the following example, the data string is comprised of 3 separate lists delimited by a pipe "|". An integer, a string and a double type list. Each list is to be parsed into an STL container of appropriate type. In this case a vector, a deque and a list. StrTk provides the ability to instantiate a sink for the specific container type that is compatible with strtk::parse.

int main() { std::string data = "1,+2,-3,4|abc,ijk,rst,xyz|123.456,+234.567,-345.678,456.789,567.890"; // Define containers std::vector<int> int_vector; std::deque<std::string> string_deque; std::list<double> double_list; std::set<int> int_set; std::queue<std::string> string_queue; std::stack<double> double_stack; std::priority_queue<int> int_priority_queue; // Define sinks strtk::vector_sink<int>::type vec_sink(","); strtk::deque_sink<std::string>::type deq_sink(","); strtk::list_sink<double>::type lst_sink(","); strtk::set_sink<int>::type set_sink(","); strtk::queue_sink<std::string>::type que_sink(","); strtk::stack_sink<double>::type stk_sink(","); strtk::priority_queue_sink<int>::type prq_sink(","); strtk::parse(data, "|", vec_sink( int_vector), deq_sink(string_deque), lst_sink( double_list)); strtk::parse(data, "|", set_sink( int_set), que_sink( string_queue), stk_sink( double_stack), prq_sink(int_priority_queue)); return 0; }

If only a certain number of elements in the list are required, for example only the first 3, then the element count on the sink can be set appropriately. The above example could be modified as follows:

int main() { std::string data = "1,+2,-3,4|string0|abc,ijk,rst,xyz|" "string1|123.456,+234.567,-345.678,456.789,567.890"; std::vector<int> int_vector; std::deque<std::string> string_deque; std::list<double> double_list; strtk::vector_sink<int>::type vec_sink(","); strtk::deque_sink<std::string>::type deq_sink(","); strtk::list_sink<double>::type lst_sink(","); std::string string_0; std::string string_1; strtk::parse(data, "|", vec_sink( int_vector).count(2), // Consume first 2 values string_0, deq_sink(string_deque).count(3), // Consume first 3 values string_1, lst_sink( double_list).count(4)); // Consume first 4 values return 0; }

Note: If there aren't enough elements in a particular list, then parsing of that list fails and subsequently the call to strtk::parse will fail.

Another way one might want to parse a tuple of values might be to parse a prefix of values into a specific number of possibly varying types, then to parse the remaining values (assuming they are all of the same type) into a sequence or list etc. StrTk provides the following simple solution to the given problem, as demonstrated below:

int main() { { std::string data = "A String Value,111.111,222.222,333.333,444.444,555.555"; std::string token; std::vector<double> double_list; strtk::parse(data, ",", token, double_list); } { std::string data = "A String Value,01-02-2003,111.111,222.222,333.333,444.444,555.555"; std::string token; std::string date; std::deque<double> double_list; strtk::parse(data, ",", token, date, double_list); } { std::string data = "A String Value,01-02-2003,123456789,111.111," "222.222,333.333,444.444,555.555"; std::string token; std::string date; int i; std::list<double> double_list; strtk::parse(data, ",", token, date, i, double_list); } { std::string data = "A String Value,01-02-2003,123456789,111.111," "222.222,333.333,444.444,555.555"; std::string token; std::string date; int i; double d; std::vector<double> double_list; strtk::parse(data, ",", token, date, i, d, double_list); } { std::string data = "A String Value,01-02-2003,123456789,111.111," "222.222,333.333,444.444,555.555"; std::string token; std::string date; int i; double d1; double d2; std::deque<double> double_list; strtk::parse(data, ",", token, date, i, d1, d2, double_list); } return 0; }

Building upon the previous datetime example, We are presented with a tuple of data that represents an astronomical event. The event defines a name, a location and a series of date-times in UTC the event was observed. In order to construct the necessary sink(s) that will be used for parsing the required type into a container, the macro strtk_register_userdef_type_sink with the specified type is invoked. The following is a definition of the struct one might construct:

struct event_information { std::size_t id; std::string name; std::string location; std::deque<datetime> observation_datetimes; }; strtk_register_userdef_type_sink(datetime)

Bringing the above together with a call to strtk::parse results in the following code which parses the event data tuple into the allocated event_information instance.

int main() { std::string event_data = "172493|Lunar Impact|Mare Tranquillitatis|" "2010-01-19 00:28:45.357,2010-02-18 00:57:07.109," "2010-03-20 01:15:11.261,2010-04-21 01:07:27.972"; strtk::deque_sink<datetime>::type deq_sink(","); event_information evt_info; strtk::parse(event_data, "|", evt_info.id, evt_info.name, evt_info.location, deq_sink(evt_info.observation_datetimes)); return 0; }

StrTk offers a set of convenient and simple to use token processing primitives that can be invoked during a call to the strtk::parse routine to perform various actions upon the tokens being parsed. These actions include such things as modifications and validations of tokens.

The following is a list of token processing primitives used for constraint and verification purposes:



strtk::ignore_token

strtk::expect

strtk::iexpect

strtk::like

strtk::inrange

strtk::trim

strtk::trim_leading

strtk::trim_trailing

strtk::as_lcase

strtk::as_ucase

There may be scenarios when given a delimited tuple of data, that one or more of the tokens need to be ignored or skipped. StrTk provides a mechanism called strtk::ignore_token that allows the parser to consume specific tokens easily without affecting overall performance. Below is an example of how strtk::ignore_token can be used in conjunction with strtk::parse to skip the 2nd and 4th tokens in the tuple:

int main() { static const std::string data = "+123,ignore0,123.456,ignore1,abcdef,ignore2"; int i; double d; std::string s; strtk::ignore_token ignore; strtk::parse(data, ",", i, ignore, d, ignore, s); std::cout << "i = " << i << std::endl; std::cout << "d = " << d << std::endl; std::cout << "s = " << s << std::endl; return 0; }

When parsing a tuple, one may want to ensure that specific tokens of the tuple are of a certain string value. StrTk provides this type of functionality via the strtk::expect and strtk::iexpect mechanisms. The strtk::expect form enforces an exact string match, whereas the strtk::iexpect enforces only a case-insensitive match. The following is an example where we attempt to parse a 'pascal-like' variable declaration and definition. The requirement is that the first token be "var" followed by a variable name and then a type name of 'Integer' which is not case sensitive.

int main() { static const std::string data = "var foo : InTeGeR = 3;"; std::string variable_name; int initial_value; bool result = strtk::parse(data, " ;", strtk::expect("var").ref(), variable_name, strtk::expect(":").ref(), strtk::iexpect("Integer").ref(), strtk::expect("=").ref(), initial_value); if (result) std::cout << variable_name << " = " << initial_value << std::endl; else std::cout << "Failed to parse statement!" << std::endl; return 0; }

Similar to the above mentioned strtk::expect and strtk::iexpect primitives, StrTk provides a simple wildcard matching of tokens functionality via the strtk::like mechanism. The special characters of '*' and '?' are used denoting 'zero or more' and 'zero or one' match modes respectively. The following example uses the strtk::like in conjunction with strtk::expect to parse a tuple of key/value pairs.

int main() { static const std::string data = "token0=+123;token1=abc;token2=-456.678;"; int i; std::string s; double d; strtk::parse(data, "=;", strtk::like("to*n?").ref(), i, strtk::like("token?").ref(), s, strtk::iexpect("tOkEn2").ref(), d); std::cout << "i = " << i << std::endl; std::cout << "s = " << s << std::endl; std::cout << "d = " << d << std::endl; return 0; }

When parsing tokens, one may wish to determine if the token when viewed in its target type resides within a specified range [r0,r1]. As the tokens can be of any type, not necessarily just string or numerical in nature, the type must have a less- than comparable attribute. The following example attempts to parse a key/value tuple that contains a temperature and a name component.

int main() { static const std::string data = "temperature=+123.456;name=Rumpelstilzchen"; double temperature; std::string name; strtk::parse(data, "=;", // Process temperature section strtk::expect("temperature").ref(), strtk::inrange(temperature, -432.1, +432.1).ref(), // Process name section strtk::expect("name").ref(), strtk::inrange(name, "AAAA", "zzzz").ref()); std::cout << "temperature = " << temperature << std::endl; std::cout << "name = " << name << std::endl; return 0; }

At times tokens within a tuple may have padding added to either the left, right or both ends. When processing the token it maybe necessary to remove the superfluous padding before attempting to convert the string or range representation of the token into its target type(int, double etc). The example below, demonstrates the use of various forms of token trimming in conjunction strtk:parse.

int main() { { std::string data = "****abc123****,****abc123****,****abc123****"; std::string s0; std::string s1; std::string s2; strtk::parse(data,",", strtk::trim("*",s0).ref(), strtk::trim_leading ("*",s1).ref(), strtk::trim_trailing("*",s2).ref()); std::cout << "s0 = [" << s0 << "]" << std::endl; std::cout << "s1 = [" << s1 << "]" << std::endl; std::cout << "s2 = [" << s2 << "]" << std::endl; // Expected Output: // s0 = [abc123] // s1 = [abc123****] // s2 = [****abc123] } { std::string data = "*?*?a string*?*?,*?*123456,123.456?*?*?"; std::string s; int i; double d; strtk::parse(data,",", strtk::trim("*?",s).ref(), strtk::trim_leading ("?*",i).ref(), strtk::trim_trailing("*?",d).ref()); std::cout << "s = [" << s << "]" << std::endl; std::cout << "i = [" << i << "]" << std::endl; std::cout << "d = [" << d << "]" << std::endl; // Expected Output: // s = [a string] // i = [123456] // d = [123.456] } return 0; }

Another pair of token processing mechanisms provided by StrTk, are the strtk::as_lcase and strtk::as_ucase mechanisms. They convert the string representation of the token to lowercase and uppercase characters respectively. The following example, parses a two token tuple, and converts the first token s0 to all lowercase and the second token s1 to all uppercase.

int main() { std::string data = "AbCd,EfGhI"; std::string s0; std::string s1; strtk::parse(data,",", strtk::as_lcase(s0).ref(), strtk::as_ucase(s1).ref()); std::cout << "s0 = [" << s0 << "]" << std::endl; std::cout << "s1 = [" << s1 << "]" << std::endl; // Expected Output: // s0 = [abcd] // s1 = [EFGHI] return 0; }

There may be times during parsing when a token which is intended to be parsed as an integral type (eg: int, short, unsigned int et al) may be represented using decimal notation (eg: 1234.00000).



Normally if the token were to be passed as-is it would cause a conversion error due to the fact that there are invalid characters within the token.



StrTk provides a facility called strtk::truncated_int that can be used with both signed and unsigned integral types. The type truncated_int is specialised with the required type, then an instance of the type is registered with it either prior to or inline with the conversion/parse call. When the conversion occurs strtk::truncated_int simply redefines the end of the token range to be the decimal point (if it is present) and then passes it along to the appropriate string_to_type_converter_impl call.

int main() { { // Convert decimal representation to an int int i = 0; std::string data = "-1234.0000"; strtk::truncated_int<int> ti; strtk::string_to_type_converter(data,ti(i)); } { // Parse a tuple of decimal values into ints i0 and i1 int i0 = 0; int i1 = 0; std::string data = "-1234.0000|456.7890"; strtk::truncated_int<int> ti0; strtk::truncated_int<int> ti1; strtk::parse(data, "|", ti0(i0), ti1(i1)); } }

An optional parameter that can be utilized is the 'fractional_size' which denotes the exact number of digits after the decimal place that is expected. In the event this condition is not met a conversion error is returned.

int main() { std::string data = "-1234.000"; strtk::truncated_int<int> ti; ti.fractional_size(3); int i = 0; if (strtk::string_to_type_converter(data,ti(i))) std::cout << "i: " << i << std::endl; else std::cout << "Error parsing: " << data << std::endl; return 0; }

In the following example, we have an array of trade execution tuples in csv format. The tuple is comprised of the following fields: ticker name (string), trade id (uint64 right aligned with space as padding), execution price (double), executed volume (unsigned int with 4 decimal place suffix). The struct trade will be used to store the tuples in memory. The objective is to parse each tuple and populate the trade_list structure with all the trades, noting any errors that occur along the way.

struct trade { unsigned long long id; std::string ticker; double price; unsigned int volume; }; int main() { std::string trade_data[] = { "AAA, 0,36.900,009352.0000", "BBB, 1,46.000,009800.0000", "CCC, 2,46.000,009400.0000", "DDD, 3,46.000,001500.0000", "AAA, 4,46.000,000600.0000", "BBB, 5,46.000,039800.0000", "CCC, 6,46.000,003200.0000", "DDD, 7,46.000,000100.0000", "AAA, 8,46.000,000800.0000", "BBB, 9,48.500,002400.0000", "CCC, 10,37.395,101200.0000", "DDD, 11,37.193,200000.0000", "AAA, 12,37.146,020000.0000", "BBB, 13,38.314,001752.0000", "CCC, 14,37.386,130000.0000", "DDD, 15,37.443,100000.0000" }; std::deque<trade> trade_list; strtk::truncated_int<unsigned int> tui; tui.fractional_size(4); for (std::size_t i = 0; i < sizeof(trade_data) / sizeof(std::string); ++i) { trade t; bool result = strtk::parse(trade_data[i],",", t.ticker, strtk::trim_leading(" ",t.id).ref(), t.price, tui(t.volume)); if (result) trade_list.push_back(t); else std::cout << "Tuple parse error: " << trade_data[i] << std::endl; } return 0; }

It should be noted that truncated_int can be used in conjunction with the various StrTk token processing primitives such as strtk::trim, strtk::trim_leading, strtk::as_lcase et al . The following example demonstrates parsing a tuple of values intended for int types, where the tokens have random space padding. The simple composition of strtk::truncated_int and strtk::trim allows for efficient and error free parsing of the tuple.

int main() { std::string data = " 123.0 | 456.00 | 789.000"; int i0,i1,i2; strtk::truncated_int<int> ti0; strtk::truncated_int<int> ti1; strtk::truncated_int<int> ti2; strtk::parse(data,"|", strtk::trim(" ",ti0(i0)).ref(), strtk::trim(" ",ti1(i1)).ref(), strtk::trim(" ",ti2(i2)).ref()); printf("i0 = [%d] i1 = [%d] i2 = [%d]

",i0,i1,i2); // i0 = [123] i1 = [456] i2 = [789] return 0; }

In the previous section the ability to ignore tokens in a tuple was discussed. The concept works well if only a few tokens need to be ignored. However problems arise when the tuples contain a large number of tokens and the tokens that are to be ignored are numerous and distributed uniformly over the entire tuple.



Situations such as this one are common, and using the ignore_token technique can not only make both the coding of the solution cumbersome and error prone but also make the parsing process itself quite



inefficient. A natural extension to ignore_token that scales and is also extremely efficient, can be found in the combined functionalities of the parse_columns and column_list. The column_list is a structure used to hold the indexes of the tokens in the tuple that are required. The indexes have to be valid, unique and in ascending order.

// Even indexes [0,12] auto cl_even = strtk::column_list(0, 2, 4, 6, 8, 10, 12); // Odd indexes [1,7] auto cl_odd = strtk::column_list(1, 3, 5, 7);

The strtk::parse_columns function takes a string of data representing a tuple, a delimiter to determine the tokens in the tuple, a strkt::column_list and a compatible number of types as target references.



The number of types has to be equal to the number of indexes in the column_list, and the types need to be convertible within the strtk namespace from a string range representation to the type. In the following example we have a tuple consisting of integers. We're only interested in the first four even numbered indexes in the tuple, the code below demonstrates how the tuple is parsed with the given constraints:

int main() { const std::string data = "1,22;333|4444 55555|666666;7777777,88888888 999;" "0000|1111,22222,333333,4444444"; int i0,i1,i2,i3,i4; strtk::parse_columns(data, ",;| ", strtk::column_list(0, 2, 4, 6, 8), i0, i1, i2, i3, i4); return 0; }

In the following example we have a tuple consisting of a mixture of types. We are only interested in the first, fifth and eighth indexes in the tuple, which happen to be of type int, double and std::string respectively. The code below demonstrates how the tuple is parsed with the given constraints:

int main() { const std::string data = "token0,1234;token2;token3,token4,11.22|token6;token7|" "text-i-text,1 2 3 4 5 6 7"; int i; double d; std::string s; strtk::parse_columns(data, ",;| ", strtk::column_list(1, 5, 8), i, d, s); return 0; }

Lets assume there is a file format for expressing a mesh. The format consists of a section that defines the unique vertexes in the mesh, and another section that defines the triangles in the mesh as a tuple of three indicies indicating the vertexes used. Each section is preceded by an integer that denotes the number of subsequent elements in that section. An example of such a file is the following:

5 +1.0,+1.0,+1.0 -1.0,+1.0,-1.0 -1.0,-1.0,+1.0 +1.0,-1.0,-1.0 +0.0,+0.0,+0.0 4 0,1,4 1,2,4 2,3,4 3,1,4

Code to parse such a file format is as follows:

struct point { double x,y,z; }; struct triangle { std::size_t i0,i1,i2; }; int main() { std::string mesh_file = "mesh.txt"; std::ifstream stream(mesh_file.c_str()); std::string s; // Process points section std::deque<point> points; point p; std::size_t point_count = 0; strtk::parse_line(stream, " ", point_count); strtk::for_each_line_n(stream, point_count, [&points,&p](const std::string& line) { if (strtk::parse(line, ",", p.x, p.y, p.z)) points.push_back(p); }); // Process triangles section std::deque<triangle> triangles; triangle t; std::size_t triangle_count = 0; strtk::parse_line(stream, " ", triangle_count); strtk::for_each_line_n(stream, triangle_count, [&triangles,&t](const std::string& line) { if (strtk::parse(line, ",", t.i0, t.i1, t.i2)) triangles.push_back(t); }); return 0; }

A semantic action is an action that is undertaken upon a token, it can be in the form of either an assessment or a manipulation. StrTk provides a very simplified semantic action capability, named strtk::util::semantic_action for types that are being parsed via the strtk::parse function. A function (stateful or stateless), taking an iterator range is used to construct the semantic_action. The body of the function performs whatever operations are required and also makes sure to maintain the contract with regards to the return status for the parse routine to complete successfully.



The following is an example where a comma delimited string is parsed into 3 types, an integer, a double and a string. The rules regarding parsing and updating of the variables is as follows, the int variable "i" will only be updated if the value parsed is odd, the double value "d" will only be updated if the parsed value is greater than or equal to 99.99 and the string value "s" will only be updated if the presented range contains the string "ring". Upon every successful update a corresponding counter will be incremented.

int main() { std::string data = "12345,123.456,A string"; int i = 0; double d = 0.0; std::string s = ""; std::size_t i_update_count = 0; std::size_t d_update_count = 0; std::size_t s_update_count = 0; typedef strtk::range::ustring::const_iterator itr_type; using strtk::util::semantic_action; strtk::parse(data,",", // Token_0 (i) - Parse and update if value is odd semantic_action([&i, &i_update_count] (itr_type begin, itr_type end) -> bool { int temp = 0; if (strtk::string_to_type_converter(begin, end, temp)) { if (temp % 2 != 0) { i = temp; ++i_update_count; } return true; } else return false; }).ref(), // Token_1 (d) - Parse and update if value is greater than 99.99 semantic_action([&d, &d_update_count] (itr_type begin, itr_type end) -> bool { double temp = 0.0; if (strtk::string_to_type_converter(begin, end, temp)) { if (temp >= 99.99) { d = temp; ++d_update_count; } return true; } else return false; }).ref(), // Token_2 (s) - Parse and update if value contains "ring" semantic_action([&s, &s_update_count] (itr_type begin, itr_type end) -> bool { static unsigned char pattern[] = "ring"; if (end != std::search(begin, end, pattern, pattern + 4)) { s.assign(begin,end); ++s_update_count; } return true; }).ref()); std::cout << "i: " << i << std::endl; std::cout << "d: " << d << std::endl; std::cout << "s: " << s << std::endl; return 0; }

A random line of text is to be selected from a user provided text file comprised of lines of varying length, such that the probability of the line being selected is 1/N where N is the number of lines in the text file. There are many trivial solutions to this problem, however if one were to further constrain the problem by indicating the file is very large (TBs) and that the system upon which the solution is to run is very limited memory-wise, most if not all trivial solutions such as storing indexes of all line offsets, or reading the entire file into memory etc will be eliminated.



However, there exists a very elegant solution to this problem of O(n), O(1) time and space complexities respectively, that involves scanning the entire file once line by line, and at every ith line choosing whether or not to replace the final result line with the current line by sampling a uniform distribution of range [0,1) and performing a replace if and only if the random value is less than 1 / i.

The logic behind this solution revolves around the fact that the probability of selecting the ith line will be 1/i and as such the total probability for selecting any of the previous i-1 lines will be 1 - (1/i) or (i - 1)/i. Because there are (i - 1) lines before the ith line, we divide the previous sum of probabilities by (i - 1), resulting in a selection probability of 1/i for any one of the lines up to and including the ith line. If the ith line were to be the last line of the text file, this then results in each of the lines having a selection probability of 1/N - simple and sweet, and so too is the following implementation of said solution:

int main(int argc, char* argv[]) { std::string file_name = argv[1]; std::string line; std::size_t i = 0; strtk::uniform_real_rng rng(static_cast (::time(0))); strtk::for_each_line(file_name, [&](const std::string& s) { if (rng() < (1.0 / ++i)) { line = s; } }); std::cout << line << std::endl; return 0; }

What changes to the above code would be required If the probability of line selection was changed to 1/(N - K) where 0 <= K <= N and K is the number of lines that will be randomly ignored.



Note: TAOCP Volume II section 3.4.2 has an in depth discussion about this problem, which is generally known as reservoir sampling, and other similar problems relating to random distributions. Also one should note that the above example has an inefficiency, in that upon every string replace, an actual string is being copied, normally all one needs to do is maintain a file offset to the beginning of the line, not doing this causes slow downs due to continuous memory allocations/reallocations which is made all the worse when large lines are encountered.

Assume we are given a piece of data in csv format which represents a time-series for the close prices of the SPDR S&P 500 ETF. The time-series ranges from 04/01/1999 to 11/11/2011. The objective is to select two dates, the first being the buy date and the second being the sell date, such that if we were to buy then sell X shares of the ETF we will have maximized our profit. It should be noted that of course the buy date must be before the sell date and that short selling is not an option in this strategy. The following is a chart that represents the price of the SPDR over the given period:

Through visual inspection we can approximate that the best buy date would be towards the end of 2002 and the corresponding sell date would be towards the end of 2007, as these time points seem to give the largest difference between the two prices. However it also seems that a buy roughly at the start of 2009 and a sell at the beginning of 2011 could also provide such a large price difference. As such the visual inspection approach has lead to an ambiguity, hence a more thorough and precise approach is required.



One could take the naive approach to solving the buy low sell high problem, by initially loading the entire time series into memory, then for each ith time point take the pricei and test it against every other pricej wherei < j, and maintain a "best profit encountered" structure that contains the best profit so far and the corresponding buy/sell prices and dates. This solution has a few problems, initially it is of O(N2) time complexity and O(N) space complexity. As an example for one million time points it will require one trillion comparisons and one million date/price units of storage. If memory and computational processing was limited on the hardware this solution can not be practically executed in a continuous online manner. Furthermore as suggested by the time -complexity as the size of the data grows, regardless of computational abilities, the compute times for the results would become astronomical and practically useless - specially for real-time reactive systems.



In these types of problems one tries to assess if an online or streaming based solution is feasible. That is a solution that does not require the data to be available all at once, can work on the data incrementally and requires no more than one-pass for each piece of data. Such a solution would typically have a time complexity of O(N) and a space complexity of O(1).



With regards to this problem the crucial insight required to convert the naive solution from O(N2) complexity to an online version of O(N) time complexity, is that every new global minima encountered is the beginning of a new period and an indicator of an end to the previous period. Looking at the chart, if one were to scan from left to right, the intuitive response is to find the point with the lowest price, ignore everything preceding, then try and find the next point with the highest price or global maxima. There are a few edge cases that need to be dealt with. The main one being the problem described above that there are two buy/sell points that could potentially be the solution. The way around this is to simply maintain the best encountered period, and compare the profit from any new period to the best so far, if it is better (more) then replace the best with the current period. Another edge case is when the data is in a continual decline, in a situation like this there will be no profitable buy/sell points. The following is a small subsection of the time-series in question:



Download spy500.csv



15/06/2000,148.16 16/06/2000,146.59 19/06/2000,148.47 20/06/2000,147.94 21/06/2000,147.88 22/06/2000,145.63 23/06/2000,144.38 26/06/2000,146.23 27/06/2000,145.16 28/06/2000,145.56 29/06/2000,144.19

The code below is a very simple single pass incremental solution to the given problem. It reads every line of the input file, parses each line into a date and price variable, checks to see if the current price is less than the current global minima price, if it is the case, it will set the current period start to the current date and set the buy price to be the current price, otherwise it checks to see if the current price is larger than the global maxima price, if that is the case then it updates the current profit, sell price and sell dates accordingly. If at the end, the buy price is not less than the sell price, it is indicative that there exists no two time points within the given time series for which a profitable transaction could occur, otherwise it prints out buy and sell dates for the required transaction and the expected profit per share.

struct period { period() : profit (std::numeric_limits<double>::min()), buy_price (std::numeric_limits<double>::max()), sell_price(std::numeric_limits<double>::min()) {} double profit; double buy_price; double sell_price; std::string buy_date; std::string sell_date; }; int main() { period best_period; period curr_period; strtk::for_each_line( "spy500.csv", [&best_period,&curr_period] (const std::string& line) { std::string date; double price; if (!strtk::parse(line, ",", date, price)) return; if (price < curr_period.buy_price) { if (curr_period.proft > best_period.profit) { best_period = curr_period; } curr_period.buy_date = date; curr_period.buy_price = price; curr_period.sell_price = std::numeric_limits<double>::min(); curr_period.sell_date = ""; } else if (price > curr_period.sell_price) { curr_period.sell_price = price; curr_period.sell_date = date; curr_period.profit = curr_period.sell_price - curr_period.buy_price; } }); if (best_period.buy_price >= best_period.sell_price) { std::cout << "No period in the time-series can be profitably exploited." << std::endl; } else { std::cout << "Buy: " << best_period.buy_date << std::endl; std::cout << "Sell: " << best_period.sell_date << std::endl; std::cout << "Profit per share: $" << best_period.profit << std::endl; } return 0; }

For the given data it is expected the above piece of code will produce the following output:

Buy: 09/10/2002 Sell: 09/10/2007 Profit per share: $78.38

Quick Quiz: What changes to the above piece of code would be required if:



Rather than prices, we instead are given percentage differences from the previous price?

We have the constraint that we can't hold a position for more than K days

A penalty of (1/K)% is applied to the profit for every day a position is held? (where K is the condition and value described above)

Short selling is allowed?

We want to process independent sections of the time series concurrently so as to speed up overall processing time.

The prices could be extremely large or small?

StrTk provides a means to easily parse and consume 2D grids of tokens in an efficient and simple manner. A grid is simply defined as a series of rows comprised of tokens, otherwise known as Delimiter Separated Values (DSV). The ith token of a row is grouped with every ith token of all other rows to produce a column. Tokens can be processed as either rows or columns.



An example of a simple token grid, where each numeric value is deemed to be a token:

1.1 2.2 3.3 4.4 5.5 1.1 2.2 3.3 4.4 5.5 1.1 2.2 3.3 4.4 5.5 1.1 2.2 3.3 4.4 5.5 1.1 2.2 3.3 4.4 5.5

A token grid can be either passed in via a file or a string buffer. The delimiters to be used for parsing the columns and rows can also be provided, if not provided standard common defaults will be used.



The following demonstrates how each cell in the grid can be access and cast to a specific type.

std::string data = "1,2,3,4,5,6

" "7,8,9,0,1,2

" "3,4,5,6,7,8

" "9,0,1,2,3,4

" "5,6,7,8,9,0

"; strtk::token_grid grid(data, data.size(), ","); for (std::size_t r = 0; r < grid.row_count(); ++r) { strtk::token_grid::row_type row = grid.row(r); for (std::size_t c = 0; c < row.size(); ++c) { std::cout << grid.get<int>(r,c) << '\t'; } std::cout << std::endl; }

The strtk::token_grid provides various helper functions for traversing rows and columns in batch mode. The functions are namely: for_each_row that is used for iterating either all or a sub-range of rows of the token_grid, and for_each_column that is used for iterating either all or a sub-range of columns of a row.

int main() { // Column 0 1 2 3 4 5 6 std::string data = "1.1,2.1,3.1,4.1,5.1,6.1,7.1

" // row_0 "1.2,2.2,3.2,4.2,5.2,6.2,7.2

" // row_1 "1.3,2.3,3.3,4.3,5.3,6.3,7.3

" // row_2 "1.4,2.4,3.4,4.4,5.4,6.4,7.4

" // row_3 "1.5,2.5,3.5,4.5,5.5,6.5,7.5

" // row_4 "1.6,2.6,3.6,4.6,5.6,6.6,7.6

" // row_5 "1.7,2.7,3.7,4.7,5.7,6.7,7.7

" // row_6 "1.8,2.8,3.8,4.8,5.8,6.8,7.8

" // row_7 "1.9,2.9,3.9,4.9,5.9,6.9,7.9

"; // row_8 strtk::token_grid grid(data,data.size(),","); { // Process each row of the token grid grid.for_each_row( [](const strtk::token_grid::row_type& row) { for (std::size_t i = 0; i < row.size(); ++i) { std::cout << row.get<double>(i) << '\t'; } std::cout << '

'; }); } { // Process rows in the range [2,6] of the token grid grid.for_each_row(grid.range(2,6), [](const strtk::token_grid::row_type& row) { for (std::size_t i = 0; i < row.size(); ++i) { std::cout << row.get<double>(i) << '\t'; } std::cout << '

'; }); } { // Process each row and column of the token grid grid.for_each_row( [](const strtk::token_grid::row_type& row) { row.for_each_column( [](const strtk::token_grid::row_type::range_type& range) { std::cout << std::string(range.first,range.second) << '\t'; }); std::cout << '

'; }); } { // Process rows in the range [2,6] and the columns in the range [1,5] of the token grid grid.for_each_row(grid.range(2,6), [](const strtk::token_grid::row_type& row) { row.for_each_column(row.range(1,5), [](const strtk::token_grid::row_type::range_type& range) { std::cout << std::string(range.first,range.second) << '\t'; }); std::cout << '

'; }); } return 0; }

The following example demonstrates how averages over rows and columns of a token grid can be computed:

std::string data = "1.1,1.1,1.1,1.1,1.1,1.1

" "2.2,2.2,2.2,2.2,2.2,2.2

" "3.3,3.3,3.3,3.3,3.3,3.3

" "4.4,4.4,4.4,4.4,4.4,4.4

" "5.5,5.5,5.5,5.5,5.5,5.5

" "6.6,6.6,6.6,6.6,6.6,6.6

"; strtk::token_grid grid(data,data.size(),","); std::vector<double> avg_c(grid.row(0).size(),0.0); std::vector<double> avg_r(grid.row_count(),0.0); std::vector<double> tmp(grid.row(0).size(),0.0); std::fill(avg_c.begin(), avg_c.end(), 0.0); for (std::size_t i = 0; i < grid.row_count(); ++i) { grid.row(i).parse<double>(tmp.begin()); std::transform(avg_c.begin(), avg_c.end(), tmp.begin(), avg_c.begin(), std::plus<double>()); avg_r[i] = std::accumulate(tmp.begin(), tmp.end(), 0.0) / tmp.size(); } for (std::size_t i = 0; i < avg_c.size(); avg_c[i++] /= grid.row_count()); std::cout << "Column Averages:\t"; std::copy(avg_c.begin(),avg_c.end(),std::ostream_iterator<double>(std::cout,'\t')); std::cout << '

'; std::cout << "Row Averages:\t"; std::copy(avg_r.begin(),avg_r.end(),std::ostream_iterator<double>(std::cout,'\t')); std::cout << '

';

The original intent of the token grid was to support fast and efficient processing of simple data tuples, such as comma separated values (CSV) formats et. al. The following example demonstrates a simple summation of traded floor volume and average daily volume based on NASDAQ OHLC (Open-High-Low-Close) data.

// Date,Symbol,Open,Close,High,Low,Volume std::string market_data = "20090701,GOOG,424.2000,418.9900,426.4000,418.1500,2310768

" "20090701,MSFT,24.0500,24.0400,24.3000,23.9600,54915127

" "20090702,GOOG,415.4100,408.4900,415.4100,406.8100,2517630

" "20090702,MSFT,23.7600,23.3700,24.0400,23.2100,65427699

" "20090703,GOOG,408.4900,408.4900,408.4900,408.4900,0

" "20090703,MSFT,23.3700,23.3700,23.3700,23.3700,0

" "20090706,GOOG,406.5000,409.6100,410.6400,401.6600,2262557

" "20090706,MSFT,23.2100,23.2000,23.2800,22.8700,49207638

" "20090707,GOOG,408.2400,396.6300,409.1900,395.9801,3260307

" "20090707,MSFT,23.0800,22.5300,23.1400,22.4600,52842412

" "20090708,GOOG,400.0000,402.4900,406.0000,398.0600,3441854

" "20090708,MSFT,22.3100,22.5600,22.6900,2200000,73023306

" "20090709,GOOG,406.1200,410.3900,414.4500,405.8000,3275816

" "20090709,MSFT,22.6500,22.4400,22.8100,22.3700,46981174

" "20090710,GOOG,409.5700,414.4000,417.3700,408.7000,2929559

" "20090710,MSFT,22.1900,22.3900,22.5400,22.1500,43238698

"; strtk::token_grid grid(market_data, market_data.size(), ","); struct stock_info { stock_info(const std::string& s = " ") : symbol(s), total_volume(0), day_count(0), average_daily_volume(0.0) {} std::string symbol; unsigned long long total_volume; std::size_t day_count; double average_daily_volume; }; stock_info goog("GOOG"); stock_info msft("MSFT"); static const std::size_t volume_column = 6; static const std::size_t symbol_column = 1; goog.day_count = grid.accumulate_column (volume_column, [](const strtk::token_grid::row_type& row) -> bool { static const std::string google_symbol("GOOG"); return google_symbol == row.get<std::string>(symbol_column); }, goog.total_volume); msft.day_count = grid.accumulate_column (volume_column, [](const strtk::token_grid::row_type& row) -> bool { static const std::string microsoft_symbol("MSFT"); return microsoft_symbol == row.get<std::string>(symbol_column); }, msft.total_volume); goog.average_daily_volume = (1.0 * goog.total_volume) / goog.day_count; msft.average_daily_volume = (1.0 * msft.total_volume) / msft.day_count; std::cout << "[GOOG] Total Volume: " << goog.total_volume << std::endl; std::cout << "[MSFT] Total Volume: " << msft.total_volume << std::endl; std::cout << "[GOOG] ADV: " << goog.average_daily_volume << std::endl; std::cout << "[MSFT] ADV: " << msft.average_daily_volume << std::endl;

The strtk::token_grid is thread-safe iff read operations are in play. As such the above calls to accumulate_column et al. can all be safely and easily executed concurrently using threads. This allows for a far more efficient data processing methodology.

Playing devil's advocate, another way of performing the above processing task, assuming only the specific values for computing the ADV are required and no further processing of the CSV data is needed, then the problem can be solved efficiently by utilizing a single pass of the data as follows:

std::string file_name = "market_data.csv"; std::unordered_map stock_map; stock_map.insert(std::make_pair ("GOOG",stock_info("GOOG"))); stock_map.insert(std::make_pair ("MSFT",stock_info("MSFT"))); strtk::for_each_line(file_name, [&](const std::string& line) { strtk::ignore_token ignore; stock_info temp; const bool result = strtk::parse(line, ",", ignore, temp.symbol, ignore, ignore, ignore, ignore, temp.total_volume); if (!result) return; auto itr = stock_map.find(temp.symbol); if (stock_map.end() == itr) return; (*itr).second.total_volume += temp.total_volume; (*itr).second.day_count++; }); auto itr = stock_map.begin(); auto end = stock_map.end(); while (end != itr) { stock_info& stock = (*itr++).second; stock.average_daily_volume = (1.0 * stock.total_volume) / stock.day_count; }

Playing the devil's other advocate, the above two examples, have both required that the filter condition be explicitly defined at compile time. However even though the condition maybe be set in stone at compile time, some of the underlyings (such as symbol) can be engineered to be modified at run-time. That still doesn't give us the freedom to perform arbitrarily complex filter expressions determined at run-time.



That said, an extremely efficient and very simple solution is at hand. The solution is called the C++ DSV Filter Library, it is based on StrTK and ExprTk libraries. It uses the strtk::token_grid as a CSV/DSV store and index, and ExprTk as the underlying expression evaluation engine. The example below takes the OHLC market data table defined above and performs a row-wise query. The expression's definition is:



'select all rows where the open price is greater than the close price and the symbol matches the wild-card pattern of "*FT*" and the date is equal to or after 20090101'

int main() { std::string file_name = "market_data.csv"; dsv_filter filter; filter.set_input_delimiter(","); if (!filter.load(file_name)) return 1; std::string expression = "(open > close) and (symbol like '*FT*') and (date >= '20090101')"; filter.add_filter(filter_expression); for (std::size_t row = 1; row < filter.row_count(); ++row) { if (dsv_filter::e_match == filter[row]) { // do something... } } return 0; }

Other example queries:

volume >= 1000000 and symbol == 'GOOG'

abs(open - close) > abs(high - low)

avg(open,close,high,low) * volume > 10^7 and inrange('20090702',date,'20090730')

(open > close) and (symbol like '*FT*') and (date >= '20090101')

It should be noted that in the example above, the rows begin at index 1. That is done because the dsv_filter expects the first row or row at index 0 to be a column definition header. The format of the column definitions is to simply add a suffix of "_s" if the values in the column are to be treated as strings or "_n" if they are to be treated as numbers. When defining expressions the suffixes should not be included when including the column names. The above mentioned OHLC csv file's header would be as follows:

Date_s,Symbol_s,Open_n,Close_n,High_n,Low_n,Volume_n

C++ DSV Filter and Dependencies

A typical operation carried out upon time-series data is to group tuples into buckets (or bins) based upon the time index value. For example grouping data into 2-minute buckets and then performing some kind of operation upon the grouped tuples such as a summation or an average etc. This process is sometimes also called: "discretization"



The strtk::token_grid class provides a method known as sequential_partition. The sequential_partition method requires a Transition Predicate, a Function and optionally a row-range. The Transition Predicate consumes a row and returns true only if the row is in the next partition. All other subsequent consecutive rows until the transition predicate returns a true are said to be in the current partition. Prior to transitioning to a new partition, the function predicate is invoked and provided with the range of rows in the current partition.



The following example takes a simple time-series (time and value), partitions the tuples into groups of Time-Buckets of period length 3 and then proceeds to compute the total sum of each group. The below summarizer class provides provides a Transition Predicate and Function.

class summarizer { public: enum column_index { tick_time_column = 0, tick_value_column = 1 }; summarizer(std::deque<double>& sum_value) : next_tick_time_(0), sum_value_(sum_value) {} // Transition Predicate bool operator()(const strtk::token_grid::row_type& row) { if (row.get (tick_time_column) >= next_tick_time_) { next_tick_time_ = row.get (tick_time_column) + 3; return true; } else return false; } // Function bool operator()(const strtk::token_grid& grid, const strtk::token_grid::row_range_type& range) { double bucket_sum = 0.0; if (!grid.accumulate_column(tick_value_column,range,bucket_sum)) { std::cout << "failed to accumulate!" << std::endl; return false; } else sum_value_.push_back(bucket_sum); return true; } private: summarizer& operator=(const summarizer&); std::size_t next_tick_time_; std::deque<double>& sum_value_; }; int main() { // Time index, Value std::string data = "10000,123.456

" "10001,612.345

" "10002,561.234

" "10003,456.123

" "10004,345.612

" "10005,234.561

" "10006,123.456

"; strtk::token_grid grid(data, data.size(), ","); std::deque<double> sum_value; summarizer s(sum_value); grid.sequential_partition(s,s); for (std::size_t i = 0; i < sum_value.size(); ++i) { std::cout << "bucket[" << i << "] = " << sum_value[i] << std::endl; } return 0; }

The expected output is as follows:

bucket[0] = 1297.035 bucket[1] = 1036.296 bucket[2] = 123.456

One of the simple extensions to the CSV format is the concept of double quoted tokens. Such tokens may contain column or row delimiters. When such a scenario is encountered, all subsequent delimiters are ignored, and kept as part of the token, until the corresponding closing double quote is encountered. The StrTk token_grid supports the parsing of such tokens. This parsing mode can be easily activated via the token_grid option set. Below is an example of a token_grid loading a CSV data set representing various airports from around the world and their specific codes and locations, in which some of the cells are double-quoted:

ICAO IATA Airport City Country AYGA GKA "Goroka Gatue" Goroka Papua New Guinea BGCO GCO "Nerlerit Inaat Constable Pynt" "Nerlerit Inaat" Greenland BZGD ZGD Godley Auckland New Zealand CYQM YQM "Greater Moncton International" Moncton Canada EDRK ZNV "Koblenz Winningen" Koblenz Germany FAHU AHU "HMS Bastard Memorial" Kwazulu-Natal South Africa FQMP MZB "Mocimboa Da Praia" "Mocimboa Da Praia" Mozambique KINS INS "Indian Springs AF AUX" Indian Springs USA UHNN HNN Nikolaevsk "Nikolaevsk Na Amure" Russia WBKK BKI "Kota Kinabalu International" Kota Kinabalu Malaysia ZSJD JDZ "Jingdezhen Airport" Jingdezhen China

The following is an StrTk code example using token_grid to parse the above CSV data set:

int main() { std::string airport_data_file_name = "airport_data.csv"; strtk::token_grid::options options; options.column_delimiters = "| ,"; options.support_dquotes = true; strtk::token_grid airport_grid(airport_data_file_name,options); // for each row r, for each column c, print cell[r,c] for (std::size_t r = 0; r < airport_grid.row_count(); ++r) { strtk::token_grid::row_type row = airport_grid.row(r); for (std::size_t c = 0; c < row.size(); ++c) { std::cout << "[" << row.get<std::string>(c) << "] "; } std::cout << std::endl; } return 0; }

As previously mentioned the concept of a delimiter based predicate can lead to some very interesting solutions. A predicate as has been defined so far, with the exception of the offset predicate, has been a stateless entity. Adding the ability to maintain a state based on what the predicate has encountered so far can allow it to behave differently from the simple single and multiple delimiter predicates.



For this example, lets assume a typical command line parsing problem which consists of double quotation mark groupings and escapable special characters, which can be considered being dual use as either delimiters or data. An example input and output is as follows:

Inputs

Data abc;"123, mno xyz",i\,jk Delimiters <space>;,.

Output Tokens

Token 0 abc Token 1 123, mno xyz Token 2 i\,jk

In order to tokenize the above described string, one can create a composite predicate using a multiple char delimiter predicate and some simple state rules. The following is an example of such an extended predicate:

class extended_predicate { public: extended_predicate(const std::string& delimiters) : escape_(false), in_bracket_range_(false), mdp_(delimiters) {} bool operator()(const unsigned char c) const { if (escape_) { escape_ = false; return false; } else if ('\\' == c) { escape_ = true; return false; } else if ('"' == c) { in_bracket_range_ = !in_bracket_range_; return true; } else if (in_bracket_range_) return false; else return mdp_(c); } void reset() { escape_ = false; in_bracket_range_ = false; } private: mutable bool escape_; mutable bool in_bracket_range_; mutable strtk::multiple_char_delimiter_predicate mdp_; };

Usage of the newly defined extended predicate is as follows:

int main() { std::string str = "abc;\"123, mno xyz\",i\\,jk"; strtk::std_string::token_list_type token_list; strtk::split(extended_predicate(".,; "), str, std::back_inserter(token_list), strtk::split_options::compress_delimiters); return 0; }

Taking our previous person struct as an example. It is clear that the tuple format has to be very specific with regards to the ordering of data. For most situations this is acceptable as the serializers and deserializers attempt to function in the simplest manner possible, however sometimes pieces of information may come in different orderings, or may be deemed optional and hence not be present in a particular tuple of data.



An approach is required that provides the means of mapping specific pieces of data to the corresponding variables (or members) that will store or make sense of them, in a very efficient and simple manner. For clarity purposes, the term "mapping" here means to populate the desired member with the given data. The association between the two being made with the key that is paired in-line with the data. Key-Value pairs or some times known as Attribute-Value pairs are one of the means by which such an association can be accomplished. The following is a diagram that demonstrates the mapping of various fields of a data struct to their corresponding data elements in a tuple comprised of key-value pairs.

In the diagram above, the key-value pairs are separated (delimited) by the pipe symbol "|". With regards to the key-value pairs themselves, the key is traditionally the first element in the pair and the value is the second element, they are separated in this case by a single equal sign "=".



Back to our original problem relating to the person struct. If we were to add keys to each of the pieces of data, we could then not only parse a tuple of data representing the various fields of the person struct, but those fields could be any order within the tuple. Furthermore any of the fields could be deemed optional, hence not necessarily be present in the tuple. It should be noted that what would denote a correct or successful parse of tuple may not only depend upon successful parsing of ranges into the various types, but it may also depend upon the mandatory presence of certain fields.



Our objective will be to successfully and in the most efficient way possible parse the following list of tuples that represent instances of our person struct.

Tuple0 = "ID=0xFA37ED12|NAME=Rumpelstiltskin|AGE=397|HEIGHT=1.31|WEIGHT=58.7" Tuple1 = "NAME=Rumpelstiltskin|AGE=397|HEIGHT=1.31|WEIGHT=58.7|ID=0xFA37ED12" Tuple2 = "AGE=397|HEIGHT=1.31|WEIGHT=58.7|ID=0xFA37ED12|NAME=Rumpelstiltskin" Tuple3 = "HEIGHT=1.31|WEIGHT=58.7|ID=0xFA37ED12|NAME=Rumpelstiltskin|AGE=397" Tuple4 = "WEIGHT=58.7|ID=0xFA37ED12|NAME=Rumpelstiltskin|AGE=397|HEIGHT=1.31" Tuple5 = "ID=0xFA37ED12|WEIGHT=58.7|AGE=397|NAME=Rumpelstiltskin|HEIGHT=1.31" Tuple6 = "WEIGHT=58.7|AGE=397|NAME=Rumpelstiltskin|HEIGHT=1.31|ID=0xFA37ED12" Tuple7 = "AGE=397|NAME=Rumpelstiltskin|HEIGHT=1.31|ID=0xFA37ED12|WEIGHT=58.7" Tuple8 = "NAME=Rumpelstiltskin|HEIGHT=1.31|ID=0xFA37ED12|WEIGHT=58.7|AGE=397" Tuple9 = "HEIGHT=1.31|ID=0xFA37ED12|WEIGHT=58.7|AGE=397|NAME=Rumpelstiltskin"

StrTk provides a means to achieve the above key-value pair parsing task, namely via the strtk::keyvalue::parser and associated key- mappers. The following is a table that depicts the various kind of key to value mappers that are available in the StrTk library:

Mapper Type Key Lookup Complexity Maximum Size strtk::keyvalue::stringkey_map std::string O(log(n)) Limited to available memory strtk::keyvalue::uintkey_map cardinal value O(1) Limited to expected maximum key value

In the code below we initially begin by defining the delimiters we expect to see between pairs of key-values and in between the key and value pairs themselves. In the example below the pair_block_delimiter field denotes the delimiter we expect between pairs (or blocks) of key-values and the field pair_delimiter denotes the delimiter we expect between a particular key and value.



Next we define p as an instance of the person struct and register each of the members of the instance p with a corresponding key with the keyvalue::parser. After which we then process each tuple of data, parsing the tuple, populating the instance p, then printing out the various fields to stdout.

struct person { unsigned int id; // key = ID std::string name; // key = NAME unsigned int age; // key = AGE double height // key = HEIGHT float weight; // key = WEIGHT }; int main() { const std::string person_data[] = { "ID=0xFA37ED12|NAME=Rumpelstiltskin|AGE=397|HEIGHT=1.31|WEIGHT=58.7", "NAME=Rumpelstiltskin|AGE=397|HEIGHT=1.31|WEIGHT=58.7|ID=0xFA37ED12", "AGE=397|HEIGHT=1.31|WEIGHT=58.7|ID=0xFA37ED12|NAME=Rumpelstiltskin", "HEIGHT=1.31|WEIGHT=58.7|ID=0xFA37ED12|NAME=Rumpelstiltskin|AGE=397", "WEIGHT=58.7|ID=0xFA37ED12|NAME=Rumpelstiltskin|AGE=397|HEIGHT=1.31", "ID=0xFA37ED12|WEIGHT=58.7|AGE=397|NAME=Rumpelstiltskin|HEIGHT=1.31", "WEIGHT=58.7|AGE=397|NAME=Rumpelstiltskin|HEIGHT=1.31|ID=0xFA37ED12", "AGE=397|NAME=Rumpelstiltskin|HEIGHT=1.31|ID=0xFA37ED12|WEIGHT=58.7", "NAME=Rumpelstiltskin|HEIGHT=1.31|ID=0xFA37ED12|WEIGHT=58.7|AGE=397", "HEIGHT=1.31|ID=0xFA37ED12|WEIGHT=58.7|AGE=397|NAME=Rumpelstiltskin", }; const std::size_t person_data_size = sizeof(person_data) / sizeof(std::string); // Basic type definition typedef unsigned char char_type; typedef strtk::keyvalue::parser kvp_type; typedef strtk::keyvalue::options<char_type> opts_type; // Setup the various delimiters opts_type options; options.pair_block_delimiter = '|'; options.pair_delimiter = '='; kvp_type kvp(options); person p; strtk::hex_to_number_sink h2ns(p.id); // Define the mapping between the key and the // members of the person struct kvp.register_keyvalue("ID" , h2ns); kvp.register_keyvalue("NAME" , p. name); kvp.register_keyvalue("AGE" , p. age); kvp.register_keyvalue("HEIGHT", p.height); kvp.register_keyvalue("WEIGHT", p.weight); // Go through each tuple for (std::size_t i = 0; i < person_data_size; ++i) { // If parsing of the ith tuple is successful // print the person struct out. if (kvp(person_data[i])) { std::cout << i << '\t' << "ID: " << p.id << '\t' << "Name: " << p.name << '\t' << "AGE: " << p.age << '\t' << "HEIGHT: " << p.height << '\t' << "WEIGHT: " << p.weight << '

'; } } return 0; }

It should be noted that the underlying associative container for the strtk::keyvalue::stringkey_map can be explicitly specified at compile time. By default it is std::map, however it can be easily changed to std::unordered_map giving it a key lookup complexity of O(1) or in fact any other container that is compatible with STL associative map semantics. The following are three examples of keyvalue_parsers specialized using std::unordered_map, the first being based on the std::string type, the second being based on the int type and the third being based on the double type:

int main() { // std::string type key to value mapper based on std::unordered_map { // Basic type definition typedef unsigned char char_type; typedef std::unordered_map key_to_val_map_t; typedef strtk::keyvalue::key_map stringhashkey_mapper; typedef strtk::keyvalue::parser kvp_type; typedef strtk::keyvalue::options<char_type> opts_type; //Setup the various delimiters opts_type options; options.pair_block_delimiter = '|'; options.pair_delimiter = '='; kvp_type kvp(options); } // int type key to value mapper based on std::unordered_map { // Basic type definition typedef unsigned char char_type; typedef std::unordered_map key_to_val_map_t; typedef strtk::keyvalue::key_map inthash_key_mapper; typedef strtk::keyvalue::parser kvp_type; typedef strtk::keyvalue::options<char_type> opts_type; // Setup the various delimiters opts_type options; options.pair_block_delimiter = '|'; options.pair_delimiter = '='; kvp_type kvp(options); } // double type key to value mapper based on std::unordered_map { // Basic type definition typedef unsigned char char_type; typedef std::unordered_map key_to_val_map_t; typedef strtk::keyvalue::key_map doublehash_key_mapper; typedef strtk::keyvalue::parser kvp_type; typedef strtk::keyvalue::options<char_type> opts_type; // Setup the various delimiters opts_type options; options.pair_block_delimiter = '|'; options.pair_delimiter = '='; kvp_type kvp(options); } return 0; }

To further the previous example, one might have a situation where a tuple contains values for different members of different types. The following example demonstrates parsing of key-value pairs that map to members of multiple types. The tuple in the example consists of values some of which are intended for the instance of data1 and others which are intended for the instance of data2.

struct data1 { int x0; std::string x1; double x2; float x3; } struct data2 { int y0; std::string y1; double y2; float y3; } int main() { typedef unsigned char char_type; typedef strtk::keyvalue::parser kvp_type; typedef strtk::keyvalue::options<char_type> opts_type; opts_type options; options.pair_block_delimiter = '|'; options.pair_delimiter = '='; kvp_type kvp(options); data1 d1; data2 d2; std::string data = "D1X0=123456789|D1X1=AbCdEfG123456|D1X2=123.456|D1X3=0101.0202f|" "D2Y0=987654321|D2Y1=tUvWxYz789012|D2Y2=789.012|D2Y3=0707.0909f"; // Register values for data1 type kvp.register_keyvalue("D1X0", d1.x0); kvp.register_keyvalue("D1X1", d1.x1); kvp.register_keyvalue("D1X2", d1.x2); kvp.register_keyvalue("D1X3", d1.x3); // Register values for data2 type kvp.register_keyvalue("D2Y0", d2.y0); kvp.register_keyvalue("D2Y1", d2.y1); kvp.register_keyvalue("D2Y2", d2.y2); kvp.register_keyvalue("D2Y3", d2.y3); // Parse the tuple kvp(data); std::cout << "D1X0: " << d1.x0 << '\t' << "D1X1: " << d1.x1 << '\t' << "D1X2: " << d1.x2 << '\t' << "D1X3: " << d1.x3 << '

'; std::cout << "D2Y0: " << d2.y0 << '\t' << "D2Y1: " << d2.y1 << '\t' << "D2Y2: " << d2.y2 << '\t' << "D2Y3: " << d2.y3 << '

'; return 0; }

The values in the key-value pairs need not always be singular. In some scenarios, a value could be a list of values. The StrTk key-value parser supports parsing of such key-value pairs through the previously demonstrated sequence sink mechanisms.



In the following example we have a complex_data struct that consists of some PODs but also a number of sequences (vector, deque, list). The tuple to be parsed consists of simple key-value pairs for the PODs and more complex looking pairs where the values for the specific sequence of values are separated by commas ",".

struct complex_data { unsigned int v0; std::vector<int> v1; double v2; std::deque<double> v3; std::string v4; std::list<std::string> v5; }; int main() { typedef unsigned char char_type; typedef strtk::keyvalue::parser<strtk::keyvalue::stringkey_map> kvp_type; typedef strtk::keyvalue::options<char_type> opts_type; opts_type options; options.pair_block_delimiter = '|'; options.pair_delimiter = '='; kvp_type kvp(options); complex_data cd; std::string data = "V0=123456789|V1=-3,-2,-1,0,+1,2+3|V2=111.222|V3=1.1,2.2,3.3,4.4,5.5|" "V4=Some Text|V5=Text1,Text2,Text3,Text4"; strtk::vector_sink<int>::type vec_sink(","); strtk::deque_sink<double>::type deq_sink(","); strtk::list_sink<std::string>::type lst_sink(","); // Register values for data1 type kvp.register_keyvalue("V0", cd.v0); kvp.register_keyvalue("V1", vec_sink(cd.v1)); kvp.register_keyvalue("V2", cd.v2); kvp.register_keyvalue("V3", deq_sink(cd.v3)); kvp.register_keyvalue("V4", cd.v4); kvp.register_keyvalue("V5", lst_sink(cd.v5)); kvp(data); std::cout << "V0: " << cd.v0 << '

' << "V1: " << strtk::join(" ",cd.v1) << '

' << "V2: " << cd.v2 << '

' << "V3: " << strtk::join(" ",cd.v3) << '

' << "V4: " << cd.v4 << '

' << "V5: " << strtk::join(" ",cd.v5) << '

'; return 0; }

So far all the examples have assumed keys of arbitrary values. In some situations such as the FIX protocol, the keys are always guaranteed to be positive integer values. If this is the case then a different kind of key-mapper can be used that is much more efficient than the general purpose string key-mapper, providing a key lookup complexity of O(1). As such StrTk provides the strtk::keyvalue::uintkey_map type for this purpose. The only difference in terms of setting up the uintkey_map from the stringkey_map is that it requires a key_count to be set. This value represents the largest possible key value that can exist. The following is an example of how the strtk::keyvalue::uintkey_map key-mapper can be used.

struct data_store { char c; // key = 121 unsigned char uc; // key = 122 short s; // key = 123 unsigned short us; // key = 124 int i; // key = 125 std::string str; // key = 126 }; int main() { const std::string data = "121=A|122=z|123=-123|124=456|125=-12345678|126=Some simple text"; typedef strtk::keyvalue::parser<strtk::keyvalue::uintkey_map> kvp_type; strtk::keyvalue::uintkey_map::options options; options.key_count = 127; // [0,126] --> 127 keys options.pair_block_delimiter = '|'; options.pair_delimiter = '='; kvp_type kvp(options); data_store ds; kvp.register_keyva