Parsing a List of Key-Value Pairs Using Spirit.Qi By Hartmut Kaiser

One of the goals of this blog is to provide shrink wrapped solutions for small everyday parsing and output generation problems. We hope to show how Spirit may be used to simplify the life for you as a C++ programmer related to data conversion problems from some external representation to your internal data structures and vice versa.

One of the tasks often to be solved is to parse arbitrary key/value pairs delimited by some separator into a std::map. Parsing the URL query format is one example: key1=value1;key2=value2;…;keyN=valueN, and that’s what I would like to provide a reusable solution for.

Let’s start with the corresponding grammar (written using the notation of Parsing Expression Grammars):

query ← pair ((';' / '&') pair)* pair ← key ('=' value)? key ← [a-zA-Z_][a-zA-Z_0-9]* value ← [a-zA-Z_0-9]+

We assume that a key has to start with any letter or an underscore, but otherwise might consist of letters, digits or underscores. A value can not be empty and may consist of letters, digits or underscores only as well. Further, we assume a query to be a sequence of at least one pair delimited by semicolons (‘;’) or ampersands (‘&’), and a pair to be a sequence of a mandatory key optionally followed by a ‘=’ and a value.

Converting any Parsing Expression Grammar (PEG) into an equivalent grammar for Spirit.Qi is a purely mechanical step. Let me provide the result first and explain some of the details later on:

query = pair >> *((qi::lit(';') | '&') >> pair); pair = key >> -('=' >> value); key = qi::char_("a-zA-Z_") >> *qi::char_("a-zA-Z_0-9"); value = +qi::char_("a-zA-Z_0-9");

All differences we see are caused by limitations of the C++ language Spirit has to live with. Direct juxtaposition is expressed using the right shift operator (‘>>’), postfix operators as the Kleene Star (‘*’) and the Plus (‘+’) are moved to the front of the corresponding expression and written as prefix operators, and the optional operator (question mark) is rewritten using the unary minus (‘-‘). Otherwise the two grammars look fairly similar. The char_ is a predefined Qi parser primitive matching exactly one character based on the description provided as its argument. The lit is very similar to char_ except that it does not expose the matched value as its attribute.

The next required step is to understand what attribute type each of our defined grammar rules should expose. These attribute types will allow us to map the external representation onto the internal C++ types we will use to store the parsing results. Let us store the keys and the values as std::string’s (assuming we have to deal with narrow character representations). The result of a parsed key/value pair can be conveniently stored into a std::pair<std::string, std::string>. Finally, the overall result of parsing the query string should be stored into a std::map<std::string, std::string>. This knowledge allows to write the declaration of our rule variables as used above:

qi::rule<Iterator, std::map<std::string, std::string>()> query; qi::rule<Iterator, std::pair<std::string, std::string>()> pair; qi::rule<Iterator, std::string()> key, value;

The type qi::rule<> is a predefined non-terminal parser provided by Spirit usable for storing the grammar definitions above. We need to provide the iterator type of the underlying input data (Iterator) and the attribute type, which is the type of the data the rule is supposed to store its parsed data in.

A word about the unusual function declaration syntax used to specify the attribute type of the rule. Non-terminals in recursive descent parsers can be seen as being very similar to functions. They return a value, their (synthesized) attribute, while they optionally may take arguments, their (inherited) attributes. Spirit uses the function declaration syntax in order to emphasize this similarity.

In the beginning I promised to provide a shrink wrapped solution, so what’s still left is to encapsulate the whole functionality. Again Spirit has some recommended way of doing that: grammar<>’s. Spirit grammar’s are special non-terminals acting as containers for one or more rules allowing to encapsulate more complex parsers:

namespace qi = boost::spirit::qi; template <typename Iterator> struct keys_and_values : qi::grammar<Iterator, std::map<std::string, std::string>()> { keys_and_values() : keys_and_values::base_type(query) { query = pair >> *((qi::lit(';') | '&') >> pair); pair = key >> -('=' >> value); key = qi::char_("a-zA-Z_") >> *qi::char_("a-zA-Z_0-9"); value = +qi::char_("a-zA-Z_0-9"); } qi::rule<Iterator, std::map<std::string, std::string>()> query; qi::rule<Iterator, std::pair<std::string, std::string>()> pair; qi::rule<Iterator, std::string()> key, value; };

The derivation from Qi’s grammar type converts the keys_and_values type into a parser. Its member rules define a grammar which makes it usable for recognizing URL query strings. The base class constructor gets passed the start rule, which is the top most rule of the grammar to be executed when the grammar is invoked. The type keys_and_values has a template parameter allowing to utilize this grammar with arbitrary iterator types.

The last missing code piece shows how to invoke the newly created parser.

std::string input("key1=value1;key2;key3=value3"); // input to parse std::string::iterator begin = input.begin(); std::string::iterator end = input.end(); keys_and_values<std::string::iterator> p; // create instance of parser std::map<std::string, std::string> m; // map to receive results bool result = qi::parse(begin, end, p, m); // returns true if successful

The function qi::parse() is one of Spirit’s main API functions. In the simplest case it takes a pair of iterators pointing to the input sequence to parse (begin, end), an instance of the parser to invoke (p), and the attribute instance to be filled with the converted data (m). This function executes the actual parsing operation and returns true if it was successful.

The fact that attributes of certain types are getting filled on the fly might look like magic to you, or you might think I left out some essential code snippets. But neither it is magic nor did I leave out anything. In Spirit.Qi all parser components (such as char_ or lit) expose specific attribute types and all compound parsers (such as sequences and alternatives) implement well defined rules for attribute propagation and merging. Our example uses the knowledge about these rules, and if you want to understand how this works in more detail, please refer to Spirit’s documentation.

Here is an example: as you might expect char_ exposes the matched character as its attribute. The Kleene Star (‘*’) and the Plus (‘+’) are compound parsers collecting the attributes of their embedded parser in a (STL compatible) container (in our case a std::string, which is a container of char). The non-terminal rule normally inherits the attribute of the parser expression on the right hand side of its assignment. Last but not least, sequences match the attributes of their elements to the corresponding parts of their attribute data structure. Since literals (such as ‘=’ or lit(‘;’)) do not expose any attribute, the expression

pair = key >> -('=' >> value);

naturally combines the two string attributes of key and value into the pair of strings as expected by the non-terminal pair, initializing the attribute of value with an empty string if it is not matched.

If you want to try it out for yourself, the complete source code for this example is available from the Boost SVN here. In the future this example will be distributed as part of the Spirit distribution, but for now it lives in the SVN only.