Parsing Escaped String Input Using Spirit.Qi By Jeroen Habraken

In a previous article you’ve learned how to generated escaped strings, today we’re going to do the reverse. We’re going to describe a Qi grammar you can use to parse quoted strings in which special characters are escaped.

The purpose of the unescaped_string grammar is to removed the enclosing quotes and un-escape the characters which were previously escaped (i.e. convert “\

” back to ‘

’ again). Unsurprisingly, the Parsing Expression Grammar (PEG) is quite like the one in the Karma article:

unesc_str = '"' (unesc_char / . / "\\x" hex)* '"' unesc_char = &"\\a" '\a' / &"\\b" '\b' / &"\\f" '\f' / &"\

" '

' / &"\\r" '\r' / &"\\t" '\t' / &"\\v" '\v' / &"\\\\" '\\' / &"\\\'" '\'' / &"\\\"" '"'

An escaped strings starts and ends with the quoting character (in this case ‘”‘) and all characters of the sequence are read either as an escaped character (unesc_char), a printable character or a “\\x” followed by the hexadecimal representation of a character code. The unesc_char will handle any of the listed character codes by parsing a backslash followed by the corresponding C-style encoding. Like in the Karma example, each of the listed alternatives (such as &”\\a” ‘\a’) reads as: if the input being parsed starts with “\\a”, return ‘\a’.

Now let’s convert this to a Spirit grammar:

unesc_str = '"' >> *(unesc_char | qi::alnum | "\\x" >> qi::hex) >> '"'; unesc_char.add("\\a", '\a')("\\b", '\b')("\\f", '\f')("\

", '

') ("\\r", '\r')("\\t", '\t')("\\v", '\v')("\\\\", '\\') ("\\\'", '\'')("\\\"", '\"');

Just like in the Karma example we’re using predefined facilities, but in this case Qi facilities. Where we were previously using karma::symbols<> to map special characters to their C-style representation, we’re doing the reverse now: mapping C-style representations to the special characters using qi::symbols<>. This too will conveniently fail if the representation is not in the symbol table. The next parser that will be tried is qi::alnum, which successfully accepts an alphanumeric character and fails on all other input. At this point you might be wondering why I’m not using qi::print, like we were using karma::print, I’ll come back to this later. Last but not least we’ll try converting a hexadecimal character representation to the actual character code if it’s prefixed with a “\\x”.

Generally, what is true for Karma generators holds for their Qi parser counterparts. They too have the ability to fail parsing if their preconditions are not met, allowing them to be used in alternatives just like the Karma generators. When one fails, the next alternative is tried until we’re out of alternatives and have to fail overall. The next alternative here is qi::alnum, which as we said accepts an alphanumeric character and fails otherwise.

Since we now have the grammar, the next step is to figure out which attributes to use for the rules. The attributes in Qi are the types and values we gat as the result of converting the input, thus it’s quite straightforward:

qi::rule<InputIterator, std::string()> unesc_str; qi::symbols<char const, char const> unesc_char;

Note that we’re assuming narrow character representation here, for wide character representation one would have used std::wstring and wchar_t const.

Akin to the Karma example, we use the Qi counterpart qi::rule<> as the non-terminal for storing the parsed data for the unesc_str, for which we again assume std::string to be its attribute. Again two of the parser alternatives inside the Kleene Star expose a single character as their attribute, the other (“\\x” >> qi::hex) exposing an unsigned int which can be implicitly casted to a char. If we strictly followed the rules we’ll see the rule actually exposes a vector<char>, but this is compatible with a std::string. Secondly the C-style representations with their character matches are stored in a qi::symbols<> instance. Now you might wonder why it’s a qi::symbols<char const, char const> and not qi::symbols<char const *, char const> since the Karma example was using karma<char const, char const *> the first parameter is the character type of the strings stored, not the type of the strings.

That’s all there’s to it!

Since the Karma example allows you to specify the quoting character, we’re going to do so too, and I’ll use this opportunity to show just how similar Karma and Qi really are. Now here’s the Karma code presented in the previous article:

karma::rule<OutputIterator, std::string(char const*)> esc_str = karma::lit(karma::_r1) << *(esc_char | karma::print | "\\x" << karma::hex) << karma::lit(karma::_r1);

Since we’re parsing data instead of generating it, we should use >> instead of <<, and obviously we’re using Qi thus we need to use that namespace, which gives us:

qi::rule<InputIterator, std::string(char const*)> unesc_str = qi::lit(qi::_r1) >> *(unesc_char | qi::alnum | "\\x" >> qi::hex) >> qi::lit(qi::_r1);

Easy huh?

This leads us back to the reason we’re using qi::alnum instead of qi::print, there are two reasons:

The alternatives are are tried in the order they are in, thus if we had *(unesc_char | qi::print | “\\x” >> qi::hex) the “\\x” >> qi::hex option would never be tried since “\\x” and hexadecimal characters are printable, and thus parsed using qi::print. Simply re-arranging the order solves this problem, thus we could use *(unesc_char | “\\x” >> qi::hex | qi::print) . The quoted character used in the example is “”'” , which means we’re actually parsing as follows

unesc_str = qi::lit("'''") >> *(unesc_char | "\\x" >> qi::hex | qi::print) >> qi::lit("'''");

but the Kleene Star is greedy (an so is the unary +, which means “one or more”), thus it would parse the closing “”'” too! The solution here is not to use qi::print, but (qi::print – “‘”) which means “all printable characters, except “‘”.

Putting this all together and wrapping it in a qi::grammar we get:

template <typename InputIterator> struct unescaped_string : qi::grammar<InputIterator, std::string(char const*)> { unescaped_string() : unescaped_string::base_type(unesc_str) { unesc_char.add("\\a", '\a')("\\b", '\b')("\\f", '\f')("\

", '

') ("\\r", '\r')("\\t", '\t')("\\v", '\v') ("\\\\", '\\')("\\\'", '\'')("\\\"", '\"') ; unesc_str = qi::lit(qi::_r1) >> *(unesc_char | qi::alnum | "\\x" >> qi::hex) >> qi::lit(qi::_r1) ; } qi::rule<InputIterator, std::string(char const*)> unesc_str; qi::symbols<char const, char const> unesc_char; };

which can be called as such:

typedef std::string::const_iterator iterator_type; std::string parsed; std::string str("'''string to unescape:\\x20\

\\r\\t\\\"\\'\\x41'''"); char const* quote = "'''"; iterator_type iter = str.begin(); iterator_type end = str.end(); client::unescaped_string<iterator_type> p; qi::parse(iter, end, p(quote), parsed); // this will result in: parsed == "string to unescape:

\r\t\"\'A"

Which uses qi::parse, one of Qi its main API functions, similar to karma::generate for generation. It takes the beginning and the end of the string to parse, an instance of the parser (p) and an attribute instance to which the parsed data is saved (parsed). We pass the quoting character sequence (the “”'” in quote) as an inherited attribute when invoking the grammar. When successful qi::parse returns true, while it returns false otherwise and one can compare iter to end to check if the complete input string has been parsed.

This leaves one minor enhancement, one can use parsed.reserve(str.length()) to reserve enough space in parsed to fit the un-escaped string, which encompasses a small speed enhancement.

If you want to try out this example for yourself, the complete source code is available from the Boost SVN here.