Introducing the Boost parser framework

One of the more complicated tasks for a C++ programmer is coding a parser within a reasonable time period. Using the GNU Flex/Bison or ANTLR parser generator typically makes sense when you're developing a compiler for a full-blown language like SQL or C++; but for grammars with a simpler Backus Naur Form (BNF), the steep learning curve of these tools don't always justify the investment. Another alternative is to use regular-expression libraries that come bundled with standard Linux® distributions or the Boost regex or tokenizer libraries, but these don't scale well with more involved grammars.

This article introduces the highly scalable Spirit parser framework from Boost. This parser generator works on an Extended Backus Naur Form (EBNF) specification coded in C++, significantly reducing development time. For further reading, see the very detailed Spirit documentation.

Install Spirit

You can download the Spirit framework for free from the Boost Web site (see the Related topics section). Note the following before you begin developing with Spirit:

You must include the <spirit.hpp> header in the sources. This header makes heavy use of template meta-programming and functors. All code in this article has been compiled using g++-3.4.4. Make sure you're using a compiler that supports these C++ features.

header in the sources. This header makes heavy use of template meta-programming and functors. All code in this article has been compiled using g++-3.4.4. Make sure you're using a compiler that supports these C++ features. Part of the Spirit framework internally uses the regular-expression library from Boost. After installation, check for the regex.h header in the installed code base.

header in the installed code base. Make sure the root directory of the Boost installation is in the compiler include search path.

search path. Spirit is a header-only library, so no extra libraries are needed at link time. This isn't true for the regex library. To include the regex sources as headers only, use the preprocessor directive define BOOST_SPIRIT_NO_REGEX_LIB in your code.

Your first Spirit program

Given a random list of words, in true C++ style your first Spirit program lists the number of unique occurrences of Hello World (that is, consecutive occurrences of the words Hello and World in the input stream) found in the list. See Listing 1; the output is 2.

Listing 1. Code to list the number of times the words Hello World occur in an input stream

#define BOOST_SPIRIT_NO_REGEX_LIB #include "regex.h" #include "spirit.hpp" #include "boost/spirit/actor.hpp" using namespace boost::spirit; const string input = "This Hello World program using Spirit counts the number of Hello World occurrences in the input"; int main () { int count = 0; parse (input.c_str(), *(str_p("Hello World") [ increment_a(count) ] | anychar_p) ); cout << count >> endl; return 0; }

The power of the Spirit framework lies in the fact that it provides built-in parsers for a host of basic types like individual characters, numbers, and strings. More complex parsers are typically created using these built-in parser objects. In Listing 1, str_p and anychar_p are predefined parsers from Spirit -- str_p matches the string it's supplied with (in this case, Hello World) and on success calls the increment_a routine to increment count by 1. anychar_p is another predefined parser that matches any character.

Let's now turn to the parse function, which is arguably the most important routine in the Spirit framework. It accepts an input stream and a grammar, and internally it runs the stream through this grammar. In this case, the input stream comes from input.c_str() , and str_p and anychar_p provide the semantics for the grammar. If you're familiar with parsing, you'll quickly realize that the second argument to the parse function is equivalent to providing a BNF.

Other predefined Spirit parsers

Consider a string that follows this pattern: <employee name: string> <employee id: int> <employee rating: float> . You need to populate the Employee data structure based on data extracted from this string. Here's a typical string: "Alex 8 9.2 Jim 91 5.6" .

Spirit has predefined parsers for characters ( alpha_p ), integers ( int_p ), and real numbers ( real_p ). Based on this, it's safe to argue that the parse routine should be called with syntax like this: parse(input.c_str(), alpha_p >> int_p >> real_p) . The logic is that parse will look first for a string, then for an integer, and finally for a real, in that order, in the input stream. Does this work? No. Listing 2 shows a working code snippet that can parse this data for you.

Listing 2. Using the alpha_p, int_p and real_p predefined parsers

#define BOOST_SPIRIT_NO_REGEX_LIB #include "regex.h" #include "spirit.hpp" #include "boost/spirit/actor/assign_actor.hpp" using namespace std; using namespace boost::spirit; const string input = "Alex 8 9.2 Jim 91 5.6"; typedef struct { string name; int idcode; float rating; } Employee; int main () { string name; int idcode; float rating; int status = parse (input.c_str(), *((+alpha_p) [assign_a(name)] >> ' ' >> int_p[assign_a(idcode)] >> ' ' >> real_p[assign_a(rating)] >> !blank_p) ).full; cout << status << endl; return 0; }

The original invocation fails for these reasons:

alpha_p parses a single character. To parse a string, you must use +alpha_p (this is similar to the EBNF + operator that implies one or more, except that Spirit uses it before and not after).

parses a single character. To parse a string, you must use (this is similar to the EBNF operator that implies one or more, except that Spirit uses it before and not after). Blank spaces separate the string, integer, and real numbers. This behavior must be accounted for. You can do so two ways: use ' ' ; or, better, use the blank_p predefined parser, which accounts for both spaces and tabs.

Here's the modified parse invocation:

parse(input.c_str(), *((+alpha_p) >> ' ' >> int_p >> ' ' >> real_p) >> !blank_p);

The second argument strictly matches a non-alphanumeric string followed by a space followed by an integer followed by another space and finally a real number. Once the parser reaches the real number, it finds a space/tab and either restarts matching the sequence or terminates. The ! operator means zero or one occurrence of a space/tab. The * operator implies zero or more occurrences of this sequence and therefore matches an empty string.

It's easy to see that a direct correlation exists between the second argument and the potential grammar rule that a traditional parser would use. Here's the typical grammar rule for the current requirement:

:S -> (ALPHA INT REAL)*

The tokens ALPHA , INT , and REAL are typically provided from the lexer. For example, INT is defined as (0-9)+. With Spirit, the steps are merged.

How do you know if something's wrong?

There are several ways to figure out if something's gone wrong with your parser. The easiest way to check is to test the data structure returned from the parse method. The return data structure is known as parse_info , and the hit field indicates whether the parsing is successful. Listing 3 shows the parse_info structure from the Boost sources.

Listing 3. The parse_info structure that the parse method returns

template <typename IteratorT = char const*> struct parse_info { IteratorT stop; // points to final parse position bool hit; // true when parsing is successful bool full; // when the parser consumed all the input std::size_t length; // number of characters consumed by parser parse_info( IteratorT const& stop_ = IteratorT(), bool hit_ = false, bool full_ = false, std::size_t length_ = 0) : stop(stop_) , hit(hit_) , full(full_) , length(length_) {} template <typename ParseInfoT> parse_info(ParseInfoT const& pi) : stop(pi.stop) , hit(pi.hit) , full(pi.full) , length(pi.length) {} };

What is assign_a? Once a predefined parser matches a string, the result needs to be stored somewhere. Using the assign_a construct, the parsed string is assigned to the corresponding variable. The general class of constructs that assign/modify variables in the Spirit framework are known as actors and are located in the boost/spirit/actor folder.

Spirit operators and their semantics

Spirit comes with several predefined operators. Table 1 summarizes these operators and their semantics. The following examples use these operators.

Table 1. Spirit operators and their semantics

Operator Semantics x >> y Match x followed by y x | y Match x or y x & y Match x and y x â y Match x but not y x ^ y Match x or y but not both *x Match x zero or more times +x Match x one or more times !x Match x zero or one time ( x ) Match x; used for priority-based grouping x [ function expression ] Execute the function/functor if x is matched x % y Match x one or more times, separated by occurrences of y

With the understanding you've developed thus far, you can now define the grammar for a floating-point number in C. The BNF is shown in Listing 4.

Listing 4. BNF for a floating-point number

Real-Number : Fractional-Part (Exponent-Part)? Fractional-Part : (DIGIT)* DOT (DIGIT)+ | (DIGIT)+ DOT Exponent-Part : ('e'|'E') ('+'|'-')? (DIGIT)+ DIGIT : ['0'-'9'] DOT : '.'

The equivalent Spirit grammar is provided in Listing 5.

Listing 5. Spirit grammar for a floating-point number, equivalent to the BNF in Listing 4

Real-Number = Fractional-Part >> ! Exponent-Part | +digit_p >> Exponent-Part ; Fractional-Part = *digit_p >> '.' >> +digit_p | +digit_p >> '.' ; Exponent-Part = ('e' | 'E') >> !('+' | '-') >> +digit_p;

Observe that Y = A >> B in the context of Spirit is the same as Y : A B in the context of parsers, where A and B can be either terminals or non-terminals. Also note that the user doesn't need to define grammar for such trivial stuff: Spirit already has the predefined parser real_p to parse real numbers.

Predefined parsers in Spirit

The flexibility of the Spirit framework comes from the fact that it has a number of predefined parsers for common situations. Table 2 provides a list of some of these parsers.

Table 2. Some of the predefined parsers in Spirit

Parser Semantics ch_p Matches a single character. range_p Matches a single character from a range of characters created from a low/high character pair. For example, range_p('a', 'z') matches all characters between a and z. anychar_p Matches any single character, including the NULL terminator \0 . str_p Matches a string: for example, str_p("mystring") matches the string mystring . blank_p Matches a continuous sequence of whitespace and tabs. space_p Similar to blank_p but also matches return and newline characters. digit_p Matches a single numeric digit. upper_p Matches any uppercase character. nothing_p Diagnostic tool; never matches anything and always fails.

Spirit directives

This section discusses directives, which are yet another powerful Spirit feature. The lexers for case-insensitive languages like Pascal and VHDL are slightly more complicated because they must parse, for example, begin and BEGin and generate the same token for the parser. Spirit accomplishes this feat by using parser directives. For instance, the predefined directive as_lower_d converts the input stream to lowercase (see Listing 6).

Listing 6. Using the as_lower_d directive for case-insensitive parsing

#define BOOST_SPIRIT_NO_REGEX_LIB #include "regex.h" #include "spirit.hpp" #include "boost/spirit/actor/assign_actor.hpp" using namespace std; using namespace boost::spirit; const string input = "THis iS a ranDOm sTRInG"; int main () { string val; int status = parse (input.c_str(), as_lower_d[str_p ("this is a random string") [assign_a(val)] ]).full; cout << status << endl; cout << val << endl; return 0; }

The output from Listing 6 is 1, THis iS a ranDOm sTRInG . It's important to understand the difference between the parsers and the parser directives -- the latter only modify the behavior of the enclosed parser, essentially augmenting the policies of that parser.

Spirit provides other predefined parser directives as well as ways to write some of your own. Let's look now at the longest_d parser directive. Consider Listing 7, and try to guess the output.

Listing 7. Parsing with ambiguous grammar

#define BOOST_SPIRIT_NO_REGEX_LIB #include "regex.h" #include "spirit.hpp" #include "boost/spirit/actor/assign_actor.hpp" using namespace std; using namespace boost::spirit; const string input = "20245.1"; int main () { int val; int status = parse (input.c_str(), int_p[assign_a(val)] | real_p).full; cout << status << " " << val << endl; return 0; }

The output from Listing 7 is 0 20245 . Why is this happening? Clearly, the entire input buffer hasn't been consumed while parsing, which is why status is 0 . To understand this, it's important to keep in mind how Spirit parses: given multiple alternatives to a rule example, S : R1 | R2 | .. | RN , the one to the left gets the maximum priority. This is similar to the way C/C++ handle conditionals: in an expression if (x && y) , if x is true, then y isn't evaluated. This behavior helps keep the tool fast.

In this case, int_p matches 20245 -- but after that it encounters a dot and has no rule to parse it. So, the parser exits.

You could solve this problem by regrouping all the available alternatives to a grammar rule, but manual regrouping is often error prone. The smarter way to handle this is to use the longest_d directive, which tries to match the rule that consumes the maximum length from the input stream. Listing 8 shows a modified call to the parse routine.

Listing 8. Using the longest_d predefined parser directive

int status = parse (input.c_str(), longest_d [int_p | real_p[assign_a(val)] ] ).full;

With this change, the output is now 1 20245.1 .

Develop a full-fledged grammar in Spirit

This section delves into designing a set of user-defined grammar rules using the Spirit framework. Here's what Spirit requires in order to design your own grammar:

Create a derived class inherited from the predefined grammar class. The grammar class is a template class that is parameterized by its derived class DerivedT and the context class ContextT . The declaration for the grammar class is as follows: template< typename DerivedT, typename ContextT = parser_context<> > struct grammar; The derived class that you design must have a nested template class/structure named definition (you may not change this name). The definition class has the following properties: It's a template class with the typename ScannerT .

. The grammar rules are defined in its constructor. The constructor is passed a reference to the actual grammar self .

. A member function named start must be present; it signifies the start rule.

Listing 9 shows the basic skeleton of a user-defined grammar.

Listing 9. Outline of the user-defined grammar class

struct my-grammar : public grammar<my-grammar> { template <typename ScannerT> struct definition { rule<ScannerT> startRule; definition(my-grammar const& self) { /* define grammar rules here */ } rule<ScannerT> const& start() const { return startRule; } }; };

Suppose you want to try to support the simple grammar shown in Listing 10, which partially parses C/C++ enumerations.

Listing 10. Simple grammar for C/C++ enums

enum_specifier : ENUM '{' enumerator_list '}' | ENUM IDENTIFIER '{' enumerator_list '}' | ENUM IDENTIFIER ; enumerator_list : enumerator | enumerator_list ',' enumerator ; enumerator : IDENTIFIER ; ENUM: "enum"; IDENTIFIER: ['a'..'z']+;

Listing 11 shows the corresponding Spirit code. The output of the program is 1 , indicating successful parsing.

Listing 11. Spirit code for parsing C/C++ enumerations

#define BOOST_SPIRIT_NO_REGEX_LIB #include "regex.h" #include "spirit.hpp" #include "boost/spirit/actor/assign_actor.hpp" using namespace std; using namespace boost::spirit; struct my_enum : public grammar<my_enum> { template <typename ScannerT> struct definition { definition(my_enum const& self) { enum_specifier = enum_p >> '{' >> enum_list >> '}'; enum_p = str_p("enum"); enum_list = +id_p >> *(',' >> +id_p); id_p = range_p('a','z'); } rule<ScannerT> enum_specifier, enum_p, enum_list, id_p; rule<ScannerT> const& start() const { return enum_specifier; } }; }; string input = "enum { ah, bk }"; int main () { my_enum e; int status = parse(input.c_str(), e, space_p).hit; cout << status << endl; return 0; }

Conclusion

This article has provided a reasonably detailed first look at the Boost Spirit framework for parsing. For more information, see the Related topics section.

Downloadable resources

Related topics