Chip Camden explains what the Backus-Naur Form syntax is and how it works. Take a look at his examples of this abstract programming concept.

Backus-Naur Form (BNF) is a syntax for describing syntax. It's used to write a formal representation of a context-free grammar. If that doesn't sound abstract enough for you, a grammar doesn't have to be a programming language or even a human language -- it can be any syntax that you want to describe.

I first encountered BNF when working on the 1992 ANSI standard for DIBOL, which contained a complete BNF syntactic specification for that language (it was 11 pages long). The exercise of defining the language in BNF helped the committee rigorously work out the details of its syntax. One of the language implementers also fed this BNF into Yet Another Compiler Compiler (yacc) to help build their compiler for their next version, so they could guarantee that it complied syntactically with the specification. That's where BNF really shines: Once you have a solid syntactic definition in BNF, you can use tools to rigorously parse the syntax it describes, without having to invent code that implements the specification, or rely on your mastery of Regular Expressions.

BNF uses a declarative syntax that allows you to define terms via "production rules." Each rule contains terms that each have more concrete rules until you get down to "terminals," which are terms that we can only describe as specific characters (literal values). Thus, for instance, if we wanted to describe the syntax for calling a shell command named "fred" that takes one required non-negative numeric value as its argument, we could express it like this:

<call-fred> ::= fred <whitespace> <number> <eol> <whitespace> ::= <space-char> | <space-char> <whitespace> <space-char> ::= \s | \t <number> ::= <digit> | <digit> <number> <digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 <eol> ::=

| <whitespace>



Each of the identifiers in angle-brackets is the name of a term. A line beginning with the term name, followed by "::=" and the components that make up the term forms the term's production rule. The terminals in the above are the literal values that aren't enclosed in angle-brackets. The "|" operator provides a logical "or." As in functional programming languages, we use recursion to describe variable-length lists like <number> and <whitespace>.

Many variants to the above syntax have emerged. Extended Backus-Naur Form (EBNF) provides a number of enhancements and simplifications. By requiring quotation marks around literal values, we can dispense with a lot of other punctuation. We can therefore express our example above in a form that seems more simple and familiar to programmers:

call-fred = "fred", whitespace, number, eol whitespace = space-char, { space-char } space-char = " " | "\t" number = digit, { digit } digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" eol = [ whitespace ], "

"

Note the use of braces to indicate "zero or more," and square brackets for optional items. Some variants also use the Regex-like asterisk "*" trailing a term to indicate zero or more, as well as "+" for one or more.

For a more useful example, let's describe the syntax of comma-separated values (CSVs).

csv-file = { row } row = field-list, eol field-list = field, [ ",", field-list ] field = [ whitespace ], field-value, [ whitespace ] field-value = quoted-string | bare-string quoted-string = '"', quoted-content, '"' quoted-content = { quoted-char } quoted-char = (any char except '"' or eol) bare-string = { bare-char } bare-char = (any char except ',' or eol) whitespace = space-char, { space-char } space-char = " " | "\t" eol = "

"

In the first rule, we define a CSV file as a series of zero or more rows. A row is a list of fields, followed by an end-of-line character. The list of fields has at least one field (even though that field can have zero length). A comma and another list of fields can optionally follow (note the recursion). An individual field may have whitespace on either end, and the field value may be either a quoted string or a bare string. A quoted string may not contain a quote, and a bare string may not contain a comma.

In a future article, we'll see how we can transform this definition into code that parses a CSV file.