The Pegex grammar has its own syntax as described in Pegex::Syntax.

The grammar is a collection of rules and looks like below:

%grammar etchosts %version 0.01 hosts: host | blanks | comments comments: /- HASH ANY* EOL/ blanks: /- EOL/ host: ip - aliases /- EOL?/ ip: ipv4 | ipv6 aliases: alias+ alias: - /(ALNUM (: WORD | DOT | DASH )*)/ - ipv4: /((: DIGIT{1,3} DOT ){3} DIGIT{1,3} )/ ipv6: /((: HEX* COLON{1,2} HEX* )+ )/

The lines beginning with the % tag are meta rules and represent information on the grammar such as the name of the grammar and the version. This allows the developer to manage multiple versioned grammars in their program.

The rest of the lines are rules and they begin with a rule name and a : followed by the description of the rule as per the Pegex::Syntax document.

The first rule hosts is the global or top-level rule for the grammar. The hosts rule can have three variations, viz., host , blanks and comments which represent the host definitions, blank lines and comments beginning with # , respectively. We need to be able to handle blank lines and comments since various /etc/hosts files have them either by default or added by the user.

The - is a shorthand for whitespace and EOL is a shorthand for the end of line characters \r

or

. HASH is a named rule describing the # symbol and COLON is a named rule describing the : symbo. DIGIT represents the regular expression [0-9] , HEX represents the regular expression [0-9A-Fa-f] that describes numbers in the hexadecimal format, and ANY represents any character except newline. The WORD represents the regular expression \w , DOT and

DASH represent the . and - characters, respectively.

Rules enclosed in // define a specific regular expression that will be generated, and are useful for creating the low-level rules using the atoms.

Detailed descriptions of all the available atoms are available at Pegex::Grammar::Atoms

High-level rules are a collection of other rules separated using the | (OR) operation or the default AND operation.

Let's try to understand the ipv4 rule. Like a standard regular expression capture we are trying to capture the IPv4 address on each line of the input. We do that by enclosing the items to be captured in parentheses. An IPv4 address is in the format xxx.xxx.xxx.xxx where xxx is a number between 0 and 255 . So we need to capture a three digit number, hence we use DIGIT{1,3} , followed by a . , and this pattern repeats three times followed by another three digit number. Hence we have the DIGIT{1,3} DOT followed by a {3} and another DIGIT{1,3} .