2011-06-08 An RFC822 Parser¶

Background¶ In email we can divide headers into two broad classes: unstructured headers (like ‘subject’) and structured headers (like ‘to’). Unstructured headers have a fairly straightforward format. Structured headers are defined in terms of a formal grammar. A minimal, less formal version of this grammar was laid out in the earliest email RFC, and it has undergone successive refinement in the subsequent RFCs. At this time the current standard is RFC 2822, with RFC 5322 on track to be the new standard. Since 5322 updates 2822 to incorporate understandings from post-2822 RFCs and is not controversial, we have been basing the email6 work on RFC 5322. The formal grammar in RFC 5322 has two major components. One is the definition of how structured email headers should be constructed and formatted when sending email. The other, referred to as the “obsolete syntax”, defines the additional grammar that should be parsed when reading email messages. The obsolete grammar attempts to capture the variety of header constructions that were permitted by the older RFCs, so that a conformant parser will be able to read messages produced by email agents adhering to the older standards. The preferred grammar covers the bulk of the header forms produced by the email packages in common use (if you define “common use” to exclude packages used by spammers). The combination of the two, the preferred grammar and the obsolete additions, does not cover 100% of the email found in the wild. In addition to the badly broken forms produced by many spam mailers, the obsolete grammar intentionally omits certain forms that were legal in the oldest RFCs in order to make the grammar more parseable. By the Postel principle we should do our best to accept these other variants as well. Note: throughout this article and in the code, I refer to the parser as an “rfc822 parser”. This follows the most-used convention when referring to email parsing. Even though there have been subsequent standards, RFC 822 remains the most widely implemented, and the general outline of the grammar was pretty well established by that rfc. Thus “rfc822 parser” has become the generic term for an email header and/or address parser.

Why Write a Special Purpose Parser?¶ All of the forgoing means that parsing a structured email header according to the formal grammar is not as simple as feeding that grammar into an LALR(1) parser-generator. With, say, a compiler, a deviation from the formal grammar produces an error message and, often, a parsing halt. When parsing an email header, on the other hand, we have to do our best to keep right on parsing to the end of the header, extracting as much information from the non-compliant input as we can. This means that a lot of the parsing code is taken up with handling exceptions. I started out this work by looking around for other RFC822 parser implementations. While I’m sure my survey was not comprehensive, I did not find any that suited our needs. Most of the ones I found were either narrowly focused on parsing addresses (which, granted, is the most important application) or were written in languages where the algorithms were buried under the complexity of the language to the extent that it seemed easier to roll my own than to translate them. When I made that decision I had underestimated the complexity of the parser: it is currently 1416 lines (including whitespace and comments), by far the largest sub-module in the email package. Looking at that statistic now, I wonder whether I made the right decision. I have not, after all, had much experience writing parsers. But at each step of the process it seemed I had just a little more work to do...so I kept going. The current code is a bit messy and in need of refactoring. It doesn’t handle all the corner cases yet, and it only handles address headers, not other types of structured headers. It is doubtless very inefficient. It is the piece of this project that I have my greatest doubts about...yet it seems necessary. Fortunately, although writing the unit tests was rather tedious at times, overall I did enjoy writing it.

Design of the Parser¶ The most logical form for a full rfc822 parser is a recursive decent parser. I chose to implement it as a set of stateless functions in a module. Each function has the form: token, value = get_XXX(value) The returned token is an instance of a grammar-item-specific class (eg: Atom, Phrase, Address). The returned value is any remaining text from the input value that wasn’t consumed by the parsed token. A get_XXX routine will raise a HeaderParseError if it encounters a condition that it can not or should not recover from. Tokens come in two fundamental types: TokenList or Terminal . The former is a subclass of list , the latter a subclass of str . A TokenList is a list of other tokens, while a Terminal is, as its name implies, a terminal value. The terminals produced by the parser are at a slightly higher level than the terminals of the formal grammar: instead of each one being an individual character, a Terminal is a string of one or more characters. Single character strings are used for specials, while multi-character strings are used for atext, qcontent, etc. There is also not a strict one-to-one correspondence between token classes and types and the elements of the formal grammar. The exceptions are, however, few, and are designed to make it easier to manipulate the resulting parse tree to extract the meaningful information. The choices made in this first draft probably should be revisited and in some cases changed...I made the choices in the piecemeal development process and developed the guiding principles of the parser as I went along. Each token has a token_type attribute. When the token corresponds exactly to an element of the formal grammar, the name from the grammar as presented in RFC 5322 is used. When the match is not exact, a variant name is used. Terminal special tokens also have unique token_type s, though these are of dubious utility to the email module itself. Each token has a value attribute and a string representation. The string representation is almost always regenerates the input that produced the token. The exceptions are where error recovery introduces terminating specials in a pair (eg: supplying a missing ‘]’ on a DomainLiteral). It is quite possible that the rule of reproducing the input should be made strict. The value attribute attempts to provide the “semantic value” of the input. According to the RFC, this means that any run of cfws is treated semantically as a single spaces. Thus the value attribute reports the non-cfws characters, with each run of cfws replaced by a single character. When I started out this seemed like the most useful representation of the token for use in extracting the useful data from the parse tree, but as matters turned out additional processing often needs to be done to extract the correct value. While cfws is equal to a single space semantically, single spaces are not everywhere semantically meaningful. So this is another area that can be revisited to see if the code can be simplified and made more useful. It may be that the value attribute is not generally useful and should be dropped. CFWS tokens also have a comments attribute that provides a list of comment texts. I have not yet integrated this into the rest of the parser to make these values more accessible and useful, but plan to do so. There are two types of Terminal tokens: ValueTerminal and WhiteSpaceTerminal . The former have a value equal to their string representation, while the latter have a value equal to a single space. Each TokenList may have additional properties that extract the meaningful information from them. For example, the AddrSpec token has local_part , domain , and addr_spec attributes. The first two should be obvious, the last is the most useful representation of the addr-spec data as a whole (in this case, the value but with leading and trailing whitespace removed). Each parser get_ method parses the input string looking for its specific token. This will generally involve calling get_ methods to parse sub-elements of the grammar. Fortunately in most cases the grammar is unambiguous once leading whitespace is skipped, so that the descent is often deterministic. In key places, however, more than one type of token can be present, in which case the parse for each is carried out in turn, with the raising of a HeaderParseError indicating that that token type is not present. In some cases the order of testing is critical to correct disambiguation. Where possible, a given level of the parser will attempt to do error recovery if a valid token is not found. This is only possible when the error encountered is such that we can be heuristically confident that the error is an error in the specification of the token, and not a different token type. This means that in general error recovery happens one level higher than the place the error is encountered. The most significant example of this in the current code is the get_address_list() list function, which treats any parsing error as a signal to scan the remaining input for the next unquoted comma and resume parsing there, looking for additional addresses. All tokens have a defects attribute and an all_defects attribute. Any time error recovery is done, or an obsolete syntax form is parsed, a Defect is added to the token’s defects list. all_defects provides a list of all the defects on the token itself, plus all the defects on any tokens it contains.

Examples¶ Here is a simple example of parsing a phrase: >>> from email.rfc822_parser import get_phrase >>> token , value = get_phrase ( "This is a (commented) phrase, " ... "in a comma list" ) >>> value ', in a comma list' >>> str ( token ) 'This is a (commented) phrase' >>> token . value 'This is a phrase' >>> token . pprint () Phrase/phrase( Atom/atom( ValueTerminal/atext('This') CFWSList/cfws( WhiteSpaceTerminal/fws(' ') ) ) Atom/atom( ValueTerminal/atext('is') CFWSList/cfws( WhiteSpaceTerminal/fws(' ') ) ) Atom/atom( ValueTerminal/atext('a') CFWSList/cfws( WhiteSpaceTerminal/fws(' ') Comment/comment( WhiteSpaceTerminal/ptext('commented') ) WhiteSpaceTerminal/fws(' ') ) ) Atom/atom( ValueTerminal/atext('phrase') ) ) There are several things to note about the parse tree. First is that I’ve elided the word token from the formal grammar. A phrase that consists of both atoms and quoted-strings will contain instances of those tokens, not word tokens. Perhaps it would be better to stick strictly to the grammar. Second, it is arbitrary to which atoms the white space gets attached, so the parser follows the rule of greedily consuming the whitespace after a word, since that is the most useful for simplifying further parsing. Third, the pretty-printed version of the parse tree makes it clear that the tokens produced by the parser are specified by the combination of their class and the value of the token_type attribute. This is another area where refactoring and cleanup may be warranted. It may also be advisable to differentiate the different types of ptext in the token_type ; currently ptext is used for terminals that come from several sources (quoted string content, comment content). The reason they are all currently the same token_type is that they are all handled in the same way: a ptext token consists of arbitrary printables and, if there are non-printables included, the token has a NonPrintablesDefect in its defects attribute. Here is a simple example of the most complicated parsing currently coded, the address-list: >>> from email.rfc822_parser import get_address_list >>> token , value = get_address_list ( ... 'foo@example.com, "Fred A. Bar" <bar@example.com>' ) >>> value '' >>> str ( token ) 'foo@example.com, "Fred A. Bar" <bar@example.com>' >>> token . value 'foo@example.com, "Fred A. Bar" <bar@example.com>' >>> str ( token . addresses [ 0 ]) 'foo@example.com' >>> str ( token . addresses [ 1 ]) ' "Fred A. Bar" <bar@example.com>' >>> token . addresses [ 0 ] . display_name is None True >>> token . addresses [ 0 ] . mailboxes [ 0 ] . local_part 'foo' >>> token . addresses [ 0 ] . mailboxes [ 0 ] . domain 'example.com' >>> token . addresses [ 1 ] . mailboxes [ 0 ] . display_name 'Fred A. Bar' >>> token . addresses [ 1 ] . mailboxes [ 0 ] . local_part 'bar' >>> token . addresses [ 1 ] . mailboxes [ 0 ] . domain 'example.com' >>> token . pprint () AddressList/address-list( Address/address( Mailbox/mailbox( AddrSpec/addr-spec( LocalPart/local-part( DotAtom/dot-atom( DotAtomText/dot-atom-text( ValueTerminal/atext('foo') ) ) ) ValueTerminal/address-at-symbol('@') Domain/domain( DotAtom/dot-atom( DotAtomText/dot-atom-text( ValueTerminal/atext('example') ValueTerminal/dot('.') ValueTerminal/atext('com') ) ) ) ) ) ) ValueTerminal/list-separator(',') Address/address( Mailbox/mailbox( NameAddr/name-addr( DisplayName/display-name( QuotedString/quoted-string( CFWSList/cfws( WhiteSpaceTerminal/fws(' ') ) BareQuotedString/bare-quoted-string( ValueTerminal/ptext('Fred') WhiteSpaceTerminal/fws(' ') ValueTerminal/ptext('A.') WhiteSpaceTerminal/fws(' ') ValueTerminal/ptext('Bar') ) CFWSList/cfws( WhiteSpaceTerminal/fws(' ') ) ) ) AngleAddr/angle-addr( ValueTerminal/angle-addr-start('<') AddrSpec/addr-spec( LocalPart/local-part( DotAtom/dot-atom( DotAtomText/dot-atom-text( ValueTerminal/atext('bar') ) ) ) ValueTerminal/address-at-symbol('@') Domain/domain( DotAtom/dot-atom( DotAtomText/dot-atom-text( ValueTerminal/atext('example') ValueTerminal/dot('.') ValueTerminal/atext('com') ) ) ) ) ValueTerminal/angle-addr-end('>') ) ) ) ) )

Integration with the Header Parser¶ Hooking this up to the header parser is relatively straightforward. The TokenList classes are mutable, so they aren’t suitable for use directly as the results of a header parse (even assuming we wanted to). So the header module provides classes to represent mailboxes and groups of mailboxes to hold the data returned by the parser. At the this level, part of the design of email6 is to remove the need for the library user to understand the details of the email RFCs in order to use the package. This is especially important because the RFC uses address to refer to something which can be either a single mailbox or a group, which are themselves lists of mailboxes. An address-list, which is the thing the library user wants to interact with, is thus a sequence of one or more mailboxes or groups. So at this level we stop using the RFC names for things, and use the names in more common use. The API at this level provides two objects: Mailbox and Group . These line up more or less with both the RFC and intuitive expectation, in that a Mailbox is a single complete address, while a Group is a list of zero or more Mailbox es. Mailbox provides access to the components of a mailbox using more common names than those used by the RFC: name : display-name username : local-part domain : domain address : addr-spec Mailbox is a subclass of str , and its string value is the full, RFC formatted mailbox. Group provides only two attributes, name and mailboxes . For a true group, name will be the group display-name, while mailboxes is always the list of Mailboxes that make up the group. It may be the empty list. Each header field that contains addresses has the same base API, regardless of whether or not it is supposed to contain only a single address, a mailbox-list, or a full address-list. There are two attributes, groups and mailboxes . mailboxes returns a composite list (in order) of all the mailbox objects mentioned by the header value. groups also returns all of the mailboxes in the value, but as a list of Group objects. Individual mailboxes are turned single-element Group s whose name is None . True groups are regular Group objects, with a non- None name . The idea here is that either attribute can be used to process all of the mailboxes, depending on whether or not one cares whether there may be actual groups in the list or not, and the same logic can be used regardless of the number of addresses that are supposed to be in the header. This is because while the RFC says that Sender, for example, is limited to being a single mailbox, you know that some email out there in the wild is going to have more than one, and this allows that case to be handled in a sensible way. So, the code currently in the feature branch allows one to do the following: >>> import email >>> msg = email . message_from_string ( """ \ ... Date: Tue, 07 Jun 2011 16:27:46 -0400 ... From: "Harry A. Card" <card@example.com> ... To: friends: foo@example.com, bar@example.com;, ... "Barb" <ping@example.com> ... Subject: A test ... ... Howdy there. ... """ ) >>> msg [ 'from' ] . mailboxes [ 0 ] . name 'Harry A. Card' >>> len ( msg [ 'to' ] . groups ) 2 >>> msg [ 'to' ] . groups [ 0 ] . name 'friends' >>> msg [ 'to' ] . groups [ 0 ] . mailboxes [ 1 ] . username 'bar' >>> msg [ 'to' ] . groups [ 1 ] . name == None True >>> len ( msg [ 'to' ] . mailboxes ) 3 >>> msg [ 'to' ] . mailboxes [ 2 ] . name 'Barb' >>> to = msg [ 'to' ] >>> del msg [ 'to' ] >>> msg [ 'to' ] = to + ', dinsdale@python.org' >>> msg [ 'to' ] . mailboxes [ 3 ] . username 'dinsdale'