My email6 work suffered a long hiatus working on another project for QNX and RIM (The PIM component of the Blackberry version 2.0 release, if you are curious). The release of that product happened just before PyCon, and that turned out to be fortuitous timing. My schedule now has a bit more space in it, enough that I should be able to continue the development work on email6 that I restarted at PyCon. More on the development work in a bit.

PyCon videotapes all presentations, so you can watch it if you are interested. Basically I cover how the email package used to work, how it works in 3.2 (marginally better) and how it will work with email6 (lots better), using various examples run in the interactive interpreter (and captured to the slides, I wasn’t crazy enough to do it live).

With the deadline for PyCon talk submissions approaching last fall, I determined that I should take a shot at doing a talk on email6. My proposal, “Email: Past, Present, and Future”, was accepted. This is the first time I’ve given a talk at a large convention, so I wasn’t at all sure how well I’d do. I ran a bit over time (shortening the Q&A period), but managed to cover everything I had prepared. In retrospect I should have shortened the intro, but you only know that kind of thing after the fact. (Next time I do something like this I’ll try doing the presentation before the local user group first.)

Sprint results¶

As always, for me the best part of PyCon is the sprints. Some people would think that was crazy, but I suspect none of them were in the sprint rooms with us. I did a fair bit of helping newcomers to Python core development (the sub-sprint I was part of), but in between I also got a fair amount of work done on email6.

I say a “fair amount”, but you wouldn’t know that from looking at the external results. What I did was to finish the folding algorithm that I was working on when I suspended work last year. Once I’d wrapped my head back around the codebase and where I’d left it, I spent pretty much two full days of my three day sprint time working on that folding algorithm. As I say in the comments in the code somewhere, the RFC5322 folding algorithm is superficially simple, but dealing with the edge cases, and especially dealing with RFC2047 encoded words, is distinctly non-trivial.

Last year one of the things I did was to rewrite the old email4/5 folding algorithm. That one was also very complex, but I did manage to simplify it after staring at the code long enough. I’m hoping that if I come back to this code a few months from now, I’ll be able to find a similar simplification. I’m pretty sure it is possible, because there is a bunch of code that is almost identical (but not quite) scattered between four methods. There’s got to be some way to simplify that.

If anyone wants to take a look before I do, the code is in the new _header_value_parser module, on the _fold methods. I’ve checked all the code in to the email6 feature branch.

The new folding algorithm is more complex than the email4/5 one because it handles more edge cases, and does its best to be “smart” about using encoded words. The old algorithm, once any encoded words were involved, would encode everything. So if you put in:

Subject: This is á non-sense sentence.

What you’d get out would be:

Subject: =?utf-8?q?This_is_=C3=A1_non-sense_sentence=2E?=

With the new folding algorithm, what you get instead is:

Subject: This is =?utf-8?q?=C3=A1?= non-sense sentence.

In other words, it encodes the minimum it can. The tricky part of that is when there is more than one word that requires encoding. In that case encoding each one individually would expand the length of the line considerably due to the RFC2047 “chrome” around the CTE encoded text. So in that case the algorithm looks back to the previously encoded word, and if encoding everything in between the two fits on the current line, it does it that way. Otherwise it starts a new line...and that’s where the tricky parts arise. Take a look at the code if you want to know more (and please tell me about any ways to you see to simplify it...that pass the tests).

There are a couple of other edge cases that the new algorithm handles that the old one didn’t. One is header spaces. The old algorithm would leave a space after the ‘:’ after the header name if it decided to wrap the whole line onto the next line. For example:

From: someimpossiblylongemailaddressthathasnoplaceinittobreaktheline@example.com

would end up as:

From: someimpossiblylongemailaddressthathasnoplaceinittobreaktheline@example.com

but with a space after the ‘:’. That means that when unfolded correctly according to RFC5322 rules, you’d get:

From: someimpossiblylongemailaddressthathasnoplaceinittobreaktheline@example.com

That is, an extra space is introduced. The new folding algorithm gets this right, and does not introduce the extra space.

Finally, if the token is too large to fit even on the next line (that is, the token itself is longer than the 78 character maximum...or whatever you have the maxlinelen set to), the new algorithm will keep it on the same line as the label, since it is going to be too long anyway, whereas the old one would place the overlong line on the next line, leaving the field label by itself on a line (with an extra space).

As I said, all of this is checked in to the feature repository, so you can try it out if you like. In this incarnation of the code, all of the header line wrapping that gets done (that is, any headers you’ve created or, if you’ve started with a parsed message, that you haven’t modified or that can’t be emitted in original form for one reason or another) will be folded by the new algorithm. This is in contrast to the version currently up on PyPI, where only headers that did not need RFC2047 encoded words were folded by the new algorithm.