Over the summer I did quite a bit of work on lxml.html. I’m pretty excited about it, because with just a little work HTML starts to be very usefully manipulatable. This isn’t how I’ve felt about HTML in the past, with all HTML emerging from templates and consumed only by browsers.

The ElementTree representation (which lxml copies) is a bit of a nuisance when representing HTML. A few methods improve it, but it is still awkward for content with mixed tags and text (common in HTML, uncommon in most other XML). Looking at Genshi Transforms there are some things I wish we could do, like simply “unwrap” text and then wrap it again. But once you remove a tag the text is thoroughly merged into its neighbors. Another little nuisance is that el.text and el.tail can be None, which means you have to guard a lot of code.

That said, here’s the Genshi example:

>>> html = HTML('''<html> ... <head><title>Some Title</title></head> ... <body> ... Some <em>body</em> text. ... </body> ... </html>''') >>> print html | Transformer('body/em').map(unicode.upper, TEXT) \ ... .unwrap().wrap(tag.u).end() \ ... .select('body/u') \ ... .prepend('underlined ')

Here’s how you’d do it with lxml.html:

>>> html = fromstring('''... same thing ...''') >>> def transform(doc): ... for el in doc.xpath('body/em'): ... el.text = (el.text or '').upper() ... el.tag = 'u' ... for el in doc.xpath('body/u'): ... el.text = 'underlined ' + (el.text or '')

I’m not sure if Genshi works in-place here, or makes a copy; otherwise these are pretty much equivalent. Which is better? Personally I prefer mine, and actually prefer it quite strongly, because it’s quite simple — it’s a function with loops and assignments. It’s practically pedestrian in comparison to the Genshi example, which uses methods to declaratively create a transformer.

Some of the things now in lxml.html include: