Recently on stackoverflow, I answered a question that I thought worthy of a highlight here on the blog. In this forum we all know that one should never parse HTML with a regex, but if we agree on that, there are still many options available afterward. The question as posed was given some HTML, remove all <style> tags and contents. The question was later amended to include that he needed to also remove <style> tags with attributes (the nail in the regex coffin) and <link> tags to stylesheets.

While you could use an XML parser or an HTML tokenizer, personally I like using the Mojo::DOM parser. This is a Document-Object Model interface to your HTML and it supports CSS3 selectors, making it really flexible when you need it. The original problem is solved as easily as:

#!/usr/bin/env perl use strict; use warnings; use Mojo::DOM; my $content = <<'END'; <html> <head> <title> Example </title> </head> <style> p{color: red; background-color: #FFFF; } div {...... ... } </style> <body> <p> hi I'm a paragraph. </p> </body> </html> END my $dom = Mojo::DOM->new( $content ); $dom->find('style')->pluck('remove'); print $dom;

The pluck method is a little confusing, but its really just a shorthand for the doing a method on each resultant object. The analogous line could be

$dom->find('style')->each(sub{ $_->remove });

which is a little more understandable but less cute. Further, to really understand that the call to find returns an instance of Mojo::Collection , a container object that has array-filtering methods as you would expect, but also can backflow back to the original DOM object. Thus when we remove the resultant tags from the collection, they are gone from the DOM object!

Now lets say that the $content variable also contained these lines

<link rel="stylesheet" type="text/css" href="$url_path/gridsorting.css"> <link rel="icon" href="somefile.jpg">

where you want to remove the first one, and not the second. You can do this in one of two ways.

$dom->find('link')->each( sub{ $_->remove if $_->{rel} eq 'stylesheet' } );

This mechanism uses the object methods (and Mojo::DOM exposes attributes as hash keys) to remove only the link tags which have rel=stylesheet . You can however use CSS3 selectors to only find those elements, however, and since Mojo::DOM has full CSS3 selector support you can do

$dom->find('link[rel=stylesheet]')->pluck('remove');

CSS3 selector statements can be joined with a comma to find all tags matching either selector, so we can simply include the line

$dom->find('style, link[rel=stylesheet]')->pluck('remove');

and get rid of all your offensive stylesheets in one fell swoop!