(See also A Modern Perl Fakebook.)

I needed to extract all hyperlinks from an HTML document today, and I needed to remove all markup except for the simplest formatting: paragraphs, emphasis, and bold. Any experienced Perl 5 programmer knows that multiple CPAN distributions exist for doing just this. You can choose whether you want an XS wrapper around an existing C library, or a smattering of regular expressions, or a DIY HTML parser, or a simple wrapper around an HTML parser.

You can know that, but that doesn't tell you how to do it.

I did my research and decided on HTML::Scrubber to remove markup. Its documentation suggests a strategy more complex than my current needs, so I eventually produced:

my $scrubber = HTML::Scrubber->new( allow => [qw( p br i u strong em hr )], ); my $scrubbed = $scrubber->scrub( $content );

For extracting hyperlinks, I used HTML::LinkExtor and wrote:

sub get_links { my ($self, $content) = @_; my $p = HTML::LinkExtor->new(); $p->parse($content); my @links; for my $link ($p->links()) { my ($tag, %a) = @$link; next unless $tag eq 'a' && $a{href} && $a{href} =~ /^http:/; push @links, $a{href}; } return \@links; }

I don't mind doing the research and customizing snippets like these for my specific needs, but I can imagine countless other people needing examples like this. If I weren't already convinced that the world needs a new resource for copy and paste examples in Modern Perl, I would be.

(If you're still not convinced, consider how much more easily a novice could find these examples than writing a correct and comprehensive regular expression for either case.)