There are plenty of choices when you need to fetch a page or two from the Internet. We are going to see several simple examples using wget, curl, LWP::Simple, and HTTP::Tiny.

wget

While they are not Perl solutions, they can actually provide a quick solution for you. I think there are virtually no Linux distributions that don't come with either wget or curl. They are both command line tool that can download files via various protocols, including HTTP and HTTPS.

You can use the system function of Perl to execute external program so you can write the following:

my $url = 'https://perlmaven.com/'; system "wget $url";

This will download the main page from the perlmaven.com domain and save it on the disk. You can then read that file into a variable of your Perl program.

However there is another, more straight-forward way to get the remote file in a variable. You can use the qx operator (what you might have seen as back-tick ``) instead of the system function, and you can ask wget to print the downloaded file to the standard output instead of saving to a file. As qx will capture and return the standard output of the external command, this can provide a convenient way to download a page directly into a variable:

my $url = 'https://perlmaven.com/'; my $html = qx{wget --quiet --output-document=- $url};

--output-document can tell wget where to save the downloaded file. As a special case, if you pass a dash - to it, wget will print the downloaded file to the standard output.

--quiet tells wget to avoid any output other than the actual content.

curl

For curl the default behavior is to print to the standard output, and the --silent flag can tell it to avoid any other output.

This is the solution with curl:

my $url = 'https://perlmaven.com/'; my $html = qx{curl --silent $url};

The drawback in both cases it that you rely on external tools and you probably have less control over those than over perl-based solutions.

Get one page using LWP::Simple

Probably the most well know perl module implementing a web client is LWP and its sub-modules. LWP::Simple is a, not surprisingly, simple interface to the library.

The code to use it is very simple. It exports a function called get that fetch the content of a single URL:

use LWP::Simple qw(get); my $url = 'https://perlmaven.com/'; my $html = get $url;

This is really simple, but in case of failure you don't know what really happened. You just get an empty document.

Get one page using HTTP::Tiny

For that HTTP::Tiny is much better even if the code is slightly longer:

use HTTP::Tiny; my $url = 'https://perlmaven.com/'; my $response = HTTP::Tiny->new->get($url); if ($response->{success}) { my $html = $response->{content}; }

HTTP::Tiny is object oriented, hence you first call the constructor new. It returns an object and on that object you can immediately call the get method.

It returns a hash with a number of interesting keys: success will be true or false, content will hold the actual html content. status is the HTTP status-code (200 for success, 404 for not found, etc.).

Try printing it out using Data::Dumper. It is very useful!

A fuller example with HTTP::Tiny

use strict; use warnings; use 5.010; use HTTP::Tiny; use Data::Dumper qw(Dumper); my $url = 'https://perlmaven.com/'; my $response = HTTP::Tiny->new->get($url); if ($response->{success}) { while (my ($name, $v) = each %{$response->{headers}}) { for my $value (ref $v eq 'ARRAY' ? @$v : $v) { say "$name: $value"; } } if (length $response->{content}) { say 'Length: ', length $response->{content}; delete $response->{content}; } print "

"; print Dumper $response; } else { say "Failed: $response->{status} $response->{reasons}"; }

The first part of the output was generated by the while-loop on the headers hash, then we used Data::Dumper to print out the whole hash. Well, except of the content itself, that we deleted from the hash. It would have been to much for this article and if you'd like to see the content, you can just visit the main page of the Perl Maven site.

content-type: text/html; charset=utf-8 set-cookie: dancer.session=8724695823418674906981871865731; path=/; HttpOnly x-powered-by: Perl Dancer 1.3114 server: HTTP::Server::PSGI server: Perl Dancer 1.3114 content-length: 21932 date: Fri, 19 Jul 2013 15:20:18 GMT $VAR1 = { 'protocol' => 'HTTP/1.0', 'headers' => { 'content-type' => 'text/html; charset=utf-8', 'set-cookie' => 'dancer.session=8724695823418674906981871865731; path=/; HttpOnly', 'x-powered-by' => 'Perl Dancer 1.3114', 'server' => [ 'HTTP::Server::PSGI', 'Perl Dancer 1.3114' ], 'content-length' => '21932', 'date' => 'Fri, 19 Jul 2013 15:20:18 GMT' }, 'success' => 1, 'reason' => 'OK', 'url' => 'https://perlmaven.com.local:5000/', 'status' => '200' };

Downloading many pages

Finally we arrived giving an example of downloading many pages using HTTP::Tiny.

use strict; use warnings; use 5.010; use HTTP::Tiny; my @urls = qw( https://perlmaven.com/ https://cn.perlmaven.com/ https://br.perlmaven.com/ ); my $ht = HTTP::Tiny->new; foreach my $url (@urls) { say "Start $url"; my $response = $ht->get($url); if ($response->{success}) { say 'Length: ', length $response->{content}; } else { say "Failed: $response->{status} $response->{reasons}"; } }

The code is, quite straight forward. We have a list of URLs in the @urls array. An HTTP::Tiny object is created and assigned to the $ht variable. The in a for-loop we go over each url and fetch it.

In order to save space in this article I only printed the size of each page.

This is the result:

Start https://perlmaven.com/ Length: 19959 Start https://cn.perlmaven.com/ Length: 13322 Start https://br.perlmaven.com/ Length: 12670

The simplicity has a price of course. It means that we wait for each request to be finished before we send out a new request. As most of the time is spent waiting for the the request to travel to the remote server, then waiting for the remote server to process the request, and then waiting till the response reaches us, we waste quite a lot of time. We could have sent all 3 requests in parallel and we would get our results much sooner.

However, this is going to be covered in another article.