Wallflower is my static website generator. Well, not really: it's actually generating a static version of any Plack application, provided it behaves reasonably when seen as a static site.

(Read on, I'm asking for help towards the end!)

Improvements

The latest release (version 1.008) finally added support for the Last-Modified response header. Combined with the support for the If-Modified-Since request header (added in version 1.002), it makes it possible to regenerate a static web site faster, by only generating those files that were actually modified.

It goes like this:

during the first run, the application sends a Last-Modified header for pages where it knows the date of the latest changes for the source (think a blog with a publication/update date)

header for pages where it knows the date of the latest changes for the source (think a blog with a publication/update date) wallflower saves the static file and changes the filesystem date to the value of the Last-Modified header

header during the following runs, if the target directory from the previous run still exists, wallflower will add an If-Modified-Since header set to the modification time of files that are already there

header set to the modification time of files that are already there if the application figures out the content has not changed, it can send a 304 Not Modified response, which wallflower interprets as "keep the existing file and move on"

This will only make the generation of subsequent versions of the site faster if generating a page is a somewhat costly operation for the application, and it can decide its content is still fresh before doing the work.

There's actually a Plack middleware that supports sending a 304 Not Modified response when the conditions apply: Plack::Middleware::ConditionalGET. However, adding it to an application processed with wallflower won't make the generation faster, because it lets the application generate the entire response before deciding to send a 304 instead. In the case of wallflower, which does not use the network at all, this is not gaining anything. The only case I can imagine some actual savings (in memory and in time saved not copying the content to the filesystem), is maybe when the application serves large static files and returns a filehandle to the content.

Remaining issues

Wallflower visits the application starting from / (by default), and then follows all the links found in text/html and text/css files. Since it behaves as a crawler, any content not reachable from the starting point will be missed.

Nowadays, even static web sites can be made quite dynamic with a little help from JavaScript. If the JavaScript loads more content from the application, wallflower won't see it. (There's no robust way to parse the JavaScript to find which URL it's going to load.)

This is easily fixed by extending the list of starting points (either on the command-line or as a file passed as an argument), but can become cumbersome.

I was thinking about ways of extending this: what if the application generated a list of links that are expected to exist, but not necessarily linked from anywhere in the HTML or CSS? That page could be added to the initial crawling list for wallflower, but we wouldn't want to save this (even if not linked from anywhere).

The idea of reading but not archiving reminded me of a famous artifact from Usenet times: the X-No-Archive header.

Here's my first question / call for help: should wallflower read the X-No-Archive: yes header in a HTTP response as "you can follow the links, but don't save this file as part of your crawling"?

Other possible improvements

It should be possible to add support for reading sitemaps, for example with WWW::Sitemap::XML, if the application produces one such file.

There may be other such standard files that are not usually linked from the web site, but that browsers look for (robots.txt comes to mind) and that wallflower should at least poke for. Note: favicon.ico is not one of them, since it's linked from the <head> section of most HTML pages.

Can anyone think of other such well-known URL?