If you analyze the user requests of your web site, you'll have to deal with enormous numbers of bots and spiders and other automated requests for your resources which don't represent measurable users. As promised in Annotating User Events for Cohort Analysis, here's how I handle them.

I wrote a tiny piece of Plack middleware which I enabled in the .psgi file which bundles my application:

package MyApp::Plack::Middleware::BotDetector ; use Modern::Perl; use Plack::Request; use Regexp::Assemble; use parent ' Plack::Middleware ' ; my $bot_regex = make_bot_regex(); sub call { my ( $self , $env ) = @_ ; my $req = Plack::Request->new( $env ); my $user_agent = $req ->user_agent ; if ( $user_agent ) { $env->{ ' BotDetector.looks-like-bot ' } = 1 if $user_agent =~ qr/ $bot_regex / ; } return $self ->app ->( $env ); } sub make_bot_regex { my $ra = Regexp::Assemble->new; while ( <DATA> ) { chomp ; $ra ->add ( ' \b ' . quotemeta ( $_ ) . ' \b ' ); } return $ra ->re ; } 1 ;

Plack middleware wraps around the application to examine and possibly modify the incoming request, to call the application (or the next piece of middleware), and to examine and possibly modify the outgoing response. Plack conforms to the PSGI specification to make this possible.

Update: This middleware is now available as Plack::Middleware::BotDetector from the CPAN. Thanks to Big Blue Marble and Trendshare for sponsoring its development and release.

All of that means that any piece of middleware gets activated by something which calls its call() method, passing in the incoming request as the first parameter. This request is a hash with specified keys. The application, or at least the next piece of middleware to call, is available from object's accessor method app() .

(I'm lazy. I use Plack::Request to turn $env into an object. This is not necessary.)

The rest of the code is really simple. I have a list of unique segments of the user agent strings I've seen in this application. I use Regexp::Assemble to turn these words into a single (efficient) regex. If the incoming request's user agent string matches anything in the regex, I add a new entry to the environment hash.

With that in place, any other piece of middleware executed after this point in the request—or the application itself—can examine the environment and choose different behavior based on the bot-looking-ness if any request. My cohort event logger method looks like:

sub log_cohort_event { my ( $self , %event ) = @_ ; return if $self ->request->env ->{ ' BotDetector.looks-like-bot ' } ; $event { usertoken } ||= $self ->sessionid || ' unknownuser ' ; push @{ $self ->cohort_events } , \ %event ; }

The embolded line is all it took in my application to stop logging cohort events for spiders. If and when I see a new spider in the logs, I can exclude it by adding a line to the middleware's DATA section and restarting the server.

(You might rather store this information in a database, but I'd rather build the regex once than loop through a database with a LIKE query. I haven't found an ideal alternate solution, which is why I haven't put this on the CPAN. Perhaps this is two modules, one for the middleware and one which exports a regex to identify spider user agents.)

There's one more trick to this cohort event logging: traceability. That's the topic for next time.