Paola Kathuria

Log analysers are not accurate. They over-report visits and over-count some browsers while under-counting other browsers. They cannot accurately distinguish spiders and robots from human visitors and they do not use fool-proof techniques for counting visits and visitors.

Spiders and robots are programs which are sent out to web sites to index it, check links or to fetch content.

Your web browser is also a program: spiders and robots are essentially no different to a web browser in terms of what they can do. And, as it happens, some browsers can be set up to send out link-checking robots.

There are thousands of spiders and robots visiting web sites. We've been compiling a database of them for the past 10 years. We use it to filter out spiders and robots from our own site stats.

A staggering 90% of page requests to this web site (www.limov.com), for example, are by spiders and robots.

There are a handful of legitimate spiders but many are programs created to harvest e-mail addresses, copy content or to look for vulnerabilities in your web server. It is in the originator's interests that their programs look like regular visitors so that they gain full access to your site.

I use examples from actual logs in this article, mostly from this web site.

Spiders visit web sites and follow links on a page, normally to collect content so that it can be indexed by search engines. They can also collect specific content such as e-mail addresses, images, and PDFs.

Robot is a term I use to describe programs which make single-hit visits, often hitting the same page at regular intervals. Robots don't follow links.

You don't need to be a computer expert to have your own spider. Source code is freely available online.

I will use 'bot' for the remainder of this article to refer to both spiders and robots.

In this section, I'll be covering server log structure, how bots are supposed to identify themselves and how you can find rogue bots in logs. If you know all this, skip to the next section.

You need to know the difference between requests and hits to be able to interpret web logs and stats. Requests refer to pages. Most web pages include a mixture of text and images. The images are included in the page as links to files on the server. If a web page includes 10 graphics, accessing the page will result in 1 request and 11 hits, with one log line per hit.

Every hit made to a web site is logged. A hit can vary from finding out whether a page or file has been updated to fetching web pages, style sheets, images and other files, such as PDFs.

Failed requests are also logged, for example when a page has been removed or when it's password-protected. Failed hits also include hacking attempts to invoke vulnerabilities in (mostly) Microsoft Windows servers.

This information is logged for each hit:

IP address or host - where the request comes from

- where the request comes from username - the username of an authenticated user (via .htaccess)

- the username of an authenticated user (via .htaccess) date/time - date and time of access

- date and time of access request - request type (GET / POST / HEAD)

- request type (GET / POST / HEAD) URL - what was requested

- what was requested version - HTTP version

- HTTP version status code - success/failure code

- success/failure code size - number of bytes downloaded

- number of bytes downloaded referrer - URL of referring page

- URL of referring page user-agent - how the browser identifies itself

'Agent' is a term used for tools sent out to act on your behalf. Browsers and bots are agents.

Here are actual logs lines resulting from a visitor displaying one web page, the Colour Selector entrance page (I've changed the IP address):

255.60.45.22 - - [04/Mar/2006:00:40:56 +0000] " GET /colour/ HTTP/1.1 " 302 - "http://www.google.de/search?hl=de&client=firefox-a&rls=org.mozilla:en-US:official&q=color+scheme+library&spell=1" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1" 255.60.145.22 - - [04/Mar/2006:00:40:57 +0000] " GET /colour/?ID=WQVZHB7F7N30D00 HTTP/1.1 " 302 - "http://www.google.de/search?hl=de&client=firefox-a&rls=org.mozilla:en-US:official&q=color+scheme+library&spell=1" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1" 255.60.45.22 - - [04/Mar/2006:00:40:59 +0000] " GET /colour/ HTTP/1.1 " 200 - "http://www.google.de/search?hl=de&client=firefox-a&rls=org.mozilla:en-US:official&q=color+scheme+library&spell=1" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1" 255.60.45.22 - - [04/Mar/2006:00:40:59 +0000] " GET /css/screen.css HTTP/1.1 " 200 6600 "http://www.limov.com/colour/" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1" 255.60.45.22 - - [04/Mar/2006:00:40:59 +0000] " GET /images/icons/colour-favicon.ico HTTP/1.1 " 200 318 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1" 255.60.45.22 - - [04/Mar/2006:00:40:59 +0000] " GET /css/screen-libr.css HTTP/1.1 " 200 1087 "http://www.limov.com/colour/" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1" 255.60.45.22 - - [04/Mar/2006:00:40:59 +0000] " GET /css/screen-nav.css HTTP/1.1 " 200 1560 "http://www.limov.com/colour/" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1" 255.60.45.22 - - [04/Mar/2006:00:40:59 +0000] " GET /css/print.css HTTP/1.1 " 200 2167 "http://www.limov.com/colour/" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1" 255.60.45.22 - - [04/Mar/2006:00:40:59 +0000] " GET /images/g-libr.jpg HTTP/1.1 " 200 4841 "http://www.limov.com/colour/" "Mozilla/5.0 (Windows; U;Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1" 255.60.45.22 - - [04/Mar/2006:00:40:59 +0000] " GET /images/l-limov.gif HTTP/1.1 " 200 3613 "http://www.limov.com/colour/" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1" 255.60.45.22 - - [04/Mar/2006:00:40:59 +0000] " GET /p.gif HTTP/1.1 " 200 49 "http://www.limov.com/colour/" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1" 255.60.45.22 - - [04/Mar/2006:00:40:59 +0000] " GET /images/tb-serv.gif HTTP/1.1 " 200 751 "http://www.limov.com/css/screen-nav.css" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1" 255.60.45.22 - - [04/Mar/2006:00:40:59 +0000] " GET /images/tb-port.gif HTTP/1.1 " 200 565 "http://www.limov.com/css/screen-nav.css" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1" 255.60.45.22 - - [04/Mar/2006:00:40:59 +0000] " GET /images/tb-abou.gif HTTP/1.1 " 200 950 "http://www.limov.com/css/screen-nav.css" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1" 255.60.45.22 - - [04/Mar/2006:00:40:59 +0000] " GET /images/tb-cont.gif HTTP/1.1 " 200 932 "http://www.limov.com/css/screen-nav.css" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1" 255.60.45.22 - - [04/Mar/2006:00:41:00 +0000] " GET /images/colour/p-my-yc-cm.gif HTTP/1.1 " 200 2280 "http://www.limov.com/colour/" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1" 255.60.45.22 - - [04/Mar/2006:00:41:00 +0000] " GET /images/colour/b00-cs.gif HTTP/1.1 " 200 106 "http://www.limov.com/colour/" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1" 255.60.45.22 - - [04/Mar/2006:00:41:00 +0000] " GET /images/colour/b00-nv.gif HTTP/1.1 " 200 70 "http://www.limov.com/colour/" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1" 255.60.45.22 - - [04/Mar/2006:00:41:00 +0000] " GET /images/colour/b00-sw.gif HTTP/1.1 " 200 133 "http://www.limov.com/colour/" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1" 255.60.45.22 - - [04/Mar/2006:00:41:00 +0000] " GET /images/colour/b00-bg.gif HTTP/1.1 " 200 877 "http://www.limov.com/colour/" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1"

By the user-agent

If it fetches robots.txt

By its behaviour

Spiders and robots usually identify themselves in the user-agent. However, there is no standard text that they're supposed to add to the user-agent (such as "I'm a spider") so that they can be found.

This means that there is no automatic way that programs which process logs to produce stats can detect them. A list of spider and robot user-agents and IP addresses must be maintained on an on-going basis so that these visitors are not included in your regular web site stats. This does not happen automatically.

Web developers can put instructions in text files for spiders to make some parts of the site off limits. This could be for peformance reasons. The instructions are put on the server in a file called robots.txt. Spiders and robots are supposed to read this file at the start of each visit but there's no way to enforce that they do. Most bots ignore the file.

However, if a visitor does access robots.txt, it's most likely a spider.

host date/time requested file user-agent 66.249.71.53 05/Mar/2006 @ 00:29:14 /robots.txt Googlebot/2.1 (+http://www.google.com/bot.html) 66.249.71.53 05/Mar/2006 @ 00:29:15 /contact.lml Googlebot/2.1 (+http://www.google.com/bot.html) 66.249.65.5 05/Mar/2006 @ 00:29:28 /projects.lml?w=0&wo=1&p=bbr-5 Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) 66.249.65.5 05/Mar/2006 @ 00:34:33 /other-work.lml?p=disney-1 Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) 66.249.71.69 05/Mar/2006 @ 00:39:31 /projects.lml?w=0&wo=1&p=oup-6 Googlebot/2.1 (+http://www.google.com/bot.html) 66.249.64.42 05/Mar/2006 @ 01:06:50 /other-work.lml?p=hp-1 Googlebot/2.1 (+http://www.google.com/bot.html) 66.249.71.32 05/Mar/2006 @ 01:20:26 /projects.lml?p=crash-1 Googlebot/2.1 (+http://www.google.com/bot.html) 66.249.65.5 05/Mar/2006 @ 01:56:53 / Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) 66.249.64.30 05/Mar/2006 @ 02:18:11 /projects.lml?p=whm-1 Googlebot/2.1 (+http://www.google.com/bot.html) 66.249.71.45 05/Mar/2006 @ 02:19:14 /projects.lml?p=bbr-4 Googlebot/2.1 (+http://www.google.com/bot.html)

host date/time requested file user-agent 202.108.22.72 05/Mar/2006 @ 03:30:17 / Baiduspider+(+http://www.baidu.com/search/spider.htm) 202.108.22.72 05/Mar/2006 @ 07:32:58 / Baiduspider+(+http://www.baidu.com/search/spider.htm) 202.108.22.72 05/Mar/2006 @ 12:10:02 / Baiduspider+(+http://www.baidu.com/search/spider.htm)

There are long gaps between visits. If Baiduspider hasn't been added to your web site stat program's filter file for spiders (assuming one exists), then this spider's visits will show up as regular one-page visits in your web site stats.

Next is a suspect series of visits. The log lines appear together in an uninterrupted block in the log file. The log lines of human visitors are usually interleaved as people take longer between requests compared to spiders.

The accesses shown people are from different IP addresses but they all refer to the same session id. The session id is a unique visitor id we add to the URL if we can't out it in a cookie. The IP addresses are from different countries.

host date/time requested file gap 213.61.13.68 5/Mar/2006 @ 04:37:56 /?ID=X46ZB0H3N5C00B4 213.61.13.68 5/Mar/2006 @ 04:37:57 /whatsnew.lml?ID=X46ZB0H3N5C00B4 1s 213.61.13.68 5/Mar/2006 @ 04:37:58 /projects.lml?ID=X46ZB0H3N5C00B4 1s 213.61.13.68 5/Mar/2006 @ 04:38:01 /journal/?ID=X46ZB0H3N5C00B4 3s 221.45.136.41 5/Mar/2006 @ 04:38:06 /description.lml?sm=1&w=8&ID=X46ZB0H3N5C00B4 5s 196.40.26.246 5/Mar/2006 @ 04:38:11 /contents.lml?ID=X46ZB0H3N5C00B4 5s 192.138.77.36 5/Mar/2006 @ 04:38:12 /preferences.lml?ID=X46ZB0H3N5C00B4 1s 192.138.77.36 5/Mar/2006 @ 04:38:13 /ico/pref-favicon.ico 192.138.77.36 5/Mar/2006 @ 04:38:13 /css/screen-nav.css 192.138.77.36 5/Mar/2006 @ 04:38:13 /css/screen-site.css 192.138.77.36 5/Mar/2006 @ 04:38:13 /css/print.css 192.138.77.36 5/Mar/2006 @ 04:38:13 /p.gif 192.138.77.36 5/Mar/2006 @ 04:38:13 /images/l-limov.gif 192.138.77.36 5/Mar/2006 @ 04:38:13 /css/screen.css 192.138.77.36 5/Mar/2006 @ 04:38:13 /images/p-os-01.gif 192.138.77.36 5/Mar/2006 @ 04:38:13 /images/site-offsite.gif 192.138.77.36 5/Mar/2006 @ 04:38:13 /images/p-os-2.gif 192.138.77.36 5/Mar/2006 @ 04:38:13 /images/p-os-1.gif 192.138.77.36 5/Mar/2006 @ 04:38:13 /images/p-fs-s.gif 192.138.77.36 5/Mar/2006 @ 04:38:13 /images/p-lh-s.gif 192.138.77.36 5/Mar/2006 @ 04:38:13 /images/p-fs-l.gif 192.138.77.36 5/Mar/2006 @ 04:38:13 /images/p-lh-t.gif 195.113.171.76 5/Mar/2006 @ 04:38:15 /colour/tips.lml?ID=X46ZB0H3N5C00B4 3s 203.148.194.131 5/Mar/2006 @ 04:38:28 /contact.lml?ID=X46ZB0H3N5C00B4 13s 211.106.21.155 5/Mar/2006 @ 04:39:20 /description.lml?sm=1&w=1&ID=X46ZB0H3N5C00B4 52s 216.41.76.34 5/Mar/2006 @ 04:39:40 /projects.lml?w=0&p=oup-7&ID=X46ZB0H3N5C00B4 20s 219.24.170.3 5/Mar/2006 @ 04:39:49 /projects.lml?wm=1&p=oup-7&ID=X46ZB0H3N5C00B4 9s 220.84.214.190 5/Mar/2006 @ 04:39:59 /about.lml?ID=X46ZB0H3N5C00B4 10s

Is it a coincidence that different people in different countries happened to visit this web site using the same session id in the URL within a few seconds of each other, each only fetching web pages and not the images and style sheets?

I'd say this was a spider. There is nothing in the host or user-agent information which allows us to recognise it. Only its odd behaviour gives it away.

To filter out this visitor from my custom stats in future, I have to block it by the session id and/or all the IP addresses it used.

In addition to using algorithms to process standard server logs, people can develop custom logs with extra information. They track visitors by putting a generated unique session id in the URL or write it to a cookie. The id is read back at every request so that the request can be logged against the session id.

If you don't use session ids, you can make some guesses on which request are from the same visitor by looking at the server logs.

The same host IP address in a short period

The referring page is from another web site

The user-agent looks like a browser

However, any of these might change within a visit.

This is what I've found from reviewing server logs regularly.

Spiders can be sent cookies and allow them to be reread by a site on subsequent visits.

visit count IP date/time requested file user-agent referring page 1 209.167.50.22 21-Oct-2005 @ 15:19:34 / LinkWalker www.emlc.org.uk/Links.htm 2 209.167.50.22 24-Oct-2005 @ 12:22:36 / LinkWalker www.emlc.org.uk/Links.htm 3 209.167.50.22 25-Oct-2005 @ 16:09:39 / LinkWalker www.emlc.org.uk/Links.htm 1 209.167.50.22 26-Oct-2005 @ 14:05:53 / LinkWalker www.emlc.org.uk/Links.htm 209.167.50.22 27-Oct-2005 @ 15:19:10 / LinkWalker www.emlc.org.uk/Links.htm 1 209.167.50.22 28-Oct-2005 @ 14:57:17 / LinkWalker www.emlc.org.uk/Links.htm 2 209.167.50.22 31-Oct-2005 @ 12:26:23 / LinkWalker www.emlc.org.uk/Links.htm 3 209.167.50.22 01-Nov-2005 @ 12:57:48 / LinkWalker www.emlc.org.uk/Links.htm 1 209.167.50.22 02-Nov-2005 @ 14:34:34 / LinkWalker www.emlc.org.uk/Links.htm

When examining raw logs, it is common to see a single visit in which each page access is from a different host. This is how visitors appear in logs when their connection is via a cacheing proxy.

Here is a visitor to the Colour Selector who made 17 page requests from ten different hosts during a single three-minute visit. Most log analysers will interpret these page requests as ten separate visitors.

host address date/time requested file referring page anchovy.ulcc.wwwcache.ja.net 10-Aug-2002 @ 14:35:56 /colour/colour.html - mozzarella.ulcc.wwwcache.ja.net 10-Aug-2002 @ 14:36:17 /colour/216.html /colour/colour.html ham.ulcc.wwwcache.ja.net 10-Aug-2002 @ 14:36:46 /colour/216/33ccff.html /colour/216.html anchovy.ulcc.wwwcache.ja.net 10-Aug-2002 @ 14:37:24 /colour/216/3399ff.html /colour/216.html fides.ulcc.wwwcache.ja.net 10-Aug-2002 @ 14:37:30 /colour/216/33ffff.html /colour/216.html pineapple.ulcc.wwwcache.ja.net 10-Aug-2002 @ 14:37:35 /colour/216/66ffff.html /colour/216.html thyme.cant.ac.uk 10-Aug-2002 @ 14:37:40 /colour/216/66ccff.html /colour/216.html basil.ulcc.wwwcache.ja.net 10-Aug-2002 @ 14:37:45 /colour/216/6699ff.html /colour/216.html tomato.ulcc.wwwcache.ja.net 10-Aug-2002 @ 14:37:49 /colour/216/0099ff.html /colour/216.html anchovy.ulcc.wwwcache.ja.net 10-Aug-2002 @ 14:38:04 /colour/216/ffccff.html /colour/216.html mozzarella.ulcc.wwwcache.ja.net 10-Aug-2002 @ 14:38:12 /colour/216/ffcc33.html /colour/216.html anchovy.ulcc.wwwcache.ja.net 10-Aug-2002 @ 14:38:17 /colour/216/6600ff.html /colour/216.html thyme.cant.ac.uk 10-Aug-2002 @ 14:38:27 /colour/216/ccffcc.html /colour/216.html mozzarella.ulcc.wwwcache.ja.net 10-Aug-2002 @ 14:38:33 /colour/216/ccff66.html /colour/216.html tomato.ulcc.wwwcache.ja.net 10-Aug-2002 @ 14:38:37 /colour/216/ccffff.html /colour/216.html jalapeno.ulcc.wwwcache.ja.net 10-Aug-2002 @ 14:38:52 /colour/216/ffffff.html /colour/216.html oregano.ulcc.wwwcache.ja.net 10-Aug-2002 @ 14:39:03 /colour/216bg.html /colour/216.html

This second example is from an AOL user. Four images were viewed during a visit, each from a different host (and IP) address.

host address time requested file user-agent cache-mtc-aa09.proxy.aol.com 00:24:13 /workshops/14th/lindsay.jpg Mozilla/4.0 (compatible; MSIE 6.0; AOL 7.0; Windows NT 5.1; .NET CLR 1.0.3705) cache-mtc-ak07.proxy.aol.com 00:24:38 /workshops/14th/frank-lindsay.jpg Mozilla/4.0 (compatible; MSIE 6.0; AOL 7.0; Windows NT 5.1; .NET CLR 1.0.3705) cache-mtc-am07.proxy.aol.com 00:25:27 /workshops/14th/rosa2-2002-06-18.jpg Mozilla/4.0 (compatible; MSIE 6.0; AOL 7.0; Windows NT 5.1; .NET CLR 1.0.3705) cache-mtc-ak03.proxy.aol.com 00:25:38 /workshops/14th/lindsay-size.jpg Mozilla/4.0 (compatible; MSIE 6.0; AOL 7.0; Windows NT 5.1; .NET CLR 1.0.3705)

And, because of cacheing proxies or people who use services such as AOL, different visitors can look like the same visitor if you look only at their host address.

The referring URL is information that is sometimes made available to the web server. It is the URL of the page which included a link to your web site which was followed by the visitor (for example, a page on your site might get listed in a search engine).

For the first page request of a visit, the referring URL will tell you the address of the page on another web site from where a link was followed. For subsequent pages, referrers will be pages within the site.

It isn't always the case that the referring site will only appear as the referrer for the first page in a visit because people use the browser's Back button a lot.

You might expect log lines for a visit to follow the trend of each page being the referrer of the next page accessed. The log lines of a made-up a visit are shown next.

time requested file referring page 12:44:05 /colour/ http://www.site.com/links.html 12:47:00 /colour/tips.lml /colour/ 12:47:16 /colour/colour.lml /colour/tips.lml 12:47:29 /colour/browse-palettes.lml /colour/colour.lml 12:50:05 /library/ /colour/browse-palettes.lml 12:50:12 /projects.lml /library/ 12:50:24 /colour/ /projects.lml 12:50:32 /services.lml /colour/ 12:50:45 /colour/colour.lml /services.lml 12:51:10 /colour/tools.lml /colour/colour.lml

However, when you look at your web server logs, you will see that people manage to get to pages from a page which doesn't include any links to the new page.

That is because they got to the new page from a link on an earlier page that was cached. When a visitor uses their browser to go back to a cached earlier page, the page isn't logged at the web server; the redisplay of a cached page means the browser gets the page from cache and not from the server.

time requested file referring page 12:44:05 /colour/ http://www.web-graphics.com/feature-002.php 12:47:00 /colour/tips.lml /colour/ 12:47:16 /colour/colour.lml /colour/ 12:47:29 /colour/browse-palettes.lml /colour/ 12:50:05 /library/ /colour/browse-palettes.lml 12:50:12 /projects.lml /colour/colour.lml 12:50:24 /colour/ /library/ 12:50:32 /services.lml /projects.lml 12:50:45 /colour/colour.lml /colour/ 12:51:10 /colour/tools.lml /colour/colour.lml

Given this, it is possible that someone can reach your site from a link on another site, explore your site for a bit but then return to the entry page through the browser's Back button.

In this event, the first page request in the visit would have a referrer of another web site. Subsequent pages would have internal referrers but then the new outside referrer would reappear in the logs.

time requested file referring page 14:57:15 /colour/ http://uk.google.yahoo.com/bin/query_uk?p=216+colours 15:06:55 /library/ /colour/ 15:07:06 /projects.lml /library/ 15:07:50 /services.lml /projects.lml 15:07:58 /colour/ http://uk.google.yahoo.com/bin/query_uk?p=216+colours 15:08:14 /colour/tips.lml /colour/

Another scenario that explains such a visit pattern is when a web site is accessed through multiple browsers. The choice of earlier and later pages on screen to interact with will contribute to the lack of a coherent path through the site in the logs.

In addition, anecdotal evidence suggests that people with screen resolutions higher than 800x600 browse with multiple windows. In the case of a site like this, where the session id is carried around in the URL, this behaviour becomes apparent when the log includes visits from the same referrer and from the same host address and user-agent.

session id time page requested referring page 7D9JHV75DE9H9G6 15:06:48 /inetuk/notice.lml - GEL8G594EHL5SH4 15:07:06 /inetuk/notice.lml - 15:07:55 /ico/hide1-favicon.ico - GEL8G594EHL5SH4 15:08:05 /inetuk/links.lml /inetuk/notice.lml GEL8G594EHL5SH4 15:08:21 /services.lml /inetuk/links.lml 7D9JHV75DE9H9G6 15:08:28 /inetuk/about.lml /inetuk/notice.lml GEL8G594EHL5SH4 15:08:48 /projects.lml /services.lml 7D9JHV75DE9H9G6 15:11:04 /projects.lml /inetuk/about.lml 15:11:14 /ico/port-favicon.ico - GEL8G594EHL5SH4 15:11:43 /projects.lml /services.lml GEL8G594EHL5SH4 15:12:17 /projects.lml?s=t /projects.lml 7D9JHV75DE9H9G6 15:12:22 /inetuk/about.lml /inetuk/notice.lml 7D9JHV75DE9H9G6 15:14:00 /inetuk/notice.lml /inetuk/about.lml

It is possible for a different referring site to appear during a visit.

The next visit is of a visitor to this site via Google but with a different referring site for the sixth page request.

time page requested referring page 01:17:36 /colour/ http://www.google.com/search?q=color+palettes 01:18:07 /library/guidelines.lml /colour/ 01:18:39 /library/promotion.lml /library/guidelines.lml 01:18:43 /journal/ /library/promotion.lml 01:22:12 /library/promotion.lml /library/guidelines.lml 01:22:27 /journal/ http://www.thestudyofdesign.com/links_magazines_l.asp

During their visit, they created a link to our site from theirs (complete with the session id in the URL) and then presumably tested the link which explains the appearance of the second outside referrer.

You can't rely on time between requests to decide if it's a new visit. This is because people might start something at work in the afternoon - go home without closing their browser - then come back and expect to carry on with whatever's in their browser. In this event, a gap between page requests could easily be 17 hours. It is not uncommon to see gaps of an hour or two in logs.

If accesses from the same host have a gap of more than 30 mins, WebTrends counts is as from different visitors.

time gap requested file referring page 10:41:13 /colour/ http://www.google.com/search?q=color+selector 10:41:23 0:00:10 /colour/mix.lml?c=9CF /colour/ 10:41:41 0:00:18 /colour/mix.lml?c=3CF /colour/mix.lml?c=9CF 10:41:48 0:00:07 /colour/mix.lml?c=6FF /colour/mix.lml?c=3CF 10:41:55 0:00:07 /colour/mix.lml?c=0FF /colour/mix.lml?c=6FF 10:42:01 0:00:06 /colour/mix.lml?c=F93 /colour/mix.lml?c=0FF 10:42:12 0:00:11 /colour/mix.lml?c=FC6 /colour/mix.lml?c=F93 10:42:55 0:00:43 /colour/mix.lml?c=F66 /colour/mix.lml?c=FC6 11:17:28 0:34:33 /colour/mix.lml?c=F63 /colour/mix.lml?c=F66 11:18:01 0:00:33 /colour/mix.lml?c=F60 /colour/mix.lml?c=F63 50 page requests not shown - gap range: 2 secs - 11 mins (average: 1 min) 12:18:08 0:00:09 /colour/swatch.lml?c=3F9 /colour/swatch.lml?c=6F6 12:42:41 0:24:33 /colour/swatch.lml?c=3F6 /colour/swatch.lml?c=3F9 12:42:47 0:00:06 /colour/swatch.lml?c=3F3 /colour/swatch.lml?c=3F6 12:47:42 0:04:55 /colour/swatch.lml?c=3C3 /colour/swatch.lml?c=3F3 12:48:39 0:00:57 /colour/swatch.lml?c=3C6 /colour/swatch.lml?c=3C3

It is possible for the user-agent to change within a visit. When it happens, it's usually a robot visitor but it can also happen with human visitors.

The log lines below are a 183-page visit from the same IP address. This can be recognised as a spider by the quick requests in a short space of time.

host IP date/time requested file user-agent 63.144.65.58 18/Apr/2001 @ 01:00:23 /inetuk/providers.html Mozilla/4.03 [en] (Win95; I) 63.144.65.58 18/Apr/2001 @ 01:02:16 /inetuk/providers/akhter.html Mozilla/4.03 [en] (Win95; I) 63.144.65.58 18/Apr/2001 @ 01:02:19 /inetuk/providers/agent-cd.html Mozilla/3.01Gold (Win95; I; 16bit) 63.144.65.58 18/Apr/2001 @ 01:02:21 /inetuk/providers/andover.html Mozilla/3.01Gold (Win95; I; 16bit) 63.144.65.58 18/Apr/2001 @ 01:02:21 /inetuk/providers/angel.html Mozilla/2.0 (compatible; MSIE 3.02; Windows 95) 63.144.65.58 18/Apr/2001 @ 01:02:22 /inetuk/providers/aladdin.html Mozilla/4.0 (compatible; MSIE 4.0; Windows NT) 63.144.65.58 18/Apr/2001 @ 01:02:22 /inetuk/providers/apanet.html Mozilla/3.0 (Win16; I) 63.144.65.58 18/Apr/2001 @ 01:02:22 /inetuk/providers/amity.html Mozilla/4.03 [en] (Win95; I) 63.144.65.58 18/Apr/2001 @ 01:02:22 /inetuk/notify.html Mozilla/3.0 (Win16; I) 63.144.65.58 18/Apr/2001 @ 01:02:22 /inetuk/catch/ Mozilla/2.0 (compatible; MSIE 3.02; Windows 95)

host IP date/time requested file user-agent njproxy4.avaya.com 30/Apr/2001 @ 14:34:18 /colour/navigate.lml Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; DigExt; WebSite-Watcher (unreg.) http://aignes.net) njproxy4.avaya.com 30/Apr/2001 @ 14:36:48 /colour/navigate.lml Mozilla/3.01 (compatible;)

hit # date/time requested file user-agent 1 29/May/2001 @ 14:59:58 /innovations/images/b-home.gif Mozilla/3.01 (compatible;) 4 29/May/2001 @ 14:59:58 /innovations/library/requirements.html Mozilla/4.77 [en] (Win95; U) 5 29/May/2001 @ 14:59:58 /innovations/images/d-structure.gif Mozilla/3.01 (compatible;) 9 29/May/2001 @ 15:00:02 /innovations/innovate.css Mozilla/4.77 [en] (Win95; U) 10 29/May/2001 @ 15:00:02 /innovations/images/g-libr.jpg Mozilla/3.01 (compatible;) 14 29/May/2001 @ 15:43:55 /innovations/favicon.ico Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Win 9x 4.90) 15 29/May/2001 @ 16:12:41 /~paola/ Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Win 9x 4.90) 16 29/May/2001 @ 16:12:41 /~paola/pictures/icons/paola.jpg Mozilla/3.01 (compatible;) 23 29/May/2001 @ 16:12:41 /~paola/paola.css Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Win 9x 4.90) 24 29/May/2001 @ 16:12:42 /~paola/pictures/icons/contents.gif Mozilla/3.01 (compatible;)

Users can change the user-agent sent to web servers in browsers (e.g., Internet Explorer, Mozilla, Opera, Konqueror and Lynx) and spiders.

If these visitors aren't know by stats programs, their visits will be counted in browser stats.

On average, 10% of human visitors to this web site can't or won't accept cookies.

With a cookie-enabled browser, web users have control of how cookies are used.

They can accept all cookies

They can only accept cookies from certain domains

They can only accept certain cookies from certain domains

They can reject third-party cookies, those from a domain different to the current site

They can remove one of more cookies

They can edit the cookie contents

Regardless of what browser is used, it is always possible to remove cookies within a visit. Cookies may be removed deliberately, perhaps in a big clearout, or become corrupted or lost during a disk crash.

One can't assume that a) cookies get written or that b) they'll remain on visitors' computers.

StatMarket's HitBox counts visitors by use of third-party cookies. People can easily configure their browsers to reject third-party cookies - those not originating from the web site they're visiting. If someone visits a site with a HitBox counter and their browser rejects the cookie, HitBox will count every page request as a new visit. HitBox over-counts visits.

Example: imagine a site that had 2,000 actual visits in one day with three requests on average. The true visit count is 2,000. If our 10% non-cookie figure is typical, HitBox would correctly count 1,800 (90% of 2,000) of the visits. However, it would process the 600 (10% of 2,000 visits x 3 pages) page requests as visits, producing an incorrect total of 2,400 visits.

A lot of web sites are optimised for Internet Explorer because it's easier for developers to ignore other browsers; until a couple of years ago, the Marks and Spencers web site turned away Mozilla users, telling them to get a better browser.

To get around this problem, modern browsers let users set the user-agent to something else. This is usually MSIE, since so many sites are optimised for IE.

Sometimes the only indication that a visitor is robot is the time between accesses. Our sites have repeat visitors which accept cookies, and have an user-agent that looks like a normal browser.

What gives them away as robots are:

They visit at regular intervals and access the same pages, such as all the links on the home page

They access all the links on a web page and in the order they appear

They access 5-10 pages within a second

The last behaviour is how they can be spotted in the logs as their log lines will appear in clumps.

Because so many sites are optimised for Microsoft Internet Explorer (MSIE), bots send an MSIE user-agent. if they aren't detected as bots, they'll over-represent the proportion of IE users, misleading the site's developers into thinking they made the right decision to turn away other browsers.

Below are log lines from a single IP address to this site. It's a robot specifically designed to show in the logs with certain referring sites. This has become a trend in robots once blogs, for example, started publishing trackback links to referring sites. These bots are basically getting other sites to publish links to their sites.

date/time requested file referring page 30/Dec/2005 @ 18:20:50 /journal/?ID=H64SBVK4NKN00F8&jm=1&e=1061 http://www.adsense-xpress.falling.net/forex777.htm 30/Dec/2005 @ 18:20:51 /journal/?ID=XQKSBV74NKM00DR&jm=1&e=1061 http://www.adsense-xpress.falling.net/swapclix.htm 24/Jan/2006 @ 12:54:45 /inetuk/interop96.lml http://www.tvinfomercials.com/ 24/Jan/2006 @ 12:54:45 /inetuk/interop96.lml http://www.7dayplan.war-q.com 17/Feb/2006 @ 06:30:53 /inetuk/interop96.lml http://www.bugtraininginfo.com/ 20/Feb/2006 @ 11:51:05 /inetuk/interop96.lml http://www.phoneconferences247.com/ 20/Feb/2006 @ 11:51:06 /inetuk/interop96.lml http://www.bugtraininginfo.com 26/Feb/2006 @ 06:32:18 /inetuk/interop96.lml http://www.catcast2006.com/ 03/Mar/2006 @ 14:21:49 /inetuk/interop96.lml http://www.war-q.com 03/Mar/2006 @ 14:21:51 /inetuk/interop96.lml http://www.200-free-4resale-products.numbers.com

I've spent more time than I care to admit looking at server logs and discovered unexpected behaviour by both human and spider visitors.

Because unique visitors can't be accurately detected, some browsers end up being over- or under-counted by stats.

There are still more complications. For example, many hit counters collect stats via accesses to a GIF placed on your web pages. Text-only browsers and screen-readers don't access the GIF and so are never included in the browser stats. This means that some disabled visitors are not included at all in browser stats.

I've concluded that you mustn't believe your web stats if they're based on log analysis - they'll tend tell you good news when the reality is likely to be discouraging.