Why Go Through the Trouble Analyzing Logs?

A server log is a simple file output made by a web server containing a record of all requests your server has received. In the context of SEO, whenever a page is requested by a bot or a searcher an entry is made. While the format can differ slightly depending on your server, they all follow The Common Logfile Format.



Server logs excel at providing insight into potential issues crawlers and searchers face when accessing your site.



For example, consider the below screenshot. Within Search Console, Google only provides “examples” of pages that they are excluding from being indexed.

This is where server logs come into play. You can access your logs and parse them to uncover potential crawler traps. You can easily filter traffic by the user agent string and look at the crawl frequency of those pages. Then you can compare it to what you’re seeing within Search Console during that same time frame.

Server logs will allow you to make better informed decisions when it comes to optimizing crawl-ability of client sites.

Additional use cases for Server Logs

Much has already been written about why you would use server logs as part of your SEO strategy.



But here are a few additional use cases.



Troubleshoot 4XX’s & 5XX’s

Audit 1:1 redirects

Identify parameter URLs

Identify crawler traps

Site speed improvements

Identify malicious bots

In this guide I want to share my process for accessing, cleaning and possible next steps to take with your data.

Intro to KNIME Analytics

KNIME is a free open-source data analytics, reporting, and integration platform. There’s a few reasons why you would want to use KNIME.

First, the biggest benefit is its free to use and has an active community of developers. But also it’s great for working with larger datasets. Whereas, Excel supports up to approximately 1 million rows and 16,384 columns of data. KNIME’s capabilities far exceed those limitations.

KNIME’s ability to handle large data sets is only limited by your hard drive space. The other huge benefit is once you build a workflow once that’s it. You might need to tweak a few things but you can easily save and reuse workflows. Overtime this will build more efficiency into your workflows.

For our use case, it will work great for preparing server log data that can be later visualized. KNIME excels at allowing users to visually create data workflows without code. The core version already includes hundreds of modules for data integration, transformation, mining and more.

Getting Started with KNIME

The first step before accessing server logs is to download and install the software. You can download KNIME analytics here.

I highly recommend reviewing a few tutorials on YouTube and investing a bit of time understanding how the platform works. Udemy has a great KNIME Analytics Bootcamp course that will get you up to speed quickly.

Nodes in KNIME

KNIME analytics is driven by nodes that connect to build a workflow. Below is an example of a simple workflow.



Each node in KNIME serves a dedicated task, from simple ones like filtering columns to complex ones like training a decision tree. After creating a node, it needs settings to execute a specific task.

All nodes are contained within the node repository and can be dragged and dropped into the workflow.

Below is an example of the Node Repository:



To begin building a workflow you will need to drag a node from the repository and place it within the workflow editor. Once it’s in the workflow editor you will need to configure each node so it will run properly.



You can also get more information about how each individual node functions by clicking configure and the ? mark at the bottom right corner of the dialogue screen.



You will more than likely run into some roadblocks while learning how to set up workflows. And that’s ok! Just remember whenever you get stuck you can click the dialogue box for more information on a specific node.

Below is an example overview of how to configure a workflow using the Excel Reader and column filtering node.

Additional KNIME Analytics Resources

If you want to learn more about KNIME here are some good resources to get you started.

Gain Access to Your Server Logs

The first step to analyzing server logs is to get the data from the server. Before continuing, make sure you have access to your server. If you can’t get access just ask your IT team for your log files.

Two Methods for Accessing Server Logs

Method 1: Through the control panel of your hosting provider

Some hosting platforms have a built-in file manager. Look for something that says: “file management”, “files”, “file manager”.

Once you find the file management folder look for a file that says “logs” or “access logs”.

Method 2: FTP access

You will need access to an FTP client. I prefer using FileZilla it’s free and easy to set up.

An FTP address, login, and password to access the server via FTP. This can be found in the admin panel of your hosting provider.

Open the FTP client, set up a new connection to the server and authorize with your credentials. After you have entered a server file directory look for your access logs.

Download Your Server Logs

Once you’ve located your server logs it’s time to download them from your server. For the proceeding steps to work you must store your log files in a central folder.

As you can see below I’ve created a dedicated folder called Access Logs.



This will keep your workflow organized and allow the loop in the next section to run.

Below are the nodes you will need to construct the loop that will read in all your server log files.

List Files : This node creates a list with the locations and URLs of the files contained in one or more given folders.

: This node creates a list with the locations and URLs of the files contained in one or more given folders. Table Row to Variable Loop Start : This node will loop over each row of the input table. This will include the list of all zip files (server logs) within the folder you set up previously.

: This node will loop over each row of the input table. This will include the list of all zip files (server logs) within the folder you set up previously. File Reader: This node is used to read data from a file or URL location. In this case, it will read data from the server logs.

This node is used to read data from a file or URL location. In this case, it will read data from the server logs. Loop End: This node is used to mark the end of a loop. It ends the workflow and collects the data from the server log loop.

Setting up the workflow

Below is what the completed workflow should look like:



List Files

To configure this node just enter the location of your folder that contains your access logs in the “Location” section.



Then connect the Table Row To Variable Loop Start node. Next, you’ll want to connect and configure the File Reader node.

The File reader node can be a bit tricky to set up. But it’s probably one of the most important nodes you will need to connect.

To configure this node at the top click browse and navigate to the folder where your Server Logs are saved. Then be sure to set the column delimiter as space. Next, at the top click the tab Flow Variables. Once there you should see an icon with a drop down that says DataURL.

Click into this drop down and select URL.



This will begin the first step in parsing the log files and pull in the data from the previous nodes.

If all goes well then you should see a preview that sort of resembles the one below:

Finally, you will need to prompt KNIME to close the loop. Select Loop End from the node repository and connect it to the File Reader node.

No need to configure this node just hit execute and the workflow will request all the server log files in your folder, separate them by a comma, and close the loop.

Now it’s time to recap!

You have just created a KNIME workflow that requests log files from a folder, separates them by a space and consolidates them into a single document.

You can also save this workflow and use it in the future for any server log related data prep.

Parsing the RowID, Request Type, & Request URI

The next step is to clean up the request type, protocol and request uri. In order to do this you’ll need to use the Column Filter and Cell Splitter nodes. You’re going to want to filter for the column with the “GET” request. This column will also contain your request uri and the protocol.



In my case, it happens to be Col5 but your mileage may vary. Also, make sure to filter for the RowID so you can join the data later.

“

A data join is when two data sets are combined in a side by side manner. To do this one column in each data set must be the same. – The Data School

”



Now connect the Cell Splitter node and the delimiter you’ll want to enter is a space (hit the spacebar). From there you shouldn’t have to tweak any other settings. If all goes according to plan you should see this as the output:



You will see the RowID, the request type (GET), the request uri (the page that was requested) and finally the HTTP protocol.



Your final workflow should resemble something similar to what’s below:



Parsing the Time Stamp

Next, you need to filter out the other columns to clean up the date and time stamp. The objective is to break the date and time stamp into their own respective columns.



Connect the Column filter node and remove everything except the RowID and your column with the date.



Next, use the Cell Splitter by Position node. This node will split the content of a selected column into several separate new columns. Below you can see how I chose to configure this node. On Column Col3 I want to split at the 1st and 12th position.



If everything goes right your output will look like this:



As you can see, you aren’t quite done. You’ll still need to remove the 1st “:” from the time column. Connect another Cell Splitter by Position node and select the time column to split. This will give us the desired output.



Now, to keep everything organized, connect the Column Filter node to remove unneeded columns.



Setting up Joins

Our next step is to join the Request URI and Date + Time workflow to each other.



To do the join select the Joiner node. This node joins two tables in a database-like way. You’ll notice there are two inputs. So, you’ll need to connect the outputs of the two previous workflows to the Joiner node.



Set the Join mode to Inner Join and for the Joining Columns select Row ID. This will join each table by their unique Row ID so the data is properly aligned.



If everything goes according to plan the two tables should now be combined on the Row ID field.



Bringing all the Data Together

Now all that’s left is to join this new table with the data contained in our Loop End node.



In order to do this, drag another Joiner node into our workflow. Then you will need to connect the Loop End node and the Joiner node together. Just like before you will need to configure an inner join on the Row ID field.



If done correctly the two tables should be correctly joined on the Row ID field. The last thing to do is to filter and rename the columns.



In order to rename the columns select the Column Rename node and configure it according to your data. And finally you can add the Column resorter along with the Column filter node to organize your final data set however you prefer.



If everything went well the final product should like similar to the one below:



Here’s an overview of my competed KNIME workflow:



Post Data Prep: Taking Server Log Analysis Further

Now that your data is cleaned and in a manageable format you now have options. You can store your logs in a database or get started visualizing your data.



Using a Database

By storing your server log data you can analyze trends, query across specific date ranges and combine data from other sources.



KNIME analytics has nodes that will let you “write” data into a database. It’s a fairly straightforward process once you clean your data.



If you’re interested, some open source databases I recommend are postgreSQL, MySQL, or SQLite.



Visualizing Data

Once you have your clean dataset you can begin to visualize your data to look for any crawlability concerns. Some good visualization options are Google Data Studio, Powerbi, and Tableau.



To give you a sense of what’s possible – below is a dashboard I created using Tableau:



You can view the live interactive dashboard here



As you can see, KNIME analytics can be an extremely helpful tool for accessing, cleaning and analyzing your clients server logs.

What do you think?

Drop me a line!