Magento robots.txt File

Magento is a very popular e-commerce platform, increasing in popularity for online retailers to develop their retail stores on. It has some great SEO features available, such as a sitemap.xml generator and canonical URL meta tag generation.

This is great, but referencing your sitemap to tell search engine robots its location is something you need to manually do by creating a robots.txt file. Also you may not want particular URL’s to be indexed, so disallowing these in your robots.txt file can also help with your seo strategy.

Disallow URL’s in Magento

The first declaration within the robots.txt file is the following line for the visiting user-agent.

User-agent: *

This statement is stating that all user agents need to follow the forthcoming rules. If you wished to allow certain user-agent’s to index particular pages, then you should create separate declarations for each.

The next stage is to the Magento store directories that we do not want to be indexed, starting each statement with the ‘Disallow’ declaration.

The first block of declarations are to stop Magento specific directories being indexed, along with the files and subdirectories contained in them.

# Directories

Disallow: /404/

Disallow: /app/

Disallow: /cgi-bin/

Disallow: /downloader/

Disallow: /errors/

Disallow: /includes/

# Disallow: /js/

Disallow: /lib/

Disallow: /magento/

# Disallow: /media/ // *Remove or comment this directory if you require Google Merchant Centre Feeds to access product images

Disallow: /pkginfo/

Disallow: /report/

Disallow: /scripts/

Disallow: /shell/

# Disallow: /skin/

Disallow: /stats/

Disallow: /var/

*Please note that if you are generating and submitting a product feed to Google Merchant Centre then you should remove the /media/ directory from this file so that product images can be accessed and used for this purpose.

The next set of declarations in the Magento robots.txt file is to disallow specific clean URL’s to specific pages that you do not want to be indexed, many of these are to prevent issues with duplicate content. There are also some statements that disallow the checkout and account related URL’s. If you have any specific page URL’s that you do not want Search Engines to index, then also add them here.

# Paths (clean URLs)

Disallow: /catalog/product_compare/

Disallow: /catalog/category/view/

Disallow: /catalog/product/view/

Disallow: /catalogsearch/

Disallow: /checkout/

Disallow: /checkout/onepage/

Disallow: /checkout/onepage/billing/

Disallow: /checkout/onepage/shipping/

Disallow: /checkout/onepage/shipping_method/

Disallow: /checkout/onepage/payment/

Disallow: /checkout/onepage/review/

Disallow: /checkout/onepage/success/

Disallow: /onestepcheckout/

Disallow: /control/

Disallow: /contacts/

Disallow: /customer/

Disallow: /customize/

Disallow: /newsletter/

Disallow: /poll/

Disallow: /review/

Disallow: /sendfriend/

Disallow: /tag/

Disallow: /wishlist/

Disallow: /example-page.html

The next bulk of disallow statements in our Magento robots.txt file are to exclude specific Magento files that are in the root directory. Please note that the Licence files for Magento should not really be present, but many Magento Developers (including myself) generally forget to remove the files when moving from development to live environments.

# Files

Disallow: /cron.php

Disallow: /cron.sh

Disallow: /error_log

Disallow: /install.php

Disallow: /LICENSE.html

Disallow: /LICENSE.txt

Disallow: /LICENSE_AFL.txt

Disallow: /STATUS.txt

The final stage of our Magento robots.txt file is to put a few statements that firstly disallow our included and structural file’s by type, such as .js, .css and .php files. The second part of these disallow statements is to stop our paged URL’s, search result URL’s and pager limit URL’s that are dynamically generated by Magento when refining results are not indexed.

# Paths (no clean URLs)

Disallow: /*.js$

Disallow: /*.css$

Disallow: /*.php$

Disallow: /*?p=*&

Disallow: /*?SID=

Disallow: /*?limit=all

Magento Sitemap Reference

The final stage is to reference your sitemap.xml or .gz so that a visiting bot detects your files location. Simply add the following line at the end of your robots.txt file, changing the URL to your own.

Sitemap: http://www.yourdomain.co.uk/sitemap.xml

Complete Magento robots.txt File

Your robots.txt file should now be complete with exclusions for directories, paths with clean URL’s, page URL’s, files, paths without clean URL’s and referenced sitemap.xml (your domain requires adding to the sitemap reference). You can download a complete demo version of the robots.txt file here: Full Magento robots.txt file download.

Always be aware that using this type of file will inform Googlebot and other Search Engine robots to exclude the URL’s specified in the file and should be used by a professional SEO with previous knowledge.