Frequently Asked Questions - How can I prevent robots and spiders from accessing Web Crossing?

Frequently Asked Questions

Common Problems and Solutions

Common Operational Questions

How can I prevent robots and spiders from accessing Web Crossing?

Rate This FAQ
Rating: (based on 7 votes)

Created On: 10 May 1999 2:35 pm
Last Edited: 11 Apr 2011 10:33 am

Question

Robots, web crawlers, and spiders are various names for indexing software that may attempt to index an entire Web Crossing site for a search engine. These "bots" may use so much of the server's processing power that performance is adversely affected. They may also eat up your Page Views, causing you to exceed your license certificate limits.

Answer

Using a "robots.txt" file is an example of proper web server management, regardless of the web server software you use.

If your site doesn't have a ROBOTS.TXT file, one should be created. A good starting point for learning about robots and how to handle them is http://www.robotstxt.org/wc/exclusion.html.

Turning off Guest access will block any non-registered user and effectively block all spider/bots from walking your site. However if your site relies on unregistered users browsing your site then this is not an option and you should use a robots.txt file. NOTE: Spiders nearly always come back to visit previously successfully loaded URLs in order to keep content current. So there well may be a period of time where you continue to get hits (sometimes many!) from spiders to those URLs. However they will get stopped at the login screen and not be able to venture further into your site.

Practice proper administration of your site! Regularly examine your server access logs (usually called common.log in your root webx directory) with a log file analysis tool (Google it, there's tons) and look for unusual activity and other trends. Even as a hosted site it is up to you to perform this task. Our Engineering group is responible for installing and updating the server code, and making sure your site is up and running, however they are not the day-to-day admins of your site. You are

Ill-mannered robots may ignore the ROBOTS.TXT file and attempt to index the site anyway. In this case, it will be necessary turn off guest access in Web Crossing, or have the system administrator configure the server to disallow access from the problem IP address in the Web Services control panel. The operator of the site with the ill-mannered robot should be contacted as well and asked to correct the problem. If you are skilled with WCTL, you could also write an authenticateFilter to block specific user_agents from access your site (pre-5.0).

Alternatively, with version 5.0+ you may use our plugin to control robots, user agents, flooding, etc. You can find this at the plugin server. The "Flood, Robot, and userAgent Control" plugin is a suite of controls for managing who visits your site and how. Controls flooding, prevents Bandwidth Theft (leeching), manages URL requests and user agents, automatic static certificates for spiders and robots, and ease of editing robots.txt.

In Web Crossing, the robots.txt file needs to go into your root HTML directory. By default this is "/html" inside the "/webx" directory unless you have changed it. If you update robots.txt be sure to refresh your cache so the changes are reflected!

See also Exceeding License Limits