Ibou.io operates a crawler service named IbouBot which fuels and updates our graph representation of the World Wide Web. This database and all the metrics are used to provide a search engine. We do not train AI models with the data.
IbouBot crawls URLs found on public pages and thus may be visiting each page which has been publicly cited somewhere.
Yes, we keep trying to crawl those pages just to be sure that a missing page doesn't reflect a temporary state or a faulty web server.
Google introduced nofollow links to let a site indicate that some pages must not be taken into account when computing web metrics. But it doesn't prevent a bot from crawling these pages.
Yes. We respect the robots.txt file and disallow directives. If you have the feeling that we do not respect your directive, please contact us.
We have a politeness policy of X seconds between two queries on the same host, and Y seconds between two queries on the same IP of the same domain. You can extend the crawl delay using the robots.txt file:
Note that the crawl delay only applies for a given host. If the same web server is hosting websites with different domains, the rules above will apply.
The robots.txt file allows you to disallow IbouBot from crawling a part or the whole of your website using the disallow directive. For example, to prevent the WordPress admin section from being accessed by IbouBot:
Even if IbouBot crawls web pages with a reasonable delay (X or Y seconds between queries), it is sometimes mistaken for a DDoS or a brute force attack. If we find a URL containing session parameters, it could also be considered as a login attempt. For these reasons, IbouBot may be temporarily blacklisted. In this case, you may try to whitelist IbouBot directly in your plugin, or contact us if you can't.