The graph shows how crawlers move from one link to another.
As we crawl the web, if a bot encounters a robots.txt file, it is blocked from crawling certain content. We can see the links pointing to the site, but we are blind to the content of the site itself. We cannot see the outbound links from that site. This leads to an immediate decline in the link graph, at least in terms of being similar to Google (if the Google bot is not similarly blocked).
But that’s not the only problem. There’s a clash of priorities in the form of crawling priorities , due to bots being blocked by robots.txt . As a bot crawls the web, it discovers links and has to prioritize which links to crawl next. Let’s say Google finds 100 links and prioritizes the top 50 to crawl. However, a different bot finds the same 100 links, but is benin number data blocked from crawling 10 of the 50 pages by robots.txt. Instead, they are forced to crawl around them, choosing a different 50 pages to crawl. This different set of crawled pages will return, of course, a different set of links. In this next round of crawling, Google will not only have a different set of pages that they are allowed to crawl, but the set itself will also be different because it crawled different pages before.
Long story short, just like the proverbial butterfly that flaps its wings and eventually causes a hurricane, small changes to robots.txt that block some bots and allow others ultimately lead to very different results than what Google actually intended.
So, how are we doing?
You know I won't leave you hanging. Let's do some research. Let's analyze the top 1,000,000 websites on the Internet according to Quantcast and determine which bots are blocked, how often, and what impact it might have.
Procedure
The procedure is quite straightforward.
Why does it matter?
-
- Posts: 321
- Joined: Tue Jan 07, 2025 4:32 am