Robots.txt and 500 Server Errors: Toxic Combination
About a month ago, we got a call from a partner agency, worried that the site for a pretty recognizable brand had somehow run afoul of Google’s guidelines.
On first glance, it appeared to be exactly that. A search for the brand name showed only a deep page showing up on page 3 of the SERP. A site: query showed some deep pages indexed, but not the core (home and product) URLs.
The usual suspects in this case were accidental crawling exclusion and penalty, but we also asked for access to Google Webmaster Tools. While waiting for GWT access, we ran a full crawl and asked the client for information about anything happening on (or to) the site over the past couple of months.
The crawl didn’t turn up anything odd. As for the site, there had been a push to drive some affiliate traffic recently, but nothing that set off any big alarms. Still, this was a lingering concern, due to the sheer number of sites that had been receiving Google’s warnings of unnatural links.
There was no accidental exclusion. The robots.txt file was showing 404 messaging and there were no on-page meta directives for robots. Surfing as Googlebot (with a user-agent spoofer, not through GWT) showed identical results, so there was no inadvertent cloaking going on, either.
We were leaning toward the affiliate linking and were preparing a full backlink analysis, but then we got GWT access, and that changed everything.
The robots.txt page was not giving a 404 error, as the error page suggested. Instead, it was showing a 500 error. In GWT’s “Robots.txt Fetch” report, we learned that this had been the case since about February 17th.
We quickly wrote up a robots.txt file with no exclusions and asked the client to upload it immediately. As Google had just attempted the fetch two hours earlier (and it seems to document an attempt about once per day), we had a long wait ahead. Upon next documented fetch the next day, Google downloaded the new robots.txt file without any problems. More important, the Crawler Access report showed that the new file was valid.
Just in case you’re interested, the preceding diagram (click it to open a larger version in a separate window) shows some key points in the event:
A. This is a 10-day period between Google getting a 500 error when fetching the robots.txt file, and organic traffic to the home page crashing pretty hard.
B. This is a date when, inexplicably, the 500 errors subsided briefly. Notice the subsequent growth, then decline, of organic traffic.
C. This is the date when the new robots.txt file was uploaded. The errors drop, and traffic slowly begins to return to normal state.
The takeaway here is pretty obvious: A 404 error and a 500 error could not be more different, especially when the page we’re talking about is the robots.txt file. One says to Google, “Go ahead and crawl me,” while the other holds up a shotgun and says “Get offa my porch.”
As seen here, the Crawl Stats charts more or less echo what we’ve already seen, but if you can explain to me how we have KB/day and “time spent” values greater than zero for the “non-crawling” days, I’m all ears. It may boil down to multiple crawling sessions that are begun and ended independently of one another, and which may continue based on an older fetch of the robots.txt file.
Follow Erik Dafforn on Twitter: