About a month ago, we got a call from a partner agency, worried that the site for a pretty recognizable brand had somehow run afoul of Google’s guidelines.
On first glance, it appeared to be exactly that. A search for the brand name showed only a deep page showing up on page 3 of the SERP. A site: query showed some deep pages indexed, but not the core (home and product) URLs.
The usual suspects in this case were accidental crawling exclusion and penalty, but we also asked for access to Google Webmaster Tools. While waiting for GWT access, we ran a full crawl and asked the client for information about anything happening on (or to) the site over the past couple of months.
The crawl didn’t turn up anything odd. As for the site, there had been a push to drive some affiliate traffic recently, but nothing that set off any big alarms. Still, this was a lingering concern, due to the sheer number of sites that had been receiving Google’s warnings of unnatural links.
There was no accidental exclusion. The robots.txt file was showing 404 messaging and there were no on-page meta directives for robots. Surfing as Googlebot (with a user-agent spoofer, not through GWT) showed identical results, so there was no inadvertent cloaking going on, either.
We were leaning toward the affiliate linking and were preparing a full backlink analysis, but then we got GWT access, and that changed everything.
The robots.txt page was not giving a 404 error, as the error page suggested. Instead, it was showing a 500 error. In GWT’s “Robots.txt Fetch” report, we learned that this had been the case since about February 17th.
We quickly wrote up a robots.txt file with no exclusions and asked the client to upload it immediately. As Google had just attempted the fetch two hours earlier (and it seems to document an attempt about once per day), we had a long wait ahead. Upon next documented fetch the next day, Google downloaded the new robots.txt file without any problems. More important, the Crawler Access report showed that the new file was valid.
Just in case you’re interested, the preceding diagram (click it to open a larger version in a separate window) shows some key points in the event:
A. This is a 10-day period between Google getting a 500 error when fetching the robots.txt file, and organic traffic to the home page crashing pretty hard.
B. This is a date when, inexplicably, the 500 errors subsided briefly. Notice the subsequent growth, then decline, of organic traffic.
C. This is the date when the new robots.txt file was uploaded. The errors drop, and traffic slowly begins to return to normal state.
The takeaway here is pretty obvious: A 404 error and a 500 error could not be more different, especially when the page we’re talking about is the robots.txt file. One says to Google, “Go ahead and crawl me,” while the other holds up a shotgun and says “Get offa my porch.”
As seen here, the Crawl Stats charts more or less echo what we’ve already seen, but if you can explain to me how we have KB/day and “time spent” values greater than zero for the “non-crawling” days, I’m all ears. It may boil down to multiple crawling sessions that are begun and ended independently of one another, and which may continue based on an older fetch of the robots.txt file.
Follow Erik Dafforn on Twitter:
Pay attention. The 600 series had rubber skin. We spotted them easy. But these are new. They look human. Sweat, bad breath, everything. Very hard to spot. I had to wait ’til he moved on you before I could zero him.
— Kyle Reese, in “The Terminator“
Suppose you’re a company who’s purchased a link-building campaign. And part of the deal was that your vendor promises you a specific number of links per month. 50. 75. 150. 200. Whatever. And that was part of what made them an attractive offer, I’m sure: The promise of a guaranteed quantity of links in a given time period.
We’re going to be following that train of thought in the next few months, but for now, I want to focus on one specific aspect of a quantity-based link-building program: Links from auto-generated content.
It’s certainly not happening to you, because unlike everyone else, you get your link report each month, click each one, make all sorts of checks to ensure that it’s really a link, read the entire article in which it sits, and so on. But believe it or not, some clients just look at the Excel sheet, see 200 rows filled, and check it off their mental list. Let’s talk about what those people are missing.
We get offers all the time from vendors who want us to outsource our link-development programs to them. One recent offer came from a fellow who owns 18,000 domains and is offering us a tidy link package. I checked out some of the sample domains he listed in his message to see what style of link development he’s selling.
I’ll show my age here a bit: the interesting thing about the posts on his sample domains is not that they’re poorly written, but that they’re not “written” at all. Instead, they’re built, assembled, compiled, (use any verb you want, except for “written”) by a content generation program. The early generations of those programs produced some real garbage, but with these, it’s harder to detect. They look very much like they are simply written by someone who’s not a great writer, or perhaps someone who knows English as a second language, but there are other signs that they’re “generated.” Take a look at the following passage from one of his sample blog posts — the type of posts he is offering to use to link back to our clients:
Now, much more about this excellent product! This skin firming product is put on thoroughly clean pores and skin and during first minutes it dries and tightens the skin to some smooth delicate, satin -like finish. I totally adore the feeling- it really businesses the skin, it is quite amazing actually! Once dried, you merely apply make-up ( I propose mineral make-up) and you’re simply all set! Your skin can look organization, pores are decreased, as well as your skin’s appearance will you should be SMOOTH. I completely adore adore really like this stuff!
At the end of sentence 1, the space between “satin” and “-like” is one clue that terms are changed and replaced regularly, programmatically, as variables. The same goes for inappropriate spaces before and after certain parentheses. To make the text seem more random, content generators use a thesaurus database. I’ve underlined some words that are pulled from such a database, and it’s clear that in these circumstances, the matching was off. Take the use of “organization,” which, on this thesaurus page, can be matched with “make-up.”
Then there’s the line “… it really businesses up the skin.” It took me a while, but I think I’ve figured that out. Down in the list of synonyms for “business,” I saw “contract.” It makes the skin tighter, or contracts it. From “contract,” the program found “business” as a synonym, and voilà.
And then, of course, there’s the last sentence:
I completely adore adore really like this stuff.
It reminds me of the old Certs commercial: “… two, two, two mints in one.” But it leaves the opposite taste in my mouth.
This isn’t new technology. But it may be new to you. Recently, several sales prospects have come to us complaining of having received Google’s warning of “possibly artificial or unnatural links pointing to your site that could be intended to manipulate PageRank.” When we examine their links, we see a lot of these auto-generated pages.
So watch for this when you’re looking at links your SEO company sends to you.
Follow Erik Dafforn on Twitter:
An enterprise-level client recently came to us with an issue we typically file under “good problems to have:” He’s running an organic campaign in which the number of unique referring phrases will soon exceed 50,000 per day.
The 50,000 number is important, because that’s the practical limit of data rows you’re allowed to export under the Google Analytics API.
His challenge was to find out whether a) any third-party API tools can circumvent that 50,000-row limit. As far as we could tell, none can. (Google Analytics Premium extends that limit to one million for $150,000 per year.)
They all talk about working “within” the limit but none discusses breaking the ceiling, with the exception of some that piece together multiple queries.
As a result, I was playing around with the keyword filters and found a bit of a hacky solution. Using some simple Regular Expressions (simple RegEx is the only RegEx I know), it became pretty easy to break up a set that’s larger than 50,000 rows into two or more smaller bits.
For example, suppose your Organic Search Traffic report for a given day has around 52,000 unique phrases. You can divide the list of terms alphabetically, in theory breaking the set into two roughly equal halves.
Following are the instructions to break a large (> 50K row) dataset into two sets. Start at your Google Analytics (Traffic Sources | Sources | Search | Organic).
To obtain the “first” half (words beginning with A-K or a digit), click the “Advanced” link, which will allow you to create a filter for your keywords. Configure the first filter like this:
Include -> Keyword -> Matching RegExp -> ^[a-k0-9]
Basically, this command tells GA to list all keywords that begin with any letter from A to K, or any digit (0-9).
When this filter is complete, click the “Apply” button. The resulting dataset will reflect the terms that we’re looking for — the “first half” of our large dataset. The “advanced” link will now say “edit” because there is a filter currently being used.
To find the “second” half (words beginning with L-Z or a non-alphanumeric character, such as a comma, colon, or other punctuation mark), click the “edit” button and set up the following criteria:
Include -> Keyword -> Matching RegExp -> ^[l-z\W]
As stated, this configuration will show you the remaining phrases — any queries that begin with letters L through Z or a non-alphanumeric character.
I tried this on a few random days and in all cases, the sum of the two segments equalled exactly the total number of visits, so I feel like these expressions cover all the character bases. (Incidentally, all your “(not provided)” terms will appear in the rows pulled from the second half of the dataset, since they technically begin with an opening parenthesis).
You can export both halves of the dataset and re-combine them in Excel and have the entire set to work with. It’s a little clunky, but with traffic growing, it’s a good way to deal with days that contain more than 50,000 unique phrases.
RegEx, of course, can do far more than divide a large dataset into two smaller chunks. It’s a very powerful filtering mechanism and can help you with very complex sorts and advanced segmentation. A couple hours of reading on the subject will enhance your analytics skills immensely.
Follow Erik Dafforn on Twitter:
— Intrapromote (@intrapromote) January 23, 2012
Last Thursday, Google announced an algorithm update designed to target pages whose above-the-fold layouts are significantly ad-heavy:
…we’ve heard complaints from users that if they click on a result and it’s difficult to find the actual content, they aren’t happy with the experience. Rather than scrolling down the page past a slew of ads, users want to see content right away. So sites that don’t have much content “above-the-fold” can be affected by this change.
Characteristically, Google declines to say exactly what above-the-fold ratio of ads to content will trigger the algorithm. Instead, it suggests that readers run their own sites through Google Labs’ “Browser Size Tool,” which is designed to show (with rather crudely drawn lines) how a site appears in various resolutions, and what percentage of the browsing public can see various parts of your content based on known statistics of browser resolution distribution. Here, for example, is a shot of the home page of the Christian Science Monitor, pulled on a screen resolution of 1280 x 800 (click to enlarge):
The X and Y axes are aligned with various resolutions, and the curved lines represent the percent of users who live within those resolution confines. I’ve outlined the ad locations in red — a horizontal banner at the top, and a block in the lower right. As these two areas are the only paid ad space in this particular version of the “above-the-fold” area, one would infer that the CS Monitor has little to fear from this particular algo change.
As do most sites, according to Google’s Matt Cutts, who penned the blog post: “This algorithmic change noticeably affects less than 1% of searches globally. That means that in less than one in 100 searches, a typical user might notice a reordering of results on the search page.”
While trying to sidestep the euphemistic glory of a phrase like “reordering of results,” many people had a common question: How, exactly, is Google determining which content blocks are ads? What about site-specific (but, perhaps, not textual) images or widgets? After all, even Google isn’t immune from snagging a few dolphins in the tuna nets, as multiple updates to Panda have confirmed.
- Check your site with the tool.
- Try to look objectively at the content viewable by, say, 65% or so of the general population, to determine whether most people would find the layout annoying or distracting.
- If you have ad blocks taking up “lots” (sorry for the vagueness) of the viewable space, try to scale back, and confirm/deny your results by segmenting your organic traffic and closely looking at measurements such as path analysis and bounce rate.
Google itself took some heat from the furnace of irony in, among others, Danny Sullivan’s column on Friday. Using several screen shots taken by himself and others, he illustrated that for many queries, Google’s own search results page can, depending on the resolution (which defines the “fold”), be taken up entirely with paid advertisements.
But don’t try to confirm that with Google Labs’ Browser Size tool. As Danny’s column pointed out (and I confirmed with numerous tests), one site on which Google’s browser size tool will NOT diagnose is a traditional Google search result. So there.
Follow Erik Dafforn on Twitter:
— Intrapromote (@intrapromote) December 13, 2011
This was a great feature, because properly prioritized, addressing these erroneous URLs (either through fixing the page to show content, redirecting the error URL to a legitimate page, or canonicalizing the “bad” URL to a good one) meant nearly instant increase in number of inbound links to your site.
The concept is simple enough: If three different third-party sites link to your site — but the page on your site they linked to has moved or is otherwise not found, your site will show a 404 error when that URL is called. But if you redirect that old URL — the URL to which the external sites are pointing — to the correct page, then you’ve instantly increased the number if inbound links to your site by three.
Sometimes, however, Google reports URLs in its “Crawl Errors” section that really aren’t URLs at all — and never were. These are the types of URLs I want to discuss in this post: Why they exist, and how to deal with them.
Consider the following example:
One client site shows links to two different URLs from 5 and 10 different pages, respectively:
First, consider the top highlighted URL, /pdf/.
Determining the origin seems simple enough. Perhaps /pdf/ used to warehouse all the site’s PDF files, manuals, guides, and so on. But that’s not the case here. The list of pages that link to the /pdf/ page, listed in the far-right column, is where we need to begin the process. If I drill down into one of those pages, it’s pretty easy to see where Google found the idea that /pdf/ might be an actual URL:
It’s important to note here that the domain I’ve blurred out in the preceding code isn’t even the client’s domain. It’s another domain that holds the PDF assets. So not only did Google pull the /pdf/ out of the line somewhat randomly, it didn’t even append it to the domain that preceded it to assemble a more credible (but still incorrect) URL.
Technically, this file does exist. But it’s not meant for site users, since it is simply a cog in the machine of presenting Flash files on the site. So a 404 is technically the type of response that we want when this file gets called. If the file were to actually be presented, it would be a pretty nasty user experience.
The point of these examples is to show that, despite what you may read, some of Google Webmaster Tools’ report “errors” aren’t the type of error that need to be fixed. In fact, in the cases mentioned here, I wouldn’t do anything to them at all. Don’t be so eager to come up with a clean Crawl Errors report that you inadvertently cause your site some problems.
The lure of “fixing” errors and recouping “lost” link equity is a strong one. But sometimes it’s built on false premises. These URLs, despite being listed as such, really aren’t being “linked to” — either externally or interally. It’s pretty unclear exactly how much authority and PageRank such a link would have when their existence relies solely on having been “discovered” in a faulty process, but my advice to clients wagers that it’s very little, and that they should not spend any time worrying about it.
What I hope you’ll take away from this post:
- When you see GWT’s list of 404 errors on your site, the number of links pointing to a URL should not be your sole criterion for determining priority for addressing them.
- Google’s URL discovery process sometimes pulls in URLs that don’t really exist.
- Check the URLs linking to the 404 URL before deciding how or whether to proceed in “fixing” the URL.
- If the URLs exist only in code or were “assembled” using faulty assumptions on Google’s part, it’s often best to ignore them and move on, addressing instead the URLs that are genuinely linked to from third-party sites.
Follow Erik Dafforn on Twitter:
We sometimes have some fun at the expense of some of the tactics in the Search industry that never seem to die. In a recent training session, our AJ Allen was struck by how some of the techniques we avoid seemed to match up to some of the more nefarious characters from comic books and graphic novels. While we are rather “risk-averse” when it comes to suggesting tactics like these (i.e., we don’t), the point of this post is fun, not a statement or referendum on hat shades. Thanks to AJ for matching these up and for the inspiration (and half the copy) for this post.
Similarly, one of Batman’s greatest foes, Two-Face was once a friend and a district attorney for Gotham City. Now having a split personality disorder and a real bad temper, Harvey Two-Face clearly represents the cloaking tactic.
A subset of cloakers, Redirectors get pages to rank well, but when a human visits, they are quickly redirected (often with a meta refresh) to a page with a prominent call to action. The Ultimate master of redirection, Mysterio’s life of crime is based on his illusions and drive for revenge. One of Spider-Man’s classic villains, Mysterio “redirects” his adversary’s attention to accomplish his dastardly deeds.
Comment Spammers (“The Joker”):
The content spammer drops links to his or her clients’ sites in blogs, forums, or social networks, even though most of those links are useless, not algorithmically beneficial, and entirely out of place in the current thread. Typically, this is the result of an SEO contract that contractually requires a “quota” of new links every month.
The Joker needs no introduction, as he embodies spammers as the ultimate agent of chaos and clutter. Everything doesn’t seem to be in place when the Joker is in town, just as spammers like to degrade the user-experience on sites with their irrelevant content.
Keyword Stuffers (“Catwoman”):
Keyword stuffers try to optimize a page by repeating targeted phrases over and over, in all locations (URL, body copy, alt tags, title, meta keywords, meta desc). Even worse are the ones who still stuff tiny copy onto a page (typically below the fold) or “hide” text by showing it as the same color as the background.
This type of practice still happens fairly often, but thanks to improving algorithms, we see such sites appear less and less on prominent results pages. As a specific tactic, it went out of vogue about 2005 or so, even though its efficacy started to degrade before that. The most devious and stealthy rogue of all time, Catwoman is a cat burglar who secretly moves about at night, looking for her next piece of treasure. Always causing Batman a lot of trouble (sometimes due to her lure of “wonderful results,” IYKWIM), she is strikingly similar to secretly placing text in the background looking for your next push in PageRank.
Doorway Pagers (“Apocalypse”):
Doorway pagers create content that appeals to engines, gets the click, then due to lack of navigational depth, gives the user little choice but to click into the main site.
Similarly, enslaving all living beings has always been the goal of Apocalypse, the mutant villain of the X-Men. If he was a web developer, he would use this tactic as it leaves the user no choice but to enter the main site after getting the clicks he desires.
Special evil points go to coders who play with HTML to effectively “disable” the browser’s “Back” button, which is the natural thing that users will try to click when they’re on a page full of yuck.
What other tactics remind you of specific characters? Drop us a line in the comments.