Select a page

Breaking Google Analytics’ 50,000 Row Limit: Enterprise Analytics

[A]n enterprise-level client recently came to us with an issue we typically file under “good problems to have:” He’s running an organic campaign in which the number of unique referring phrases will soon exceed 50,000 per day.

The 50,000 number is important, because that’s the practical limit of data rows you’re allowed to export under the Google Analytics API.

His challenge was to find out whether a) any third-party API tools can circumvent that 50,000-row limit. As far as we could tell, none can. (Google Analytics Premium extends that limit to one million for $150,000 per year.)

They all talk about working “within” the limit but none discusses breaking the ceiling, with the exception of some that piece together multiple queries.

As a result, I was playing around with the keyword filters and found a bit of a hacky solution. Using some simple Regular Expressions (simple RegEx is the only RegEx I know), it became pretty easy to break up a set that’s larger than 50,000 rows into two or more smaller bits.

For example, suppose your Organic Search Traffic report for a given day has around 52,000 unique phrases. You can divide the list of terms alphabetically, in theory breaking the set into two roughly equal halves.

Following are the instructions to break a large (> 50K row) dataset into two sets. Start at your Google Analytics (Traffic Sources | Sources | Search | Organic).

To obtain the “first” half (words beginning with A-K or a digit), click the “Advanced” link, which will allow you to create a filter for your keywords. Configure the first filter like this:

Include -> Keyword -> Matching RegExp ->  ^[a-k0-9]

RegEx for the first half of our keywords

This configuration captures all words that begin with A-K or a digit.

Basically, this command tells GA to list all keywords that begin with any letter from A to K, or any digit (0-9).

When this filter is complete, click the “Apply” button. The resulting dataset will reflect the terms that we’re looking for — the “first half” of our large dataset. The “advanced” link will now say “edit” because there is a filter currently being used.

To find the “second” half (words beginning with L-Z or a non-alphanumeric character, such as a comma, colon, or other punctuation mark), click the “edit” button and set up the following criteria:

Include -> Keyword -> Matching RegExp -> ^[l-z\W]

As stated, this configuration will show you the remaining phrases — any queries that begin with letters L through Z or a non-alphanumeric character.

I tried this on a few random days and in all cases, the sum of the two segments equalled exactly the total number of visits, so I feel like these expressions cover all the character bases. (Incidentally, all your “(not provided)” terms will appear in the rows pulled from the second half of the dataset, since they technically begin with an opening parenthesis).

You can export both halves of the dataset and re-combine them in Excel and have the entire set to work with. It’s a little clunky, but with traffic growing, it’s a good way to deal with days that contain more than 50,000 unique phrases.

RegEx, of course, can do far more than divide a large dataset into two smaller chunks. It’s a very powerful filtering mechanism and can help you with very complex sorts and advanced segmentation. A couple hours of reading on the subject will enhance your analytics skills immensely.

[author] [author_image timthumb='on']http://blog.intrapromote.com/wp-content/uploads/2011/06/erik-dafforn.jpg[/author_image] [author_info]Erik joined Intrapromote full-time in 2002 after coming on as a contractor in 1999. Yeah, he’s been here a while. Currently, as President, he divides his time between overseeing the SEO department and managing organic SEO campaigns. He’s also one of Intrapromote’s Chief Big Idea Guys. Prior to working at Intrapromote, Erik worked in publishing as a development editor in the programming imprints of IDG Books and Prentice-Hall Computer Publishing so if anyone needs a keyboard shortcut, they ask Erik.

Among Erik’s professional bragging rights include the fact that he led the team that was awarded Honda’s prestigious “Premier Partner” vendor award for SEO Services. He also contributed on the recently published Search Engine Optimization Secrets with Danny Dover.
[/author_info] [/author]

Follow Erik Dafforn on Twitter:

Erik Dafforn
Erik joined Intrapromote full-time in 2002 after coming on as a contractor in 1999. Yeah, he's been here a while. Currently, as President, he divides his time between overseeing the SEO department and managing organic SEO campaigns. He's also one of Intrapromote's Chief Big Idea Guys. Prior to working at Intrapromote, Erik worked in publishing as a development editor in the programming imprints of IDG Books and Prentice-Hall Computer Publishing so if anyone needs a keyboard shortcut, they ask Erik. Among Erik's professional bragging rights include the fact that he led the team that was awarded Honda's prestigious "Premier Partner" vendor award for SEO Services. He also contributed on the recently published Search Engine Optimization Secrets with Danny Dover. Erik rivals Aaron Sorkin in his candid wit (although he will completely disagree with that statement) and has been a tremendous asset to Intrapromote since day one. He is incredibly intelligent and when you read any of his posts, you'll be able to sit back and say "Wow, I've just been Dafforned."
Erik Dafforn
Erik Dafforn
Erik Dafforn

1 Comment

  1. Tim Seitz May 29, 2012 Reply

    Another option that we use is the start-index and max-results fields in the Google Analytics Query Explorer. start-index sets the index for the first record in the dataset and then returns the next max-results records. The API only returns 10,000 records at a time, but with repeated indexing, you can retrieve a large dataset.

Leave a reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>