The Basics of Regular Expressions for Google Analytics

Regular expressions, commonly known as regex, form the basis of much of my work with Google Analytics.  Quite simply, I could not do half of what I do without it. My knowledge is based on a lot of trial and error but it all started with an ebook from Luna Metrics for which I am incredibly grateful – http://www.lunametrics.com/regex-book/Regular-Expressions-Google-Analytics.pdf.

This blog post is not meant to replace that ebook but to be a quick reference guide followed by some practical applications within Google Analytics of regular expressions.

Where you use regular expressions

Regular expressions end up being used in what feels like most Google Analytics functionality.  I am going to forget at least one use but here are the key areas:

  • Applying report filters (particularly with the following reports)
    • All Pages
    • Keywords (where provided)
    • All Referrals
  • Creating View (Profile) filters in the configuration – examples include:
    • Rename pages
    • Exclude IP Addresses or robot traffic
    • Rename traffic sources
  • Creating Segments
  • Setting up Goals
  • Including filters while creating Custom Reports or Dashboard widgets

With that list of uses, regex becomes something you MUST know if you are planning on using Google Analytics on a regular basis to get real value out of it.

The key characters

Regular expressions are built around a set of wildcard characters.  But whereas you may think of * as a match anything character, there is a lot more flexibility here.  The below is not a complete list but what you need to know to get started.

Hat (technically called a caret) ^

  • This is a regular expression for “begins with”
  • Matches any string that contains whatever follows the ^
  • E.g. ^aa will match “aaa”, “aab” but not “baa”
  • Use for identifying category of pages e.g. ^/services

Dollar sign $

  • This is the regular expression for “ends with”
  • It is the simple opposite of ^
  • E.g. aa$ will match “aaa”, “baa” but not “aab”
  • Use for identifying category of pages e.g. html$ (no URL query parameters)

Period .

  • The simplest regex character, the dot can replace any character
  • E.g. a.b will match “aab”, “abb”, “a5b” or “a!b” but not “ab”
  • Use to correct for spelling mistakes e.g. ach..ve (is it ie or ei?)

Pipe |

  • The pipe means OR and is one of the simplest characters to use when learning regex
  • E.g. aa|bb will match “aa” or “bb” but not “aabb” or “ab”
  • Use when identifying social media traffic e.g. facebook|twitter

Question Mark ?

  • The previous character is optional and is not required for that string to match
  • So a? means match a or blank
  • e.g. ba?b will match “bb”, “bab” but not “baaab”
  • Use with multi word brand names where the space may not be included e.g. l3 ?analytics

Plus symbol +

  • The plus symbol means match one or more of the previous character
  • So a+ means match one or more a’s
  • e.g. ba+b will match “bab” and “baaab”

Asterisk symbol *

  • Similar to the plus symbol except it matches zero or more characters
  • So a* means match zero or more a’s
  • e.g. ba*b will match “bab”, “baaab” and “bb”

Square brackets []

  • Square brackets are where it starts to get complicated, they mean include one of these characters
  • So [abc] could match any of a, b or c
  • Square brackets can also include characters so [ab12_/] means match one of a, b, 1, 2, _ or /
  • And you can also use a range where [a-z0-9] means match one of any letter or any number

Round brackets ()

  • These are like a mathematical formula in that their contents are processed independently
  • E.g. b(a|c)b will match “bab” or “bcb” but not “bacb”, “ba” or “bc”
  • In plain English, this example means match b, then either a OR c, then another b

Back slash

  • With all of these wildcard characters, sometimes they are the characters you need to identify in a string
  • In that situation, use the backslash to indicate the character is not a wildcard
  • E.g. ba?b will match “ba?b” but not “bb” or “bab”

More complicated examples

I gave pretty simple examples when describing these wildcards but the real power when you combine wildcards.  Some examples of these combinations are:

  • .* will match any string
  • .+ will match non blank string
  • [a-z]+ will match one or more letters
  • [a-z0-9-_]+ will match one or more letters, numbers, dashes or underscores – as is found in most page names
  • a(cat|dog)?b will match “acatb”, “adogb” or (as the entire contents of the () are optional), “ab”
  • ^/[0-9]+/[0-9]+/ will match pages names for blog posts that commence with /yyyy/mm/<blog post title>

Use cases in Google Analytics

Valid Traffic Only

Your live Google Analytics profiles (I can’t get used to calling them Views) should only include data from your live website.  If L3 Analytics had three subdomains (www, blog and support), the regular expression to use in the profile filter is:

^(www.|blog.|support.)?l3analytics.com$

This translates as “starts with www. OR blog. OR support. OR is blank, followed by l3analytics.com, ending there”.

GA Profile Filter for including only Valid Traffic

Identify Social Media traffic

Traffic from Social Media networks that is not tagged with campaign parameters will appear with a medium of “referral” and a source of the social media network domain name.  You can rename the medium to social media for all this traffic using the following regular expression for the source:

facebook|twitter|^t.co$|linkedin|plus.google.com|digg|pinterest|instagram|stumbleupon

GA profile filter for identifying Social Media traffic

Identify Subset of Pages

It can be easy to identify a subset of pages (e.g. product pages, article pages) if there is a good URL structure in place.  If not, it can be incredibly difficult but using regular expressions makes it possible/easier.  Let’s use Paperchase as an example:

To identify department pages (within the All Pages report), they will need to filter on

^/[a-z-]+/icat/

To identify product list pages (within the All Pages report), they will need to filter on

^/[a-z0-9-+_]+/[a-z0-9-+_]+/icat/

To identify product pages (within the All Pages report), they will need to filter on

^/invt/[0-9]+/$

These examples are not exact as there appears to be some variation in URLs but will cover most cases.  Basically looking to have one string prior to icat for department level pages and two strings for product list level.  There are many more details that could be identified on filters that are applied to these pages.

Similar logic is used if you want to set Goals for viewing a Product page or for using a filter on a Product List page.

Rename Pages

Taking the previous logic to the next level, these pages can be renamed using Profile Filters and regular expressions.  A key change here is that anything within a () is remembered by Google Analytics and can be used within a string that is output.

So, to rename blog posts that use the format of /yyyy/mm/<blog post title>, use a Custom Advanced filter and select Request URI (page name) for field A and Output.  The regular expression is then:

  • Field A – ^/[0-9]+/[0-9]+/(.+)
  • Output – /blog/post/$A1

Where $A1 is the blog post title

You can then identify all blog posts by simply filtering on “/blog/post”.

GA Profile Filter to rename Blog Posts

What would you like to know

Ok, regular expressions are very powerful which means they can also get very complicated.  I tend to choose those advanced complicated examples myself but it would be more useful to provide examples that everyone can follow.

Please leave a comment or send me an email (peteroneill@l3analytics.com) if you have a use case for a regular expression but don’t know what the regular expression is.  I can answer and use for a future blog post (if ok with you).  This will stop me getting too advanced, focusing instead on practical everyday usage.

5 responses to “The Basics of Regular Expressions for Google Analytics”

  1. Great guide Peter!

    I use regular expressions in Google Analytics all the time – they’re fantastic.

    Unfortunately, I think that it probably one of the most powerful, yet under used features within Google Analytics. As an example, within the marketing team I work with – no one knows what regular expressions are (no surprise). When I show them how to use them to solve a problem they’ve got, they are always in awe of what you can do with them but because it is a complex concept, the syntax is difficult to understand & it isn’t something they use all the time – it is often forgotten.

    On a slide tangent, I’m excited that Google re-enabled regular expressions from the search/filter box not so long ago – it was annoying having to go into the advanced search section, select regular expression and enter the criteria when I could have just whacked it directly into the search/filter box above the tabulated data.

  2. colin says:

    Excellent Peter, just what I needed !
    Many thanks.

    Col

  3. Varun says:

    Hi,
    What can be the regex for email medium as campaign medium.
    Can you help me with that.

Leave a Reply

Your email address will not be published. Required fields are marked *

Looking to improve your digital analytics or conversion rate optimisation?
Contact us today on   +44 (0)20 3865 1589
info@leapthree.com

Alternatively, come visit us in London and we’ll buy the coffee!
Get in touch
CONTACT FORM
Leave us a message
Use the form below to leave us a message.



Close
LeapThree
WeWork
1 Primrose Street
London EC2A 2EX