Utilizing Python to get well web optimization website visitors (Half one)


Serving to a shopper get well from a nasty redesign or website migration might be one of the crucial vital jobs you may face as an web optimization.

The standard method of conducting a full forensic web optimization audit works nicely more often than not, however what if there was a technique to pace issues up? You can probably save your shopper some huge cash in alternative value.

Final November, I spoke at TechSEO Increase and introduced a method my workforce and I commonly use to research visitors drops. It permits us to pinpoint this painful downside shortly and with surgical precision. So far as I do know, there are not any instruments that at the moment implement this method. I coded this resolution utilizing Python.

That is the primary a part of a three-part sequence. Partly two, we’ll manually group the pages utilizing common expressions and partially three we’ll group them mechanically utilizing machine studying methods. Let’s stroll over half one and have some enjoyable!

Winners vs losers

SEO traffic after a switch to shopify, traffic takes a hit

Final June we signed up a shopper that moved from Ecommerce V3 to Shopify and the web optimization visitors took an enormous hit. The proprietor arrange 301 redirects between the previous and new websites however made various unwise adjustments like merging a lot of classes and rewriting titles throughout the transfer.

When visitors drops, some components of the positioning underperform whereas others don’t. I prefer to isolate them with a purpose to 1) focus all efforts on the underperforming components, and a pair of) study from the components which are doing nicely.

I name this evaluation the “Winners vs Losers” evaluation. Right here, winners are the components that do nicely, and losers those that do badly.

visual analysis of winners and losers to figure out why traffic changed

A visualization of the evaluation seems to be just like the chart above. I used to be capable of slim down the problem to the class pages (Assortment pages) and located that the primary problem was brought on by the positioning proprietor merging and eliminating too many classes throughout the transfer.

Let’s stroll over the steps to place this sort of evaluation collectively in Python.

You possibly can reference my rigorously documented Google Colab pocket book right here.

Getting the info

We need to programmatically evaluate two separate time frames in Google Analytics (earlier than and after the visitors drop), and we’re going to make use of the Google Analytics API to do it.

Google Analytics Question Explorer gives the best method to do that in Python.

  1. Head on over to the Google Analytics Question Explorer
  2. Click on on the button on the prime that claims “Click on right here to Authorize” and comply with the steps supplied.
  3. Use the dropdown menu to pick the web site you need to get information from.
  4. Fill within the “metrics” parameter with “ga:newUsers” with a purpose to observe new visits.
  5. Full the “dimensions” parameter with “ga:landingPagePath” with a purpose to get the web page URLs.
  6. Fill within the “section” parameter with “gaid::-5” with a purpose to observe natural search visits.
  7. Hit “Run Question” and let it run
  8. Scroll all the way down to the underside of the web page and search for the textual content field that claims “API Question URI.”
    1. Verify the field beneath it that claims “Embody present access_token within the Question URI (will expire in ~60 minutes).”
    2. On the finish of the URL within the textual content field it is best to now see access_token=string-of-text-here. You’ll use this string of textual content within the code snippet under as  the variable referred to as token (be sure to stick it contained in the quotes)
  9. Now, scroll again as much as the place we constructed the question, and search for the parameter that was crammed in for you referred to as “ids.” You’ll use this within the code snippet under because the variable referred to as “gaid.” Once more, it ought to go contained in the quotes.
  10. Run the cell when you’ve crammed within the gaid and token variables to instantiate them, and we’re good to go!

First, let’s outline placeholder variables to go to the API

metrics = “,”.be a part of([“ga:users”,”ga:newUsers”])

dimensions = “,”.be a part of([“ga:landingPagePath”, “ga:date”])

section = “gaid::-5”

# Required, please fill in with your personal GA info instance: ga:23322342

gaid = “ga:23322342”

# Instance: string-of-text-here from step eight.2

token = “”

# Instance https://www.instance.com or http://instance.org

base_site_url = “”

# You possibly can change the beginning and finish dates as you want

begin = “2017-06-01”

finish = “2018-06-30”

The primary perform combines the placeholder variables we crammed in above with an API URL to get Google Analytics information. We make extra API requests and merge them in case the outcomes exceed the 10,000 restrict.

def GAData(gaid, begin, finish, metrics, dimensions, 

           section, token, max_results=10000):

  “””Creates a generator that yields GA API information 

     in chunks of dimension `max_results`”””

  #construct uri w/ params

  api_uri = “https://www.googleapis.com/analytics/v3/information/ga?ids=&”

             “start-date=&end-date=&metrics=&”

             “dimensions=&section=&access_token=&”

             “max-results=”

  # insert uri params

  api_uri = api_uri.format(

      gaid=gaid,

      begin=begin,

      finish=finish,

      metrics=metrics,

      dimensions=dimensions,

      section=section,

      token=token,

      max_results=max_results

  )

  # Utilizing yield to make a generator in an

  # try and be reminiscence environment friendly, since information is downloaded in chunks

  r = requests.get(api_uri)

  information = r.json()

  yield information

  if information.get(“nextLink”, None):

    whereas information.get(“nextLink”):

      new_uri = information.get(“nextLink”)

      new_uri += “&access_token=”.format(token=token)

      r = requests.get(new_uri)

      information = r.json()

      yield information

Within the second perform, we load the Google Analytics Question Explorer API response right into a pandas DataFrame to simplify our evaluation.

import pandas as pd

def to_df(gadata):

  “””Takes in a generator from GAData() 

     creates a dataframe from the rows”””

  df = None

  for information in gadata:

    if df is None:

      df = pd.DataFrame(

          information[‘rows’], 

          columns=[x[‘name’] for x in information[‘columnHeaders’]]

      )

    else:

      newdf = pd.DataFrame(

          information[‘rows’], 

          columns=[x[‘name’] for x in information[‘columnHeaders’]]

      )

      df = df.append(newdf)

    print(“Gathered rows”.format(len(df)))

  return df

Now, we will name the features to load the Google Analytics information.

information = GAData(gaid=gaid, metrics=metrics, begin=begin, 

                finish=finish, dimensions=dimensions, section=section, 

                token=token)

information = to_df(information)

Analyzing the info

Let’s begin by simply getting a take a look at the info. We’ll use the .head() technique of DataFrames to check out the primary few rows. Consider this as glancing at solely the highest few rows of an Excel spreadsheet.

information.head(5)

This shows the primary 5 rows of the info body.

A lot of the information shouldn’t be in the best format for correct evaluation, so let’s carry out some information transformations.

First, let’s convert the date to a datetime object and the metrics to numeric values.

information[‘ga:date’] = pd.to_datetime(information[‘ga:date’])

information[‘ga:users’] = pd.to_numeric(information[‘ga:users’])

information[‘ga:newUsers’] = pd.to_numeric(information[‘ga:newUsers’])

Subsequent, we’ll want the touchdown web page URL, that are relative and embody URL parameters in two extra codecs: 1) as absolute urls, and a pair of) as relative paths (with out the URL parameters).

from urllib.parse import urlparse, urljoin

information[‘path’] = information[‘ga:landingPagePath’].apply(lambda x: urlparse(x).path)

information[‘url’] = urljoin(base_site_url, information[‘path’])

Now the enjoyable half begins.

The objective of our evaluation is to see which pages misplaced visitors after a specific date–in comparison with the interval earlier than that date–and which gained visitors after that date.

The instance date chosen under corresponds to the precise midpoint of our begin and finish variables used above to collect the info, in order that the info each earlier than and after the date is equally sized.

We start the evaluation by grouping every URL collectively by their path and including up the newUsers for every URL. We do that with the built-in pandas technique: .groupby(), which takes a column title as an enter and teams collectively every distinctive worth in that column.

The .sum() technique then takes the sum of each different column within the information body inside every group.

For extra info on these strategies please see the Pandas documentation for groupby.

For individuals who is perhaps conversant in SQL, that is analogous to a GROUP BY clause with a SUM within the choose clause

# Change this relying in your wants

MIDPOINT_DATE = “2017-12-15”

earlier than = information[information[‘ga:date’] < pd.to_datetime(MIDPOINT_DATE)]

after = information[information[‘ga:date’] >= pd.to_datetime(MIDPOINT_DATE)]

# Visitors totals earlier than Shopify swap

totals_before = earlier than[[“ga:landingPagePath”, “ga:newUsers”]]

                .groupby(“ga:landingPagePath”).sum()

totals_before = totals_before.reset_index()

                .sort_values(“ga:newUsers”, ascending=False)

# Visitors totals after Shopify swap

totals_after = after[[“ga:landingPagePath”, “ga:newUsers”]]

               .groupby(“ga:landingPagePath”).sum()

totals_after = totals_after.reset_index()

               .sort_values(“ga:newUsers”, ascending=False)

You possibly can examine the totals earlier than and after with this code and double examine with the Google Analytics numbers.

print(“Visitors Totals Earlier than: “)

print(“Row rely: “, len(totals_before))

print(“Visitors Totals After: “)

print(“Row rely: “, len(totals_after))

Subsequent up we merge the 2 information frames, in order that we’ve got a single column akin to the URL, and two columns akin to the totals earlier than and after the date.

Now we have totally different choices when merging as illustrated above. Right here, we use an “outer” merge, as a result of even when a URL didn’t present up within the “earlier than” interval, we nonetheless need it to be part of this merged dataframe. We’ll fill within the blanks with zeros after the merge.

# Evaluating pages from earlier than and after the swap

change = totals_after.merge(totals_before, 

                            left_on=”ga:landingPagePath”, 

                            right_on=”ga:landingPagePath”, 

                            suffixes=[“_after”, “_before”], 

                            how=”outer”)

change.fillna(zero, inplace=True)

Distinction and share change

Pandas dataframes make easy calculations on complete columns straightforward. We will take the distinction of two columns and divide two columns and it’ll carry out that operation on each row for us. We are going to take the distinction of the 2 totals columns, and divide by the “earlier than” column to get the % change earlier than and after out midpoint date.

Utilizing this percent_change column we will then filter our dataframe to get the winners, the losers and people URLs with no change.

change[‘difference’] = change[‘ga:newUsers_after’] – change[‘ga:newUsers_before’]

change[‘percent_change’] = change[‘difference’] / change[‘ga:newUsers_before’]

winners = change[change[‘percent_change’] > zero]

losers = change[change[‘percent_change’] < zero]

no_change = change[change[‘percent_change’] == zero]

Sanity examine

Lastly, we do a fast sanity examine to make it possible for all of the visitors from the unique information body continues to be accounted for in spite of everything of our evaluation. To do that, we merely take the sum of all visitors for each the unique information body and the 2 columns of our change dataframe.

# Checking that the whole visitors provides up

information[‘ga:newUsers’].sum() == change[[‘ga:newUsers_after’, ‘ga:newUsers_before’]].sum().sum()

It ought to be True.

Outcomes

Sorting by the distinction in our losers information body, and taking the .head(10), we will see the highest 10 losers in our evaluation. In different phrases, these pages misplaced essentially the most whole visitors between the 2 intervals earlier than and after the midpoint date.

losers.sort_values(“distinction”).head(10)

You are able to do the identical to evaluate the winners and attempt to study from them.

winners.sort_values(“distinction”, ascending=False).head(10)

You possibly can export the shedding pages to a CSV or Excel utilizing this.

losers.to_csv(“./losing-pages.csv”)

This looks as if lots of work to research only one website–and it’s!

The magic occurs whenever you reuse this code on new purchasers and easily want to switch the placeholder variables on the prime of the script.

Partly two, we’ll make the output extra helpful by grouping the shedding (and successful) pages by their varieties to get the chart I included above.

Associated studying

SEO tips tools guides 2018
guide to google analytics terms
SEO travel mistakes to avoid in 2019



Supply hyperlink

Add a Comment

Your email address will not be published. Required fields are marked *