python data analysis concept

Python for Marketers: Do reviews impact click through rate?

  • What this is for: Exploring correlations between multiple variables
  • Requirements: Python Anaconda distribution, Google My Business or comparable data with multiple variables you want to test, Basic understanding of Pandas dataframe & statistics
  • Concepts covered: Creating a correlation matrix, basic dataframe calculations, scatter plots, exporting charts as PNG
  • Download the entire Python file

I’ve always been really interested in online reviews and how they influence consumer behavior. I’ve written a couple blog posts on some of the latest research into online reviews here and here.

For this tutorial, I wanted to test a hypothesis that a higher number of reviews and higher ratings correspond to higher click through rates. Using Python, I created a quick method of testing my hypothesis and looking for other correlations in the data.

Notice I said correspond to higher CTRs and not cause higher CTRs. An important reminder before we move on is that correlation does not equal causation.

Setting up your data file

First, we need to set up our data. This is the hardest and most time consuming part if you are starting from scratch.

I started with a random sample of businesses in Google My Business (n=52). I’m interested in how reviews impact behavior, so I’m going to track the total number of reviews, the total number of sites with reviews, and average star rating. Because some results are more visible than others, I’m going to look at these variables for both the Google search card and all Google results. These will be my independent variables.

For click through rate, I calculated the total number of actions on the Google My Business listing divided by total search impressions.

So here is the complete list of the variables that I collected:

  • Record ID (unique record ID)
  • Business type (I want to compare two business types, which will be recorded as 0 and 1)
  • Total number of citations (Number of 3rd party sites that have name, address, phone number for this business listing)
  • Total number of sites with reviews
  • Total number of reviews
  • Total number of sites with reviews in search card
  • Total number of reviews in search card
  • Cumulative rating in the search card
  • Click through rate for Google My Business listing (this will be my dependent variable)

Getting started in Python

First, we’ll import our libraries and our data.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

#import CSV file
df = pd.read_csv('seo_data.csv', delimiter = ',')

Next, it’s very simple to print a correlation matrix with Pandas. Since we have 9 columns, we’ll make sure it shows all 9.

#Print the correlation coefficient matrix
pd.set_option('display.max_columns', 9)
print(df.corr())

If you’re using the Sypder IDE, it will look like this.

Understanding the correlation matrix

Remember, correlation coefficients are between -1 and 1. A -1 indicates a perfect negative correlation and a 1 indicates a perfect positive correlation. A 0 indicates no correlation between the variables. For more, read about correlation coefficients.

The variable we’re most interested in is our dependent variable, click through rate. So we’ll look at the last column to see how CTR correlates with the variables.

Based on the correlation matrix, CTR does not have any strong correlation with any of the other variables.

Next, we can isolate one of the relationships. For example, if we wanted to isolate the correlation coefficient for CTR and the total number of reviews in the search card, we could use the np.corrcoef() function.

#Print correlation coefficient
cc = np.corrcoef(df['total_review_count_search_card'], y=df['CTR'])
print('Correlation Coefficient =',cc)

Scatter plot

To help visualize the data, we’ll create a scatter plot of the data with a linear regression line.

#Print scatter plot
plt.scatter(x=df['total_review_count_search_card'], y=df['CTR'])
plt.xlabel('Total Reviews in Search Card')
plt.ylabel('Actions Per Impression')
#Plot linear regression
iv = np.array(df['total_review_count_search_card']) #iv for independent variable
dv = np.array(df['CTR']) #dv for dependent variable
gradient, intercept, r_value, p_value, std_err = stats.linregress(iv,dv)
mn=np.min(iv)
mx=np.max(iv)
iv1=np.linspace(mn,mx,500)
dv1=gradient*iv1+intercept
plt.plot(iv,dv,'ob')
plt.plot(iv1,dv1,'-r')

We’ll also save the chart as a PNG file

#Save plot as PNG
plt.savefig('correlation.png')
#Plot chart
plt.show()

This is what it returns:

scatter plot

The scatter plot makes it a little easier to see that there really is not a correlation between the total number of reviews and CTR. If our hypothesis were correct, the line would have an upward slope because an increased number of reviews would correlate to a higher CTR.

Calculating mean & frequency

Now I’m curious if there is a difference in click through rates between the two types of business units I examined. So we’ll filter the dataframe and calculate the mean CTR for each subgroup.

#Filter dataframe to only show business type 1
unit1 = df.loc[df['business_type'] == 1]
unit1_avg = np.mean(unit1['CTR'])
print('Mean CTR for business type 1 =',unit1_avg)

#Filter dataframe to only show business type 0
unit0 = df.loc[df['business_type'] == 0]
unit0_avg = np.mean(unit0['CTR'])
print('Mean CTR for business type 0 =',unit0_avg)

To chart this, we can use this code:

#Plot bar chart comparing means
objects = ('Type 1', 'Type 0')
y_pos = np.arange(len(objects))
averages = [unit1_avg,unit0_avg]
plt.bar(y_pos, averages, align='center', alpha=0.5)
plt.xticks(y_pos, objects)
plt.ylabel('CTR')
plt.title('Mean CTR')
plt.ylim(0, 1) #Manually set y-axis to max 100% CTR
plt.savefig('ctr.png')
plt.show()

We’ll also save this as a PNG, which looks like this:

Finally, Im’m interested in how many of each type of business there were in that calculation. To calculate the frequency, we’ll calculate the total number of rows in the filtered dataframe.

#Return frequnecy of data
unit1_frequency = unit1.shape[0] #Returns number of rows
print('Number of business type 1 = ',unit1_frequency)
unit0_frequency = unit0.shape[0] #Returns number of rows
print('Number of business type 0 = ',unit0_frequency)

Interpreting the results

From the data, we can see that click through rates are almost equal between the two types of businesses.

It also appears that ratings and the number of reviews are not related to click through rates, but we need to be very specific about this conclusion. For example, let’s take a look at some of the limitations of this experiment:

  • This only looks at the CTR on the Google My Business listing. Web users may have clicked through to the website through the main organic search results section.
  • Or maybe they’re not searching for a new business. We don’t know the intent of people searching. Are they researching businesses, or do they already know the business and are just looking up an address or phone number?
  • Or maybe they just saw the business listing because they were searching for something else. Google shows related results, so a lot of the impressions may have come from people who were actually looking for something else.

So we have to be very careful not to overgeneralize the results. We can’t say reviews don’t matter in the consumer buying process. We can only say reviews do seem to have an impact on CTRs of Google My Business listings for these two types of businesses that were examined. And that’s a very important distinction when you’re analyzing data like this.

You may also like