In this post I walk-through one of the coolest features in Google BigQuery - the ability to training and evaluate machine learning models directly in BigQuery using SQL syntax. I’m going to import Google’s public e-commerce dataset from Google Analytics and build a machine learning model that predicts return buyers.
In this article you will learn:
- What BigQuery ML is and what its advantages are
- How to import Google's open source e-commerce data
- How to perform some exploratory analysis using BigQuery
- How to train and identify a poor machine learning model in BigQuery
- How to improve on a poor model by adding additional features
Who should read this?
- You should already have a Google Cloud account.
- You should have a familiarity with SQL syntax.
- You should have some familiarity with Machine Learning models.
Table of Contents
- Getting into the BigQuery ML Environment
- Exploratory Analysis: What % of Visitors Made a Purchase?
- Exploratory Analysis: Top and Bottom Products by Sales
- Exploratory Analysis: Visitors Who Bought on a Return Visit
- Build a Machine Learning Model to Predict Future Buyers
- Create a BigQuery Dataset to Store ML Models
- Improving Model Performance with Better Features
BigQuery ML is a service in Google’s popular BigQuery platform that enables users to create and execute machine learning models with minimal code using standard SQL queries. If you use BigQuery to store and analyze data, BigQuery ML can be a huge value-add to your data analysis and analytics programs.
Google touts the following as the key advantages of using BigQuery ML:
BigQuery ML has the following advantages over other approaches to using ML with a cloud-based data warehouse:
BigQuery ML democratizes the use of ML by empowering data analysts, the primary data warehouse users, to build and run models using existing business intelligence tools and spreadsheets. This enables business decision making through predictive analytics across the organization. There is no need to program an ML solution using Python or Java. Models are trained and accessed in BigQuery using SQL — a language data analysts know. BigQuery ML increases the speed of model development and innovation by removing the need to export data from the data warehouse. Instead, BigQuery ML brings ML to the data. Exporting and re-formatting the data:
- Increases complexity — Multiple tools are required.
- Reduces speed — Moving and formatting large amounts data for Python-based ML frameworks takes longer than model training in BigQuery.
- Requires multiple steps to export data from the warehouse, restricting the ability to experiment on your data.
- Can be prevented by legal restrictions (such as HIPAA guidelines).
One of the major shortcomings of BigQuery ML, in my opinion, is a lack of support for some of the more popular and top-performing machine learning models. At the time of this writing, BigQuel ML only supports linear regression for forecasting, logistic regression for classification, and K-Means clustering for data segmentation.
These models can take you a long way in the majority if applications - so if you’re comfortable with only these options then BigQuery ML could likely be a great option for you.
To access BigQuery ML from your google cloud console, you’ll need to navigate to the Big Data section in the Navigation menu. From there, click Big Query.
From here, the next thing we’re going to do is add the Google Analytics E-Commerce dataset. This dataset is pretty awesome because it contains real-world analytics data for Google’s Merchance Store. The dataset contains real-world flaws and requires real-world processes to process and clean the data before training - which we’re going to do!
Once you have the BigQuery interface open you can click this link to automatically pin the dataset to your own console.
Because this is Google Analytics data, there is awesome documentation for all the columns names here: Google Analytics Schema. You should open this in a new tab and keep it open because we’ll reference it several times in this article.
You should see the Data-to-Insights data resource in the menu on the left. It should look something like this…
We’re going to use the web-analytics table to perform this analysis.
Quick Tip: When you have a dataset open you can click on Query_Table and it will populate the query editory with the proper table reference data. For the
web_analytics table you’ll see this appear:
To find out the percentage of visitors that made a purchase, we can use the following query.
Breaking Down This Query:
WITH clause contains one or more named subqueries which execute every time a subsequent
SELECT statement references them. Any clause or subquery can reference subqueries you define in the
WITH clause. This includes any
SELECT statements on either side of a set operator, such as
We use this primarily for readability. Here we declare two main queries:
visitorsquery pulls all the unique
fullVisitorId’s from the web_analytics tables and saves it as the alias
purchasersquery is the same query as
visitors, except it filters it for rows where a transaction has taken place.
- We then calculate
Click Run Query, and you should see a result like this:
We can see that the conversation rate on visitors to Google’s Merchandise store is 2.69% from 741,721 total visitors.
Another thing this data permits us to do is find the best selling products, and worst selling products in Google’s Merchandise Store. To so this, we can run a query like the following:
Breaking Down This Query:
You’ll probably notice the
UNNEST command here. The
UNNEST command takes an array type entry and converts it into a table. If you visit the Google Analytics Schema that we referenced before, you can see why these start to make sense.
This data has a
RECORD entry for
hits. You’ll notice that in the schema there are sub-entries below hits, like:
hits itself is an array of fields represented by
hits.<subfield>. Calling the following command:
Converts this array into a table, and saves it as
h. Right after this we have another command:
This is because each
hits.<subfield> is broken down further into
hits.<subfield>.<deeper_subfield>. So to access these we need to unpack that array into a table as well and call it
p. Make sense? These are just nested entries that are stored as arrays in the parent table that we need to unpack to access.
The next line…
Just means to group by the first and second column of the resulting table regardless of their names. We could also have written:
And gotten the same results. The results should look like this:
We can see that the top selling products on Google’s Merchandise Store are:
- Nest® Learning Thermostat 3rd Gen-USA - Stainless Steel
- Nest® Cam Outdoor Security Camera - USA
- Nest® Cam Indoor Security Camera - USA
- Nest® Protect Smoke + CO White Wired Alarm-USA
- Nest® Protect Smoke + CO White Battery Alarm-USA
Conversely, if we flip the
DESC flag to
ASC we get the worst selling products.
- Google Youth Girl Tee Green
- Google Men’s Microfiber 1/4 Zip Pullover Grey/Black
- Google Men’s Short Sleeve Hero Tee Charcoal
- YouTube Twill Cap
- Android Twill Cap
Suppose we want to know how many visitors bought on a return visit after they had visited the site once (Whether they bought on the first visit or not)
Breaking Down This Query:
You can read this query as:
- Make a query called
- Retrieve columns
fullvisitorid, and create a new column called
- is 1 if the
fullvisitoridhas at least 1 transaction and it’s not their first time on the site.
- 0 otherwise.
- is 1 if the
- Group these results bu the
Note that in Google’s Schema,
- Total number of new users in session (for convenience). If this is the first visit, this value is 1, otherwise it is null.
We can see that of 729,848 visitors, only 11,873 will return to make a purchase! That’s only 1.6%. Note that this includes a subset of buyers who bought the first time and also return to buy again.
Now we’re going to dive into the magic of Big Query ML to train a machine learning model to predict if a buyer will return in the future to make a purchase. By accurately predicting this, it can allow you to be more precise with ad-spend dollars by retargeting the visitors with this highest probability of coming back to make a purchase.
To start, we’re going to train a model on a very small subset of features and see how well the model works. This is just to get a taste of how ML models are built in BigQuery.
We’ll build our model on the following three features:
totals.pageviews: Total number of pageviews within the session.
totals.bounces: whether the visitor left the website immediately.
totals.timeOnSite: how long the visitor was on our website.
We need to assemble this data, along with our target variable that we want to predict :
will_buy_on_return_visit. To do this, we’re going to create two queries and then join them:
In order to store ML models, you’ll need to create a new dataset (Just like storing new data tables in your profile). To do this, click on your project name in the left pane, and then click Create Dataset.
Once you create a dataset, you can write the code for the ML model. Because we’re going to be doing a classification (predicting a binary value for whether or not the visitor will return and purchase) we’re going to use logistic regression.
Running this will take a few minutes…
When it’s done, we can explore the model’s performance parameters. BigQuery ML is actually pretty awesome in this regard - it provides analytics out of the box that give you a great sense of how well your model performs!
When the model finishes, you’ll see a Go to model button appear on the right side of the screen. Click on this. You’ll have four tabs of performance metrics to explore - click on the Evaluation tab and you’ll see the following:
Our AUC metric is 0.7788. Let’s test the model on out-of-sample data and see how well we perform.
Evaluating the model is as simple as calling the
ML.EVALUATE command on a test data query. We can do this accordingly:
Here we set up thresholds for a performance column we’ll call
model_quality. When we run this evaluation query, we’ll see something like this…
We can see that our roc_auc score for the test data was 0.74155, which equates to not_great in our scoring rubrik. Let’s see if we can beat this score by adding a richer feature set to the training data!
With machine learning models, the judicious addition of features can dramatically improve model performance because it gives the algorithm more dimensions from which to find structure in the data.
Perhaps there is structure hidden in the following features that can help our model (You can add any features you like here!)
- How far the visitor got in the checkout process on their first visit.
- Where the visitor came from (traffic source: organic search, referring site etc..)
- Device category (mobile, tablet, desktop)
- Geographic information (country)
Let’s create a second model called
Note that this line:
Adds the feature for how far the customer got in the checkout process. We cast this as an integer, because google defines this into levels from 1-8:
- Click through of product lists = 1,
- Product detail views = 2,
- Add product(s) to cart = 3,
- Remove product(s) from cart = 4,
- Check out = 5,
- Completed purchase = 6,
- Refund of purchase = 7,
- Checkout options = 8,
- Unknown = 0.
After training, we can view this model’s performance characteristics on the training data again.
We can see that this model’s roc_auc score is 0.9315 - significantly better than our prior model with fewer features. Note that we’re training this data on the same 9 months of data as the first model so that we have an apple-to-apples comparison.
Let’s see how this model performs on the testing data. The code for building the testing set is essentially the same as the first set, with a few variables changed…
Our test set roc_auc metric is 0.912797 using this model, and our rating is good.
Great! So just to recap what we have done so far:
- We have created a logistic regression model to predict which new visitors will come back to the site and make a purchase.
- We did this with a small set a features, and with a larger set of features.
- We were able to demonstrate better roc_auc model performance with the larger set of features.
Now, to make this model even more useful we should put to work. Let’s get this model to actually tell us what new visitors have the highest probability of coming back and making a purchase.
classification_model02, we’ll query our e-commerce dataset and return a list of new visitors ranked highest to lowest by their probability to return to the site and buy.
The contents of this query should be familiar by now!
When we look at our output, we can see that BigQuery ML automatically added three new columns for us:
predicted_will_buy_on_return_visi: The ‘yes’ or ‘no’ decision from the model
predicted_will_buy_on_return_visit_probs.label: The binary classifier for yes/no
predicted_will_buy_on_return_visit_probs.prob: The confidence the model has in the prediction belonging to each label.
Google BigQuery ML is a really powerful platform that leverages the parallel data processing capabilities of Big Query and integrates the ability to generate machine learning models in SQL. It allows data analysts - typically the primary user of enterprise data - to build machine learning models using a language they’re familiar with (SQL). In this post we covered the following:
- Importing a dataset into Google BigQuery
- Performing exploratory analysis on Google Analytics data.
- Building multiple machine learning models using BigQuery ML and analyzing their performance.
- Using a machine learning model to get the site visitors with the highest probability of returning to the site to buy.
If you’re confused at all by the SQL syntax used in this post, Google has an excellent documentation page on BigQuery syntax.
I hope you enjoyed this post!