Building a Brain - Distributing Machine Learning Models in NoSQL
Author :: Kevin Vecmanis
In this post I walk through an architecture model for building better operational intelligence into VanAurum by distributing and accessing many machine learning models in MongoDB, a popular open source NoSQL database framework.
In this article you will learn:
- Terminology differences between NoSQL and SQL
- Pulling structured SQL data using SQLAlchemy
- Training a machine learning model using MLAutomator
- Storing the model in MongoDB with all the details required to load and make predictions
- How to load a model from MongoDB
Table of Contents
- Thoughts on System architecture
- Structrure of the Distributed Model Cluster (Brain)
- Structure of VanAurum’s model cluster
- Pulling structured data from PostGreSQL
- Prepare data for machine learning algorithms
- Training the models with MLAutomator
- Saving the Model and Meta Data to MongoDB
- Loading the Model and Meta Data from MongoDB
- List of helper functions
Operational Intelligence (OI) is a term that describes one of the more immediate and practical use cases of machine learning and AI in business. As machine learning gets more and more capable the possibilities for OI are going to grow with it.
Wikipedia has this to say about OI:
The purpose of OI is to monitor business activities and identify and detect situations relating to inefficiencies, opportunities, and threats and provide operational solutions. Some definitions define operational intelligence as an event-centric approach to delivering information that empowers people to make better decisions, based on complete and actual information.
You can define operational intelligence however you like, but I regard it as just varying degrees of analysis automation. Questions that at one time would need to be answered manually by a person are now coming under the purview of OI. Consider the following questions:
- How does my customer purchasing behaviour change on days that it’s rainy ?
- Who are my customers most likely to convert to a premium service given that they already purchase the standard service ?
- What 3 asset classes have the highest apriori probability of rising over the next 2 months ?
These are all questions that require you to source data, build a model, and project that model forward into the future. There is real leg-work required for people to do this, so why should we expect anything different from an AI system delivering OI insights?
With any intelligent or semi-intelligent system, I think it’s unreasonable to expect it to know something until it knows it.
Consider yourself and everything that you currently know. You’re reading the words on this page without any thought or consideration for the individual letters in the words. You’re likely not even conciously thinking about individual words - you’re just reading, and the words are stimulating “concepts” in your mind that your brain is stringing together from past “models” it has learned from.
At one point in your life you didn’t know what the letter A was. A capital A has three lines in it, and at one point you didn’t even know what a line was. These are all things we learn to identify over time, through exposure and repetition. Learning is the process of piling one abstract concept on top of another:
- Lines make letters
- Letters make up words
- Words make up sentences
- Sentences express concepts
- The concept in a sentence can elicit even higher level concepts - humour, irony, paradox.
- Sentences can comprise paragraphs, pages, and books that can represent an entire life’s worth of abstraction and ideas.
The message here is that you don’t know something until you do. So if we’re going to build an operational intelligence framework, why should the process be any different?
In the rest of this post I’m going to walk through the implementation of an OI system I’m building for VanAurum. The code here is as robust as it needs to be for demonstration purposes, but no more. I hope you enjoy it.
Thoughts on System Architecture
To build an operational intelligence framework in the manner I’m going to discuss, it requires a solid data infrastructure and foundation. In my own opinion, 90-95% of the time spent designing any holisitc machine learning system should be spent thinking through, designing, and building the data infrastructure. The importance of it can’t be understated - it’s like the foundation in a skyscraper.
It also requires extensive engagement with the business professionals in your organization. Regardless of how good your data infrastructure is, if you’re not collecting the data that’s needed to make smart business decisions or add customer value it’s all for nothing.
The framework we’re looking at here is as follows:
- Extract data from structured data sources: In this case I’m pulling data from a PostGreSQL database hosted on Heroku. This database contains all the structured data VanAurum uses to generate market analysis.
- Transforming the data into a format applicable to each learning instance, which I’ll describe below.
- Fit an optimal machine learning pipeline to the data.
- Store the pipeline in a NoSQL database with all of the meta data required to access the same database the model was trained on (so that it can perform predictions in the future on an updated version of the database table).
- Retrieving the models and their meta data so that they can be used for OI.
So what kind of stuff do I want to do with this OI layer? In particular, I want VanAurum to be able to answer questions like:
- What is the probability that Gold prices will be higher in 5 days? 20 days? 161 days?
- What five assets have the highest probability of trading lower in one month?
How the models are trained, and the extent to which this data can be relied on in relatively non-deterministic systems like financial markets is a topic for another article.
Hypothetically speaking, these questions would be fairly easy to answer if you had a database table of probabilities with a schema roughly like this:
TABLE : Probabilities SCHEMA: (PRIMARY_KEY) AssetName: TYPE char 1_Day_Probability: TYPE float 2_Day_Probability: TYPE float 3_Day_Probability: TYPE float 4_Day_Probability: TYPE float . . . . N_Day_Probability: TYPE float
You could answer the first question like:
SELECT 5_Day_Probability, 20_Day_Probability, 161_Day_Probability FROM Probabilities WHERE (AssetName = "GOLD")
And you could answer the second question like:
SELECT TOP 5 AssetName, 21_Day_Probability FROM Probabilities ORDER BY 21_Day_Probability ASC -- Note: VanAurum predicts probability of a rise. Probability of fall = (1 - probability of rise)
Simple enough, right? The goal of the OI system is to make the intelligent insights easy enough to access that business professionals using standard analytics sofware (like Mode or MetaBase) can get the intelligent insights from the user interface. This type of simple query could easily be handled by off-the-shelf OI software - and this is our goal.
Lots of business professional don’t have the time, desire, or knowledge to perform complex SQL queries - and this is the value that OI analytics software can deliver. But imagine our
Probabilities table didn’t exist? Answering these questions en-masse would become difficult.
This is where our distributed model cluster (brain) comes into play. The intention of this cluster of models is to push value-added information to a structured dataset with a schema that makes it more easily accessible by analytics software. With a robust underlying data-ingestion framework, a cluster of models can be built on top of it to facilitate any type of OI that the data permits.
Structrure of the Distributed Model Cluster (Brain)
NoSQL databases lend themselves well to storing machine learning models. Some of the reasons are:
- You don’t need a predefined schema - not a deal breaker here, but if you don’t need one why make one?
- Very well suited to hierarchical data storage - you’ll see why this benefts us when you see VanAurum’s structure below.
- Horiztonally scalable - which means you can easily cluster NoSQL databases. This is a huge benefit for accumulating machine learning models because you want to be able to quickly and easily store and access models. It wouldn’t be much of a “brain” if you couldn’t!
Quick note: Going forward, when I mention the following NoSQL terms you can think of the SQL equivalent:
- Collection = Table
- Document = Row/Record
All of the other shortcomings of NoSQL databases - like the fact that it doesn’t support complex relations and queries - aren’t really a design concern for us. Why? Because all we want is a quick way to store and load machine learning models in a hierarchical format. Each model will be explicitly called when it needs to be, and there are no relationships between any of the models. We also want it horizontally scalable so that we can add thousands of models without significant degradation in retrieval time. If this isn’t an application for NoSQL, I’m not sure what is!
Structure of VanAurum’s Model Cluster
You see the hierarchical and horiztonal nature of this model cluster - this will make the cluster extremely scalable if we use the right cloud technology (Like MongoDB on Heroku or DynamoDB on AWS).
What we want to do is train models on predicting a comprehensive variety of N-day return periods for whatever asset class we want to make predictions on. Assets could be things like: Gold, Copper, S&P 500 index, Bitcoin, ratios, or individual stocks.
In this article I’m going to walk through a demonstration for one asset class - Gold - and one return period - 25 days.
I’m using the python library for MongoDB - PyMongo. Let’s dive into the code!
Pulling Structured Data From PostGreSQL
The following function pulls data from a PostGres database hosted on Heroku, transforms it into a training data set with a single return period as a target.
Important: There are helper functions used by some of the following methods. All the helper functions are included at the end of this article: List of helper functions. I’ll include a
# Helper tag beside each method that is a helper. I wanted to stick to the point here and not clutter the code section too much.
The next thing we want to to is load this .csv file and split it into readable feature/target
ndarray types that are readable by our machine learning algorithms.
Prepare Data for Machine Learning Algorithms
Note that in this method we return not only X and Y training/target data, we also return
features. We’re going to use this in our NoSQL document entry so that when we retrieve models we know what columns from our base data table were used in the training process.
Training the Model With MLAutomator
Here we’re going to do a few things:
- Declare several variables that get saved in the MongoDB document along with each model.
asset: The asset table to pull from our PostGres database.
sources: VanAurum’s assets are split into their own database tables. If joins are required to reconstruct the training dataset, the complete list of tables are listed here. In this case we’re only using a single table.
return_period: Used to create our target column. Also gets saved to MongoDB as an identifiier for the model (see diagram shown previously)
- Create the training dataset using our method
Ytraining data, along with
featureswhich gets saved as meta data for our MongoDB model so that the training table can be reconstructed.
- Train our model using
MLAutomator- this is my open source library for fast model selection using bayesian optimization. See it on Github here: MLAutomator
Saving the Model and Meta Data to MongoDB
As mentioned previously, there are a few bread crumbs we want to save as meta data so that our models are going to be useful in the future.
- We want to be able to reconstruct the data the model was trained on so that we can make predictions on the same data in the future.
- MLAutomator saves a complete pipeline (pre-processing, feature selection, and training), so we don’t need to worry about those here.
- We do need to save:
- The schema from the original structured data that were used -
model_name, so that we can reference the model later.
- A list of all the source tables used to build the data
- The connection string to access the database that the training data was pulled from. Why? Because new data we want to make predictions on is going to appear here in the future.
- The schema from the original structured data that were used -
Loading the Model and Meta Data from MongoDB
Tying These Methods Together
To tie these last few code segments together, here’s what that complete process looks like when we add
Checkpoint - What have we done so far?
So we have basically built this from our original cluster structure:
- We have one collection,
GOLD, and one document,
- We’ve built the methods necessary to:
- Pull structured data
- Build training data sets
- Train models
- Save models into our cluster
- Load models from our cluster
Filling in the rest of our cluster is just a matter of looping through the asset tables and desired return periods and executing these methods.
Looping back to our original objective…
What was the point of this again? Well we want to break down as many barriers as possible between business professionals and intelligent data insights. Operational Intelligence systems don’t become intelligent on their own - they require robust data infrastucture, purposeful data collection, and a streamlined process for identifying areas where OI clusters like this should be built.
A lot can be accomplished from an analytics perspective just by having the right data structured in the right schema - Machine learning doesn’t need to be applied everywhere. For some insights, you do need machine learning - and if you need to cover a wide breadth of questions, NoSQL clusters like this are one way to build it at scale!
I hope you enjoyed this post. Below this are the helper methods and additional libraries that were used in the main methods above. Feel free to check them out.