As a governance, risk and compliance (GRC) or audit professional, you may ask yourself: “Why bother with predictive analytics?” To answer that question let’s start by taking a look at a practical example.
Assume for a moment that you have 1,000 transactions you need to review for potential fraud. You know you only have a business day or so to review them, and you also know you have experienced fraud in the past for this particular type of transaction. The time limit you have to review the transactions would allow you to review maybe 100 of the 1,000 transactions.
A standard approach to this problem would often involve taking a random sample from the population. You might also perform a few analytic tests to pick out the riskier 100 transactions to try to get better results. This is where predictive modelling comes into play.
If you have past examples of fraud to look at, there is a good chance you could fit a predictive model to the data. Your past threshold, outlier and percentage-based tests are a good starting point for variables to try in the model. You might also consider other data for the transactions.
Assume that you then fit a model to the identified past fraud alongside some known “non-fraud” transactions. You could then use the model to score your 1,000 transactions of new data. Instead of selecting a random, or single test based sample, you can select the records which the model predicts as fraud risks. This may greatly improve your chances of catching fraud, as the sample is completely data-driven—and risk-based. Your model would pick what it thinks are the “riskiest” transactions based on past data, rather than relying on random chance and single analytic tests.
Don’t think that your previously built analytic tests are no longer valuable, the best results could be obtained by combining those simple analytic tests with an advanced predictive model. The model and the tests will both miss some fraud, but if you combine the outputs of the two you can learn even more about your data.
1. Understand the value of explainability
When you start your first predictive model project, you may be tempted to try the fanciest, newest machine learning algorithm you can find; try to resist this temptation. You will need to get buy-in from management and the end users who will ultimately be using the model predictions in their day-to-day work. The single most important thing to getting users to buy into a predictive modelling project is explainability. Share how the model works in everyday terms and choose a simple model to work with so you can get acquainted with your chosen program.
A good algorithm to start with is the decision tree algorithm, because it produces a visualisation that shows exactly how it makes its predictions. Decision trees are also very friendly to new users and can generally handle many different types of data. Most predictive modelling utilities have a decision tree model option. The model works a lot like a flowchart in practice, by reading data fields in a dataset and making decisions based on them progressively. The top of the flowchart can be thought of as the “trunk” of the decision tree. You start at the “trunk” of the tree and, based on data values, you follow the tree down its branches to reach a conclusion/prediction. Take a look at the figure below to see a simple decision tree and how it makes predictions based on data. Many decision tree implementations will look similar to this graphic in form.
First (starting in the “Tenure” box), if a customer has been a customer for less than two years, then you follow the tree to the next box. The second box (“Monthly Usage”) considers the customer’s usage of their internet service. If they use 2,000 or more megabytes of data each month, then the model predicts that customer will renew their contract. This is indicated by following the “>= 2000” branch to the blue conclusion box at the bottom containing the “Renew” prediction (this box has a green border, indicating a positive outcome). Likewise, if the customer uses less than 2,000 megabytes of data per month, then the model predicts they will not renew their service subscription (“Churn”).
2. You need past data to predict a future outcome
A common misconception about predictive models is that some assume they are magic. But in reality, there is no voodoo going on behind the scenes in a predictive model. Conceptually, a model learns from past examples of outcomes (or values) and then uses what it learned to predict against unseen data. Whether you are doing classification (e.g., will the cat jump on the counter or not?) or numerical prediction (e.g., how high will the cat jump to get a snack?), the fact remains that you need past data to train a model to make predictions.
The idea sounds simple, but getting the data you need to build a model can be the most challenging part of a project. You can also combine a predictive model with other data analysis techniques to get better results. For example, you might build a model based on past data that predicts against new data to determine whether it thinks a transaction is risky or not. You could augment this process with more traditional outlier detection style reports that look for things like “is this transaction X percent over the average for its category?”
3. Don’t give up
Predictive modelling can be frustrating, especially when syntax isn’t working or results aren’t compelling but the satisfaction of solving those problems is often well worth the frustration.
There are also many online resources for finding solutions to coding problems with common modelling programs like R or Python.
1. Build a proof of concept model
Let’s start off with proof of concept models, since most projects will require buy-in from other stakeholders. A proof of concept model is also a great way to assess the viability of a particular modelling project.
The first thing you’ll want to do is test a proof of concept model on a small dataset to see if this outcome is indeed predictable. This “pilot” project will show management that your model has the potential to function on larger scale data.
Let’s walk through the steps of building a proof of concept model using a practical example. Assume you have a business problem identified: You want to know which customers in your customer database would buy your top-of-the-line product.
Step 1: Identify relevant historical data for training your model
To build a proof of concept, you will need some historical data on which to train the model. Generally, you will want to split this dataset into two parts: a training dataset and a test dataset.
Let’s look at how you could pull the needed historical data together to build a training dataset for the aforementioned purchase prediction example. One approach is to use your customer master table and include a data field that simply states “yes” if they bought the top-of-the-line product and “no” if they have not bought the top-of-the-line product. This “yes” or “no” data field is, in fact, the outcome that you would like to predict in your data.
Since there are two possible outcomes in this example, your best bet is a classification model.
Step 2: Identify the independent variables on which to base predictions.
You then need to identify independent variables to test in order to identify potential predictors. Determining possible independent variables that can be used to make predictions may take some creative thinking. For example, you might try things like summarising data from the purchase history table, or location information from the customer master (e.g., zip code, city, state). You could also try descriptive information, like customer industry or firm size.
Step 3: Train the model, and test predictions against your test dataset.
Once you’ve selected some relevant independent variables, the next step is to fit/train a model based on these variables using your training dataset. And then you need to test the model’s predictions against the testing dataset of known outcomes (i.e., historical data). It’s important to assess how many predictions the model makes correctly on the second testing dataset. By comparing your model’s predictions on the test dataset vs. actual historical data outcomes of the test dataset, you get a side-by-side comparison which reveals the effectiveness of your variables at predicting actual outcomes.
A final thing to consider with proof of concept models is that sometimes they show that a particular model simply doesn’t work. You may find that the outcome you are trying to predict is not supported by data. It is also important to note that models are never going to be 100% accurate. Expect your proof of concept to make some wrong predictions. However, if the model is correct enough to provide actionable information to your organisation, then you can use the results to construct a business case for using the model to predict outcomes (with x% accuracy).
2. Don’t be intimidated by jargon
A lot of fancy terminology is thrown around when people talk about data science. When starting out, this can be overwhelming and confusing. I know firsthand from using different data mining tools that each tool has its own jargon for the same things. The best thing you can do is learn some basics, so that when you encounter new jargon you will at least have an idea conceptually of what the main parts of a model are.
Dependent Variable vs. Independent Variable
When someone says “Dependent Variable” (DV), the first thing that comes to mind is a scientist in a lab somewhere doing complicated things. Instead of thinking “Dependent Variable,” think “the outcome I want to predict.” You’re sure to encounter other names for DVs, but the concept is generally the same across many tools and algorithms.
Do the same for “Independent Variable”(IV), but instead think of this as “data fields that might have some useful information for prediction.”
For example, a dependent variable might be: had a car crash (“yes” or “no”). Some independent variables for this might be: driving speed, blood alcohol content and weather conditions.
An easy way to think of dependent and independent variables is to consider them as outcomes (DV) and predictors (IV). Generalising concepts in predictive modelling like this opened me up to a much wider world of algorithms and tools to use, because I could figure out the basics on my own and then fill in the blanks on how to use a particular model by reviewing literature on it.
Dependent and independent variables
In the example table above, the blue column is the dependent variable (the value you want to predict), and the white columns are the independent variables (predictors). The greyed-out column at the end is a field that is not useful as a dependent or independent model variable—and is therefore obscured. You may recognise some of the fields from the decision tree flow chart in part one of this blog post series.
This is made-up data, but if you look online you can find many free datasets to hone your predictive modelling skills on.