You are on page 1of 32

Topic 6

Predictive Modelling

An important aspect of data science is to find out what data can tell us about the future. What do
our Internet profiles say about our purchasing interest? How can a patient’s medical history be used
to judge how well he or she will respond to a treatment [Adhikari and DeNero, Accessed on 2019]?
Credit card companies, insurance companies, etc. wants to know the anomalies in time series data that
suggest fraud.
Data analysis and predictive modelling have always been important because governments and
companies are using these techniques for them to decide the policies and strategies to handle changes.
However, these topics become “hot” again due to giant companies such as Google and Facebook are
using advanced data analytics to do targeted advertising and have been earning a lot of money.
The components of the Kimball data warehousing/business intelligence architecture can be out-
lined using the table below [Kimball and Ross, 2013].

Data Sources Back Office Front Office


Topic 1 Topic 2 Topics 3, 4 Topics 5, 6
Accounting ETL system Enterprice Data Business
system, 1. transform from Warehouse Intelligence
Marketing data, source to target 1. atomic and Applications
Branch shop 2. normalisation summary data 1. Ad hoc query
data, 3. etc. 2. organised by 2. Standard reports
Inventory business logic 3. Descriptive Ana-
Design goals
system, 3. business logic lytics
1, Throughput
Customer formal specification 4. Data mining and
2. Integrity and
Feedback models
consistency Design goals
Channel 5. Predictive Analyt-
1. Easy to use
ics
2. Query
performance

“Predictive analytics” is a bit of marketing term for various tasks that share the intent of deriving
predictive information directly from data. Three different specific application areas stand out [Janert,
2011, Chapter 18]:
Supervised learning

Regression Models When the factors and outcomes are (continuous) numerical values, linear
algebra, nonlinear system and statistics can be combined to identify the changes of out-
comes according to the variations of the factors. This leads to linear regression models and
nonlinear regression models [Kuhn and Johnson, 2013].
Classification Assign each record to exactly one of a set of predefined classes. For example,
classify credit card transactions as “valid” or “fraudulent”. Spam filtering is another ex-
ample. Classification is considered “supervised”, because the classes are known ahead of
time and don’t need to be inferred from the data. Algorithms are judged on their ability to
assign records to the correct class.

83
84 TOPIC 6. PREDICTIVE MODELLING

Unsupervised learning

Clustering Group records into clusters, where the number, size and shape of the clusters is un-
known. Clustering is considered “unsupervised”, because no information about the clusters
is available ahead of the clustering procedure.
Association

Recommendation Recommend a suitable item based on past interest or behaviour. Recommendation


can be seen as a form of clustering, where you start with an anchor and then try to find items
that are similar or related to it.
The general steps to build a predictive model is as follows:
1. Collecting the data (Topic 1)

2. Preparing the data and fixing issues such as missing values and outliers (Topic 2)

3. Use exploratory analysis to help study the content of your data and select a proper algorithm
that suits your need (Topics 2 and 3).

4. Training a model using the algorithm you just selected. Start with a simple model that only uses
the most important variables / features.

5. Check model performance using the evaluation methods.

6. If the model is not satisfactory, choose another algorithm or introduce different variables into
the exsiting model.
Popular software tools: R ML libraries including stats, glmnet, caret; Python popular packages for
ML including scikit-learn, statsmodels; Weka (Java based); etc.
A few online resources related to predictive modelling are https://www.aisoma.de/10-statistical-techniques/,
https://www.dataversity.net/brief-history-data-science/, https://www.cheatography.com/lulu-0012/
cheat-sheets/test-ml/, https://www.analyticsvidhya.com/blog/2016/01/guide-data-exploration/, https:
//docs.microsoft.com/en-us/azure/machine-learning/studio/algorithm-cheat-sheet, https://bookdown.
org/jefftemplewebb/IS-6489/, https://bookdown.org/egarpor/PM-UC3M/, https://www.anotherbookondatascience.
com/, etc.
The book references for predictive modelling, machine learning and data mining are Hastie et al.
[2008], Berry and Linoff [2004], http://cseweb.ucsd.edu/~elkan/255/dm.pdf, Hand et al. [2001], Grus
[2015] (Python), Kleinman and Horton [2010] (R and SAS), Witten et al. [2011] (Weka), Coelho and
Richert [2015], Bowles [2015], Lantz [2015], Abbott [2014], Lewis [2015], Kuhn and Johnson [2013]
(using R), Ruppert [2011], etc.

§6.1 Make Sense Out of Data


It is interesting to observe how “data is money” is both “true” and “false” at the same time in this section.
In business, “customer is king” because they pay for services. The “data” associated with customer
behaviour and the “data” associated with the “costs” are essential for companies to find opportunities.
However, when “data becomes the king”, many companies start to increase the “costs” in analysing
probably useless data (e.g. education institutes are wasting a huge amount of time archiving students’
assignments rather than helping them to learn). So in the end, the costs increase a lot but customer
satisfaction lowers and instead of an increasing revenue, the revenue decreases.
According to https://www.inzata.com/2019/05/21/7-ways-to-grow-your-business-with-data-monetization/,
data is the new currency. In the past, businesses in the IT sector have always been deriving value from
data. However, the ability to effectively use and monetise data is now impacting virtually all types of
business.
6.1. MAKE SENSE OUT OF DATA 85

This means that driving value from data is something you can implement in your own business
strategy. What many people may not realise is that this process can be extremely challenging.
1. Decision Architecture
When thinking about analytics, the majority of organisations want to know how their business is
performing, along with what information is needed to answer various performance questions. While
this can help to inform and to describe what is taking place in the organisation, it doesn’t enable any
type of action.
Instead, the goal needs to be to capture the decision architecture of specific business problems.
Once this is done, you can build analytics capabilities to create a diagnosis that enables decisions and
actions. Leaders need to focus on making decisions that are based on data, rather than just answering
questions about what already happened.
2. Stop Revenue Leaks
Busy healthcare providers, clinics, and hospitals can easily lose track of the services being rendered.
Every procedure has an assigned code and description. Each of these often includes errors.
By using analytics, the organisations can identify patterns associated with procedures and codes,
flagging patient invoices for possible errors or even missing charges. Intelligent data use can also help
the organisations improve the ROI of their collections process.
3. Data Aggregation
The method that is at the very bottom of the pyramid, but that represents the biggest opportunity
to earn, is data aggregation.
This means taking data from various sources, including your business, and merging it together to
create a larger, integrated picture. While the data sources on their own may be interesting, when they
are combined, they become valuable.
An example of this would be your credit report. The information credit bureaus aggregate, such as
the credit cards you have, if you have a mortgage, and if you pay your bills on time, can be sold for a
profit.
By aggregating this information into a single report, the information can be sold to interested
parties. While there isn’t a lot of money in this, it’s still money.
4. Infer Customer Satisfaction
Many organisations use social media and survey sentiment to understand the levels of customer
satisfaction. By combining data from several sources, airlines can now infer how satisfied a customer
is based on factors, like where they are sitting.
This process requires information to be aggregated from several sources. However, in the airline
example, you can use the information to determine if a customer is going to fly with you again, and if
not, offer a free upgrade or other incentives.
5. Embrace a New Revenue Model
Today, data is actively changing relationships companies have with customers. Manufacturers
of tangible goods are now supplementing the products they sell with flexible software options and
services to offer customers new choices and new revenue streams.
Additionally, these companies are providing much higher levels of penalisation. Across several
industries, new economic models are starting to be explored — like replacing an auto fleet with self-
driving cars.
In this example, rather than selling data, people are going to pay you to solve a problem or to
provide answers. This is a unique revenue model.
The value lies in the fact that you have married your data to the mission of a business and solving
a problem that businesses have. This is what is going to generate revenue.
6. Detect Piracy and Fraud
Most online retailers sell products on several different websites. Supplemental sales channels typ-
ically include eBay, Amazon.com and other online marketplaces maintained by larger retailers, like
Best Buy and Walmart.
86 TOPIC 6. PREDICTIVE MODELLING

Selling through these channels is extremely data-intensive, since the customer types, products, and
pricing can vary greatly across the channels. In some case, the price discrepancies are so large that
they signal possible piracy or fraud.
If you sell across dozens of e-commerce websites, then consider building databases of your own
products and your unique pricing. You can then compare this to existing expected pricing data, allow-
ing you to detect stolen goods or suppliers who are mispricing their goods.
With this information, it’s possible to go to the marketplace and make a report stating that they
believe someone is selling stolen items.
7. Use Data Magnetisation Methods for Your Business
Data magnetisation is an ever-evolving beast that offers opportunities to earn profits by providing
information to others.
According to https://qz.com/1664575/is-data-science-legit/, after millennia of relying on anec-
dotes, instincts, and old wives’ tales as evidence of our opinions, most of us today demand that people
use data to support their arguments and ideas. Whether it’s curing cancer, solving workplace inequal-
ity, or winning elections, data is now perceived as being the Rosetta stone for cracking the code of
pretty much all of human existence.
But in the frenzy, we’ve conflated data with truth. And this has dangerous implications for our
ability to understand, explain, and improve the things we care about.
A professor of data science at NYU, Andrea Jones-Rooy consistently finds that whether she is
talking to students or clients, she has to remind them that data is not a perfect representation of reality:
It’s a fundamentally human construct, and therefore subject to biases, limitations, and other meaningful
and consequential imperfections.
The clearest expression of this misunderstanding is the question heard from boardrooms to class-
rooms when well-meaning people try to get to the bottom of tricky issues:
“What does the data say?”
Data doesn’t say anything. Humans say things. They say what they notice or look for in data—
data that only exists in the first place because humans chose to collect it, and they collected it using
human-made tools.
Data can’t say anything about an issue any more than a hammer can build a house or almond meal
can make a macaron. Data is a necessary ingredient in discovery, but you need a human to select it,
shape it, and then turn it into an insight. Data is therefore only as useful as its quality and the skills of
the person wielding it.
So if data on its own can’t do or say anything, then what is it?
What is data?
Data is an imperfect approximation of some aspect of the world at a certain time and place. It’s
what results when humans want to know something about something, try to measure it, and then
combine those measurements in particular ways.
Here are four big ways that we can introduce imperfections into data.

• random errors

• systematic errors

• errors of choosing what to measure

• and errors of exclusion

These errors don’t mean that we should throw out all data ever and nothing is knowable, however.
It means approaching data collection with thoughtfulness, asking ourselves what we might be missing,
and welcoming the collection of further data.
This view is not anti-science or anti-data. To the contrary, the strength of both comes from being
transparent about the limitations of our work. Being aware of possible errors can make our inferences
stronger.
6.1. MAKE SENSE OUT OF DATA 87

The first is random errors. This is when humans decide to measure something, and then either
due to broken equipment or their own mistakes, the data recorded is wrong. This could take the form
of hanging a thermometer on a wall to measure the temperature, or using a stethoscope to count
heartbeats. If the thermometer is broken, it might not tell you the right number of degrees. The
stethoscope might not be broken, but the human doing the counting might space out and miss a beat.
Just as we want to analyse things carefully with statistics and algorithms, we also need to collect it
carefully, too.
A big way this plays out in the rest of our lives (when we’re not assiduously logging temperatures
and heartbeats) is in the form of false positives in medical screenings. A false positive for, say, breast
cancer, means the results suggest we have cancer but we don’t. There are lots of reasons this might
happen, most of which boil down to a misstep in the process of turning a fact about the world (whether
or not we have cancer) into data (through mammograms and humans).
The consequences of this error are very real, too. Studies show a false positive can lead to years
of negative mental-health consequences, even though the patient turned out to be physically well. On
the bright side, the fear of false positives can also lead to more vigilant screening.
Generally speaking, as long as our equipment isn’t broken and we’re doing our best, we hope these
errors are statistically random and thus cancel out over time-though that’s not a great consolation if
your medical screening is one of the errors.
The second is systematic errors. This refers to the possibility that some data is consistently making
its way into your dataset at the expense of others, thus potentially leading you to make faulty conclu-
sions about the world. This might happen for lots of different reasons: who you sample, when you
sample them, or who joins your study or fills out your survey.
A common kind of systematic error is selection bias. For example, using data from Twitter posts
to understand public sentiment about a particular issue is flawed because most of us don’t tweet —
and those who do don’t always post their true feelings. Instead, a collection of data from Twitter is
just that: a way of understanding what some people who have selected to participate in this particular
platform have selected to share with the world, and no more.
The 2016 US presidential election is an example where a series of systematic biases may have led
the polls to wrongly favour Hillary Clinton. It can be tempting to conclude that all polling is wrong —
and it is, but not in the general way we might think.
One possibility is that voters were less likely to report that they were going to vote for Trump due
to perceptions that this was the unpopular choice. We call this social desirability bias. It’s useful to
stop to think about this, because if we’d been more conscious of this bias ahead of time, we might have
been able to build it into our models and better predict the election results.
Medical studies are sadly riddled with systematic biases, too: They are often based on people who
are already sick and who have the means to get to a doctor or enrol in a clinical trial. There’s some
excitement about wearable technology as a way of overcoming this. If everyone who has an Apple
Watch, for example, could just send their heart rates and steps per day to the cloud, then we would
have tons more data with less bias. But this may introduce a whole new bias: The data will likely now
be skewed to wealthy members of the Western world.
The third is errors of choosing what to measure. This is when we think we’re measuring one thing,
but in fact we’re measuring something else.
She work with many companies who are interested in finding ways to make more objective hiring
and promotion decisions. The temptation is often to turn to technology: How can we get more data in
front of our managers so they make better decisions, and how can we apply the right filters to make
sure we are getting the best talent in front of our recruiters?
But very few pause to ask if their data is measuring what they think it’s measuring. For example, if
we are looking for top job candidates, we might prefer those who went to top universities. But rather
than that being a measure of talent, it might just be a measure of membership in a social network that
gave someone the “right” sequence of opportunities to get them into a good college in the first place.
A person’s GPA is perhaps a great measure of someone’s ability to select classes they’re guaranteed to
88 TOPIC 6. PREDICTIVE MODELLING

ace, and their SAT scores might be a lovely expression of the ability of their parents to pay for a private
tutor.
Companies are so obsessed with being on the cutting edge of methodologies that they’re skipping
the deeper question: Why are we measuring this in this way in the first place? Is there another way
we could more thoroughly understand people? And, given the data we have, how can we adjust our
filters to reduce some of this bias?
Finally, errors of exclusion. This happens when populations are systematically ignored in datasets,
which can set a precedent for further exclusion. We’re inferences about apples from data about oranges
— but with worse consequences than an unbalanced fruit salad.
For example, women are now more likely to die from heart attacks than men, which is thought
to be largely due to the fact that most cardiovascular data is based on men, who experience different
symptoms from women, thus leading to incorrect diagnoses.
Choosing to study something can also motivate further research on that topic, which is a bias in and
of itself. As it’s easier to build from existing datasets than create your own, researchers often gather
around certain topics — like white women running for office or male cardiovascular health — at the
expense of others. If you repeat this enough times, all of a sudden men are the default in heart-disease
studies and white women are the default in political participation studies.
Other examples abound. Measuring “leadership” might motivate people to be more aggressive in
meetings, thus breaking down communication in the long run. Adding an “adversity” score to the SATs
might motivate parents to move to different zip codes so their scores are worth more.
She also see this play out in the diversity space: DiversityInc. and other organisations that try
to evaluate diversity of companies have chosen a few metrics on which they reward companies — for
example, “leadership buy-in”, which is measured by having a Chief Diversity Officer. In order to tick
this box, it has motivated a burst of behaviours that may not actually do anything, like appointing a
CDO who has no real power.
Why we still need to believe in data
In the age of anti-intellectualism, fake news, alternative facts, and pseudo-science, she is very
reluctant to say any of this. Sometimes it feels like we scientists are barely hanging on as it is. But she
believes that the usefulness of data and science comes not from the fact that it’s perfect and complete,
but from the fact that we recognise the limitations of our efforts. Just as we want to analyse things
carefully with statistics and algorithms, we also need to collect it carefully, too. We are only as strong
as our humility and awareness of our limitations.
This doesn’t mean throw out data. It means that when we include evidence in our analysis, we
should think about the biases that have affected their reliability. We should not just ask “what does it
say?” but ask, “who collected it, how did they do it, and how did those decisions affect the results?”
We need to question data rather than assuming that just because we’ve assigned a number to
something that it’s suddenly the cold, hard Truth. When you encounter a study or dataset, I urge you
to ask: What might be missing from this picture? What’s another way to consider what happened?
And what does this particular measure rule in, rule out, or motivate?
We need to be as thoughtful about data as we are starting to be about statistics, algorithms, and
privacy. As long as data is considered cold, hard, infallible truth, we run the risk of generating and
reinforcing a lot of inaccurate understandings of the world around us.

§6.2 The AI Hierachy


https://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007
AI has inspired massive fear-of-missing-out (FOMO) and fear-uncertainty-doubt (FUD). Some of
it is deserved, some of it not — but the industry is paying attention. From stealth hardware startups
to fintech giants to public institutions, teams are feverishly working on their AI strategy. It all comes
down to one crucial, high-stakes question: ‘How do we use AI and machine learning to get better at
what we do?’
6.2. THE AI HIERACHY 89

More often than not, companies are not ready for AI. Maybe they hired their first data scientist to
less-than-stellar outcomes, or maybe data literacy is not central to their culture. But the most common
scenario is that they have not yet built the infrastructure to implement the most basic data science
algorithms and operations, much less machine learning.
Think of AI as the top of a pyramid of needs. Yes, self-actualization (AI) is great, but you first need
food, water and shelter (data literacy, collection and infrastructure).

At the bottom of the pyramid we have data collection. What data do you need, and what’s available?
If it’s a user-facing product, are you logging all relevant user interactions? If it’s a sensor, what data is
coming through and how? How easy is it to log an interaction that is not instrumented yet? After all,
the right dataset is what made recent advances in machine learning possible.
Next, how does the data flow through the system? Do you have reliable streams / ETL? Where do
you store it, and how easy is it to access and analyse? Jay Kreps has been saying (for about a decade)
that reliable data flow is key to doing anything with data.
Only when data is accessible, you can explore and transform it. This includes the infamous ‘data
cleaning’ (Topic 2). This is when you discover you’re missing a bunch of data, your sensors are unreli-
able, a version change meant your events are dropped, you’re misinterpreting a flag — and you go back
to making sure the base of the pyramid is solid.
When you’re able to reliably explore and clean the data, you can start building what’s traditionally
thought of as BI or analytics: define metrics to track, their seasonality and sensitivity to various factors.
Maybe doing some rough user segmentation and see if anything jumps out. However, since your goal is
AI, you are now building what you’ll later think of as features to incorporate in your machine learning
model. At this stage, you also know what you’d like to predict or learn, and you can start preparing
your training data by generating labels, either automatically (which customers churned?) or with
humans in the loop.
We have training data — surely, now we can do machine learning? Maybe, if you’re trying to
internally predict churn; no, if the result is going to be customer-facing. We need to have a (however
primitive) A/B testing or experimentation framework in place, so we can deploy incrementally to avoid
disasters and get a rough estimate of the effects of the changes before they affect everybody. This is
also the right time to put a very simple baseline in place (for recommender systems, this would be e.g.
‘most popular’, then ‘most popular for your user segment’ — the very annoying but effective ‘stereotype
before personalization’).
Simple heuristics are surprisingly hard to beat, and they will allow you to debug the system end-
to-end without mysterious ML black boxes with hypertuned hyperparameters in the middle.
90 TOPIC 6. PREDICTIVE MODELLING

At this point, we can deploy a very simple ML algorithm (like logistic regression or, yes, division),
then think of new signals and features that might affect the results. Weather and census data are the
go-tos. And no  — as powerful as it is, deep learning doesn’t automatically do this for us. Bringing in
new signals (feature creation, not feature engineering) is what can improve our performance by leaps
and bounds. It’s worth spending some time here, even if as data scientists we’re excited about moving
on to the next level in the pyramid.
If all the foundations are properly instrumented, i.e. ETL is humming, the data is organised and
cleaned. The dashboards, labels and good features are set up. The right things are being measured
and experiments can be conducted daily. The time is then ripped for experimenting with machine
learning. One might get some big improvements in production, or one might not. Worst case, we learn
new methods, develop opinions and hands-on experience with them, and get to tell our investors and
clients about our AI efforts without feeling like an impostor. Best case, we make a huge difference to
our users, clients and our company — a true machine learning success story.

§6.3 Regression Models (Supervised Learning)


According to https://www.analyticsvidhya.com/blog/2015/06/machine-learning-basics/, supervised
learning methods are all predictive models because they estimate the possible outcome based on his-
torical or collected data.
When there is a functional relationship between the class label (dependent variables) and the set of
features (independent variables). For the prediction of continuous outcome, we can try the regression
model. The model validation can be done by checking for over-fitting, under-fitting, split data, cross-
validation, confusion matrix, ROC and MSE [Lantz, 2015, Chapter 6].
The statistical methods such as R 2 , Adjusted R 2 , MAE (mean absolute error), MSE (mean square
error), RMSE (root mean square error), AIC (Akaike information criterion), BIC (Bayesian information
criterion), residual analysis, goodness-of-fit test, cross validation are used to evaluate the effectiveness
of the regression model.

§6.3.1 Linear Regression


Use case: Revenue prediction, econometric predictions, modelling marketing responses

It is widely used for predicting numeric values. It trains and predicts fast, but can be prone to over-
fitting so proper feature selection is often needed. It also suffer from outliers. One needs to introduce
appropriate transform to make it fit nonlinear functions.

Python

• sklearn.linear_model.LinearRegression
• sklearn.linear_model.{Ridge, Lasso, ElasticNet, SGDRegressor};
• http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.
html

• library(stats) : lm, glm


• library(MASS) : lm.ridge
• library(lars) : lars
• library(glmnet) : glmnet
6.3. REGRESSION MODELS (SUPERVISED LEARNING) 91

§6.3.2 Logistic Regression

Use case: Classification of email spams; Online transaction is fraudulent or not; Modelling mar-
keting responses

A generalised linear model with dependent variable being binary (0 or 1). It is mostly used to
predict whether an event is going to occur based on the dependent variables.

Python

• sklearn.linear_model.LogisticRegression
• sklearn.linear_model.SGDClassifier(loss=’log’), the default is ‘svn’.
• http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.
html

• library(stats) : glm
• library(glmnet) : glmnet
• https://cran.r-project.org/web/packages/HSAUR/vignettes/Ch_logistic_regression_glm.
pdf
• https://cran.r-project.org/web/packages/phylolm/index.html

§6.3.3 Support Vector Machine (SVM)

Use case: Character recognition, Image recognition, Text classification

It is a supervised learning model with associated learning algorithms that analyse data used for
classification and regression analysis. It uses some kernel function to map data points to a higher
dimensional space and find a hyperplane to divide these points in that space. Ideal for very large data
set with high dimensions, or if you know the decision boundary is not linear. Given a set of training
examples, each marked as belonging to one or the other of two categories, an SVM training algorithm
builds a model that assigns new examples to one category or the other, making it a non-probabilistic
binary linear classifier. An SVM model is a representation of the examples as points in space, mapped
so that the examples of the separate categories are divided by a clear gap that is as wide as possible.
New examples are then mapped into that same space and predicted to belong to a category based on
the side of the gap on which they fall.
It is difficult to interpret when applying nonlinear kernels and it also suffers from too many exam-
ples, after 10,000 examples it starts taking too long to train.

Python

• sklearn.svm.{SVC, LinearSVC, NuSVC, SVR, LinearSVR, NuSVR, OneClassSVM}


• http://scikit-learn.org/stable/modules/svm.html

• library(e1071) : svm
• https://cran.r-project.org/web/packages/e1071/index.html
92 TOPIC 6. PREDICTIVE MODELLING

§6.4 Classification (Supervised Learning)


In the previous section, regression algorithms attempt to estimate the mapping function f from the
input variables x to numerical or continuous output variables y. In this section, we discuss classifica-
tion algorithms (some of which also have regression variants) which attempt to estimate the mapping
function f from the input variables x to discrete or categorical output variables y. This means that
classification is a class of methods to classify an observation to one of the several known categories. It
involves feature weighting and selection and parameter optimisation.
To evaluate the correctness of the model, we study the accuracy, confusion matrix, sensitivity and
specificity, ROC (receiver operating characteristic), AUC (area under the curve) and cross validation.

§6.4.1 Decision Trees


Use case: Targeted advertising

It is a divide-and-conquer algorithm which requires little data preparation and can handle both
numeric and categorical data. Easy to interpret and visualise but susceptible to overfitting. [Lantz,
2015, Chapter 5]

Python

• sklearn.tree.DecisionTreeClassifier
• https://scikit-learn.org/stable/modules/tree.html
• https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

• library(rpart): fit<-rpart(Kyphosis~Age+Number+Start,method="class",data=kyphosis)
• https://cran.r-project.org/web/packages/rpart/index.html
• https://cran.r-project.org/web/packages/party/index.html
• https://www.tutorialspoint.com/r/r_decision_tree.htm

§6.4.2 Random Forest


Use case: Credit card fraud detection, Bioinformatics

An ensemble method that combines many decision trees together. It has all pros that a basic de-
cision tree has, can handle many features and usually has high accuracy. However, it is difficult to
interpret, weaker on regression when estimating values at the extremities of the distribution of re-
sponse values and it is biased in multiclass problems toward more frequent classes.

Python

• sklearn.ensemble.RandomForestClassifier; sklearn.ensemble.ExtraTreesClassifier;
• sklearn.ensemble.RandomForestRegressor; sklearn.ensemble.ExtraTreesRegressor;
• http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

• library(randomForest) : randomForest
• https://cran.r-project.org/web/packages/randomForest/index.html
6.4. CLASSIFICATION (SUPERVISED LEARNING) 93

§6.4.3 Gradient Boosting


Use case: Search engines (solving the problem of learning to rank)

It can approximate most nonlinear function and best in class predictor. However, it can overfit if
run for too many iterations and it is sensitive to noisy data and outliers and doesn’t work well without
parameter tuning.

Python

• sklearn.ensemble.{GradientBoostingClassifier, GradientBoostingRegressor};
• http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html

• library(gbm) : gbm
• https://cran.r-project.org/web/packages/gbm/index.html

§6.4.4 AdaBoosting
AdaBoost, short for Adaptive Boosting, is a machine learning meta-algorithm formulated by Yoav Fre-
und and Robert Schapire in 2003. It can be used in conjunction with many other types of learning
algorithms to improve performance. The output of the other learning algorithms (’weak learners’) is
combined into a weighted sum that represents the final output of the boosted classifier. AdaBoost is
adaptive in the sense that subsequent weak learners are tweaked in favor of those instances misclassi-
fied by previous classifiers. AdaBoost is sensitive to noisy data and outliers. In some problems it can
be less susceptible to the overfitting problem than other learning algorithms. The individual learners
can be weak, but as long as the performance of each one is slightly better than random guessing, the
final model can be proven to converge to a strong learner.
Every learning algorithm tends to suit some problem types better than others, and typically has
many different parameters and configurations to adjust before it achieves optimal performance on a
dataset, AdaBoost (with decision trees as the weak learners) is often referred to as the best out-of-
the-box classifier. When used with decision tree learning, information gathered at each stage of the
AdaBoost algorithm about the relative ‘hardness’ of each training sample is fed into the tree growing
algorithm such that later trees tend to focus on harder-to-classify examples.

Python

• sklearn.ensemble.{AdaBoostClassifier, AdaBoostRegressor};
• https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html

• library(ada) : ada;
• https://cran.r-project.org/web/packages/ada/index.html
• https://cran.r-project.org/web/packages/adabag/index.html
94 TOPIC 6. PREDICTIVE MODELLING

§6.4.5 Naïve Bayes


Use case: Face recognition, Sentiment analysis, Spam detection, Text classification

The idea behind Bayesian Classifiers (instance-based classifiers) is simple: to classify an unknown
instance, find an existing instance that is “most similar” to the new instance and assign the class label
of the known instance to the new one! [Janert, 2011, Chapter 18]
A Bayesian classifier is a kind of instance-based classifier that takes a probabilistic (i.e., nondeter-
ministic) view of classification. Given a set of attributes, it calculates the probability of the instance to
belong to this or that class. An instance is then assigned the class label with the highest probability.
[Lantz, 2015, Chapter 4]
It takes into account prior knowledge and doesn’t require too much memory and can be used for
online learning. However, the feature independence assumptions is unrealistic and it fails estimating
rare occurrences and suffers from irrelevant features.

Python

• sklearn.naive_bayes.{GaussianNB, MultinomialNB, BernoulliNB}


• http://scikit-learn.org/stable/modules/naive_bayes.html

• library(klaR) : NaiveBayes
• library(e1071) : naiveBayes
• https://cran.r-project.org/web/packages/bnlearn/index.html

§6.4.6 Supervised Neural Network Models


Use case: Image recognition, Language recognition and translation, Speech recognition, Vision
recognition

Supervised neural network models can approximate any nonlinear function and is robust to out-
liers. However, they are very difficult to set up, difficult to tune because of too many parameters and
we have to decide the architecture of the network, difficult to interpret and easy to overfit.
Multi-layer Perceptron (MLP) is a supervised learning algorithm that learns a n-D function by train-
ing on a m-D dataset. The algorithm can learn a non-linear function approximator for either classifi-
cation or regression. It is different from logistic regression, in that between the input and the output
layer, there can be one or more non-linear layers, called hidden layers [scikit-learn developers, 2019].

Python

• sklearn.neural_network.{MLPClassifier, MLPRegressor}
• http://scikit-learn.org/dev/modules/neural_networks_supervised.html

• library(neuralnet) : neuralnet
• https://cran.r-project.org/web/packages/neuralnet/index.html
• library(AMORE) : train
• library(nnet) : nnet
6.5. CLUSTERING (UNSUPERVISED LEARNING) 95

§6.4.7 K-Nearest Neighbour (KNN)


Use case: Bank credit risk analysis, Computer vision, Multilabel tagging, Recommender systems,
Spell checking problems

KNN algorithm [Lantz, 2015, Chapter 3] is a lazy learning algorithm that doesn’t require much in
training, but can be slow in prediction if you have a large data set. This method is slow and cumbersome
in the predicting phase and it may fail to predict correctly due to the curse of dimension.

Python

• sklearn.neighbors.{KNeighborsClassifier, KNeighborsRegressor}
• http://scikit-learn.org/stable/modules/neighbors.html

• library(class): knn
• https://cran.r-project.org/web/packages/kknn/index.html

§6.5 Clustering (Unsupervised Learning)


A clustering model (or descriptive modelling) groups a set of objects into several unknown clusters.
These models can be externally evaluated using data that are not used for clustering but with known
class labels.

§6.5.1 K-Means Clustering


Use case: Customer segmentation

This method groups objects into k clusters [Lantz, 2015, Chapter 9]. The goal is to have the ob-
jects in one cluster more similar to each other than to any object in other clusters. When k is not
pre-determined, many methods can be used to find a good value of k, such as the elbow method (in
yellowbrick, a sklearn extension) and silhouette method (in sklearn.metrics).

Python

• sklearn.cluster.{KMeans, MiniBatchKMeans}
• http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

• library(stats) : kmeans
• https://cran.r-project.org/web/packages/broom/vignettes/kmeans.html

§6.5.2 Unsupervised Neural Network Models


Restricted Boltzmann machines (RBM) are unsupervised nonlinear feature learns based on probabilistic
models. The RBM tries to maximise the likelihood of the data using a particular graphical mode [scikit-
learn developers, 2019].
At present, Python’s scikit-learn only provides the Bernoulli Restricted Boltzmann machine model,
which assumes the inputs to be binary or values between 0 and 1.

Python
96 TOPIC 6. PREDICTIVE MODELLING

• sklearn.neural_network.BernoulliRBM
• http://scikit-learn.org/dev/modules/neural_networks_unsupervised.html

• library(devtools), library(RBM): RBM


• https://github.com/TimoMatzen/RBM
• https://cran.r-project.org/web/packages/deepnet/

§6.5.3 SVD
Use case: Recommender systems

It is a dimension-reduction method which can restructure data in a meaningful way but it is difficult
to understand why data has been restructured in a certain way.

Python

• sklearn.decomposition.TruncatedSVD
• http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.
html

• library(svd) : svd
• https://cran.r-project.org/web/packages/svd/index.html

§6.5.4 PCA
Use case: Removing collinearity, Reducing dimensions of the dataset

The problem of PCA is the implication of strong linear assumptions (components are a weighted
summations of features). When the data lacks the linear property, the dimension-reduction no longer
works.
Incremental PCA is a linear dimension-reduction method using SVD of the data, keeping only the
most significant singular vectors to project the data to a lower dimensional space. The input data is
centred but not scaled for each feature before applying the SVD. KernelPCA is a nonlinear dimension-
reduction method using “linear” and “nonlinear” kernels.

Python

• sklearn.decomposition.PCA, IncrementalPCA, KernelPCA


• http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

• library(stats): princomp, stats


• https://cran.r-project.org/web/packages/ggfortify/vignettes/plot_pca.html
6.6. TIME SERIES 97

§6.5.5 NMF
Use case: dimensionality reduction, source separation or topic extraction.

Given a non-negative matrix X , a non-negative matrix factorisation (NMF) tries to find two non-
negative matrices (W , H ) whose product approximates X .

Python

• sklearn.decomposition.NMF
• https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html

• library(irlba) : irlba
• https://cran.r-project.org/web/packages/irlba/index.html

§6.6 Time Series


Finally, we just mentioned that there are stochastic models ar(), arima(), arima.sim(), arma(), garch()
for time series analysis, which can be summarised as the following diagram

and the following steps


Step 1: Visualise the Time Series
It is essential to analyse the trends prior to building any kind of time series model. The details
we are interested in pertains to any kind of trend, seasonality or random behaviour in the series.
We have covered this part in the second part of this series.
There are three commonly used technique to make a time series stationary:

(a) Detrending: Here, we simply remove the trend component from the time series. For in-
stance, the equation of my time series is:

x(t) = (mean + trend ∗ t) + error.

We’ll simply remove the part in the parentheses and build model for the rest.
(b) Differencing: This is the commonly used technique to remove non-stationarity. Here we
try to model the differences of the terms and not the actual term. For instance,

x(t) − x(t − 1) = ARMA(p, q)

This differencing is called as the Integration part in AR(I)MA. Now, we have three param-
eters: p:AR, d:I, q:MA.
98 TOPIC 6. PREDICTIVE MODELLING

(c) Seasonality : Seasonality can easily be incorporated in the ARIMA model directly. More
on this has been discussed in the applications part below.

Step 3: Find Optimal Parameters


The parameters p,d,q can be found using ACF and PACF plots. An addition to this approach is
can be, if both ACF and PACF decreases gradually, it indicates that we need to make the time
series stationary and introduce a value to “d”.

Step 4: Build ARIMA Model


With the parameters in hand, we can now try to build ARIMA model. The value found in the
previous section might be an approximate estimate and we need to explore more (p,d,q) com-
binations. The one with the lowest BIC and AIC should be our choice. We can also try some
models with a seasonal component. Just in case, we notice any seasonality in ACF/PACF plots.

Step 5: Make Predictions


Once we have the final ARIMA model, we are now ready to make predictions on the future time
points. We can also visualize the trends to cross validate if the model works fine.

Example 6.6.1. https://www.analyticsvidhya.com/blog/2015/12/complete-tutorial-time-series-modeling/


1 data ( AirPassengers )
2 start ( AirPassengers ) # start of index ( year ) ?
3 end ( AirPassengers ) # end of index ?
4 frequency ( AirPassengers )
5 summary ( AirPassengers )
6 plot . ts ( AirPassengers )
7 abline ( reg = lm ( AirPassengers ~ time ( AirPassengers )))
8 cycle ( AirPassengers ) # This will print the cycle across years .
9 plot ( aggregate ( AirPassengers , FUN = mean ))
10 boxplot ( AirPassengers ~ cycle ( AirPassengers ))
11 # Augmented Dickey - Fuller Test
12 adf . test ( diff ( log ( AirPassengers )) , alternative =" stationary ", k =0)
13 # ACF Plots
14 acf ( log ( AirPassengers ))
15 acf ( diff ( log ( AirPassengers )))
16 pacf ( diff ( log ( AirPassengers )))
17 ( fit <- arima ( log ( AirPassengers ) , c (0 , 1, 1) , seasonal = list ( order = c (0 , 1 , 1) , period = 12)))
18 pred <- predict ( fit , n . ahead = 10*12)
19 ts . plot ( AirPassengers ,2.718^ pred$pred , log = " y", lty = c (1 ,3))

You might also like