You are on page 1of 55

F E BRUA RY 2018

WWW.SOLVER-INTERNATIONAL.COM

TUTORIAL
Using Simulation and
Optimization Together

DATA
ANALYSIS
Location Data –
The Missing Link

EXECUTIVE
INTERVIEW
Miguel Martinez –
Microsoft Power BI

TUTORIAL
Text Mining
solver...
International

Educating and Empowering


the Business Analyst

Solver-International.com

Solver-International.com provides immediate information, exclusive


articles, and updated news for businesses and academics. The
Website also offers daily and weekly access for everyone seeking
the latest tools to improve their analytic capabilities.
Welcome to Analytic Solver ®
Cloud-based Data and Text Mining that Integrates with Excel

Everything in Predictive and Prescriptive Analytics distributions, 50 statistics and risk measures, rank-
Everywhere You Want, from Concept to Deployment. order and copula correlation, distribution fitting, and
charts and graphs. And it has full-power, point-and-click
The Analytic Solver® suite makes powerful forecasting,
optimization, with large-scale linear and mixed-integer
data mining and text mining software available in your
programming, nonlinear and simulation optimization,
web browser (cloud-based software as a service), and in
stochastic programming and robust optimization.
Microsoft Excel. And you can easily create models in our
RASON® language for server, web and mobile apps. Find Out More, Start Your Free Trial Now.
Full-Power Data Mining and Predictive Analytics. In your browser, in Excel, or in Visual Studio, Analytic
Solver comes with everything you need: Wizards, Help,
It’s all point-and-click: Text mining, latent semantic
User Guides, 90 examples, even online training courses.
analysis, feature selection, principal components and
clustering; exponential smoothing and ARIMA for Visit www.solver.com to learn more or ask questions,
forecasting; multiple regression, logistic regression, and visit analyticsolver.com to register and start a free
k-nearest neighbors, discriminant analysis, naïve trial – in the cloud, on your desktop, or both!
Bayes, and ensembles of trees and neural networks for
prediction; and association rules for affinity analysis.
Simulation/Risk Analysis, Powerful Optimization.
Analytic Solver is also a full-power, point-and-click tool
for Monte Carlo simulation and risk analysis, with 50 Tel 775 831 0300 • Fax 775 831 0314 • info@solver.com
Fe b r u a ry 2018 • Volume 2, Number 1 • w w w . s ol ve r -i n te r n a t i on a l . com

CONTENTS
Tutorial:
Using Simulation and
Optimization Together
Combining ideas from
simulation and optimization to
build and solve a simulation
optimization model.

14
24 Data Analysis
Location Data – The
Missing Link
Location data, and the intelligence
that geospatial analysis can
provide business decision makers,
is the missing link that ties together
various data sets from Big Data
sources.

2 | Solver International | February 2018 Solver-International.com


Tutorial
Text Mining
Some data analysis tasks
cannot be completed in a
reasonable time by a human and
that makes text mining a useful
tool in modern data science and
forensics.

Executive Interview
30
Miguel Martinez—
Microsoft Power BI
Miguel Martinez is a Senior
Product Marketing Manager for Society News
Microsoft Power BI.
2018 INFORMS

42 Business Analytics
Conference
The place to gain real-world
insights to implement a successful
analytics strategy

50 COL UMNS &


DE PA RT ME N TS
4 Off the Top
6 Impact Analytix
10 Industry News
50 INFORMS Society News
52 Glossary

Educating and Empowering the Business Analyst February 2018 | Solver International | 3
Off the

TOP Daniel Fylstra, editor


dfylstra@solver.com

SOLVER INTERNATIONAL DIGITAL MAGAZINE

Smart Move:
A JOINT VENTURE BETWEEN FRONTLINE SYSTEMS, INC. AND
LIONHEART PUBLISHING, INC.

SOLVER INTERNATIONAL ADVERTISING AND

Learn Analytics
EDITORIAL OFFICE
Send all advertising submissions for Solver International to:
Lionheart Publishing Inc.
1635 Old​41 Hwy., Suite 112-361, Kennesaw, GA 30152 USA
Tel.: 888.303.5639 • Fax: 770.432.6969
Email: lpi@lionhrtpub.com
Solver International has an editorial mission statement, URL: www.lionheartpub.com
“Educating and Empowering the Business Analyst,” that you’ll
see throughout this issue. We’re speaking to everyone employed
PRESIDENT
in roles that do – or should – include “analyst” in the job title, John Llewellyn, ext. 209
as well as managers of people in those roles. But in truth llewellyn@lionhrtpub.com

our audience is broader than that: As Miguel Martinez from Direct: 404.918.3275

Microsoft says in this issue’s Executive Interview, analytics and


business intelligence is “a key piece of knowledge – it must be EDITOR
Daniel Fylstra
part of your skillset,” even “if you’re not in a hardcore data
dfylstra@solver.com
analysis industry or department.” For MBA and undergrad
business students, this message should be clear as well – in
today’s world, you can’t leave data and analytics to the “quants NEWS SUBMISSIONS
editor@solver-international.com
in the back room.” It’s front and center, from the departmental
meeting to the board room.
ART DIRECTOR
So by all means, take advantage of our free digital magazine
Alan Brubaker, ext. 218
to learn! This issue includes two advanced tutorials, on Text albrubaker@lionhrtpub.com
Mining and Simulation Optimization. You’ve probably heard
these terms, but do you know what’s really involved in these ONLINE PROJECTS MANAGER
analytic methods? Our tutorials go into enough depth to be a REPRINTS & SUBSCRIPTIONS
little challenging, but the reward is that you’ll find you do have Patton McGinley, ext. 214
a grasp of what these methods can do, and how they might be patton@lionhrtpub.com

used in your own business or future career.


You’ll also find an article on Location Data and geospatial ADVERTISING SALES MANAGER
analysis in these pages. It’s remarkable just how pervasive and John Llewellyn
llewellyn@lionhrtpub.com
useful location data is in literally every industry. As Thomas
Direct: 404.918.3275
Walk writes, “Every object exists in space and time and there is
always something to be learned from analyzing those locations
FRONTLINE SYSTEMS, INC.
in conjunction with other data sets” – from retail sales to
P. O. Box 4288, Incline Village, NV 89450
insurance claims to fuel usage. And don’t miss Jen Underwood’s www.solver.com
always-fascinating column, about “Prescriptive Analytics at the
Edge” – Jen cites examples from anticipating machine failure to
preventing customer churn. Solver International is published bimonthly by Lionheart Publishing,
Inc. in cooperation with Frontline Systems, Inc. Deadlines for
So why learn analytics? Quoting Miguel Martinez from contributions: Manuscripts and news items should arrive no later

Microsoft once more, “this is the right time in history to be than three weeks prior to the first day of the month of publication.
Address correspondence to: Editor, Solver International, Lionheart
exposed to analytics, because you’re going to be very successful.” Publishing, Inc., 1635 Old​ 41 Hwy., Suite 112-361, Kennesaw, GA
We couldn’t agree more. Si 30152. The opinions expressed in Solver International are those of
the authors, and do not necessarily reflect the opinions of Lionheart
Publishing Inc., Frontline Systems, Inc. or the editorial staff of Solver
International. All rights reserved.

4 | Solver International | February 2018 Solver-International.com


Welcome to Analytic Solver ®
Cloud-based Simulation Modeling that Integrates with Excel

Everything in Predictive and Prescriptive Analytics nonlinear optimization, simulation optimization,


Everywhere You Want, from Concept to Deployment. stochastic programming and robust optimization. And
it’s a full-power tool for forecasting, data mining and
The Analytic Solver® suite makes the fastest Monte
text mining, from time series methods to classification
Carlo simulation and risk analysis software available in
and regression trees, neural networks and more, with
your web browser (cloud-based software as a service),
access to SQL databases and Spark Big Data clusters.
and in Microsoft Excel. And you can easily create models
in our RASON® language for server, web and mobile apps. Find Out More, Start Your Free Trial Now.
Comprehensive Risk and Decision Analysis Tools. In your browser, in Excel, or in Visual Studio, Analytic
Solver comes with everything you need: Wizards, Help,
Use a point-and-click Distribution Wizard, 50 probability
User Guides, 90 examples, even online training courses.
distributions, automatic distribution fitting, compound
distributions, rank-order correlation and three types Visit www.solver.com to learn more or ask questions,
of copulas; 50 statistics, risk measures and Six Sigma and visit analyticsolver.com to register and start a free
functions; easy multiple parameterized simulations, trial – in the cloud, on your desktop, or both!
decision trees, and a wide array of charts and graphs.
Optimization, Forecasting, Data and Text Mining.
Analytic Solver is also a full-power, point-and-click tool
for conventional and stochastic optimization, with
powerful linear and mixed-integer programming, Tel 775 831 0300 • Fax 775 831 0314 • info@solver.com
Impact

ANALYTIX Jen Underwood

Prescriptive Analytics
at the Edge

T he explosion of the
internet-of-things
(IoT), along with
sensor, social, and other
artificial intelligence-based
analytics over data collected
from devices located in the
edge server’s smart objects
degree of utilization to
serve customers reliably.

• Managing and optimizing


streaming data associated and other passive or semi- operations
with Industry 4.0, is driving passive devices—such as Performing real-time
organizations to use edge sensors and RFID tags—but anomaly and root cause
computing to provide them also from edge computers detection; profiling
with real-time analytics. Edge and routers. As a result, edge behavior; improving
computing—edge servers processing can still discover resource allocation; aiding
that control one or more intelligence patterns in near decisions on productivity
field devices—provides real-time at a low latency and and automating routine
unprecedented time-to- without any essential loss of to optimize smart factory,
insight and time-to-action bandwidth. facility, asset, fleet, and
while reducing the amount Edge analytics is personnel management.
of data being transmitted to appropriate for use in cases
central data centers. Applying involving fast, close to the • Reducing risks
prescriptive analytics “at the field processing with multiple Developing risk models,
edge” optimizes operations. devices. Here are several minimizing potential for
examples of edge analytics in human bias, detecting
EDGE ANALYTICS IN action. fraudulent activities,
ACTION refining processes with
Automated embedded • Performing prescriptive consistent, rules-based
artificial intelligence, maintenance decision-making, and
predictive and prescriptive Developing maintenance continuously monitoring
analytics can all be found in plans based on mean time environmental conditions.
retail, logistics, security and between failure (MTBF)
cyber-security, energy, fleet statistics for reducing • Preventing customer
maintenance, manufacturing, costly downtime periods. churn
insurance, industrial, and Improving product quality Preventing customer
other segments. Edge and process productivity churn by proactively
intelligence stems from by ensuring an unmatched resolving detected

6 | Solver International | February 2018 Solver-International.com


Image Source: GE Power

performance issues,
intelligently learning usage
organizations, customer
acquisition is expensive. Every
Every lost
or consumption patterns, lost customer directly impacts customer directly
and effectively targeting
offers and services to keep
your bottom line. Most
unhappy customers will never
impacts your
customers happy. tell you about performance bottom line.
issues. With simple, smart
• Enhancing competitive churn detection analytics,
Most unhappy
advantage you can improve customer customers
Better handling of market experience and retention.
challenges in an ever- Low-latency decision will never tell
changing global business. making is also ideal for you about
Adapting to windows of remote asset monitoring.
opportunity, and scaling Edge intelligence enables performance
and automating processes monitoring operations to issues.
while reducing costs. rapidly decide when to take
Recognizing variations and needed actions, to repair
forecasting responses to or replace an asset. Based
changing conditions. on the deployment of edge
intelligence in the field,
For example, you could directly on the machine or
use edge intelligence to parts, it is possible to identify
proactively provide predicted asset degradation patterns.
customer churn information This can lead to proactive
to customer success and versus reactive replacement
sales teams to take actions. of the part to avoid downtime
In service or contract-driven or disruption in production

Educating and Empowering the Business Analyst February 2018 | Solver International | 7
Impact

ANALYTIX
operations. Edge intelligence also offer “what-if ” analysis
also isolates the asset location of alternatives to better
or causes of problems making guide decision-makers. Edge
The emergence it much easier for staff in the analytics solutions today
of “smart,” field to find and resolve issues.
Another benefit of edge
augment humans – they do
not replace them. It is crucial
connected things intelligence is the collection for humans to guide and
of related conditional data. monitor these systems.
coupled with Operators can learn more Prescriptive analytics
existing subject about situational factors or on the edge is an emerging
alterations of an environment frontier that has progressed
matter expert that may have caused an issue. from post-event analysis of
knowledge That contextual knowledge historical data, real-time
can be used to better prepare event analysis, and predictive
creates an for or prescriptively prevent analytics, to the forward-
opportunity for issues in the future. looking automation of actions
At the organizational to optimize operations.
a new era of level, richer datasets can Industry leading organizations,
analytics. be collected and persisted,
including historic datasets
such as GE with Predix, are
already powering prescriptive
from multiple sensors, parts, maintenance programs. Pilot
machines, and edge devices. programs to collect data from
Hence, learned intelligence smart things has already
can be reused across the been completed. Now we are
enterprise enabling proactive seeing artificial intelligence
support services such as “Asset being embedded along with
Condition Monitoring as a prescriptive analytics.
Service” and “Maintenance By making connected
as a Service.” It also can be things smarter and moving
continuously updated through intelligence to the edge of
measurement of model drift. the network or within the
smart things, automated edge
SMART THINGS prescriptive analytics can
The emergence of “smart,” drive big wins for enterprises.
connected things coupled As companies consider how
Jen Underwood is Founder with existing subject matter best to apply this new wave
and Principle of Impact expert knowledge creates an of edge analytics technologies
Analytix, LLC. Impact Analytix opportunity for a new era of into their business, they are
is a boutique integrated analytics. Modern analytical embracing design thinking
product research, consulting,
technical marketing and
tools can not only predict methodology to reimagine
creative digital media agency what is likely to occur but digital processes. Si
led by experienced hands-
on practitioners. Jen can be
tweeted at @idigdata

8 | Solver International | February 2018 Solver-International.com


Welcome to Analytic Solver ®
Cloud-based Optimization Modeling that Integrates with Excel

Everything in Predictive and Prescriptive Analytics And it’s a full-power, point-and-click tool for fore-
Everywhere You Want, from Concept to Deployment. casting, data mining and text mining, from time series
methods to classification and regression trees, neural
The Analytic Solver® suite makes the world’s best
networks and association rules – complete with access
optimization software available in your web browser
to SQL databases, Apache Spark Big Data clusters, and
(cloud-based software as a service), and in Microsoft
text document folders.
Excel. And you can easily create models in our RASON®
language for your server, web and mobile apps. Find Out More, Start Your Free Trial Now.
Linear Programming to Stochastic Optimization. In your browser, in Excel, or in Visual Studio, Analytic
Solver comes with everything you need: Wizards, Help,
It’s all point-and-click: Fast, large-scale linear, quadratic
User Guides, 90 examples, even online training courses.
and mixed-integer programming, conic, nonlinear,
non-smooth and global optimization. Easily incorporate Visit www.solver.com to learn more or ask questions,
uncertainty and solve with simulation optimization, and visit analyticsolver.com to register and start a free
stochastic programming, and robust optimization. trial – in the cloud, on your desktop, or both!

Monte Carlo Simulation, Data and Text Mining.


Analytic Solver is also a full-power, point-and-click tool
for Monte Carlo simulation and decision analysis, with
a Distribution Wizard, 50 distributions, 50 statistics and
risk measures, and a wide array of charts and graphs. Tel 775 831 0300 • Fax 775 831 0314 • info@solver.com
Industry

NEWS

Text Mining Uncovers Millions of ‘Fake


Comments’ Sent to FCC

A ccording to a November
29, 2017 report by
Reuters News Agency,
“More than half of the 21.7
duplicate e-mail addresses or
temporary e-mail addresses,
while many individual names
appeared thousands of times
“More than half
of the 21.7 million
public comments
million public comments in the submissions, Pew said.
submitted to the U.S. Federal For example, ‘Pat M’ was submitted to the
Communications Commission listed on 5,910 submissions, FCC [included]
about net neutrality this year and the e-mail address
used temporary or duplicate john_oliver@yahoo.com was false or misleading
email addresses and appeared used in 1,002 comments. TV information.”
to include false or misleading host John Oliver supported
information, the Pew Research keeping net neutrality on his
Center said.” HBO talk show.” the rest had been submitted
From April 27 to Aug. 30 Pew did not say how many multiple times, in some cases,
the public was able to submit of the comments supported hundreds of thousands of
comments to the FCC on the or opposed the FCC’s times,” the authors stated.
topic, online or by e-mail. The proposal. “They found that “Thousands of identical
Reuter’s article noted, “Of only six percent of submitted comments were submitted
those, 57 percent used either comments were unique while in the same second on at
least five occasions. On July
19 at precisely 2:57:15 p.m.
ET, 475,482 comments were
submitted, Pew said, adding
that almost all were in favor
of net neutrality.”
In the same vein, data
scientist Jeff Kao used a
similar dataset and got a
similar result. Writing on the
blog Hackernoon, Kao reports,
“I used natural language
processing techniques
to analyze net neutrality
comments submitted to the
FCC from April-October
2017, and the results were
disturbing. My research
found at least 1.3 million
Text Mining Uncovers Millions of Fake Comments Sent to FCC fake pro-repeal comments,

10 | Solver International | February 2018 Solver-International.com


with suspicions about many I needed to break down the algorithms on the meaning of
more. In fact, the sum of fake 22-plus million comments and the comments. This method
pro-repeal comments in the more than 60GB worth of text identified nearly 150 clusters of
proceeding may number in data and metadata into smaller comment submission texts of
the millions.” pieces. various sizes.”
He continues, “It was clear “Thus, I tallied up the many After clustering comment
from the start that the data was duplicate comments and arrived categories and removing
going to be duplicative and at 2,955,182 unique comments duplicates, Kao found “less
messy. If I wanted to do the and their respective duplicate than 800,000 of the 22 million
analysis without having to set counts. I then mapped each comments submitted to the
up the tools and infrastructure comment into semantic space FCC (3-4 percent) could be
typically used for ‘big data,’ vectors and ran some clustering considered truly unique.” Si

Frontline Systems Releases XLMiner®


SDK V2018 for High-Performance
Predictive Analytics

F rontline Systems has difference between common “XLMiner SDK


released XLMiner libraries and truly robust, is a toolkit that
SDK V2018, a next- high-performance software –
generation version of its especially if you’re working in developers can
Software Development Kit C++, C# or Java,” said Daniel count on to build
for data mining, text mining, Fylstra, Frontline’s President
forecasting, and predictive and CEO. “XLMiner SDK is commercial-grade
analytics. XLMiner SDK offers a toolkit that developers can applications.”
application developers working count on to build commercial-
in C++, C#, Java, Python or R grade applications.”
a powerful, high-level API for
quickly creating applications FULL SUPPORT FOR developers will benefit from
that use predictive analytics. POPULAR PROGRAMMING automatic recognition and
Developers can register for a LANGUAGES “command completion” for
free account at https://www. XLMiner SDK provides XLMiner’s objects, properties
solver.com, and download and full API support for five and methods. And the new
install a fully-functional version popular programming SDK is ready for REPL
of XLMiner SDK with a free languages: C++ 11 or later, C# (Read-Eval-Print-Loop) style
15-day trial license. 4.0 or later, Java 8, Python 2.7 execution with C# Interactive.
“Data mining and or 3.6 (both are supported), XLMiner SDK’s R support
machine learning software and R 3.4. In Microsoft uses R-native types, including
has proliferated, but there’s a Visual Studio and R Studio, R’s own DataFrame type; hence

Educating and Empowering the Business Analyst February 2018 | Solver International | 11
Industry

NEWS
it can be used easily with a comprehensive data mining API; JSON (JavaScript Object
wide range of R packages from and text mining toolkit. Notation); and CSV (Comma
CRAN. XLMiner SDK provides Separated Value) and Excel
its own “R package” that can SUPPORT FOR files.
be loaded with one command POPULAR DATABASES The SDK also handles
from R itself, or an IDE such as AND FILES, TEXT, AND unstructured text data, and
R Studio. BIG DATA provides stemming, term
For C++, C# and Java XLMiner SDK has built-in normalization, vocabulary
developers, XLMiner SDK tools to read data from SQL reduction, creation of a
should be especially welcome, databases using ODBC (Open term-document matrix, and
since quality data mining tools Data Base Connectivity), with concept extraction with latent
have been hard to find for special support for Oracle, semantic indexing. It even
these popular languages. But Microsoft SQL Server and has built-in facilities to draw
even R and Python developers Access databases; OData (“the a statistically representative
will find that XLMiner SDK “web successor to ODBC”) sample from an Apache
offers a far better integrated, data sources exposing a REST Spark Big Data cluster,

XMiner SDK V2018 in Visual Studio

12 | Solver International | February 2018 Solver-International.com


running a Frontline-supplied bagging, and random forest memory-intensive, while
component on one of the methods. Few other products K Nearest Neighbors is an
cluster nodes. provide such extensive PMML order of magnitude faster
support. in k-parameter tuning, and
MODEL EXPORT IN XLMiner SDK also handles distance matrices
PMML, AND EXPORT/ provides its own JSON that would exceed available
IMPORT IN JSON serialization format, more memory in other software.
The new SDK release general than PMML, for Category Reduction
can export a wide range of its full range of objects and Missing Data Handling
trained/fitted models in (DataFrames, Estimators and algorithms are also extended
industry-standard PMML Models) and properties. for multivariate use, with
(Predictive Modeling Markup new “missing value options”
Language) format, from data FASTER AND MORE for different data types, and
transformations to linear and ROBUST ALGORITHMS One-Hot-Encoding is faster
logistic regression, decision Statistical and machine and enhanced for categorical
trees, neural networks, and learning algorithms in variables. The new release
k nearest neighbors for both XLMiner SDK are optimized even offers Vector and Matrix
classification and prediction; for performance on current objects that enable developers
discriminant analysis, naïve Intel-compatible processors. to write high-level “linear
Bayes, time series models, In the new release, the algebra expressions” with
association rules, and even Naive Bayes algorithm high-performance, parallel
ensembles with boosting, is much faster and less multi-core execution. Si

XLMiner SDK V2018 in R Studio

Educating and Empowering the Business Analyst February 2018 | Solver International | 13
d FTR
TUTORIAL
SLUG

USING
Deck

SIMULATION
AUTHOR

AND
OPTIMIZATION
TOGETHER
14 | Solver International | February 2018 Solver-International.com
d
FTR SLUG

Photo Courtesy of 123rf.com | © Wavebreak Media Ltd

Last summer in Solver International, we explored quantitative models


to deal with uncertainty and risk using Monte Carlo simulation. And
last fall, we explained the basics of mathematical optimization:
finding the best way to allocate scarce resources, such as money,
people and equipment. In many cases, what we really want is the
best, or optimal decision under conditions where there is uncertainty
and risk. That’s the topic of this article, where we’ll combine ideas
from simulation and optimization to build and solve a simulation
optimization model.

BY DAN FYLSTRA, CEO, Frontline Systems

Educating and Empowering the Business Analyst February 2018 | Solver International | 15
d TUTORIAL: SIMULATION OPTIMIZATION
We still won’t be covering the full to invest, or when to schedule call
scope of stochastic optimization, which center employees – uncertain variables
aims to find the best decisions for an represent quantities that we cannot
uncertain future. To do that, we would control or decide ourselves – such
go beyond simulation optimization as stock prices, or the frequency of
to explicitly model decisions we incoming calls to the call center. To
must make here and now, and other simulate the behavior of uncertain
decisions where we can wait and see. variables, we’ll draw random samples
Stay tuned for a tutorial where we take from a probability distribution for each
you further into the world of modeling of those variables. In advanced models,
and decision making using prescriptive we would use rank-order correlation or
analytics! But let’s get started with the copulas to deal with the fact that some
basics of simulation optimization in the uncertain quantities are related.
present.
REVISITING A CALL CENTER
FROM OPTIMIZATION: SCHEDULING EXAMPLE
DECISION VARIABLES, In our fall tutorial on optimization,
OBJECTIVE AND we used a very simple example of a call
CONSTRAINTS center employee scheduling model,
Our model will have decision shown in Figure 1. Our problem is
variables: amounts of resources to schedule enough employees to
allocated to some use or purpose – for work each day of the week (decision
example, the number of call center variables) to handle our predicted
employees working on each shift, or call volume (a constraint), while
the number of packages of a given size minimizing total payroll cost (our
loaded onto each truck. We’ll have an objective).
objective such as costs we’d like to Since employees want to work 5
minimize, or profits to be maximized, consecutive days and have 2 consecutive
and constraints, or limits on the ways days off, there are 7 possible weekly
resources may be allocated, that reflects schedules, each one starting on a
the real-world situation. As we’ll see different day of the week. These are
shortly, dealing with uncertainty means labeled A, B, C through G in the Excel
that we must reconsider what it means model. For example, employees on
to maximize or minimize our objective, Schedule A have Sunday and Monday
or to satisfy our constraints. off, then they work Tuesday through
Saturday. Our decision variables – the
FROM SIMULATION: number of employees working on each
UNCERTAIN VARIABLES, schedule – are in cells D15:D21; they
DISTRIBUTIONS AND are summed in cell D22.
CORRELATION In this simple model, all employees
Our models will also have uncertain are paid at the same rate, $96 per day at
variables. Where decision variables cell N15. So our objective, payroll costs
represent quantities that we can to be minimized, is just =D22*N15*5 (5
control or decide – such as how much working days per week).

16 | Solver International | February 2018 Solver-International.com


d
TUTORIAL
In our fall tutorial, we met a F, but those on Schedules A and G
constraint that the number of employees will have the day off. So the number
working each day of the week is greater of employees working Sunday is just
than or equal to the “Minimum Required =SUMPRODUCT(D15:D21,F15:F21)
Per Day” figures in row 25. We assumed – and similarly for the other days of the
that we actually knew these numbers – week.
we’ll start there, but in most real-world Our SUMPRODUCT formulas
call centers, the call volumes per day are are in row 24, and we want each value
uncertain, so we’ll tackle that issue later to be greater than or equal to the
in this tutorial. corresponding “minimum required”
The 1s and 0s in the middle of number in row 25. We can express this
the worksheet help us calculate the as F24:L24 >= F25:L25, or using Excel
number of employees we’ll have in the defined names, as “Employees_per_day
call center on each day of the week. >= Required_per_day”. You can see the
For example, on Sunday we’ll have the Solver model taking shape in the right-
employees on Schedules B through hand Task Pane in Figure 1.

Figure 1: A simple Call Center scheduling example.

Educating and Empowering the Business Analyst February 2018 | Solver International | 17
d TUTORIAL: SIMULATION OPTIMIZATION
We cannot have a negative number the Ribbon, or the “green arrow”
of employees on any schedule, so we on the Task Pane. The optimal
include a constraint D15:D21 >= 0 solution ($12,000 payroll cost – a big
(or with defined names, “Employees_ improvement from $18,000 initially) is
per_schedule >= 0”). Solver allows any shown in Figure 2. Solver assigns 2, 5,
whole number or fractional value for a 7, 4, 6, 1 and 0 employees, respectively,
decision variable, but we can’t actually to Schedules A, B, C, D, E, F and G.
assign one-half or two-thirds of an Since our call volumes are heaviest on
employee to a schedule – so we add a weekends, no employees will take both
constraint “Employees_per_schedule = Saturday and Sunday off.
integer”. Take a close look at the
constraints: Even with the
SOLVING VIA CONVENTIONAL requirement for 5 consecutive
OPTIMIZATION working days and 2 consecutive
We can now solve the model days off, Solver found a way to
by clicking the Optimize button on allocate exactly the minimum required

Figure 2: Optimal solution of the Call Center model without uncertainty.

18 | Solver International | February 2018 Solver-International.com


d
TUTORIAL
employees on 5 of 7 days; on the and hence the number of employees
other 2 days, just one extra employee needed, will fluctuate each day. To keep
was needed. Your intuition might tell things simple, we’ll skip the calculations
you that this is “too tight” – that’s relating call volumes to employees
because you can imagine that call needed, and focus on adjusting the
volumes, and hence the number of model so the minimum number of
employees needed per day, are not so employees required per day in row 25 is
predictable. But in its current form, uncertain.
the model says that the number Drawing from our Monte Carlo
needed is known exactly – and Solver simulation tutorial in the summer
is taking full advantage of this. Solver International, we’ll replace these
Far too often in practice, even constant numbers with “generator
in large companies and consulting functions” based on probability
firms, optimization models are built distributions. Instead of a fixed number
in this way – assuming, implicitly, like 22 employees needed on Sunday,
that the data or parameters of the we’ll use a LogNormal distribution with
model are known exactly – which is a mean value of 22, and a standard
hardly ever true. If you take nothing deviation of 2. Analytic Solver shows
else from this tutorial, think about us the “shape” or probability density
how you can avoid this pitfall! But of this distribution, as shown in Figure
read on to learn how you can create a 3. Our model now says that we’ll need
better model of a real-world business an uncertain or variable number of
situation. employees on Sunday: most (90%) of
the time we’ll need between 19 and 25.
ADDING UNCERTAINTY In a similar way, we’ll replace each
TO THE MODEL of the fixed numbers in row 25 with
At present, we have
constant numbers in
row 25 for the minimum
number of employees
needed per day: 22 on
Sunday, 17 on Monday,
13 on Tuesday, 14 on
Wednesday, 15 on
Thursday, 18 on Friday,
and 24 on Saturday. These
are based on the estimated
incoming call volume
per day, and the average
number of calls one
employee can handle per
day. These quantities are
uncertain, not constant:
The actual number of calls,
Figure 3: Probability distribution for employees needed on Sunday.

Educating and Empowering the Business Analyst February 2018 | Solver International | 19
d TUTORIAL: SIMULATION OPTIMIZATION
LogNormal distributions, where the called a Monte Carlo “trial”. On each
mean of the distribution is the same as trial, Analytic Solver will randomly
the fixed number we had before. choose a “sample” value for each of
To make the model a little more the 7 uncertain variables in row 25,
interesting, we’ll add a commission respecting the relative frequencies of
plan for our call center employees. the probability distributions – then
Such a plan would be based on collect and summarize the results.
our (uncertain and variable) call To do this, we’d better tell Analytic
volume, but since we’re reflecting that Solver what we want to summarize:
uncertainty in our distributions for the The easiest way to do this is to select
number of employees needed, we’ll just cell C27 (total payroll cost) and
calculate a total commission for the choose Results – Statistics – Mean,
week, based on this number, and add “dropping” the result in cell C28.
it to our total payroll cost in cell D27. Starting with our Figure 1 setup,
This makes our total payroll cost – assigning 5 employees to each of
which we want to minimize by a smart the 7 possible schedules, we choose
assignment of employees to schedules Simulate – Run Once. Analytic
– itself an uncertain, variable quantity. Solver displays the simulated results
Think about that for a minute. for total payroll cost. Because of
our commission plan, this is now a
IMPACT OF UNCERTAINTY: distribution of outcomes, peaking
RUNNING A SIMULATION around $18,000, but sometimes less
To see the effect of uncertainty and sometimes more, as shown in
on our model, we can run a Monte Figure 4.
Carlo simulation. If you read our
Monte Carlo simulation tutorial last WHAT DOES IT MEAN TO
summer, you know how this works: OPTIMIZE AN UNCERTAIN
Analytic Solver will cause Excel to MODEL?
calculate the model 1,000 or 10,000 If you weren’t thinking about this
times – each such calculation is earlier, ask yourself now: What does
it mean to “minimize” this
wide range of possible
values for total payroll
cost? We need to tell the
software exactly what to
do. There is more than
one possible answer, but
most managers would
probably be happy if
we can minimize the
average (mean) payroll
cost, or in the language of
probability, the expected
value of total payroll
Figure 4: Results from Cell Center model Monte Carlo simulation.

20 | Solver International | February 2018 Solver-International.com


d
TUTORIAL
cost. Fortunately, in cell C28 we now Figure 3. What does it mean to say
have an Analytic Solver function that a fixed number of employees
=PsiMean(C27) which computes this working on (say) Sunday in F24 is
quantity – so in the next step, we will “greater than or equal to” the wide
make C28, not C27, the objective to range of values in F25?
be minimized in our optimization. We could say that F24 should
We have a similar issue with our be greater than the maximum value
primary constraint “Employees_per_ that is ever sampled for F25 during
day >= Required_per_day”: For each a simulation – and Analytic Solver
day of the week, we want the number would let you do this, for example
of employees working that day (in by placing =PsiMax(F25) in cell F26,
row 24) to cover the minimum and making the constraint F24 >=
number required (in row 25). But F26. But that would lead to a high
those fixed numbers in row 25 are staffing level and payroll cost (a mean
now uncertain variables, representing of about $20,000, if you try it), and
a wide range of values – as shown in some employees are sure to be idle

Figure 5: Complete setup of the Cell Center simulation optimization model.

Educating and Empowering the Business Analyst February 2018 | Solver International | 21
d TUTORIAL: SIMULATION OPTIMIZATION
some of the time. Most managers the expected value of total payroll cost
would say something like “let’s be across all of our Monte Carlo trials,
sure we meet the demand at least at C28. Instead of requiring that
90% of the time” – and Analytic “Employees_per_day >= Required_
Solver will let you specify this, with per_day”, where “Required_per_day” is
=PsiPercentile(F25,90%) in cell F26, now a set of LogNormal distributions,
and F24 >= F26. we require that “Employees_per_day
>= Required_90th_Percentile”. We
A SIMULATION OPTIMIZATION still have a lower bound of 0 for
MODEL each decision variable, but we’ve
Now that we know what it means added an upper bound of 25: This is
to optimize the Call Center model because Solver Engines suitable for
with uncertainty, we can complete this simulation optimization problem
the model formulation as shown in generally require such bounds on the
Figure 5. Instead of minimizing total variables, to limit the search space and
payroll cost at C27, we’re minimizing time.

Figure 6: Optimal solution of the Cell Center simulation optimization model.

22 | Solver International | February 2018 Solver-International.com


d
TUTORIAL
Note that with our original Figure Figure 6 shows the optimal
1 setup, assigning 5 employees to solution found by the Evolutionary
each of the 7 possible schedules, Solver for this model, in about 5
we don’t actually satisfy the seconds on an average PC. The
constraint “Employees_per_day >= Evolutionary Solver can use either
Required_90th_Percentile” on Saturday “genetic algorithm” or “tabu and
(and our expected total payroll cost scatter search” methods – both of
is still about $18,000). The first task these find the same solution, but the
for Analytic Solver is to find a feasible genetic algorithm is faster in this case.
solution – one that satisfies all of the Unlike our initial setup, we’re now
constraints. satisfying all of the constraints, and
we’ve reduced the expected value of
SOLVING THE SIMULATION total payroll cost from about $18,000
OPTIMIZATION MODEL to $15,150.
We’ll use the advanced
Evolutionary Solver included as part CONCLUSION: WELCOME TO
of Analytic Solver to find a solution, PRESCRIPTIVE ANALYTICS
using simulation optimization. In this tutorial, we’ve introduced
The basic idea behind simulation the approaches you can take to
optimization is very simple: An “outer combine Monte Carlo simulation and
loop” tests different possible values mathematical optimization methods,
for the decision variables, and within to find an optimal solution for the
that, an “inner loop” samples many kind of problem that occurs repeatedly
different values for the uncertain in almost every kind of business:
variables, computes statistics such as allocating scarce resources under
PsiMean() and PsiPercentile(), and conditions of uncertainty.
returns values that the “outer loop” But we’ve only scratched the
search process can use. surface of what’s possible with
If you read our fall tutorial on modern “prescriptive analytics”
conventional optimization, where we tools. There are many more tools
discussed linear programming, linear available to help formulate models,
mixed-integer, nonlinear and global choose probability distributions or
optimization problems, you might fit them to past data, run multiple,
wonder what type of problem we’ve parameterized simulations and
created here. Analytic Solver can optimizations, and much more. And
automatically analyze the algebraic we haven’t even touched upon the
structure of our model: It’s shown topic of stochastic optimization,
in the bottom right of Figure 5, which aims to find the best decisions
“Stochastic LP/MIP”. This type of for an uncertain future. So we
model is much harder to solve than hope you’ll stayed tuned for more
a conventional LP/MIP problem, but insights about real-world prescriptive
it’s not the “worst case” of a non- analytics, in future issues of Solver
smooth non-convex model. International. Si

Educating and Empowering the Business Analyst February 2018 | Solver International | 23
d DATA ANALYSIS

Location Data –

THE
MISSING
LINK
Location data, and the intelligence that geospatial
analysis can provide business decision makers,
is the missing link that ties together various data
sets from Big Data sources.
Photo Courtesy of 123rf.com | © Thananit Suntiviriyanon

BY THOMAS WALK, COO


Turtler GPS Ltd.
Dublin, Ireland

24 | Solver-International | February 2018 Solver-International.com


Educating and Empowering the Business Analyst February 2018 | Solver-International | 25
d DATA ANALYSIS
While the mantra of real estate has long been “Location, Location,
Location,” that focus is just becoming more obvious in data
analysis. Location analytics has the potential to render the complex
data landscape more orderly. Location is a common link between
seemingly disconnected dumps of Big Data analytics. Data sets
that are disconnected and seem to have no relevance to each other
can suddenly make sense once the dimension of location is added.
Relationships between data sets with no obvious connection can
and will emerge once you geo-enrich them, giving you a better view
of customer behavior.
Geo-enriching disconnected with geocoding — getting latitude
data sets is an effective way to reveal and longitude from an address. With
relationships that are not obvious and, these sets of data, you can do geo-
by doing so, to arrive at the types of enriching. Geo-enrichment produces
analytic insights that help business the geocoded data points with
decision makers improve their bottom authoritative attributes to present a
lines. The following are just a few ways more detailed understanding of the real
in which geo-enriching with location environment of business data. 
data can improve the bottom line: In the real estate insurance
1. Reduce costs industry, an address or parcel of land
2. Augment address verification is geocoded. Then it is geo-enriched
3. Improve customer experience with with data about the land such as the
in-store alerts type and number of buildings, the
4. Develop civic and community property age, construction, residential
participation in local government or commercial use, sales value,
etc. Accurately geocoding location
Location data is now mainstream, data can put a property inside or
with business intelligence platforms outside a designated natural hazard
adding it to their offerings. However, to area. In real location, it is only a
get deeper knowledge of location data difference of a few hundred feet, but
and the analytic insights it offers, you it could have huge cost implications
still need specialized vendors who can if the insurer undercharges or
provide more advanced analytics such overcharges for a policy or exposes
as demographics. For the best results, itself to too much risk.
geo-spatial data must be combined Regardless of industry, raw data is
with more traditional business data to much more useful if it is geo-enriched.
unlock business value.   In the financial sector, geo-enrichment
of accounts can protect against fraud
GEO-ENRICHING and identify links between different
BUSINESS DATA accounts. This is done by locating
Location data is fundamental customers and transactions, incidents
for enriching business data. It starts and areas of risk, thereby spotting

26 | Solver International | February 2018 Solver-International.com


d
DATA ANALYSIS
geographic connections between themselves by the degree to which their
accounts.  data is geo-enriched. In the logistics
For civic organization, raw data industry, accurate addressing can
on pollution can be monitored make a huge difference in company
and analyzed against demographic performance. It sounds like a small
information, such as density of vehicles, thing, but it can have huge implications.
which may be a factor in causing Inputting correct and accurate
the pollution. The location of health addresses is a major bottleneck in
problems caused by pollution allows logistics because of national difference
officials to do something about it.  in address standards. This pain point
Creating data sets is such a starts out small but can result in huge
massive undertaking many businesses wastes of time and money. 
are choosing to buy data sets, with For example, global addresses
intelligence vendors differentiating are not standardized across borders.

Busy City.

Educating and Empowering the Business Analyst February 2018 | Solver International | 27
d DATA ANALYSIS

Incorrect input of an address can Energy (BLE) beacons, Wi-Fi or


result in futile trips, custom holds, other means to analyze the in-store
compliance failures, unwanted costs shopping behavior of a customer,
and delays, substantial fines, and even can use this data to offer automated
the loss of trading rights. This can be in-store alerts and offerings based
fixed by geo-enriching data that already on the customer’s past shopping
includes location coordinates, making history. This will give a personal
sure the address metadata is correct and touch to sales offerings based on
accurate.  actual customer preference in an
individualized way.
UNDERSTANDING Starting in 2014 and 2015, there
BUSINESS DATA was an increasing awareness of the
Every object exists in space and importance of leveraging location
time and there is always something analytics among business leaders.
to be learned from analyzing those Geospatial analytics had expanded
locations in conjunction with beyond its base in GIS (Geographic
other data sets. In fact, location Information Systems) platforms. The
is a common key to link different companies that can link location
data sets, revealing a previously analytics with other types of analysis,
unknowable relationship.  such as predictive and machine
For example, it could link the learning, are the ones that can provide
relationship between the path the broadest value to customers. They
of an actual physical tornado are also the ones best able to handle the
and the resources an insurance large data dumps from Big Data. 
company should marshal in the There are few enterprises at the
aftermath. People are influenced geospatial level that can analyze
by geographic considerations location analytics in the context of
when they consume products or broader data. The industry still relies
services, so organizations are using primarily on open source platforms
location analytics to gain insight and, thus, there is a broad opportunity
when choosing the placement of for vendors and ancillary companies to
an asset such as a company store or exploit this space.
what products to offer in different Something like 80% of business
geographic regions. Consultants and data has a hidden location component,
business intelligence experts agree but traditionally few operators are
that those businesses which can link working out this location intelligence
location analytics with traditional and the connections it can illuminate.
intelligence and data will receive the The revolution of change brought
greatest benefits.  by the advent of the Internet gave
In retail, stores that use location companies access to millions of
technology such as Bluetooth Low IP addresses, and then the advent

28 | Solver International | February 2018 Solver-International.com


d
DATA ANALYSIS
tt

BECAUSE MAPS ARE SO WIDELY USED


AND READILY RECOGNIZABLE BY
of cellphones—and especially
smartphones—gave companies real- EVEN THE LAYPERSON, USING THEM
time access to the location of hundreds
TO PRESENT COMPLEX BUSINESS
of millions of users. In addition, with
the mainstreaming of the IoT (Internet ANALYTICS OPENS THIS BUSINESS
of Things) revolution, there will be
ANALYSIS TO PEOPLE FAR BEYOND THE
a wider and wider array of everyday
devices connected to the Internet; that TECHNICAL PROFESSIONALS WHO HAVE
can feed location analytics into the big
ANALYZED THIS INFORMATION IN THE
location data dump. 
PAST.
PRESENTING BUSINESS DATA
Maps are a time-tested tool for
visualizing data. Mercator, cylindrical, for regular people to grasp is in the
and conic projections have been case of home assurances. Buying and
around for hundreds of years to share selling houses is a process that involves
information about location. This mountain of documents, all of which
location data was used by explorers, are related to a certain location. With
traders and business leaders to share an interactive map, an assurance
business data from antiquity to today.  organization can make it possible to see
Traditional business analytics at a glance how close a given property
that focus on text search, machine is to natural hazards such as a flood
learning, and algorithm dumps zone, an earthquake zone, lands that are
tend to present information in prone to forest fire, erosion areas, etc. 
boring, hard-to-read spreadsheets Location data and GPS analytics
or charts. The use of maps, however, are the key that can tie seemingly
to present location analytics and the unconnected data together to gain
insight they bring to other business valuable business insight. Companies
data, makes it easy to visualize the that can best utilize the vast stores of
underlying relationships between available location data to geo-enrich
different business vectors. Because their business intelligence are best
maps are so widely used and readily suited to thrive in the new economy.
recognizable by even the layperson, With these insights, they can make
using them to present complex better decision on how, what and
business analytics opens this business where to provide products and
analysis to people far beyond the services to their customers. They can
technical professionals who have improve their bottom lines by cutting
analyzed this information in the past. unnecessary costs. And they can
One example of how this form better and more accurately present
of visualization through maps can information to their shareholders and
make complex analytics much easier customers alike. Si

Educating and Empowering the Business Analyst February 2018 | Solver International | 29
d TUTORIAL

TEXT
MINING
Some data analysis tasks cannot be completed in a reasonable time by a
human and that makes text mining a useful tool in modern data science
and forensics.

Text mining is the practice of automated analysis of one document or a


collection of documents (corpus) to extract non-trivial information.
Text mining usually involves the process of transforming
unstructured textual data into a structured representation by
analyzing the patterns derived from text. The results can be
analyzed to discover interesting information, some of which would
only be found by a human carefully reading and analyzing the text.
Typical, text mining includes but is not limited to Automatic Text
Classification/Categorization, Topic Extraction, Concept Extraction,
Documents/Terms Clustering, Sentiment Analysis, Frequency-
based Analysis and many more.

BY NICOLE STEIDEL, While you may not deal with


Technical Support Analyst, datasets in the double-digit millions,
Frontline Systems using the text mining capabilities of
modern analytics software can produce
results that can show similar patterns to
the data and help you analyze the data
more effectively. Here is a brief tutorial
on using a major analytics application
to develop your text mining capabilities.

30 | Solver International | February 2018 Solver-International.com


Photo Courtesy of 123rf.com | © Marina Putilova

USING TEXT MINER harmful for some specific applications


Frontline’s Analytic Solver of Natural Language Processing (NLP),
Data Mining’s Text Miner takes an it has been proven to work very well for
integrated approach to text mining applications such as Text Categorization,
as it does not totally separate analysis Concept Extraction, and other areas
of unstructured data from traditional addressed by Text Miner’s capabilities.
data mining techniques applicable It has been shown in many theoretical/
for structured information. While empirical studies that syntactic similarity
Analytic Solver Data Mining is a very often implies semantic similarity. One
powerful tool for analyzing text only, way to access syntactic relationships is
it also offers automated treatment of to represent text in terms of Generalized
mixed data—combinations of multiple Vector Space Model (GVSM). The
unstructured and structured fields. advantage of such representation is
This is a particularly useful feature a meaningful mapping of text to the
that has many real-world applications, numeric space; the disadvantage is that
such as analyzing maintenance reports, some semantic elements, such as the order
evaluation forms, insurance claims, and of words, are lost (recall the bag-of-words
many other formats. assumption).
Text Miner uses the “bag of words” Input to Text Miner could be of
model–the simplified representation two main types: a few relatively large
of text, where the precise grammatical documents (for example, several books)
structure of text and exact word order or a relatively large number of smaller
is disregarded. Instead, syntactic, documents (collection of emails, news
frequency-based information is preserved articles, product reviews, comments,
and is used for text representation. tweets, Facebook posts). While Text
Although such assumptions might be Miner can analyze large text documents,

Educating and Empowering the Business Analyst February 2018 | Solver International | 31
d TUTORIAL: TEXT MINING
it is particularly effective for large that aid in understanding of Text
corpuses of relatively small documents. Mining outputs and that are produced
Obviously, this functionality has limitless by Text Miner are:
number of applications, for instance, • Zipf plot–for visual/interactive
e-mail spam detection, topic extraction exploration of frequency-based
in articles, automatic rerouting of information extracted by Text
correspondence, sentiment analysis of Miner
product reviews, and even analyzing net • Scree Plot, Term-Concept and
neutrality postings! Document-Concept 2D scatter
The input for text mining is a plots–for visual/interactive
dataset on a worksheet, with at least one exploration of Concept Extraction
column that contains free-form text— results
or file paths to documents in a file
system containing free-form text—and, If you are interested in visualizing
optionally, other columns that contain specific parts of text mining analysis
traditional structured data. outputs, Text Miner provides
The output of text mining is a rich capabilities for charting–the
set of reports that contain general functionality that can be used to
explorative information about the explore text mining results and
collection of documents and structured supplement the standard charts
representations of text. Free-form text mentioned.
columns are expanded to a set of new Here is how to use Text Miner to
columns with numeric representation. process and analyze approximately
The new columns will each correspond 1000 text files and use the results for
to either a single term (word) found automatic topic categorization. This
in the “corpus” of documents, or, if will be achieved by using structured
requested, a concept extracted from representation of text presented to
the corpus through Latent Semantic Logistic Regression for building the
Indexing (LSI, also called LSA or Latent model for classification. Obviously, it
Semantic Analysis). Each concept will help your understanding of the
represents an automatically derived process if you have Solver Analytics
complex combination of terms/words installed to follow along. The software
that have been identified to be related is available for a trial period at https://
to a topic in the corpus of documents. www.solver.com/xlminer-platform.
The structural representation of text
can serve as an input to any traditional GETTING THE DATA
data mining techniques available in This example is based on the text
Text Miner: unsupervised/supervised, dataset at http://www.cs.cmu.edu/
affinity, visualization techniques, etc. afs/cs/project/theo-20/www/data/
In addition, Text Miner also news20.html, which consists of 20,000
presents a visual representation of messages, collected from 20 different
text mining results to allow the user to Internet newsgroups. We selected
interactively explore information that about 1,200 of these messages that were
would otherwise be extremely hard to posted to two interest groups, Autos
analyze manually. Typical visualizations and Electronics (about 500 documents

32 | Solver International | February 2018 Solver-International.com


d
TUTORIAL
from each). Four folders are created:
Autos, Electronics, Additional Autos,
and Additional Electronics.
In Text Miner, select all files in the
folder and the files will appear in the
left listbox under Files. Move the files
from the Files listbox to the Selected
Files listbox. Repeat these steps for the
other subfolders. When these steps are
completed, 985 files will appear under
Selected Files.
Select Sample from selected files
to enable the Sampling Options. Text
Miner will perform sampling from the
files in the Selected Files field. Enter 300
for Desired sample size and Text Miner
will select 300 files using Simple random
sampling with a seed value of 12345.
Under Output, leave the default setting Figure 1.
of Write file paths. Rather than writing
out the file contents into the report, Text do this, we’ll use Excel to sort the rows by
Miner will include the file paths. the file path. The file paths should now be
The output XLM_SampleFiles will be sorted between Electronics and Autos files.
inserted into the Data Mining task pane. Confirm that XLM_SampleFiles is
The Data portion of the report displays selected for Worksheet. Select TextVar
the selections made on the Import From in the Variables listbox, and move it to
File System dialog. Here we see the path the Selected Text Variables listbox. By
of the directories, the number of files doing so, we are selecting the text in the
written, our choice to write the paths documents as input to the Text Miner
or contents (File Paths), the sampling model. Ensure that “Text variables
method, the desired sample size, the contain file paths” is checked.
actual size of the sample, and the seed Leave the default setting for Analyze
value (12345) (Figure 1). all terms selected under Mode. When
Underneath the Data portion are this option is selected, Text Miner will
paths to the 300 text files in random examine all terms in the document. A
order that were sampled. If Write file “term” is defined as an individual entity
contents had been selected, rather in the text, which may or may not be
than Write file paths, the report would an English word. A term can be a word,
contain the RowID, File Path, and the number, e-mail, URL, etc. Terms are
first 32,767 characters present in the separated by all possible delimiting
document. characters (i.e. \, ?, ‘, `, ~, |, \r, \n, \t, :, !,
The selected file paths are now @, #, $, %, ^, &, *, (, ), [, ], {, }, <>,_, ;, =,
in random order, but we will need to -, +, \) with some exceptions related to
categorize the “Autos” and “Electronics” stopwords, synonyms, exclusion terms
files to be able to identify them later. To and boilerplate normalization (URLs,

Educating and Empowering the Business Analyst February 2018 | Solver International | 33
d TUTORIAL: TEXT MINING
e-mails, monetary amounts, etc.). Leave both Start term/phrase and
Text Miner will not tokenize on these End term/phrase empty under Text
delimiters. Location. If this option is used, text
Note: Exceptions are related not appearing before the first occurrence
to how terms are separated but as to of the Start Phrase will be disregarded
whether they are split based on the and similarly, text appearing after End
delimiter. For example: URL’s contain Phrase (if used) will be disregarded. For
many characters such as “/” or “;” and example, if text mining the transcripts
Text Miner will not tokenize on these from a Live Chat service, you would not
characters in the URL but will consider be interested in any text appearing before
the URL and will remove the URL if the heading “Chat Transcript” or after
selected for removal the heading “End of Chat Transcript.”
If Analyze specified terms only You would enter “Chat Transcript” into
is selected, the Edit Terms button will the Start Phrase field and “End of Chat
be enabled. If you click this button, the Transcript” into the End Phrase field
Edit Exclusive Terms dialog opens. Here (Figure 2).
you can add and remove terms to be Leave the default setting for
considered for text mining. All other Stopword removal. Click Edit to view a
terms will be disregarded. For example, list of commonly used words that will
if we wanted to mine each document for be removed from the documents during
a specific part name such as “alternator” pre-processing. To remove a word from
we would click Add Term on the Edit the Stopword list, simply highlight
Exclusive Terms dialog, then replace the desired word, then click Remove
“New term” with “alternator” and click Stopword. To add a new word to the
Done to return to the Pre-Processing list, click Add Stopword, a new term
dialog. During the text mining process, “stopword” will be added.
Text Miner would analyze each Text Miner also allows additional
document for the term alternator, stopwords to be added or existing to
excluding all other terms. be removed via a text document (*.txt)
by using the Browse button to navigate
to the file. Terms in the text document
can be separated by a space, a comma,
or both. If supplying the three terms
in a text document, rather than in the
Edit Stopwords dialog, the terms could
be listed as: subject emailterm from
or subject,emailterm,from or subject,
emailterm, from. If we had a large list of
additional stopwords, this would be the
preferred way to enter the terms.
Advanced in the Term Normaliza-
tion group allows us to indicate to Text
Miner, that:
• If stemming reduced term length
Figure 2. to 2 or less characters, disregard the

34 | Solver International | February 2018 Solver-International.com


d
TUTORIAL
term (Minimum stemmed term
length).
• HTML tags, and the text enclosed,
will be removed entirely. HTML
tags and text contained inside
these tags often contain technical,
computer-generated information
that is not typically relevant to the
goal of the text mining application.
• URLs will be replaced with the Figure 3.
term, “urltoken.” Specific form remove the term “emailtoken” from the
of URLs do not normally add collection by editing “exclusionterm” to
any meaning, but it is sometimes “emailtoken.” We can also enter these
interesting to know how many terms into a text document (*.txt) and
URLs are included in a document. add the terms all at once.
• E-mail addresses will be replaced Text Miner also allows the
with the term, “emailtoken.” Since combining of synonyms and full
the documents in our collection phrases by clicking Advanced within
all contain a great many email Vocabulary Reduction. Select Synonym
addresses (and the distinction reduction at the top of the dialog
between the different e-mails often to replace synonyms such as “car,”
has little use in text mining), these “automobile,” “convertible,” “vehicle,”
e-mail addresses will be replaced “sedan,” “coupe,” “subcompact,” and
with the term “emailtoken.” “jeep” with “auto.” Click Add Synonym
• Numbers will be replaced with the
term, “numbertoken.”
• Monetary amounts will be
substituted with “moneytoken.”
(Figure 3)

Recall that when we inspected an


email from the document collection
we saw several terms such as “subject,”
“from,” and e-mail addresses. Since all
our documents contain these terms,
including them in the analysis will not
provide any benefit and could bias the
analysis. As a result, we will exclude
these terms from all documents by
selecting Exclusion list then clicking
Edit. The label “exclusionterm” is added.
Click to edit and change to “subject.”
Then repeat these same steps to add
the term “from.” We can take the e-mail
issue one step further and completely Figure 4.

Educating and Empowering the Business Analyst February 2018 | Solver International | 35
d TUTORIAL: TEXT MINING
and replace “rootterm” with “auto” term “phrasetoken” will appear, click to
then replace “synonym list” with “car, edit, and enter “wagon.” Click “phrase”
automobile, convertible, vehicle, sedan, to edit and enter “station wagon.” If
coupe, subcompact, jeep” (without the supplying phrases through a text file
quotes) (Figure 4). (*.txt), each line of the file must be of the
If adding synonyms from a text form phrasetoken:phrase or using our
file, each line must be of the form example, wagon:station wagon.
rootterm:synonymlist or using our If you enter 200 for Maximum
example: auto:car automobile convertible Vocabulary Size, Text Miner will
vehicle sedan coup or auto:car,autom reduce the number of terms in the
obile,convertible,vehicle,sedan,coup. final vocabulary to the top 200 most
Note separation between the terms in frequently occurring in the collection.
the synonym list be either a space, a Leave Perform stemming as the
comma or both. If we had a large list of selected default. Stemming is the practice
synonyms, this would be the preferred of stripping words down to their “stems”
way to enter the terms. or “roots.” For example, stemming of
Text Miner also allows the terms such as “argue,” “argued,” “argues,”
combining of words into phrases that “arguing,” and “argus” would result in the
indicate a singular meaning such as stem “argu.” However, “argument” and
“station wagon” which refers to a specific “arguments” would stem to “argument.”
type of car rather than two distinct The stemming algorithm utilized in Text
tokens – station and wagon. To add a Miner is “smart” in the sense that while
phrase in the Vocabulary Reduction “running” would be stemmed to “run,”
– Advanced dialog, select Phrase “runner” would not. Text Miner uses
reduction and click Add Phrase. The the Porter Stemmer 2 algorithm for the
English Language. For more information
on this algorithm, please see: http://
tartarus.org/martin/PorterStemmer/
Leave the default selection for
Normalize case. When this option
is checked, Text Miner converts all
text to a consistent (lower) case, so
that Term, term, TERM, etc. are all
normalized to a single token “term”
before any processing, rather than
creating three independent tokens
with different cases. This simple
method can dramatically affect the
frequency distributions of the corpus,
leading to biased results (Figure 5).
For many text mining applications,
the goal is to identify terms that are
useful for discriminating between
documents. If a term occurs in all or
Figure 5. almost all documents, it may not be

36 | Solver International | February 2018 Solver-International.com


d
TUTORIAL
possible to highlight the differences. If basic information on the frequency
a term occurs in very few documents, of terms appearing in the document
it will often indicate great specificity of collection. With this information we can
this term, which is not very useful for “rank” the significance or importance of
some Text Mining purposes. Enter 3 for these terms relative to the collection and
Remove terms occurring in less than any individual document.
_% of documents and 97 for Remove Latent Semantic Indexing, in
terms occurring in more than _% of comparison, uses singular value
documents. decomposition (SVD) to map the terms
When you enter 20 for Maximum and documents into a common space
term length, terms that contain more to find patterns and relationships. For
than 20 characters will be excluded from example, if we inspected our document
the text mining analysis and will not be collection, we might find that each time
present in the final reports. This option the term “alternator” appeared in an
can be extremely useful for removing automobile document, the document
some parts of text which are not actual also included the terms “battery” and
English words, for example, URLs or “headlights.” Or each time the term
computer-generated tokens, or to exclude “brake” appeared in an automobile
very rare terms such as Latin species or document, the terms “pads” and
disease names, i.e. Pneumonoultramicro- “squeaky” also appeared. However, there
scopicsilicovolcanoconiosis. is no detectable pattern regarding the
Keep the default selection of use of the terms “alternator” and “brake.”
TF-IDF (Term Frequency–Inverse Documents including “alternator” might
Document Frequency) for Term- not include “brake” and documents
Document Matrix Scheme. A including “brake” might not include
term-document matrix is a matrix “alternator.” Our four terms—battery,
that displays the frequency-based headlights, pads, and squeaky—describe
information of terms occurring in a two different automobile repair issues,
document or collection of documents. failing brakes and a bad alternator.
Each column is assigned a term and Latent Semantic Indexing will
each row a document. If a term appears attempt to distinguish between
in a document, a weight is placed in the these two different topics, identify
corresponding column indicating the the documents that deal with faulty
term’s importance or contribution. brakes, alternator problems or both,
Text Miner offers four different and map the terms into a common
commonly used methods of weighting semantic space using singular value
scheme used to represent each value in decomposition. SVD is a tool used by
the matrix—Presence/Absence, Term Text Miner to extract concepts that
Frequency, TF-IDF (the default) and explain the main dimensions of meaning
Scaled term frequency. It’s also possible of the documents in the collection.
to create your own scheme by selecting The results of LSI are usually hard to
choices for local weighting, global examine because the construction of the
weighting, and normalization. concept representations will not be fully
The statistics produced and displayed explained. Interpreting these results is
in the Term-Document Matrix contain more of an art than a science. However,

Educating and Empowering the Business Analyst February 2018 | Solver International | 37
d TUTORIAL: TEXT MINING
Text Miner provides several visualizations small number of words extremely
that simplify this process greatly. frequently and use a large number of
Select Maximum number of words very rarely.
concepts and increment the counter Text Miner will produce a table
to 20. Doing so will tell Text Miner to displaying the document ID, length of
retain the top 20 of the most significant the document, number of terms, and 20
concepts. If Automatic is selected, Text characters of the text of the document.
Miner will calculate the importance Run the Text Mining analysis and four
of each concept, take the difference pop-up charts will appear, and seven
between each and report any concepts output sheets will be inserted into the
above the largest difference. Model tab of the Text Miner task pane.
Keep Term frequency table selected The Term Count table shows that the
(the default) under Preprocessing original term count in the documents
Summary and select Zipf ’s plot. Increase was reduced by 85.54 percent by the
the Most frequent terms to 20 and select removal of stopwords, excluded terms,
Maximum corresponding documents. synonyms, phrase removal and other
The Term frequency table will include specified preprocessing procedures. The
the top 20 most frequently occurring Documents table lists each Document
terms. The first column, Collection with its length, number of terms, and
Frequency, displays the number of if Keep a short excerpt is selected on
times the term appears in the collection. the Output Options tab and a value is
The second column, Document present for Number of characters, then
Frequency, displays the number of an excerpt from each document will be
documents that include the term. displayed.
The third column, Top Documents, The Term–Document Matrix lists
displays the top 5 documents where the the 200 most frequently appearing
corresponding term appears the most terms across the top and the document
frequently. IDs down the left. If a term appears
The Zipf Plot graphs the in a document, a weight is placed in
document frequency against the the corresponding column indicating
term ranks in descending order of the importance of the term using
frequency. Zipf ’s law states that the our selection of TF-IDF on the
frequency of terms used in a free- Representation dialog (Figure 6).
form text drops exponentially, i.e. The Final List of Terms table
that people tend to use a relatively contains the top 20 terms occurring in
the document collection, the number
of documents that include the term,
and the top 5 document IDs where
the corresponding term appears most
frequently. In this list we would see
terms such as “car,” “power,” “engine,”
“drive,” and “dealer,” which suggests
that many of the documents, even
the documents from the electronic
Figure 6. newsgroup, were related to autos.

38 | Solver International | February 2018 Solver-International.com


d
TUTORIAL
The Zipf Plot shows that our
collection of documents follows
the power law stated by Zipf. As
we move from left to right on the
graph, the documents that contain
the most frequently appearing terms
(when ranked from most frequent
to least frequent) drop quite steeply.
Hover over each data point to see the
detailed information about the term
corresponding to this data point.
In our exercise, the term
“numbertoken” is the most frequently
occurring term in the document Figure 7.
collection, appearing in 223 documents
out of 300 and 1,083 times total.
Compare this to a less frequently
occurring term such as “thing” which
appears in only 64 documents and only
82 times total (Figures 7 & 10).
The Concept Importance table
lists each concept, its singular value,
the cumulative singular value and
the percent singular value explained.
The number of concepts extracted
is the minimum of the number of
documents (985) and the number of
Figure 10.
terms (limited to 200). These values
are used to determine which concepts The Term Importance table lists
should be used in the Concept– the 200 most important terms. To
Document Matrix, Concept–Term increase the number of terms from
Matrix, and the Scree Plot. In 200, enter a larger value for Maximum
this example, we entered “20” for Vocabulary on the Pre-processing tab
Maximum number of concepts. of Text Miner (Figure 8).

Figure 8.

Educating and Empowering the Business Analyst February 2018 | Solver International | 39
d TUTORIAL: TEXT MINING
The Scree Plot gives a graphical In fact, two documents placed
representation of the contribution at extremes of a concept (one close
or importance of each concept. The to -1 and other to +1) indicates
largest “drop” or “elbow” in the plot strong differentiation between these
appears between the first and second documents in terms of the extracted
concept. This suggests that the first concept. This provides means for
top concept explains the leading topic understanding actual meaning of
in our collection of documents. Any the concept and investigating which
remaining concepts have significantly concepts have the largest discriminative
reduced importance. However, we power, when used to represent the
can always select more than one documents from the text collection.
concept to increase the accuracy of The Concept–Term Matrix lists
the analysis. It is best to examine the the top 20 most important concepts
Concept Importance table and the along the top of the matrix and the top
“Cumulative Singular Value” to identify 200 most frequently appearing terms
how many top concepts capture enough down the side of the matrix. The Term-
information for your application. Concept Scatter Plot visually represents
The Concept-Document Scatter the Concept–Term Matrix. It displays
Plot is a visual representation of the all terms from the final vocabulary in
Concept–Document matrix. Analytic terms of two concepts. Similar to the
Solver Data Mining normalizes each Concept-Document scatter plot, the
document representation so it lies on Concept-Term scatter plot visualizes the
a unit hypersphere. Documents that distribution of vocabulary terms in the
appear in the middle of the plot, with semantic space of meaning extracted
concept coordinates near 0 are not with LSA. The coordinates are also
explained well by either of the shown normalized, so the range of axes is always
concepts. The further the magnitude of [-1,1], where extreme values (close to +/-
coordinate from zero, the more effect 1) highlight the importance or “load” of
that concept has for the corresponding each term to a particular concept.
document (Figure 11). The terms appearing in a zero-
neighborhood of concept range do not
contribute much to a concept definition.
In our example, if we identify a concept
having a set of terms that can be divided
into two groups, one related to “Autos”
and other to “Electronics,” and these
groups are distant from each other on
the axis corresponding to the concept,
this would provide evidence that this
particular concept “caught” some pattern
in the text collection that is capable of
discriminating the topic of article.
Therefore, Term-Concept scatter
plot is an extremely valuable tool for
Figure 11. examining and understanding the main

40 | Solver International | February 2018 Solver-International.com


d
TUTORIAL
topics in the collection of documents,
finding similar words that indicate a
similar concept, or the terms explaining
the concept from “opposite sides”—e.g.,
term1 can be related to cheap affordable
electronics and term2 can be related to
expensive luxury electronics.
From here, we can use any of the
six classification algorithms to classify
our documents according to some term
or concept using the Term–Document
matrix, Concept–Document matrix,
or Concept–Term matrix where each
document becomes a “record” and each
concept becomes a “variable.”
In this example, we will use the
Logistic Regression Classification Figure 9.
method to create a classification model model is being validated or tested,
using the Concept Document matrix. the known classification is only used
Recall that this matrix includes the top to evaluate the performance of the
twenty concepts extracted from the algorithm.
document collection across the top of Logistic Regression will classify 47
the matrix and each document in the out of a total of 60 documents in the
sample down the left. Each concept validation partition of our sample,
will now become a “feature” and each which translates to an overall error of
document will now become a “record.” 21.67 percent (Figure 9).
First, we’ll need to append a Analytic Solver Data Mining
new column with the class that the provides powerful tools for importing
document is currently assigned: a collection of documents for
electronics or autos. Since we sorted comprehensive text preprocessing,
our documents at the beginning of quantitation, and concept extraction,
the example starting with Autos, we to create a model that can be used
can simply enter “Autos” into the to process new documents–all
appropriate column. performed without any manual
We’ll need to partition our data intervention. When using Text Miner
into two datasets, a training dataset in conjunction with classification
where the model will be “trained,” algorithms, Analytic Solver Data
and a validation dataset where the Mining can be used to classify
newly created model can be tested, customer reviews as satisfied/not
or validated. When the model is satisfied, distinguish between which
being trained, the actual class label products garnered the least negative
assignments are shown to the algorithm reviews, extract the topics of articles,
for it to learn which variables (or cluster the documents/terms, etc.
concepts) result in an “auto” or The applications for text mining are
“electronic” assignment. When the boundless. Si

Educating and Empowering the Business Analyst February 2018 | Solver International | 41
d EXECUTIVE INTERVIEW

MIGUEL MARTINEZ —
Microsoft Power BI

Miguel Martinez is a Senior Product


Marketing Manager for Microsoft Power
BI. He oversees the digital strategy for
the cloud business intelligence solution.
His background includes formal training
and practical experience in Marketing
and Engineering.

Solver International: Microsoft’s Power BI is both


an application, Power BI Desktop, and an online
SaaS (Software as a Service). It also connects to a
host of other common applications such as Word
and Excel. How did Power BI and its internal group
develop?
Miguel Martinez: I usually default to what Satya Nadella,
our current CEO, keeps stating in interviews which is the
motto of the company, “To empower everyone to achieve their
maximum potential.” The way people work is changing. When
Microsoft realized that companies throughout the world are
producing more data, and that we needed tools to tap into the

42 | Solver International | February 2018 Solver-International.com


Educating and Empowering the Business Analyst February 2018 | Solver International | 43
d EXECUTIVE INTERVIEW: MIGUEL MARTINEZ
potential for making decisions based can use to connect to the data, mash
on that data, that is what drove us it up, clean it up, and make it ready
internally. Like the early years with SQL, for analysis; creative visualization
as you build on top of that very basic is the channel that everybody in the
data operation, how do you get insights organization can use to understand
and then make better decisions to make that data and make it as interactive
people more productive? I think that and as friendly as possible. And then
was the core of the beginning of the BI you have the Power BI service, which
group that you see today. is the “cloud component,” where you
And I would add, even though share—and because it is cloud-based
we have hundreds of offerings you have more processing power and
across the board for the consumer more access for people to tap into that
and for the enterprise, our groups analysis.
need to come out with the best tool, The third piece is the consumption
where everything works together piece. Basically, everybody has a phone,
as a platform, integrates with each so it’s not enough that you have a
other, and actually accomplishes the desktop application and a browser
maximum potential. Not only do you application. You need to make that
have hardcore data tools, like analysis analysis available anywhere people are
services, reporting services, all very consuming it. If you’re in a warehouse
IT-centric in the beginning, as you and all you have is a tablet to see the
move forward and more people need real-time level of inventory, we allow
access to that data, you see groups you to do it on that tablet. We don’t
working together to get the power force you to go to a laptop or a desktop
of that data and the data analysis machine to consume that data. Those
to everybody in an organization. It are the different moving parts that you
doesn’t matter what skillset they have see in Power BI.
or how technical they are. Going back to the first part, it’s
a collection of tools and I think that
SI: Power BI is not a thing, it’s a makes the case for not being just one
collection of things, right? thing, it is the back end. The back
Martinez: I would say yes and end includes all the technology, all
no. Let me give you the reason why. the patents, all the knowledge that we
Yes, it’s a collection. You have several have on how to analyze data and scale
components within that portfolio, it to a level where it doesn’t matter if
but they all serve the same purpose, you’re a 100,000-person enterprise
which is giving people access to data, or a 10-person startup. The back end
so they can get insights and do a serves the purpose of providing a lot
better job on a day-to-day basis. But of power, a lot of analysis, that comes
if you double click on that—and this right out of the box with familiar tools.
is the no part—while it’s a collection Because they have been around for so
of things, you have an authoring tool, long, we try to bring all the things that
Power BI Desktop, that an analyst people know about those tools into the
or whoever is tapping into that data consumption layer of Power BI.

44 | Solver International | February 2018 Solver-International.com


d
EXECUTIVE INTERVIEW
SI: Back in the early days of an “A-ha moment!” out of an analysis,
micro-computing, back in the there is a big problem today. To put it
1980s, the big problem was very, very bluntly, if I’m on my laptop
always integration--integrating and I see that insight but then have I go
hardware and software, making to a different system or to a different
it easy to get information. Is machine, and execute something based
that one of the things that BI on that, there is a gap. What you’ve
is now doing, integrating all seen from Microsoft, not only on our
the collections of data that are enterprise software or productivity
scattered around the world? software but even in the consumer side
Martinez: Definitely, yes. That of things, is that we will bring your day-
is one of the pillars, not only of the to-day job into the application itself. It
Power BI offering, but every Microsoft doesn’t matter if it’s operational, if it’s
offering today. If you look across financial, if it’s physical, or if it’s digital,
all our suite of products, including we try to bring that into the application
Office, what you’re going to see is not so it’s easy for you and you don’t lose
only very tight integration between time and you are more productive
those different applications, but also executing on the things that you
very tight integration, or at least a learned from that BI-specific tool.
communication funnel, between Some examples: under the
applications that do something else. umbrella of Power BI—the
I think there are several examples engineering organization that we call
of that within the Power BI world. Let the Business Application Platform—
me talk about the top three. Number you have things like Microsoft Flow,
one is connection to data. I would be which is basically “if then, do that” for
a fool to assume that all the data for enterprises. Not only does it let you
a company, or even for a particular know if sales went over a certain level,
analyst, lives in the same place and you can automate the way to react if
in the same format. So, a modern sales go over that level, so it can email
business intelligence tool, one that certain people because they need to
is not only relevant today but will be execute something in response. If you
relevant in the future, is one that can have other applications, like a CRM
connect to any source of data, any (customer relationship management)
format of data. It could be structured software, you can also automate the
or unstructured. It could live in an actual task to get it done immediately
Excel or a CSV file. It could be in instead of having to check all the
Salesforce or in Google Analytics. So checkboxes to get that done.
that is key into a successful BI tool, Integration? Yes, in the ‘80s it
and that’s what Power BI is trying to was being studied and it was very
accomplish. important. Now it’s a requirement.
The second piece is focused not Nobody will use any piece of
only on the data but also the execution. enterprise software if it doesn’t
If you get a great insight out of integrate with the other things that
something, or somebody gets an idea or they’re using.

Educating and Empowering the Business Analyst February 2018 | Solver International | 45
d EXECUTIVE INTERVIEW: MIGUEL MARTINEZ
SI: We hear a lot of discussion it’s worthless. So those two things are
of Big Data. Are BI and Big Data very related, and that’s what you need,
compatible? to make both of them play well together
Martinez: I would say they’re and get the most value out of them.
definitely compatible, and I can speak
from experience on that one. In one of SI: We usually talk about data
my previous positions, in a previous and intelligence and information
life—I’ve always been in data-related all being similar, but different. You
roles—I was at an airline and looking mentioned velocity. I’d like to go
at all the operational data that we back to that. Is current micro-
captured. You can use that data to figure level BI capable of dealing with
out fuel consumption or adjust shifts high-velocity data, that is, data
for crews that needed to fly planes. that changes frequently? Can it
The amounts of data back then—I’m keep currency of the data so that
talking 10 or 12 years ago—was already people are working with what is
overwhelming and I would think it current rather than what is old
would qualify as Big Data. There is no data?
way to take advantage of that amount Martinez: Definitely. This is where
of data, that velocity, that volume, that Microsoft is heading, how to quickly get
real-time component, if you don’t the data into the hands of people that
have a layer that allows people who need it. Are people accessing the one
are making business and operational version of the truth at the right time?
decisions to make sense out of it. All those components are key, not only
To bring it back to things that you to Power BI but, again, to all our cloud
see, not only in Power BI but in a lot of and enterprise/data analytics solutions.
BI tools, how do you translate an insane For example, Power BI is a pioneer on
amount of data into a visualization, the data visualization piece, to visualize
into something that makes sense? I data in real time. At our Data Insights
see two trends there. One is the data Summit, which is where we bring data
visualization piece, the sexy part that analysts to talk about Excel and Power
you see in all the BI tools, and the BI, the real-time component, like the
intelligence behind it. It doesn’t matter visualization piece, is front and center.
how many people you put into data It’s a big differentiator that we have. It
analysis, you’re still going to need the doesn’t matter that you can visualize
AI (artificial intelligence) component every sensor that you have on a factory
to filter that data and to show you just floor if there’s no logic and intelligence
the things that are relevant, to save you behind that.
time from moving that amount of data Because of the way data is coming
to actual analysis. into the hands of people that use it,
So, BI and Big Data are 100 percent real-time is a key. It doesn’t matter if
compatible. You need to deal with any your data are on the same platform or
type and any amount of data that you if you’re using different systems, data
have available. But if you don’t allow the needs to come as close to real-time as
user to execute on that amount of data, possible.

46 | Solver International | February 2018 Solver-International.com


d
EXECUTIVE INTERVIEW
SI: You talk about Big Data, about of their service. If you have a tool that
big companies that have a lot allows you to make better sense out of
of sensors, as you point out, or that data, which could be a pie chart and
are collecting data from many maybe a bar chart, to see what the key
different sources—the Internet word that is most popular, that can be of
of Things. What about smaller great benefit.
companies, perhaps a two or If you can easily tap into that data,
three person information-based in a familiar environment like Excel, and
company? Do they get any value you can tap into it in real time, that will
from business intelligence? give a company as small as an individual
Martinez: Definitely. It doesn’t the same power as an organization of
matter what size of company it is; it 100,000 people tapping into sensors in
could be a company of one. First is the every factory, the benefit of making the
technical component, a way of speeding right decisions based on data. Microsoft
up access to a solution, so you can be up has always been a great enterprise
and running quickly. Basically, signing destination but that doesn’t mean that
up for an account or downloading small companies, even students or
Power BI Desktop lowers the barriers individuals having a business, cannot
of entry, the barriers to learning. It take advantage of it.
integrates familiar tools, the things that
you’re already using. You’re going to see SI: Is business intelligence
a lot of the Excel UI (user interface) and a long-term investment for
logic within Power BI without being in Microsoft?
Excel, because there’s no need to have Martinez: Absolutely, yes. It has been
two Excels within the same company. for the last three years. If you look at the
Those two are complementary and they pace of innovation in Power BI and in
work better together. So, number one is Microsoft Research, which is the leading
how accessible and how easy to use it is. edge of the innovation that we have, you’re
Number two, I would say is the cost going to see that business intelligence
component, which is part of how hard is just one very specific piece of data
it is to get the tools you need. Power BI analytics and empowering productivity
is a leader in the market in that way. The for people to achieve more. It’s in the
scalability of the product on a per-user DNA of every single product we put out.
pricing basis is very, very accessible,
and I would say it’s one of the most SI: Let’s go back to the
competitive that’s out there today. beginning. Tell me how you
And then number three is the data see business intelligence
itself. I have a very good friend who owns and, obviously, Microsoft’s
a shop. They need to be available on involvement in it? What does it
every single online search that happens mean to an MBA student or those
in the ZIP code where they offer their in other programs looking forward
services. The providers of those search to working in analytics?
engines, like Bing or Google, make that Martinez: Great question. I
data available for you, for free. It’s part think the number one advantage is

Educating and Empowering the Business Analyst February 2018 | Solver International | 47
d EXECUTIVE INTERVIEW: MIGUEL MARTINEZ
the familiarity. They don’t need to in a better way; and, three, coming
learn new things to take advantage up with actual better insights and
of what was, eight years ago when I solutions because you are empowering
took my MBA, an IT-focused or IT- people, as an audience for your business
bubble type of workload. Now you knowledge, to make better decisions,
can tap into any data that you want. more informed decisions.
You are easily able to match it up and
relate it to others without having a Si: Have you had any experience
lot of technical knowledge, and get with applications from other
visual representation that is easy to companies that work with Power
understand and will take you to a BI? Has that presented a benefit
finding or to an insight in a shorter or a problem for you?
time. This is what I see with MBAs, Martinez: We have a lot of
they’re usually going to have to interact experience with that situation. You
with a lot of executives, with a lot of cannot expect any customer, any user
people where the patience they have for of a particular piece of software, to
an analysis or for an insight is going to be exclusively on your platform. It
be very short. happens, but it’s very, very rare. That
If you look at the pain points of means when you build a tool like
any data analyst or business analyst, Power BI, or any other tool within
when they are presenting an analysis Microsoft, you need to consider
or a model to someone, it’s usually, several things: not everybody’s using
“How do you present it so it doesn’t only your tools, so it must be open
take a long time for that audience to source and, especially if it’s a data
understand? How do you tell a story analytics tool, you need to bring all
of the things that you’re finding in the data in regardless of where it is
that data?” generated. We accomplish a lot of
All the tools in modern BI, which that through working with other
we call the third wave of BI, are making companies.
it easier for the person creating those On the data connectivity side,
analyses and those reports, and also for Power BI connects natively to names
the consumer of those reports. You are like Salesforce, Google Analytics,
shortening the gap between whoever NetSuite, QuickBooks—it’s a list that
is analyzing the data—the business goes to the hundreds. When doing
analyst, the MBA student, the data analysis for our own team, I connect
scientist—and the person receiving it. to those company’s programs and that
And new ideas, new insights can come experience is very, very friendly.
out because it’s not only the analyst On the consumption side, even
doing the analysis, it’s everybody though Microsoft is a company that
together consuming that data and has a lot of resources, you are going
coming out with an analysis. to have a lot of custom visualizations
So, to summarize, number one, that people want to build for specific
not a lot of investment in learning new problems they’re trying to solve. For
tools; two, more power to present things example, using Narrative Sciences

48 | Solver International | February 2018 Solver-International.com


d
EXECUTIVE INTERVIEW
to create a model and get a natural SharePoint site. We allow people to do
language representation, actual words that. It’s a benefit we have, internally
that tell you what the data is telling you. and externally, so it’s very easy to use
With things like that, even though we Power BI with other tools.
can do them, we tap into partners to
offer even better solutions. SI: If you were going to sell the
As I mentioned before, I used idea of business intelligence to an
to work for an airline. There’s a MBA student, one who is trying to
custom visualization that one of decide what area of business to
our community members built—it’s go into, what would you say?
available on our website—that is an Martinez: This is going to sound
interactive, part-by-part visualization like a blunt approach, but if I’m really
of an airplane engine. It only trying to convince them, I would say,
applies to companies that are in that “Take a look at any job posting––it
industry, so you can count them on doesn’t matter the level, it doesn’t
your fingers. The fact that you can matter the industry, the company––
easily accomplish something that and how see they talk about their data
specific, because of the open source analysis needs.”
component of a tool, is something We have a case study internally
you’re going to see, not only in Power where you look at the history of the
BI, but in most of the Microsoft world or humanity, and we’re seeing
offerings. something very similar with data when
And then, the last piece is how you you compare it to books. When books
get the data out. While we may want were very new, few people could read
you to go into Power BI and look at because books were too expensive.
that analysis, for some customers, for It was very difficult and expensive to
some companies, that is not the channel produce a book; you had to copy it by
where they want it delivered. You need hand. Then you had the printing press
to play well by embedding Power BI and books explode. After a few hundred
analysis and Microsoft analysis in other years, you couldn’t get a job if you
tools, and that is something that we didn’t read.
do, too. We have offerings like Power We’re seeing the same with data.
BI Embedded, where instead of going Even if you’re not in a hardcore data
to our SaaS experience, you can take analysis industry or department,
visualizations from a Power BI report the fact that business intelligence is
or model and consume them in another the way data analytics is reflected in
application. business decisions—it’s a key piece of
A good example of that is knowledge—it must be part of your
SharePoint. It is one of the CMSs that skillset, there’s no way you can escape
are most widely used, especially in it. So, either through fear that you have
the business environment. What if a to learn it or because you enjoy data
company doesn’t want to use Power BI analysis, this is the right time in history
and send everybody to Power BI? They to be exposed to analytics because
just want to embed that analysis in a you’re going to be very successful. Si

Educating and Empowering the Business Analyst February 2018 | Solver International | 49
Society News

INFORMS Melissa Moore

2018 INFORMS Business


Analytics Conference
The place to gain real-world insights to implement a successful analytics strategy

E very year, analytics


professionals from around
the world attend the
INFORMS Business Analytics
2018 INFORMS Business
Analytics Conference. The
conference provides attendees
many opportunities to share
Conference to interact with and ideas, to network, and to
learn from leading analytics learn about a range of current
professionals and industry topics and trends that can help
experts and to gain real-world businesses and organizations
insights on a successful analytics improve their analytics prowess
strategy. From April 15-17, 2018 by applying science to the art of
nearly 1,000 attendees, at all business.
stages of their analytics careers,
will travel to Baltimore—the ANALYTICS BEST
hometown of INFORMS and PRACTICES
a vital hub of discovery and The 2018 conference
growth in Maryland—for the will feature more than 150
presentations and perspectives
on topics that include Real
Time Decision Systems,
Managing Risk, Supply
Chain/Logistics, and Revenue
Management. In addition,
plenary speakers from leading
organizations nationwide,
including Bill Schmarzo, the
Chief Technology Officer of
Big Data Practice with Dell,
will share how analytics has
transformed their organizations
and is continuing to help
propel them forward.
Baltimore will host the 2018 INFORMS Conference on Business Analytics & O.R., The valuable insights
along with the Marriott Waterfront Hotel. shared in these presentations
Photo Courtesy of 123rf.com | © Jon Bilous

50 | Solver International | February 2018 Solver-International.com


will enable attendees, regardless which recognizes and
of their industry, to leverage rewards outstanding
analytics best practices to contributions of
optimize their business operations research
processes and help their (O.R.) and analytics
organizations’ key decision in the for-profit and
makers better understand the non-profit sectors
value of analytics applications worldwide. Each
and how to apply them to year, INFORMS
maximize their own data. honors finalist
teams that have
MAKE LASTING improved organizational
CONNECTIONS efficiency, increased profits, The 2018
In addition to a full meeting brought better products to
schedule, attendees will have consumers, helped foster conference will
many unique opportunities for peace negotiations, and saved feature more than
networking and idea sharing, lives.  To date, cumulative
as well as learn about the latest benefits of the Edelman 150 presentations
quantitative methodologies finalist projects have topped and perspectives on
and technologies to sharpen the $250 billion mark and
their skill sets and help advance have impacted nearly every topics that include
their analytics careers. Dozens industry. Real Time Decision
of vendors and industry Previous winners of the
representatives will also be in Franz Edelman Award include Systems, Managing
attendance to share how their Holiday Retirement, UPS, Risk, Supply
organizations are utilizing IBM, Syngenta, The Centers
analytics, and a career fair for Disease Control and Chain/Logistics,
offers a unique opportunity for Prevention, Memorial Sloan and Revenue
companies to meet, network, Kettering Cancer Center,
and interview talented analytics Hewlett-Packard, and General Management.
professionals. Motors, among others. 
The Analytics Conference INFORMS provides
also celebrates the unique networking and Business Analytics Conference
achievements in analytics learning opportunities for in Baltimore.
at the highest level with the individual professionals and For more information
Franz Edelman Awards Gala, organizations of all types and about the meeting and
in which a number of prizes sizes, to better understand and speakers, click here. Si
are presented in recognition use O.R. and analytics tools Melissa Moore is the
of individuals, universities, and methods to transform Executive Director of
and organizations that are strategic visions and achieve INFORMS, the largest
leading the way in furthering better outcomes. Improve international association for
the field of analytics. The Gala your analytics capabilities and operations research and
culminates in the presentation hear from leaders in the field! analytics professionals and
students.
of the Franz Edelman Award, Attend the 2018 INFORMS

Educating and Empowering the Business Analyst February 2018 | Solver International | 51
Te r m s o f

THE TRADE

Glossary
A Ambari H HBase
A web interface for managing Hadoop services and A non-relational, distributed database that runs on top of
components Hadoop

Apache Kafka HCatalog


a distributed streaming platform for building real-time data A table and storage management layer
pipelines and streaming apps.
Hive
Apache Spark A data warehousing and SQL-like query language
Open-source cluster computing framework with highly
performant in-memory analytics and a growing number of M MapReduce
related projects a parallel processing software framework that takes inputs,
partitions them into smaller problems and distributes them
C Cassandra to worker nodes
A distributed database system
O ODBC
Cubes ODBC stands for Open Data Base Connectivity, a
A cube is a set of related measures and dimensions that is connection method to data sources.
used to analyze data.
• A measure is a transactional value or measurement that a Oozie
user may want to aggregate. Measures are sourced from
A Hadoop job scheduler
columns in one or more source tables, and are grouped into
measure groups.
• A dimension is a group of attributes that represent an P Pig
area of interest related to the measures in the cube, and A platform for manipulating data stored in HDFS
which are used to analyze the measures in the cube. The
attributes within each dimension can be organized into
hierarchies to provide paths for analysis.
Python
Python is a high-level programming language for general-
purpose programming. Python emphasizes code readability
E Edge computing and a syntax which allows programmers to express
Edge computing is a method of optimizing cloud systems concepts in fewer lines of code than might be used in
by performing data processing at the edge of the network, languages such as C++ or Java.
near the source of the data. Edge computing covers a
range of technologies that includes mobile data acquisition
and signature analysis, wireless sensor networks, and
R R
R is a language and environment for statistical computing
cooperative distributed peer-to-peer ad hoc networking
and graphics. It is a GNU project similar to the S language
and processing.
and environment, The S language is often the vehicle
of choice for research in statistical methodology, and R
F Flume provides an Open Source route to participation in that
Software for streaming data into HDFS activity.

G Google BigQuery S Solr


BigQuery is Google's fully managed, petabyte scale, A scalable search tool
enterprise data warehouse for analytics. BigQuery is
serverless; there is no infrastructure to manage or a
database administrator.
Sqoop
Moves data between Hadoop and relational databases

H Hadoop W Welch’s Test


The Apache Hadoop software library is a framework that
Welch’s Test for Unequal Variances (also called Welch’s
allows the distributed processing of large data sets across
t-test, Welch’s adjusted T or unequal variances t-test) is
clusters of computers using simple programming models. It
used to see if two sample means are significantly different.
is designed to scale up from single servers to thousands of
The null hypothesis for the test is that the means are equal.
machines, each offering local computation and storage.
The alternate hypothesis for the test is that means are not
equal.
Hadoop Distributed File System
(HDFS) Y YARN
the scalable system that stores data across multiple (Yet Another Resource Negotiator) provides resource
machines without prior organization. management for the processes running on Hadoop

Z Zookeeper
An application that coordinates distributed processing

52 | Solver International | February 2018 Solver-International.com

You might also like