Professional Documents
Culture Documents
WWW.SOLVER-INTERNATIONAL.COM
TUTORIAL
Using Simulation and
Optimization Together
DATA
ANALYSIS
Location Data –
The Missing Link
EXECUTIVE
INTERVIEW
Miguel Martinez –
Microsoft Power BI
TUTORIAL
Text Mining
solver...
International
Solver-International.com
Everything in Predictive and Prescriptive Analytics distributions, 50 statistics and risk measures, rank-
Everywhere You Want, from Concept to Deployment. order and copula correlation, distribution fitting, and
charts and graphs. And it has full-power, point-and-click
The Analytic Solver® suite makes powerful forecasting,
optimization, with large-scale linear and mixed-integer
data mining and text mining software available in your
programming, nonlinear and simulation optimization,
web browser (cloud-based software as a service), and in
stochastic programming and robust optimization.
Microsoft Excel. And you can easily create models in our
RASON® language for server, web and mobile apps. Find Out More, Start Your Free Trial Now.
Full-Power Data Mining and Predictive Analytics. In your browser, in Excel, or in Visual Studio, Analytic
Solver comes with everything you need: Wizards, Help,
It’s all point-and-click: Text mining, latent semantic
User Guides, 90 examples, even online training courses.
analysis, feature selection, principal components and
clustering; exponential smoothing and ARIMA for Visit www.solver.com to learn more or ask questions,
forecasting; multiple regression, logistic regression, and visit analyticsolver.com to register and start a free
k-nearest neighbors, discriminant analysis, naïve trial – in the cloud, on your desktop, or both!
Bayes, and ensembles of trees and neural networks for
prediction; and association rules for affinity analysis.
Simulation/Risk Analysis, Powerful Optimization.
Analytic Solver is also a full-power, point-and-click tool
for Monte Carlo simulation and risk analysis, with 50 Tel 775 831 0300 • Fax 775 831 0314 • info@solver.com
Fe b r u a ry 2018 • Volume 2, Number 1 • w w w . s ol ve r -i n te r n a t i on a l . com
CONTENTS
Tutorial:
Using Simulation and
Optimization Together
Combining ideas from
simulation and optimization to
build and solve a simulation
optimization model.
14
24 Data Analysis
Location Data – The
Missing Link
Location data, and the intelligence
that geospatial analysis can
provide business decision makers,
is the missing link that ties together
various data sets from Big Data
sources.
Executive Interview
30
Miguel Martinez—
Microsoft Power BI
Miguel Martinez is a Senior
Product Marketing Manager for Society News
Microsoft Power BI.
2018 INFORMS
42 Business Analytics
Conference
The place to gain real-world
insights to implement a successful
analytics strategy
Educating and Empowering the Business Analyst February 2018 | Solver International | 3
Off the
Smart Move:
A JOINT VENTURE BETWEEN FRONTLINE SYSTEMS, INC. AND
LIONHEART PUBLISHING, INC.
Learn Analytics
EDITORIAL OFFICE
Send all advertising submissions for Solver International to:
Lionheart Publishing Inc.
1635 Old41 Hwy., Suite 112-361, Kennesaw, GA 30152 USA
Tel.: 888.303.5639 • Fax: 770.432.6969
Email: lpi@lionhrtpub.com
Solver International has an editorial mission statement, URL: www.lionheartpub.com
“Educating and Empowering the Business Analyst,” that you’ll
see throughout this issue. We’re speaking to everyone employed
PRESIDENT
in roles that do – or should – include “analyst” in the job title, John Llewellyn, ext. 209
as well as managers of people in those roles. But in truth llewellyn@lionhrtpub.com
our audience is broader than that: As Miguel Martinez from Direct: 404.918.3275
Microsoft once more, “this is the right time in history to be than three weeks prior to the first day of the month of publication.
Address correspondence to: Editor, Solver International, Lionheart
exposed to analytics, because you’re going to be very successful.” Publishing, Inc., 1635 Old 41 Hwy., Suite 112-361, Kennesaw, GA
We couldn’t agree more. Si 30152. The opinions expressed in Solver International are those of
the authors, and do not necessarily reflect the opinions of Lionheart
Publishing Inc., Frontline Systems, Inc. or the editorial staff of Solver
International. All rights reserved.
Prescriptive Analytics
at the Edge
T he explosion of the
internet-of-things
(IoT), along with
sensor, social, and other
artificial intelligence-based
analytics over data collected
from devices located in the
edge server’s smart objects
degree of utilization to
serve customers reliably.
performance issues,
intelligently learning usage
organizations, customer
acquisition is expensive. Every
Every lost
or consumption patterns, lost customer directly impacts customer directly
and effectively targeting
offers and services to keep
your bottom line. Most
unhappy customers will never
impacts your
customers happy. tell you about performance bottom line.
issues. With simple, smart
• Enhancing competitive churn detection analytics,
Most unhappy
advantage you can improve customer customers
Better handling of market experience and retention.
challenges in an ever- Low-latency decision will never tell
changing global business. making is also ideal for you about
Adapting to windows of remote asset monitoring.
opportunity, and scaling Edge intelligence enables performance
and automating processes monitoring operations to issues.
while reducing costs. rapidly decide when to take
Recognizing variations and needed actions, to repair
forecasting responses to or replace an asset. Based
changing conditions. on the deployment of edge
intelligence in the field,
For example, you could directly on the machine or
use edge intelligence to parts, it is possible to identify
proactively provide predicted asset degradation patterns.
customer churn information This can lead to proactive
to customer success and versus reactive replacement
sales teams to take actions. of the part to avoid downtime
In service or contract-driven or disruption in production
Educating and Empowering the Business Analyst February 2018 | Solver International | 7
Impact
ANALYTIX
operations. Edge intelligence also offer “what-if ” analysis
also isolates the asset location of alternatives to better
or causes of problems making guide decision-makers. Edge
The emergence it much easier for staff in the analytics solutions today
of “smart,” field to find and resolve issues.
Another benefit of edge
augment humans – they do
not replace them. It is crucial
connected things intelligence is the collection for humans to guide and
of related conditional data. monitor these systems.
coupled with Operators can learn more Prescriptive analytics
existing subject about situational factors or on the edge is an emerging
alterations of an environment frontier that has progressed
matter expert that may have caused an issue. from post-event analysis of
knowledge That contextual knowledge historical data, real-time
can be used to better prepare event analysis, and predictive
creates an for or prescriptively prevent analytics, to the forward-
opportunity for issues in the future. looking automation of actions
At the organizational to optimize operations.
a new era of level, richer datasets can Industry leading organizations,
analytics. be collected and persisted,
including historic datasets
such as GE with Predix, are
already powering prescriptive
from multiple sensors, parts, maintenance programs. Pilot
machines, and edge devices. programs to collect data from
Hence, learned intelligence smart things has already
can be reused across the been completed. Now we are
enterprise enabling proactive seeing artificial intelligence
support services such as “Asset being embedded along with
Condition Monitoring as a prescriptive analytics.
Service” and “Maintenance By making connected
as a Service.” It also can be things smarter and moving
continuously updated through intelligence to the edge of
measurement of model drift. the network or within the
smart things, automated edge
SMART THINGS prescriptive analytics can
The emergence of “smart,” drive big wins for enterprises.
connected things coupled As companies consider how
Jen Underwood is Founder with existing subject matter best to apply this new wave
and Principle of Impact expert knowledge creates an of edge analytics technologies
Analytix, LLC. Impact Analytix opportunity for a new era of into their business, they are
is a boutique integrated analytics. Modern analytical embracing design thinking
product research, consulting,
technical marketing and
tools can not only predict methodology to reimagine
creative digital media agency what is likely to occur but digital processes. Si
led by experienced hands-
on practitioners. Jen can be
tweeted at @idigdata
Everything in Predictive and Prescriptive Analytics And it’s a full-power, point-and-click tool for fore-
Everywhere You Want, from Concept to Deployment. casting, data mining and text mining, from time series
methods to classification and regression trees, neural
The Analytic Solver® suite makes the world’s best
networks and association rules – complete with access
optimization software available in your web browser
to SQL databases, Apache Spark Big Data clusters, and
(cloud-based software as a service), and in Microsoft
text document folders.
Excel. And you can easily create models in our RASON®
language for your server, web and mobile apps. Find Out More, Start Your Free Trial Now.
Linear Programming to Stochastic Optimization. In your browser, in Excel, or in Visual Studio, Analytic
Solver comes with everything you need: Wizards, Help,
It’s all point-and-click: Fast, large-scale linear, quadratic
User Guides, 90 examples, even online training courses.
and mixed-integer programming, conic, nonlinear,
non-smooth and global optimization. Easily incorporate Visit www.solver.com to learn more or ask questions,
uncertainty and solve with simulation optimization, and visit analyticsolver.com to register and start a free
stochastic programming, and robust optimization. trial – in the cloud, on your desktop, or both!
NEWS
A ccording to a November
29, 2017 report by
Reuters News Agency,
“More than half of the 21.7
duplicate e-mail addresses or
temporary e-mail addresses,
while many individual names
appeared thousands of times
“More than half
of the 21.7 million
public comments
million public comments in the submissions, Pew said.
submitted to the U.S. Federal For example, ‘Pat M’ was submitted to the
Communications Commission listed on 5,910 submissions, FCC [included]
about net neutrality this year and the e-mail address
used temporary or duplicate john_oliver@yahoo.com was false or misleading
email addresses and appeared used in 1,002 comments. TV information.”
to include false or misleading host John Oliver supported
information, the Pew Research keeping net neutrality on his
Center said.” HBO talk show.” the rest had been submitted
From April 27 to Aug. 30 Pew did not say how many multiple times, in some cases,
the public was able to submit of the comments supported hundreds of thousands of
comments to the FCC on the or opposed the FCC’s times,” the authors stated.
topic, online or by e-mail. The proposal. “They found that “Thousands of identical
Reuter’s article noted, “Of only six percent of submitted comments were submitted
those, 57 percent used either comments were unique while in the same second on at
least five occasions. On July
19 at precisely 2:57:15 p.m.
ET, 475,482 comments were
submitted, Pew said, adding
that almost all were in favor
of net neutrality.”
In the same vein, data
scientist Jeff Kao used a
similar dataset and got a
similar result. Writing on the
blog Hackernoon, Kao reports,
“I used natural language
processing techniques
to analyze net neutrality
comments submitted to the
FCC from April-October
2017, and the results were
disturbing. My research
found at least 1.3 million
Text Mining Uncovers Millions of Fake Comments Sent to FCC fake pro-repeal comments,
Educating and Empowering the Business Analyst February 2018 | Solver International | 11
Industry
NEWS
it can be used easily with a comprehensive data mining API; JSON (JavaScript Object
wide range of R packages from and text mining toolkit. Notation); and CSV (Comma
CRAN. XLMiner SDK provides Separated Value) and Excel
its own “R package” that can SUPPORT FOR files.
be loaded with one command POPULAR DATABASES The SDK also handles
from R itself, or an IDE such as AND FILES, TEXT, AND unstructured text data, and
R Studio. BIG DATA provides stemming, term
For C++, C# and Java XLMiner SDK has built-in normalization, vocabulary
developers, XLMiner SDK tools to read data from SQL reduction, creation of a
should be especially welcome, databases using ODBC (Open term-document matrix, and
since quality data mining tools Data Base Connectivity), with concept extraction with latent
have been hard to find for special support for Oracle, semantic indexing. It even
these popular languages. But Microsoft SQL Server and has built-in facilities to draw
even R and Python developers Access databases; OData (“the a statistically representative
will find that XLMiner SDK “web successor to ODBC”) sample from an Apache
offers a far better integrated, data sources exposing a REST Spark Big Data cluster,
Educating and Empowering the Business Analyst February 2018 | Solver International | 13
d FTR
TUTORIAL
SLUG
USING
Deck
SIMULATION
AUTHOR
AND
OPTIMIZATION
TOGETHER
14 | Solver International | February 2018 Solver-International.com
d
FTR SLUG
Educating and Empowering the Business Analyst February 2018 | Solver International | 15
d TUTORIAL: SIMULATION OPTIMIZATION
We still won’t be covering the full to invest, or when to schedule call
scope of stochastic optimization, which center employees – uncertain variables
aims to find the best decisions for an represent quantities that we cannot
uncertain future. To do that, we would control or decide ourselves – such
go beyond simulation optimization as stock prices, or the frequency of
to explicitly model decisions we incoming calls to the call center. To
must make here and now, and other simulate the behavior of uncertain
decisions where we can wait and see. variables, we’ll draw random samples
Stay tuned for a tutorial where we take from a probability distribution for each
you further into the world of modeling of those variables. In advanced models,
and decision making using prescriptive we would use rank-order correlation or
analytics! But let’s get started with the copulas to deal with the fact that some
basics of simulation optimization in the uncertain quantities are related.
present.
REVISITING A CALL CENTER
FROM OPTIMIZATION: SCHEDULING EXAMPLE
DECISION VARIABLES, In our fall tutorial on optimization,
OBJECTIVE AND we used a very simple example of a call
CONSTRAINTS center employee scheduling model,
Our model will have decision shown in Figure 1. Our problem is
variables: amounts of resources to schedule enough employees to
allocated to some use or purpose – for work each day of the week (decision
example, the number of call center variables) to handle our predicted
employees working on each shift, or call volume (a constraint), while
the number of packages of a given size minimizing total payroll cost (our
loaded onto each truck. We’ll have an objective).
objective such as costs we’d like to Since employees want to work 5
minimize, or profits to be maximized, consecutive days and have 2 consecutive
and constraints, or limits on the ways days off, there are 7 possible weekly
resources may be allocated, that reflects schedules, each one starting on a
the real-world situation. As we’ll see different day of the week. These are
shortly, dealing with uncertainty means labeled A, B, C through G in the Excel
that we must reconsider what it means model. For example, employees on
to maximize or minimize our objective, Schedule A have Sunday and Monday
or to satisfy our constraints. off, then they work Tuesday through
Saturday. Our decision variables – the
FROM SIMULATION: number of employees working on each
UNCERTAIN VARIABLES, schedule – are in cells D15:D21; they
DISTRIBUTIONS AND are summed in cell D22.
CORRELATION In this simple model, all employees
Our models will also have uncertain are paid at the same rate, $96 per day at
variables. Where decision variables cell N15. So our objective, payroll costs
represent quantities that we can to be minimized, is just =D22*N15*5 (5
control or decide – such as how much working days per week).
Educating and Empowering the Business Analyst February 2018 | Solver International | 17
d TUTORIAL: SIMULATION OPTIMIZATION
We cannot have a negative number the Ribbon, or the “green arrow”
of employees on any schedule, so we on the Task Pane. The optimal
include a constraint D15:D21 >= 0 solution ($12,000 payroll cost – a big
(or with defined names, “Employees_ improvement from $18,000 initially) is
per_schedule >= 0”). Solver allows any shown in Figure 2. Solver assigns 2, 5,
whole number or fractional value for a 7, 4, 6, 1 and 0 employees, respectively,
decision variable, but we can’t actually to Schedules A, B, C, D, E, F and G.
assign one-half or two-thirds of an Since our call volumes are heaviest on
employee to a schedule – so we add a weekends, no employees will take both
constraint “Employees_per_schedule = Saturday and Sunday off.
integer”. Take a close look at the
constraints: Even with the
SOLVING VIA CONVENTIONAL requirement for 5 consecutive
OPTIMIZATION working days and 2 consecutive
We can now solve the model days off, Solver found a way to
by clicking the Optimize button on allocate exactly the minimum required
Educating and Empowering the Business Analyst February 2018 | Solver International | 19
d TUTORIAL: SIMULATION OPTIMIZATION
LogNormal distributions, where the called a Monte Carlo “trial”. On each
mean of the distribution is the same as trial, Analytic Solver will randomly
the fixed number we had before. choose a “sample” value for each of
To make the model a little more the 7 uncertain variables in row 25,
interesting, we’ll add a commission respecting the relative frequencies of
plan for our call center employees. the probability distributions – then
Such a plan would be based on collect and summarize the results.
our (uncertain and variable) call To do this, we’d better tell Analytic
volume, but since we’re reflecting that Solver what we want to summarize:
uncertainty in our distributions for the The easiest way to do this is to select
number of employees needed, we’ll just cell C27 (total payroll cost) and
calculate a total commission for the choose Results – Statistics – Mean,
week, based on this number, and add “dropping” the result in cell C28.
it to our total payroll cost in cell D27. Starting with our Figure 1 setup,
This makes our total payroll cost – assigning 5 employees to each of
which we want to minimize by a smart the 7 possible schedules, we choose
assignment of employees to schedules Simulate – Run Once. Analytic
– itself an uncertain, variable quantity. Solver displays the simulated results
Think about that for a minute. for total payroll cost. Because of
our commission plan, this is now a
IMPACT OF UNCERTAINTY: distribution of outcomes, peaking
RUNNING A SIMULATION around $18,000, but sometimes less
To see the effect of uncertainty and sometimes more, as shown in
on our model, we can run a Monte Figure 4.
Carlo simulation. If you read our
Monte Carlo simulation tutorial last WHAT DOES IT MEAN TO
summer, you know how this works: OPTIMIZE AN UNCERTAIN
Analytic Solver will cause Excel to MODEL?
calculate the model 1,000 or 10,000 If you weren’t thinking about this
times – each such calculation is earlier, ask yourself now: What does
it mean to “minimize” this
wide range of possible
values for total payroll
cost? We need to tell the
software exactly what to
do. There is more than
one possible answer, but
most managers would
probably be happy if
we can minimize the
average (mean) payroll
cost, or in the language of
probability, the expected
value of total payroll
Figure 4: Results from Cell Center model Monte Carlo simulation.
Educating and Empowering the Business Analyst February 2018 | Solver International | 21
d TUTORIAL: SIMULATION OPTIMIZATION
some of the time. Most managers the expected value of total payroll cost
would say something like “let’s be across all of our Monte Carlo trials,
sure we meet the demand at least at C28. Instead of requiring that
90% of the time” – and Analytic “Employees_per_day >= Required_
Solver will let you specify this, with per_day”, where “Required_per_day” is
=PsiPercentile(F25,90%) in cell F26, now a set of LogNormal distributions,
and F24 >= F26. we require that “Employees_per_day
>= Required_90th_Percentile”. We
A SIMULATION OPTIMIZATION still have a lower bound of 0 for
MODEL each decision variable, but we’ve
Now that we know what it means added an upper bound of 25: This is
to optimize the Call Center model because Solver Engines suitable for
with uncertainty, we can complete this simulation optimization problem
the model formulation as shown in generally require such bounds on the
Figure 5. Instead of minimizing total variables, to limit the search space and
payroll cost at C27, we’re minimizing time.
Educating and Empowering the Business Analyst February 2018 | Solver International | 23
d DATA ANALYSIS
Location Data –
THE
MISSING
LINK
Location data, and the intelligence that geospatial
analysis can provide business decision makers,
is the missing link that ties together various data
sets from Big Data sources.
Photo Courtesy of 123rf.com | © Thananit Suntiviriyanon
Busy City.
Educating and Empowering the Business Analyst February 2018 | Solver International | 27
d DATA ANALYSIS
Educating and Empowering the Business Analyst February 2018 | Solver International | 29
d TUTORIAL
TEXT
MINING
Some data analysis tasks cannot be completed in a reasonable time by a
human and that makes text mining a useful tool in modern data science
and forensics.
Educating and Empowering the Business Analyst February 2018 | Solver International | 31
d TUTORIAL: TEXT MINING
it is particularly effective for large that aid in understanding of Text
corpuses of relatively small documents. Mining outputs and that are produced
Obviously, this functionality has limitless by Text Miner are:
number of applications, for instance, • Zipf plot–for visual/interactive
e-mail spam detection, topic extraction exploration of frequency-based
in articles, automatic rerouting of information extracted by Text
correspondence, sentiment analysis of Miner
product reviews, and even analyzing net • Scree Plot, Term-Concept and
neutrality postings! Document-Concept 2D scatter
The input for text mining is a plots–for visual/interactive
dataset on a worksheet, with at least one exploration of Concept Extraction
column that contains free-form text— results
or file paths to documents in a file
system containing free-form text—and, If you are interested in visualizing
optionally, other columns that contain specific parts of text mining analysis
traditional structured data. outputs, Text Miner provides
The output of text mining is a rich capabilities for charting–the
set of reports that contain general functionality that can be used to
explorative information about the explore text mining results and
collection of documents and structured supplement the standard charts
representations of text. Free-form text mentioned.
columns are expanded to a set of new Here is how to use Text Miner to
columns with numeric representation. process and analyze approximately
The new columns will each correspond 1000 text files and use the results for
to either a single term (word) found automatic topic categorization. This
in the “corpus” of documents, or, if will be achieved by using structured
requested, a concept extracted from representation of text presented to
the corpus through Latent Semantic Logistic Regression for building the
Indexing (LSI, also called LSA or Latent model for classification. Obviously, it
Semantic Analysis). Each concept will help your understanding of the
represents an automatically derived process if you have Solver Analytics
complex combination of terms/words installed to follow along. The software
that have been identified to be related is available for a trial period at https://
to a topic in the corpus of documents. www.solver.com/xlminer-platform.
The structural representation of text
can serve as an input to any traditional GETTING THE DATA
data mining techniques available in This example is based on the text
Text Miner: unsupervised/supervised, dataset at http://www.cs.cmu.edu/
affinity, visualization techniques, etc. afs/cs/project/theo-20/www/data/
In addition, Text Miner also news20.html, which consists of 20,000
presents a visual representation of messages, collected from 20 different
text mining results to allow the user to Internet newsgroups. We selected
interactively explore information that about 1,200 of these messages that were
would otherwise be extremely hard to posted to two interest groups, Autos
analyze manually. Typical visualizations and Electronics (about 500 documents
Educating and Empowering the Business Analyst February 2018 | Solver International | 33
d TUTORIAL: TEXT MINING
e-mails, monetary amounts, etc.). Leave both Start term/phrase and
Text Miner will not tokenize on these End term/phrase empty under Text
delimiters. Location. If this option is used, text
Note: Exceptions are related not appearing before the first occurrence
to how terms are separated but as to of the Start Phrase will be disregarded
whether they are split based on the and similarly, text appearing after End
delimiter. For example: URL’s contain Phrase (if used) will be disregarded. For
many characters such as “/” or “;” and example, if text mining the transcripts
Text Miner will not tokenize on these from a Live Chat service, you would not
characters in the URL but will consider be interested in any text appearing before
the URL and will remove the URL if the heading “Chat Transcript” or after
selected for removal the heading “End of Chat Transcript.”
If Analyze specified terms only You would enter “Chat Transcript” into
is selected, the Edit Terms button will the Start Phrase field and “End of Chat
be enabled. If you click this button, the Transcript” into the End Phrase field
Edit Exclusive Terms dialog opens. Here (Figure 2).
you can add and remove terms to be Leave the default setting for
considered for text mining. All other Stopword removal. Click Edit to view a
terms will be disregarded. For example, list of commonly used words that will
if we wanted to mine each document for be removed from the documents during
a specific part name such as “alternator” pre-processing. To remove a word from
we would click Add Term on the Edit the Stopword list, simply highlight
Exclusive Terms dialog, then replace the desired word, then click Remove
“New term” with “alternator” and click Stopword. To add a new word to the
Done to return to the Pre-Processing list, click Add Stopword, a new term
dialog. During the text mining process, “stopword” will be added.
Text Miner would analyze each Text Miner also allows additional
document for the term alternator, stopwords to be added or existing to
excluding all other terms. be removed via a text document (*.txt)
by using the Browse button to navigate
to the file. Terms in the text document
can be separated by a space, a comma,
or both. If supplying the three terms
in a text document, rather than in the
Edit Stopwords dialog, the terms could
be listed as: subject emailterm from
or subject,emailterm,from or subject,
emailterm, from. If we had a large list of
additional stopwords, this would be the
preferred way to enter the terms.
Advanced in the Term Normaliza-
tion group allows us to indicate to Text
Miner, that:
• If stemming reduced term length
Figure 2. to 2 or less characters, disregard the
Educating and Empowering the Business Analyst February 2018 | Solver International | 35
d TUTORIAL: TEXT MINING
and replace “rootterm” with “auto” term “phrasetoken” will appear, click to
then replace “synonym list” with “car, edit, and enter “wagon.” Click “phrase”
automobile, convertible, vehicle, sedan, to edit and enter “station wagon.” If
coupe, subcompact, jeep” (without the supplying phrases through a text file
quotes) (Figure 4). (*.txt), each line of the file must be of the
If adding synonyms from a text form phrasetoken:phrase or using our
file, each line must be of the form example, wagon:station wagon.
rootterm:synonymlist or using our If you enter 200 for Maximum
example: auto:car automobile convertible Vocabulary Size, Text Miner will
vehicle sedan coup or auto:car,autom reduce the number of terms in the
obile,convertible,vehicle,sedan,coup. final vocabulary to the top 200 most
Note separation between the terms in frequently occurring in the collection.
the synonym list be either a space, a Leave Perform stemming as the
comma or both. If we had a large list of selected default. Stemming is the practice
synonyms, this would be the preferred of stripping words down to their “stems”
way to enter the terms. or “roots.” For example, stemming of
Text Miner also allows the terms such as “argue,” “argued,” “argues,”
combining of words into phrases that “arguing,” and “argus” would result in the
indicate a singular meaning such as stem “argu.” However, “argument” and
“station wagon” which refers to a specific “arguments” would stem to “argument.”
type of car rather than two distinct The stemming algorithm utilized in Text
tokens – station and wagon. To add a Miner is “smart” in the sense that while
phrase in the Vocabulary Reduction “running” would be stemmed to “run,”
– Advanced dialog, select Phrase “runner” would not. Text Miner uses
reduction and click Add Phrase. The the Porter Stemmer 2 algorithm for the
English Language. For more information
on this algorithm, please see: http://
tartarus.org/martin/PorterStemmer/
Leave the default selection for
Normalize case. When this option
is checked, Text Miner converts all
text to a consistent (lower) case, so
that Term, term, TERM, etc. are all
normalized to a single token “term”
before any processing, rather than
creating three independent tokens
with different cases. This simple
method can dramatically affect the
frequency distributions of the corpus,
leading to biased results (Figure 5).
For many text mining applications,
the goal is to identify terms that are
useful for discriminating between
documents. If a term occurs in all or
Figure 5. almost all documents, it may not be
Educating and Empowering the Business Analyst February 2018 | Solver International | 37
d TUTORIAL: TEXT MINING
Text Miner provides several visualizations small number of words extremely
that simplify this process greatly. frequently and use a large number of
Select Maximum number of words very rarely.
concepts and increment the counter Text Miner will produce a table
to 20. Doing so will tell Text Miner to displaying the document ID, length of
retain the top 20 of the most significant the document, number of terms, and 20
concepts. If Automatic is selected, Text characters of the text of the document.
Miner will calculate the importance Run the Text Mining analysis and four
of each concept, take the difference pop-up charts will appear, and seven
between each and report any concepts output sheets will be inserted into the
above the largest difference. Model tab of the Text Miner task pane.
Keep Term frequency table selected The Term Count table shows that the
(the default) under Preprocessing original term count in the documents
Summary and select Zipf ’s plot. Increase was reduced by 85.54 percent by the
the Most frequent terms to 20 and select removal of stopwords, excluded terms,
Maximum corresponding documents. synonyms, phrase removal and other
The Term frequency table will include specified preprocessing procedures. The
the top 20 most frequently occurring Documents table lists each Document
terms. The first column, Collection with its length, number of terms, and
Frequency, displays the number of if Keep a short excerpt is selected on
times the term appears in the collection. the Output Options tab and a value is
The second column, Document present for Number of characters, then
Frequency, displays the number of an excerpt from each document will be
documents that include the term. displayed.
The third column, Top Documents, The Term–Document Matrix lists
displays the top 5 documents where the the 200 most frequently appearing
corresponding term appears the most terms across the top and the document
frequently. IDs down the left. If a term appears
The Zipf Plot graphs the in a document, a weight is placed in
document frequency against the the corresponding column indicating
term ranks in descending order of the importance of the term using
frequency. Zipf ’s law states that the our selection of TF-IDF on the
frequency of terms used in a free- Representation dialog (Figure 6).
form text drops exponentially, i.e. The Final List of Terms table
that people tend to use a relatively contains the top 20 terms occurring in
the document collection, the number
of documents that include the term,
and the top 5 document IDs where
the corresponding term appears most
frequently. In this list we would see
terms such as “car,” “power,” “engine,”
“drive,” and “dealer,” which suggests
that many of the documents, even
the documents from the electronic
Figure 6. newsgroup, were related to autos.
Figure 8.
Educating and Empowering the Business Analyst February 2018 | Solver International | 39
d TUTORIAL: TEXT MINING
The Scree Plot gives a graphical In fact, two documents placed
representation of the contribution at extremes of a concept (one close
or importance of each concept. The to -1 and other to +1) indicates
largest “drop” or “elbow” in the plot strong differentiation between these
appears between the first and second documents in terms of the extracted
concept. This suggests that the first concept. This provides means for
top concept explains the leading topic understanding actual meaning of
in our collection of documents. Any the concept and investigating which
remaining concepts have significantly concepts have the largest discriminative
reduced importance. However, we power, when used to represent the
can always select more than one documents from the text collection.
concept to increase the accuracy of The Concept–Term Matrix lists
the analysis. It is best to examine the the top 20 most important concepts
Concept Importance table and the along the top of the matrix and the top
“Cumulative Singular Value” to identify 200 most frequently appearing terms
how many top concepts capture enough down the side of the matrix. The Term-
information for your application. Concept Scatter Plot visually represents
The Concept-Document Scatter the Concept–Term Matrix. It displays
Plot is a visual representation of the all terms from the final vocabulary in
Concept–Document matrix. Analytic terms of two concepts. Similar to the
Solver Data Mining normalizes each Concept-Document scatter plot, the
document representation so it lies on Concept-Term scatter plot visualizes the
a unit hypersphere. Documents that distribution of vocabulary terms in the
appear in the middle of the plot, with semantic space of meaning extracted
concept coordinates near 0 are not with LSA. The coordinates are also
explained well by either of the shown normalized, so the range of axes is always
concepts. The further the magnitude of [-1,1], where extreme values (close to +/-
coordinate from zero, the more effect 1) highlight the importance or “load” of
that concept has for the corresponding each term to a particular concept.
document (Figure 11). The terms appearing in a zero-
neighborhood of concept range do not
contribute much to a concept definition.
In our example, if we identify a concept
having a set of terms that can be divided
into two groups, one related to “Autos”
and other to “Electronics,” and these
groups are distant from each other on
the axis corresponding to the concept,
this would provide evidence that this
particular concept “caught” some pattern
in the text collection that is capable of
discriminating the topic of article.
Therefore, Term-Concept scatter
plot is an extremely valuable tool for
Figure 11. examining and understanding the main
Educating and Empowering the Business Analyst February 2018 | Solver International | 41
d EXECUTIVE INTERVIEW
MIGUEL MARTINEZ —
Microsoft Power BI
Educating and Empowering the Business Analyst February 2018 | Solver International | 45
d EXECUTIVE INTERVIEW: MIGUEL MARTINEZ
SI: We hear a lot of discussion it’s worthless. So those two things are
of Big Data. Are BI and Big Data very related, and that’s what you need,
compatible? to make both of them play well together
Martinez: I would say they’re and get the most value out of them.
definitely compatible, and I can speak
from experience on that one. In one of SI: We usually talk about data
my previous positions, in a previous and intelligence and information
life—I’ve always been in data-related all being similar, but different. You
roles—I was at an airline and looking mentioned velocity. I’d like to go
at all the operational data that we back to that. Is current micro-
captured. You can use that data to figure level BI capable of dealing with
out fuel consumption or adjust shifts high-velocity data, that is, data
for crews that needed to fly planes. that changes frequently? Can it
The amounts of data back then—I’m keep currency of the data so that
talking 10 or 12 years ago—was already people are working with what is
overwhelming and I would think it current rather than what is old
would qualify as Big Data. There is no data?
way to take advantage of that amount Martinez: Definitely. This is where
of data, that velocity, that volume, that Microsoft is heading, how to quickly get
real-time component, if you don’t the data into the hands of people that
have a layer that allows people who need it. Are people accessing the one
are making business and operational version of the truth at the right time?
decisions to make sense out of it. All those components are key, not only
To bring it back to things that you to Power BI but, again, to all our cloud
see, not only in Power BI but in a lot of and enterprise/data analytics solutions.
BI tools, how do you translate an insane For example, Power BI is a pioneer on
amount of data into a visualization, the data visualization piece, to visualize
into something that makes sense? I data in real time. At our Data Insights
see two trends there. One is the data Summit, which is where we bring data
visualization piece, the sexy part that analysts to talk about Excel and Power
you see in all the BI tools, and the BI, the real-time component, like the
intelligence behind it. It doesn’t matter visualization piece, is front and center.
how many people you put into data It’s a big differentiator that we have. It
analysis, you’re still going to need the doesn’t matter that you can visualize
AI (artificial intelligence) component every sensor that you have on a factory
to filter that data and to show you just floor if there’s no logic and intelligence
the things that are relevant, to save you behind that.
time from moving that amount of data Because of the way data is coming
to actual analysis. into the hands of people that use it,
So, BI and Big Data are 100 percent real-time is a key. It doesn’t matter if
compatible. You need to deal with any your data are on the same platform or
type and any amount of data that you if you’re using different systems, data
have available. But if you don’t allow the needs to come as close to real-time as
user to execute on that amount of data, possible.
Educating and Empowering the Business Analyst February 2018 | Solver International | 47
d EXECUTIVE INTERVIEW: MIGUEL MARTINEZ
the familiarity. They don’t need to in a better way; and, three, coming
learn new things to take advantage up with actual better insights and
of what was, eight years ago when I solutions because you are empowering
took my MBA, an IT-focused or IT- people, as an audience for your business
bubble type of workload. Now you knowledge, to make better decisions,
can tap into any data that you want. more informed decisions.
You are easily able to match it up and
relate it to others without having a Si: Have you had any experience
lot of technical knowledge, and get with applications from other
visual representation that is easy to companies that work with Power
understand and will take you to a BI? Has that presented a benefit
finding or to an insight in a shorter or a problem for you?
time. This is what I see with MBAs, Martinez: We have a lot of
they’re usually going to have to interact experience with that situation. You
with a lot of executives, with a lot of cannot expect any customer, any user
people where the patience they have for of a particular piece of software, to
an analysis or for an insight is going to be exclusively on your platform. It
be very short. happens, but it’s very, very rare. That
If you look at the pain points of means when you build a tool like
any data analyst or business analyst, Power BI, or any other tool within
when they are presenting an analysis Microsoft, you need to consider
or a model to someone, it’s usually, several things: not everybody’s using
“How do you present it so it doesn’t only your tools, so it must be open
take a long time for that audience to source and, especially if it’s a data
understand? How do you tell a story analytics tool, you need to bring all
of the things that you’re finding in the data in regardless of where it is
that data?” generated. We accomplish a lot of
All the tools in modern BI, which that through working with other
we call the third wave of BI, are making companies.
it easier for the person creating those On the data connectivity side,
analyses and those reports, and also for Power BI connects natively to names
the consumer of those reports. You are like Salesforce, Google Analytics,
shortening the gap between whoever NetSuite, QuickBooks—it’s a list that
is analyzing the data—the business goes to the hundreds. When doing
analyst, the MBA student, the data analysis for our own team, I connect
scientist—and the person receiving it. to those company’s programs and that
And new ideas, new insights can come experience is very, very friendly.
out because it’s not only the analyst On the consumption side, even
doing the analysis, it’s everybody though Microsoft is a company that
together consuming that data and has a lot of resources, you are going
coming out with an analysis. to have a lot of custom visualizations
So, to summarize, number one, that people want to build for specific
not a lot of investment in learning new problems they’re trying to solve. For
tools; two, more power to present things example, using Narrative Sciences
Educating and Empowering the Business Analyst February 2018 | Solver International | 49
Society News
Educating and Empowering the Business Analyst February 2018 | Solver International | 51
Te r m s o f
THE TRADE
Glossary
A Ambari H HBase
A web interface for managing Hadoop services and A non-relational, distributed database that runs on top of
components Hadoop
Z Zookeeper
An application that coordinates distributed processing