You are on page 1of 62

BUMA 30063

FUNDAMENTALS OF
DESCRIPTIVE ANALYTICS

COMPILED BY:

MR. ROGENE C. ESGUERRA


Introduction

“The most valuable commodity I know of is information.”


Gordon Gekko

This course aims to introduce students to the fundamentals of descriptive analytics.


Descriptive analytics makes use of current transactions to enable managers to visualize how the
company is performing. This course will teach the students to learn how to prepare reports using
descriptive analytics tools.
Descriptive Analytics seeks to provide a depiction or “summary view” of facts and figures
in an understandable format, to either inform or prepare data for further analysis. It uses two
primary techniques, namely data aggregation and data mining to report past events. It presents
past data in an easily digestible format for the benefit of a wide business audience.

A common example of Descriptive Analytics are company reports that simply provide a
historic review of an organization’s operations, sales, financials, customers, and stakeholders. It
is relevant to note that in the Big Data world, the “simple nuggets of information” provided by
Descriptive Analytics become prepared inputs for more advanced Predictive or Prescriptive
Analytics that deliver real-time insights for business decision making.

Descriptive Analytics helps to describe and present data in a format which can be easily
understood by a wide variety of business readers. Descriptive Analytics rarely attempts to
investigate or establish cause and effect relationships. Some common methods employed in
Descriptive Analytics are observations, case studies, and surveys. Thus, collection and
interpretation of large amount of data may be involved in this type of analytics.

Examples of Descriptive Analytics include: summarizing past events such as regional


sales, customer attrition, or success of marketing campaigns; tabulation of social metrics such
as Facebook likes, Tweets, or followers; reporting of general trends like hot travel destinations
or news trends.

According to Four Types of Big Data Analytics and Examples of Their Use, as soon as
the “volume, velocity, and variety” of Big Data invades the limited business data silos, the game
changes. Now, powered by the hidden intelligence of massive amounts of market data,
Descriptive Analytics takes new meaning.

Whenever Big Data intervenes, vanilla-form Descriptive Analytics is combined with the
extensive capabilities of Prescriptive and Predictive Analytics to deliver highly-focused insights
into business issues and accurate future predictions based on past data patterns. Descriptive
Analytics mines and prepares the data for use by Predictive or Prescriptive Analytics.

ii
Table of Contents

KNOWLEDGE MANAGEMENT
Knowledge Management Definition1 Elements of Big Data2 Computer Support
Systems2 The Evolution of Big Data2 Examples of Knowledge Management3
Data Mining Descriptive Applications5 Self-test exercises6 DATA VISUALIZATION
Data visualization definition6 Overview7 Characteristics of Effective Graphical
Displays7 Quantitative Message8 Eight types of quantitative messages9 Visual
Perception and Data Visualization 10
Human Perception/Cognition and Data Visualization 10 History of Data
Visualization 10 Terminology 11 Examples of Diagrams used for Data
Visualization 11 Other Perspectives 13 Data Presentation Architecture 14
Objectives 14 Scope 14 Related Fields 15 Self-test exercises 15 ASSOCIATION
RULE LEARNING
Association rule learning definition 16 Useful Concepts 17 Support 17 Confidence
17 Lift 18 Conviction 18 Process 19 History 19 Alternative Measures of
Interestingness 20 Statistically Sound Associations 20 Algorithms 20 Other Types
of Association Mining 22 Self-test Exercises 24 CLUSTER ANALYSIS
Cluster Analysis Definition 24 Algorithms 26 Connectivity-based Clustering
(Hierarchical Clustering) 26 Centroid-based Clustering 27 Distribution-based
Clustering 28 Density-based Clustering 29 Recent Developments 30 Evaluation
and Assessment 31

iii
Internal Evaluation 31 External Evaluation 32 Applications 35 Self-test Exercises
37 MARKET BASKET ANALYSIS
Market basket analysis 37 Creating named ranges 39 Calculating Lifts for Multiple
Two-way Product Combinations Simultaneously 39 Computing Three-Way Lift 40
Optimizing the Three-Way Lift 42 Using Lift to Optimize Store Layout 43 Self-test
Exercises 45 LINK ANALYSIS
Link Analysis Definition 45 Link Analysis Terms 45 Basic Network Graphics with
NodeXL 50 Link Analysis Application with PolyAnalyst (Olson and Shi 2007) 52
Self-test Exercises 56

iv
Republic of the Philippines
POLYTECHNIC UNIVERSITY OF THE PHILIPPINES

College of Business Administration

Course Title : FUNDAMENTALS OF DESCRIPTIVE ANALYTICS


Course Code : BUMA 30023
Course Credit : 3 UNITS
Course Description : This course aims to introduce students to the fundamentals of descriptive
analytics. Descriptive analytics makes use of current transactions to enable managers to visualize
how the company is performing. This course will teach the students to learn how to prepare reports
using descriptive analytics tools.
Course Objectives : Upon the successful completion of this course, the students will be able to: 1.
Identify opportunities for descriptive analytics.
2. Discuss key concepts and techniques in descriptive analytics.
3. Utilize descriptive analytic tools to prepare reports and monitor performance.
4. Create an analysis based on reports.
5. Develop an awareness of the ethical norms as required under policies and applicable laws
governing confidentiality and non disclosure of data/ information/ documents and proper conduct in
the learning process and application of business analytics.

Learning Outcomes:
1. Define business requirements.
2. Tailor data architecture.
3. Describe where to look for data in an organization and create required reports.
4. Analyses to support business optimization.
5. Understand data management concepts and criticality of data availability in order to make
reliable business decisions. 6. Demonstrate understanding business intelligence including
importance of data gathering, data storing, data analyzing and accessing data.
7. Apply different analytics modelling concepts on enterprise data.
8. Understand the functions of data access constraints of various departments within an
organization and provide compliance reports. 9. Work on various analytics tools available in the
market for various business functions.
10. Participate actively in business discussions with various departments and create common
reports or specific/unit reports with regard to predictive and prescriptive analytics.
11. Understand the business processes as they relate to data analysis and optimization.
12. Convey results of data analysis to organizational stakeholders at various levels.

v
Semester and Academic Year: 1st SEMESTER / AY 2020-2021

COURSE PLAN
Week Topic Leaning Outcomes Methodology Resources Assessment Understand the concept
of
knowledge management and its
primary application to fact

1-3Knowledge Management
based decision making.

Describe the emergence and development of big data in lieu to descriptive analytics.

Understand the function of computer systems in facilitating data manipulation techniques.


Understand the nature and concept of data visualization.
Lecture Camm, Cochran, Fry, Ohlmann, Anderson,
Sweeney, Williams.
Essentials of Business
Analytics (1st Ed.) 2019.
Cengage Learning.
Albright, C. & Winston, W.
(2005) Business Analytics: Data
Analysis and Decision Making.
Checking of Self-test
exercises, assignments and other activities

Checking of Self-test
4-6Data Visualization

7-9Association Rule Learning


Enumerate examples of data visualization techniques.

Describe the effective graphical displays for data interpretation Understand the nature and
concept of association rule learning.

Enumerate examples of association rules and


techniques.

Describe confidence, lift, support and conviction.


Lecture Lecture

Cengage Learning

Bentley, D. (2007) Business Intelligence and Analytics. Library Press

Olson, D. (2017) Descriptive Data Mining. Springer Nature Singapore Pte Ltd. 2017
exercises, assignments and other activities

Checking of Self-test
exercises, assignments and other activities
Midterm Examination

vi

10-12 Cluster Analysis

Market
Understand the concept of cluster analysis and its
importance to decision making.

Enumerate different categories of cluster analysis and its characteristics.

Comprehend the recent developments on the


application of cluster analysis. Understand the nature and concept of market basket analysis.

Assimilate the differences of


Lecture
Camm, Cochran, Fry,
Ohlmann, Anderson,
Sweeney, Williams.
Essentials of Business
Analytics (1st Ed.) 2019.
Cengage Learning.
Albright, C. & Winston, W.
(2005) Business Analytics: Data
Checking of Self-test
exercises, assignments and other activities

Checking of Self-test
14-15
Basket Analysis
categories of lifts in lieu of market basket analysis.

Familiarize market basket analysis via data manipulation on excel spreadsheets. Understand the
nature and concept of link analysis.

Assimilate the advantages of


Lecture
Analysis and Decision Making. Cengage Learning
Bentley, D. (2007) Business Intelligence and Analytics. Library Press.

Olson, D. (2017) Descriptive Data Mining. Springer Nature Singapore Pte Ltd. 2017
exercises, assignments and other activities

Checking of Self-test
16-17 Link Analysis
using link analysis to gather customer insights.

Familiarize the use of excel spreadsheet and software to facilitate data manipulation.
Lecture
exercises, assignments and other activities

Final Examination

vii
COURSE GRADING SYSTEM
Class Standing 70%
• Quizzes
• Assignments
• Case Analysis
• Activities
Midterm / Final Examinations 30%
100%
Midterm Grade + Final Term Grade = FINAL GRADE
2
Enhanced by: Received by: Noted by:

Mr. Rogene C. Esguerra Jenny Q. Estingor, MEM Fernando F. Estingor, MSIT Instructor
Academic Head Campus Director
viii
CHAPTER I
KNOWLEDGE MANAGEMENT

Objectives
After the completion of this module, you should be able to comprehend and demonstrate
the mastery of the following:
➢ Understand the concept of knowledge management and its primary application to fact-based
decision making.
➢ Describe the emergence and development of big data in lieu to descriptive analytics. ➢
Understand the function of computer systems in facilitating data manipulation techniques. ➢
Enumerate key elements of descriptive analytics.
➢ Enumerate descriptive analytics applications.

Knowledge Management

Knowledge management is an overarching term referring to the ability to identify, store,


and retrieve knowledge. Identification requires gathering the information needed and to analyze
available data to make effective decisions regarding whatever the organization does. This
include research, digging through records, or gathering data from wherever it can be found.
Storage and retrieval of data involves database management, using many tools developed by
computer science. Thus knowledge management involves understanding what knowledge is
important to the organization, understanding systems important to organizational decision
making, database management, and analytic tools of data mining. The era of big data is here.
Davenport (2014) defines big data as:

➢ Data too big to fit on a single server;


➢ Too unstructured to fit in a row-and-column database;
➢ Flowing too continuously to fit into a static data warehouse;
➢ Having the characteristic of lacking structure.

Knowledge management (KM) needs to cope with big data by identifying and managing
knowledge assets within organizations. KM is process oriented, thinking in terms of how
knowledge can be acquired, as well as tools to aid decision making. Rothberg and Erickson
(2005) give a framework defining data as observation, which when put into context becomes
information, which when processed by human understanding becomes knowledge. The point
of big data is to analyze, converting data into insights, innovation, and business value. It can add
value by providing real-time measures of performance, provide more timely analyses based on
more complete data, and lead to sounder decisions (Manyika et al. 2011). We live in an
environment driven by data. Amazon prospers by understanding what their customers want, and
delivering content in effective ways.

Wal-Mart has been very successful in using electronic means to gather sales in real time,
storing 65 weeks of data in massive data warehouse systems that they intensively mine to
support inventory, pricing, transportation, and other decisions related to their business. Data
management also is found in governmental operations. The National Weather Service has
collected unbelievable quantities of information related to weather, harnessing high-end
computing power to improve weather prediction. NASA has developed a knowledge base of
physical relationships

1
enabling space activities. Waller and Fawcett (2013) describe big data in terms of volume,
velocity, and variety.
With respect to volume, retail sales data aggregates a firm’s many cash register
transactions in real time; such information can be aggregated by consumer to profile each and
generate recommendations for additional sales; the same input can update inventories by stock
keeping-unit and better manage shipping across the supply chain; sensor data can also track
misplaced inventory in stores and warehouses. With respect to velocity, sales data can be real
time, as well as aggregated to hourly, daily, weekly, and monthly form to support marketing
decisions. Inventory data obtained in real-time can be aggregated to hourly or monthly updates.
Location and time information can be organized to manage the supply chain. Variety is
magnified in this context by sales events from cash registers in brick and mortar locations, along
with Internet sales, wholesale activity, international activity, and activity by competitors. All of
this information can be combined with social media monitoring to better profile customers by
profile. Inventory activity can be monitored by type of outlet as well as by vendor. Sensor-
obtained data can be traced by workers involved, paths used, and locations.

Computer Support Systems

Computer systems have been applied to support business decision making for decades
(Olson and Courtney 1992). When personal computers came out, they were used to provide
analytic tools for specific problems (decision support systems) (Sprague and Carlson 1982).
Commercial software firms (such as Execucom and Comshare) extended this idea to dedicated
systems to serve executives by providing them the key data they were expected to be interested
in at their fingertips (executive support systems). Another commercial application was on-line
analytic processing, developing database spreadsheet software capable of providing reports on
any of a number of available dimensions.

In a parallel universe, statisticians and students of artificial intelligence revolutionized the


field of statistics to develop data mining, which when combined with database capabilities
evolving on the computer side led to business intelligence. The quantitative side of this
development is business analytics, focusing on providing better answers to business decisions
based on access to massive quantities of information ideally in real-time (big data).

The Evolution of Big Data

Davenport (2013) reviewed three eras of analytics (see Table 1.1). The first era involved
business intelligence, with focus on computer systems to support human decision making (for
instance, use of models and focused data on dedicated computer systems in the form of
decision support systems). The second era evolved in the early 21st Century introducing big
data, through internet and social media generation of masses of data. Davenport sees a third
era in a data
enriched environment where on-line real-time analysis can be conducted by firms in every
industry. This is accomplished through new tools, using Hadoop clusters and NoSQL databases
to enable data discovery, applying embedded analytics supporting cross-disciplinary data teams.
One source of all of this data is the Internet of Things. Not only do people send
messages now cars, phones, and machines communicate with each other (Kellmereit and
Obodovski 2013). This enables much closer monitoring of patient health, to include little
wristbands to monitor the wearer’s pulse, temperature, blood pressure forwarded on to the
patient’s physician. How people ever survived until 2010 is truly a wonder. But it does indicate
the tons of data in which a miniscule
bit of important data exists.

Of course, signals are sent only when critical limits are reached, just as vending
machines can send signals to replenish stock at their critical limits. Monitors in homes can
reduce electricity use, thus saving the globe from excessive warming. Cars can send signals to
dealers about engine problems, so that they might send a tow truck to the location provided by
the car’s GPS. Insurance already advertise their ability to attach devices to cars to identify good
drivers, a euphemism for detection of bad driving so that they can cancel policies more likely to
call for claims.

Examples of Knowledge Management

There have been impressive accomplishments using big data. Google detected the
SARS epidemic much sooner than the US health system (Brynjolfsson and McAfee 2014). The
Harvard Medical School found Tweets to be as accurate as official reports in tracking cholera
after the 2010 Haitian earthquake, and two weeks faster. Movie firms have found Tweets to be
good at predicting box-office revenues. Wu et al. (2014) provided a knowledge management
framework for the product lifecycle, to include classification of knowledge types:

➢ Customer knowledge - CRM focus in data mining terms;


➢ Development knowledge - product design involving engineering expertise;
➢ Production knowledge - knowledge of production processes;
➢ Delivery and Service knowledge - knowledge of the processes needed to serve customers.

Knowledge of customers is a classical customer profiling matter. The other three bullets
are classical business process reengineering matters, often involving tacit knowledge which
organizations generate in the form of their employees’ expertise. Management of these forms of
knowledge require:

➢ A mechanism to identify and access knowledge;


➢ A method for collaboration to identify who, how, and where knowledge is;
➢ A method to integrate knowledge for effectively making specific decisions.

3
Data can be found in statistics of production measures, which accounting provides and
which industrial engineers (and supply chain managers) analyze for decision making. Knowledge
also exists in the experience, intuition, and insight found in employees (tacit information). This
tacit knowledge includes organizational value systems. Thus expression of such knowledge is
only available through collaboration within organizations. With respect to knowledge
management, it means that the factual data found in accounting records needs to be
supplemented by expertise, and a knowledge management system is closely tied to the idea of
business process mapping. Business process mapping in turn is usually expressed in the form of
a flowchart of what decisions need to be made, where knowledge can be found, and the
approval authority in the organizations control system.
Kellmereit and Obodovski (2013) view this brave new world as a platform for new
industries, around intelligent buildings, long-distance data transmission, and expansion of
services in industries such as health care and utilities. Humans and machines are contended to
work best in tandem, with machines gathering data, providing analytics, and applying algorithms
to optimize or at least improve systems while humans provide creativity. (On the other hand,
computer scientists such as Kurzweil (2000) expect machines to develop learning capabilities
circe 2040 in the Great Singularity). Retail organizations (like Wal-Mart) analyze millions of data
sets, some fed by RFID signals, to lower costs and thus serve customers better. Use of all of this
data requires increased data storage, the next link in knowledge management. It also is
supported by a new data environment, allowing release from the old statistical reliance on
sampling, because masses of data usually preclude the need for sampling. This also leads to a
change in emphasis from hypothesis generation and testing to more reliance on pattern
recognition supported by machine learning.

A prime example of what this can accomplish is customer relationship management,


where every detail of company interaction with each customer can be stored and recalled to
analyze for likely interest in other company products, or management of their credit, all designed
to optimize company revenue from every customer.

Knowledge is defined in dictionaries as the expertise obtained through experience or


education leading to understanding of a subject. Knowledge acquisition refers to the
processes of perception, learning, and reasoning to capture, structure, and represent
knowledge from all sources for the purpose of storing, sharing, and implementing this
knowledge. Our current age has seen a view of a knowledge being used to improve society.
Knowledge discovery involves the process of obtaining knowledge, which of course can be
accomplished in many ways. Some learn by observing, others by theorizing, yet others by
listening to authority. Almost all of us learn in different combinations of these methods,
synthesizing different, often conflicting bits of data to develop our own view of the world.

Knowledge management takes knowledge no matter how it is discovered and provides a


system to provide support to organizational decision making. In a more specific sense,
knowledge discovery involves finding interesting patterns from data stored in large databases
through use of computer analysis. In this context, the term interesting implies non-trivial, implicit,
previously unknown, easily understood, useful and actionable knowledge. Information is defined
as the
patterns, correlations, rules, or relationships in data providing knowledge useful in decision
making.

4
We live in an age swamped with data. As if satellite feeds of weather data, military
intelligence, or transmission of satellite radio and television signals weren’t enough, the devices
of the modern generation including Twitter, Facebook, and their many competitors flood us with
information. We sympathize with the idea that parents can more closely monitor their children,
although whether these children will develop normally without this close supervision remains to
be seen. But one can’t help wonder how many signals containing useless information clutter up
the lives of those with all of these devices.

Data Mining Descriptive Applications

Knowledge management consists of the overall field of human knowledge


(epistemology) as well as means to record and recall it (computer systems) and quantitative
analysis to understand it (in business contexts, business analytics). There are many applications
of quantitative analysis, falling within the overall framework of the term business analytics.
Analytics has been around since statistics became widespread. With the emergence of
computers, we see three types of analytic tools. Descriptive analytics focus on reports of what
has happened.

Statistics are a big part of that. Descriptive models are an example of unsupervised
learning, where the algorithm identifies relationships without user direction. They don’t predict
some target value, but rather try to provide clues to data structure, relationships, and
connectedness. Predictive analytics extend statistical and/or artificial intelligence to provide
forecasting capability. They are directed in the sense that a target is defined. This can be a
continuous variable to forecast. It also includes categorical output, especially classification
modeling that applies models to suggest better ways of doing things, to include identification of
the most likely customer profiles to send marketing materials, or to flag suspicious insurance
claims, or many other applications.

Diagnostic analytics can apply analysis to sensor input to direct control systems
automatically. This is especially useful in mechanical or chemical environments where speed
and safety considerations make it attractive to replace human monitors with automated systems
as much as possible. It can lead to some problems, such as bringing stock markets to their
knees
for short periods (until humans can regain control). Prescriptive analytics applies quantitative
models to optimize systems, or at least to identify improved systems. Data mining includes
descriptive and predictive modeling. Operations research includes all three. This book focuses
on the forecasting component of predictive modeling, with the classification portion of
prescriptive analytics demonstrated.

Predictive modeling is well-studied, starting with linear regression but extending


through autoregressive integrated moving-average (ARIMA) to generalized autoregressive
conditional heteroscedasticity (GARCH) models and many variants. The crux of data mining
modeling is classification, accomplished by logistic regression, neural network, and decision
tree prescriptive models. They are covered in other books. Prescriptive analytics is an emerging
field, with many interesting developments to come. This book presents begins with discussion of
visualization, a simple but important step in data mining. Business analytics starts with getting
data, which can come from many sources, internal or external to organizations. We do not delve
into the even more important aspect of data transformation and cleansing, where the bulk of
data mining work occurs. We do provide simple explanations of some descriptive tools that are
more machine learning, supporting the evolving field of pattern recognition. Cluster analysis is
well known from

5
statistical theory to identify groupings within datasets. We will then look at some methods
evolving from marketing, to include recency, frequency, and monetary expenditures (RFM)
models, an early and simplified means of classification. Link analysis deals with making sense
of social networks.

SELF-TEST EXERCISES

Review Questions:
1. Discuss the concept of knowledge management and its primary application to fact-based
decision making.
2. Describe the emergence and development of big data in lieu to descriptive analytics.
3. Discuss the function of computer systems in facilitating data manipulation techniques.
4. Enumerate key elements of descriptive analytics.
5. Enumerate descriptive analytics applications.

CHAPTER II
DATA VISUALIZATION

Objectives
After the completion of this module, you should be able to comprehend and demonstrate
the mastery of the following:
➢ Understand the nature and concept of data visualization.
➢ Enumerate examples of data visualization techniques.
➢ Describe the effective graphical displays for data interpretation.
➢ Enlist the practical applications of data visualization techniques.
➢ Understand the concept and function of Data Presentation Architecture.

Data Visualization

Data visualization or data visualisation is viewed by many disciplines as a modern


equivalent of visual communication. It involves the creation and study of the visual
representation of data, meaning “information that has been abstracted in some schematic form,
including attributes or variables for the units of information”. A primary goal of data visualization
is to communicate information clearly and efficiently via statistical graphics, plots and
information graphics. Numerical data may be encoded using dots, lines, or bars, to visually
communicate a quantitative message. Effective visualization helps users analyze and reason
about data and evidence. It makes complex data more accessible, understandable and usable.

Users may have particular analytical tasks, such as making comparisons or


understanding causality, and the design principle of the graphic (i.e., showing comparisons or
showing causality) follows the task. Tables are generally used where users will look up a
specific measurement, while charts of various types are used to show patterns or relationships
in the data for one or more variables. Data visualization is both an art and a science. It is viewed
as a branch of descriptive statistics by some, but also as a grounded theory development tool
by others. The rate at which data is generated has increased. Data created by internet activity
and an expanding number of sensors in the environment, such as satellites, are referred to as
“Big Data”.

6
Processing, analyzing and communicating this data present a variety of ethical and analytical
challenges for data visualization. The field of data science and practitioners called data scientists
have emerged to help address this challenge.

Overview
Data visualization is one of the steps in analyzing data and presenting it to users. Data
visualization refers to the techniques used to communicate data or information by encoding it as
visual objects (e.g., points, lines or bars) contained in graphics. The goal is to communicate
information clearly and efficiently to users. It is one of the steps in data analysis or data science.
According to Friedman (2008) the “main goal of data visualization is to communicate information
clearly and effectively through graphical means. It doesn’t mean that data visualization needs to
look boring to be functional or extremely sophisticated to look beautiful. To convey ideas
effectively, both aesthetic form and functionality need to go hand in hand, providing insights into
a rather sparse and complex data set by communicating its key-aspects in a more intuitive way.
Yet designers often fail to achieve a balance between form and function, creating gorgeous data
visualizations which fail to serve their main purpose — to communicate information”. Indeed,
Fernanda Viegas and Martin M. Wattenberg have suggested that an ideal visualization should
not only communicate clearly but stimulate viewer engagement and attention. Not limited to the
communication of an information, a well-crafted data visualization is also a way to a better
understanding of the data (in a data-driven research perspective), as it helps uncover trends,
realize insights, explore sources, and tell stories. Data visualization is closely related to
information graphics, information visualization, scientific visualization, exploratory data analysis
and statistical graphics. In the new millennium, data visualization has become an active area of
research, teaching and development. According to Post et al. (2002), it has united scientific and
information visualization.

Characteristics of Effective Graphical Displays

Professor Edward Tufte explained that users of information displays are executing
analytical tasks such as making comparisons or determining causality. The design principle of
the information graphic should support the analytical task, showing the comparison or causality.
In his 1983 book The Visual Display of Quantitative Information, Edward Tufte defines ‘graphical

7
displays’ and principles for effective graphical display in the following passage: “Excellence in
statistical graphics consists of complex ideas communicated with clarity, precision and
efficiency.

Graphical displays should:

➢ Show the data


➢ Induce the viewer to think about the substance rather than about methodology, graphic
➢ Design, the technology of graphic production or something else
➢ Avoid distorting what the data has to say
➢ Present many numbers in a small space
➢ Make large data sets coherent
➢ Encourage the eye to compare different pieces of data
➢ Reveal the data at several levels of detail, from a broad overview to the fine structure
➢ Serve a reasonably clear purpose: description, exploration, tabulation or decoration ➢
Be closely integrated with the statistical and verbal descriptions of a data set.

Graphics reveal data. Indeed graphics can be more precise and revealing than
conventional statistical computations.” For example, the Minard diagram shows the losses
suffered by Napoleon’s army in the 1812–1813 period. Six variables are plotted: the size of the
army, its location on a two-dimensional surface (x and y), time, direction of movement, and
temperature. The line width illustrates a comparison (size of the army at points in time) while the
temperature axis suggests a cause of the change in army size. This multivariate display on a two
dimensional surface tells a story that can be grasped immediately while identifying the source
data to build credibility. Tufte wrote in 1983 that: “It may well be the best statistical graphic ever
drawn.” Not applying these principles may result in misleading graphs, which distort the
message or support an erroneous conclusion. According to Tufte, chartjunk refers to extraneous
interior decoration of the graphic that does not enhance the message, or gratuitous three
dimensional or perspective effects. Needlessly separating the explanatory key from the image
itself, requiring the eye to travel back and forth from the image to the key, is a form of
“administrative debris.” The ratio of “data to ink” should be maximized, erasing non-data ink
where feasible. The Congressional Budget Office summarized several best practices for
graphical displays in a June 2014 presentation. These included: a) Knowing your audience; b)
Designing graphics that can stand alone outside the context of the report; and c) Designing
graphics that communicate the key messages in the report.

Quantitative Messages

8
A time series illustrated with a line chart demonstrating trends in U.S. federal spending
and revenue over time.
A scatterplot illustrating negative correlation between two variables (inflation and
unemployment) measured at points in time.

Author Stephen Few described eight types of quantitative messages that users may
attempt to understand or communicate from a set of data and the associated graphs used to
help communicate the message:

1. Time-series: A single variable is captured over a period of time, such as the unemployment
rate over a 10-year period. A line chart may be used to demonstrate the trend. 2. Ranking:
Categorical subdivisions are ranked in ascending or descending order, such as a ranking of
sales performance (the measure) by sales persons (the category, with each sales person a
categorical subdivision) during a single period. A bar chart may be used to show the
comparison across the sales persons.
3. Part-to-whole: Categorical subdivisions are measured as a ratio to the whole (i.e., a
percentage out of 100%). A pie chart or bar chart can show the comparison of ratios, such as
the market share represented by competitors in a market.
4. Deviation: Categorical subdivisions are compared against a reference, such as a comparison
of actual vs. budget expenses for several departments of a business for a given time period.
A bar chart can show comparison of the actual versus the reference amount.
5. Frequency distribution: Shows the number of observations of a variable for given interval,
such as the number of years in which the stock market return is between intervals such as 0-
10%, 11-20%, etc. A histogram, a type of bar chart, may be used for this analysis. A boxplot
helps visualize key statistics about the distribution, such as median, quartiles, outliers, etc.
6. Correlation: Comparison between observations represented by two variables (X,Y) to
determine if they tend to move in the same or opposite directions. For example, plotting
unemployment (X) and inflation (Y) for a sample of months. A scatter plot is typically used for
this message.
7. Nominal comparison: Comparing categorical subdivisions in no order, such as the sales
volume by product code. A bar chart may be used for this comparison.

9
8. Geographic or geospatial: Comparison of a variable across a map or layout, such as the
unemployment rate by state or the number of persons on the various floors of a building. A
cartogram is a typical graphic used.

Analysts reviewing a set of data may consider whether some or all of the messages and
graphic types above are applicable to their task and audience. The process of trial and error to
identify meaningful relationships and messages in the data is part of exploratory data analysis.
Visual Perception and Data Visualization

A human can distinguish differences in line length, shape orientation, and color (hue)
readily without significant processing effort; these are referred to as “pre-attentive attributes.” For
example, it may require significant time and effort (“attentive processing”) to identify the number
of times the digit “5” appears in a series of numbers; but if that digit is different in size,
orientation, or color, instances of the digit can be noted quickly through pre-attentive processing.
Effective graphics take advantage of pre-attentive processing and attributes and the relative
strength of these attributes. For example, since humans can more easily process differences in
line length than surface area, it may be more effective to use a bar chart (which takes
advantage of line length to show comparison) rather than pie charts (which use surface area to
show comparison).

Human Perception/Cognition and Data Visualization

There is a human side to data visualization. With the “studying [of] human perception and
cognition...” we are better able to understand the target of the data which we display. Cognition
refers to processes in human beings like perception, attention, learning, memory, thought,
concept formation, reading, and problem solving. The basis of data visualization evolved
because as a picture is worth a thousand words, data displayed graphically allows for an easier
comprehension of the information. Proper visualization provides a different approach to show
potential connections, relationships, etc. which are not as obvious in non-visualized quantitative
data. Visualization becomes a means of data exploration. Human brain neurons involve multiple
functions but 2/3 of the brain’s neurons are dedicated to vision. With a well-developed sense of
sight, analysis of data can be made on data, whether that data is quantitative or qualitative.
Effective visualization follows from understanding the processes of human perception and being
able to apply this to intuitive visualizations is important. Understanding how humans see and
organize the world is critical to effectively communicating data to the reader. This leads to more
intuitive designs.

History of Data Visualization

There is a history of data visualization: beginning in the 2nd century C.E. with data
arrangement into columns and rows and evolving to the initial quantitative representations in the
17th century. According to the Interaction Design Foundation, French philosopher and
mathematician René Descartes laid the groundwork for Scotsman William Playfair. Descartes
developed a two-dimensional coordinate system for displaying values, which in the late 18th
century Playfair saw potential for graphical communication of quantitative data. In the second
half of the 20th century, Jacques Bertin used quantitative graphs to represent information
“intuitively, clearly, accurately, and efficiently”.

10
John Tukey and more notably Edward Tufte pushed the bounds of data visualization.
Tukey with his new statistical approach: exploratory data analysis and Tufte with his book “The
Visual Display of Quantitative Information”, the path was paved for refining data visualization
techniques for more than statisticians. With the progression of technology came the progression
of data visualization; starting with hand drawn visualizations and evolving into more technical
applications – including interactive designs leading to software visualization. Programs like SAS,
SOFA, R, Minitab, and more allow for data visualization in the field of statistics. Other data
visualization applications, more focused and unique to individuals, programming languages such
as D3, Python and JavaScript help to make the visualization of quantitative data a possibility.

Terminology
Data visualization involves specific terminology, some of which is derived from statistics.
For example, author Stephen Few defines two types of data, which are used in combination to
support a meaningful analysis or visualization:

➢ Categorical: Text labels describing the nature of the data, such as “Name” or “Age”. This
term also covers qualitative (non-numerical) data.
➢ Quantitative: Numerical measures, such as “25” to represent the age in years. Two primary
types of information displays are tables and graphs.
o A table contains quantitative data organized into rows and columns with categorical
labels. It is primarily used to look up specific values. In the example above, the table
might have categorical column labels representing the name (a qualitative variable)
and age (a quantitative variable), with each row of data representing one person (the
sampled experimental unit or category subdivision).
o A graph is primarily used to show relationships among data and portrays values
encoded as visual objects (e.g., lines, bars, or points). Numerical values are
displayed within an area delineated by one or more axes. These axes provide scales
(quantitative and categorical) used to label and assign values to the visual objects.
Many graphs are also referred to as charts.

KPI Library has developed the “Periodic Table of Visualization Methods,” an interactive
chart displaying various data visualization methods. It includes six types of data visualization
methods: data, information, concept, strategy, metaphor and compound.

Examples of Diagrams used for Data Visualization

Name Visual
Dimension Example Usages

Length/count Category single time period. For a single


variable measured over time
(trend) a line chart is preferable.
(color)
Comparison of values, such as
sales performance for several
Bar chart persons or businesses in a

11
Scatter plot (3D)

Histogram
Scatter
plot
Position z (e.g., correlation) between
unemployment (x) and inflation
(y) for multiple time periods.
Color

Nodes size Nodes color


Network
Bin limits Ties
thickness
Count/length (color)
Ties color

X position Spatialization
Determining frequency of annual
Y position (symbol / stock market percentage returns
glyph) within ranges (bins) such as 0-
10%, 11-20%, etc. The height of
the bar represents the number
(color) of observations (years) with a
return % in the range Finding clusters in the network
represented by the bin. (e.g. grouping Facebook friends
(size)
into different clusters).
Determining the most influential
Position x nodes in the network (e.g. A
company wants to target a small
group of people on Twitter for a
Position y Determining the relationship marketing campaign).

12
planning Width

Color

Time (flow)

Analyzing risk, with green, yellow


Disk and
space by location / file type red representing low, medium,
and
high risk, respectively. Size

Color

Other Perspectives
Streamgraph
Color

Time (flow)

Row

Treemap Column

Cluster

Color

Schedule / progress, e.g. in Gantt chart Heat map


project

There are different approaches on the scope of data visualization. One common focus is
on information presentation, such as Friedman (2008) presented it. In this way Friendly (2008)
presumes two main parts of data visualization: statistical graphics, and thematic cartography. In
this line the “Data Visualization: Modern Approaches” (2007) article gives an overview of seven
subjects of data visualization:

➢ Articles & resources


➢ Displaying connections
➢ Displaying data
13
➢ Displaying news
➢ Displaying websites
➢ Mind maps
➢ Tools and services

All these subjects are closely related to graphic design and information representation.
On the other hand, from a computer science perspective, Frits H. Post (2002) categorized the
field into a number of sub-fields:

➢ Information visualization
➢ Interaction techniques and architectures
➢ Modelling techniques
➢ Multiresolution methods
➢ Visualization algorithms and techniques
➢ Volume visualization

Data Presentation Architecture

Data presentation architecture (DPA) is a skill-set that seeks to identify, locate,


manipulate, format and present data in such a way as to optimally communicate meaning and
proper knowledge. Historically, the term data presentation architecture is attributed to Kelly
Lautt: “Data Presentation Architecture (DPA) is a rarely applied skill set critical for the success
and value of Business Intelligence. Data presentation architecture weds the science of
numbers, data and statistics in discovering valuable information from data and making it usable,
relevant and actionable with the arts of data visualization, communications, organizational
psychology and change management in order to provide business intelligence solutions with the
data scope, delivery timing, format and visualizations that will most effectively support and drive
operational, tactical and strategic behaviour toward understood business (or organizational)
goals. DPA is neither an IT nor a business skill set but exists as a separate field of expertise.
Often confused with data visualization, data presentation architecture is a much broader skill set
that includes determining what data on what schedule and in what exact format is to be
presented, not just the best way to present data that has already been chosen (which is data
visualization). Data visualization skills are one element of DPA.”

Objectives

➢ To use data to provide knowledge in the most efficient manner possible (minimize noise,
complexity, and unnecessary data or detail given each audience’s needs and roles) ➢ To use
data to provide knowledge in the most effective manner possible (provide relevant, timely and
complete data to each audience member in a clear and understandable manner that conveys
important meaning, is actionable and can affect understanding, behavior and decisions)

Scope

With the above objectives in mind, the actual work of data presentation architecture
consists of:

14
➢ Creating effective delivery mechanisms for each audience member depending on their role,
tasks, locations and access to technology
➢ Defining important meaning (relevant knowledge) that is needed by each audience member
in each context
➢ Determining the required periodicity of data updates (the currency of the data) ➢ Determining
the right timing for data presentation (when and how often the user needs to see the data)
➢ Finding the right data (subject area, historical reach, breadth, level of detail, etc.) ➢
Utilizing appropriate analysis, grouping, visualization, and other presentation formats

Related Fields

DPA work shares commonalities with several other fields, including:

➢ Business analysis in determining business goals, collecting requirements, mapping


processes.
➢ Business process improvement in that its goal is to improve and streamline actions and
decisions in furtherance of business goals
➢ Data visualization in that it uses well-established theories of visualization to add or highlight
meaning or importance in data presentation.
➢ Graphic or user design: As the term DPA is used, it falls just short of design in that it does not
consider such detail as colour palates, styling, branding and other aesthetic concerns, unless
these design elements are specifically required or beneficial for communication of meaning,
impact, severity or other information of business value. For example:
o Choosing locations for various data presentation elements on a presentation page
(such as in a company portal, in a report or on a web page) in order to convey
hierarchy, priority, importance or a rational progression for the user is part of the DPA
skill-set.
o Choosing to provide a specific colour in graphical elements that represent data of
specific meaning or concern is part of the DPA skill-set
➢ Information architecture, but information architecture’s focus is on unstructured data and
therefore excludes both analysis (in the statistical/data sense) and direct transformation of
the actual content (data, for DPA) into new entities and combinations.
➢ Solution architecture in determining the optimal detailed solution, including the scope of data
to include, given the business goals
➢ Statistical analysis or data analysis in that it creates information and knowledge out of data.

SELF-TEST EXERCISES

Review Questions:
1. Discuss the nature and concept of data visualization.
2. Enumerate examples of data visualization techniques.
3. Describe the effective graphical displays for data interpretation.
4. Enlist the practical applications of data visualization techniques.
5. Discuss the concept and function of Data Presentation Architecture.

15
CHAPTER III
ASSOCIATION RULE LEARNING

Objectives
After the completion of this module, you should be able to comprehend and demonstrate
the mastery of the following:
➢ Understand the nature and concept of association rule learning.
➢ Enumerate examples of association rules and techniques.
➢ Describe confidence, lift, support and conviction.
➢ Enlist examples of applications that can facilitate association rule learning.
➢ Understand the concept and function of association mining.

Association Rule Learning

Association rule learning is a rule-based machine learning method for discovering


interesting relations between variables in large databases. It is intended to identify strong rules
discovered in databases using some measures of interestingness. Based on the concept of
strong rules, Rakesh Agrawal et al. introduced association rules for discovering regularities
between products in large-scale transaction data recorded by point-of-sale (POS) systems in
supermarkets. For example, the rule {onions, potatoes}⇒{burger} found in the sales data of a
supermarket would indicate that if a customer buys onions and potatoes together, they are likely
to also buy hamburger meat. Such information can be used as the basis for decisions about
marketing activities such as, e.g., promotional pricing or product placements. In addition to the
above example from market basket analysis association rules are employed today in many
application areas including Web usage mining, intrusion detection, Continuous production, and
bioinformatics. In contrast with sequence mining, association rule learning typically does not
consider the order of items either within a transaction or across transactions.

Definition

Following the original definition by Agrawal et al. the problem of association rule mining is
defined as:

Let 1 = {i1, i2, , , ,in } be a set of n binary attributes called items.

Let D = {t1, t2, , tm} be a set of transactions called the database.

16
Each transaction in D has a unique transaction ID and contains a subset of the items in I.
A rule is defined as an implication of the form:

X ⇒Y

Where X ,Y ⊆ I and X ∩Y =∅. .

Every rule is composed by two different sets of items, also known as itemsets, X and Y,
where X is called antecedent or left-hand-side (LHS) and Y consequent or right-hand-side
(RHS). To illustrate the concepts, we use a small example from the supermarket domain. The
set of items is I = {milk, bread, butter, beer, diapers} and in the table is shown a small database
containing the items, where, in each entry, the value 1 means the presence of the item in the
corresponding transaction, and the value 0 represents the absence of an item in that
transaction. An example rule for the supermarket could be {butter, bread}⇒{milk}meaning that if
butter and bread are bought, customers also buy milk.

Note: this example is extremely small. In practical applications, a rule needs a support of
several hundred transactions before it can be considered statistically significant, and data-sets
often contain thousands or millions of transactions.

Useful Concepts

In order to select interesting rules from the set of all possible rules, constraints on various
measures of significance and interest are used. The best-known constraints are minimum
thresholds on support and confidence. Let X be an item-set, X ⇒Y an association rule and T a
set of transactions of a given database.

Support

Support is an indication of how frequently the item-set appears in the database. The
support value of X with respect to T is defined as the proportion of transactions in the database
which contains the item-set X . In formula: supp(X ) / N. In the example database, the item-set
{beer, diapers}has a support of since it occurs in 20% of all transactions (1 out of 5 transactions).
The argument of supp() is a set of preconditions, and thus becomes more restrictive as it grows
(instead of more inclusive).

Confidence

Confidence is an indication of how often the rule has been found to be true. The
confidence value of a rule, X ⇒Y, with respect to a set of transactions T, is the proportion of the
transactions that contains X which also contains Y. Confidence is defined as:

conf (X ⇒Y) = supp(X ∪Y) / supp(X ).

For example, the rule {butter, bread}⇒{milk}has a confidence of 0.2 > in the database,
which means that for 100% of the transactions containing butter and bread the rule is correct

17
(100% of the times a customer buys butter and bread, milk is bought as well). Note that supp(X
∪Y)means the support of the union of the items in X and Y. This is somewhat confusing since
we normally think in terms of probabilities of events and not sets of items. We can rewrite
supp(X ∪Y) as the joint probability ( ), X Y P E ∩E where X E and Y E are the events that a transaction
contains itemset X or Y, respectively.

Thus confidence can be interpreted as an estimate of the conditional probability


P(EY/EX), the probability of finding the RHS of the rule in transactions under the condition that
these transactions also contain the LHS.

Lift
The lift of a rule is defined as:

or the ratio of the observed support to that expected if X and Y were independent. For
example, the rule {milk, bread}⇒{butter}has a lift of

If the rule had a lift of 1, it would imply that the probability of occurrence of the
antecedent and that of the consequent are independent of each other. When two events are
independent of each other, no rule can be drawn involving those two events.

If the lift is > 1, that lets us know the degree to which those two occurrences are
dependent on one another and makes those rules potentially useful for predicting the
consequent in future data sets. The value of lift is that it considers both the confidence of the
rule and the overall data set.

Conviction

The conviction of a rule is defined as

For example, the rule {milk, bread}⇒{butter}has a conviction of

18
and can be interpreted as the ratio of the expected frequency that X occurs without Y
(that is to say, the frequency that the rule makes an incorrect prediction) if X and Y were
independent divided by the observed frequency of incorrect predictions. In this example, the
conviction value of 1.2 shows that the rule {milk, bread}⇒{butter}would be incorrect 20% more
often (1.2 times as often) if the association between X and Y was purely random chance.

Process
Frequent itemset lattice, where the color of the box indicates how many transactions
contain the combination of items. Note that lower levels of the lattice can contain at most the
minimum number of their parents’ items; e.g. {ac} can have only at most min(a,c) items. This is
called the downward-closure property. Association rules are usually required to satisfy a user
specified minimum support and a user-specified minimum confidence at the same time.
Association rule generation is usually split up into two separate steps:

1. A minimum support threshold is applied to find all frequent item-sets in a database. 2. A


minimum confidence constraint is applied to these frequent item-sets in order to form rules.

While the second step is straightforward, the first step needs more attention. Finding all frequent
item-sets in a database is difficult since it involves searching all possible item sets (item combinations).
The set of possible item-sets is the power set over I and has size 2n −1 (excluding the empty set which is
not a valid item-set). Although the size of the power-set grows exponentially in the number of items n in
I, efficient search is possible using the downward-closure property of support (also called anti-
monotonicity) which guarantees that for a frequent itemset, all its subsets are also frequent and thus for an
infrequent item-set, all its super-sets must also be infrequent. Exploiting this property, efficient algorithms
(e.g., Apriori and Eclat) can find all frequent item-sets.

History

The concept of association rules was popularised particularly due to the 1993 article of
Agrawal et al., which has acquired more than 18,000 citations according to Google Scholar, as
of August 2015, and is thus one of the most cited papers in the Data Mining field. However, it is
possible that what is now called “association rules” is similar to what appears in the 1966 paper

19
on GUHA, a general data mining method developed by Petr Hájek et al. An early (circa 1989)
use of minimum support and confidence to find all association rules is the Feature Based
Modeling framework, which found all rules with supp(X ) and conf (X ⇒Y) greater than user
defined constraints.

Alternative Measures of Interestingness

In addition to confidence, other measures of interestingness for rules have been


proposed. Some popular measures are:

➢ All-confidence
➢ Collective strength
➢ Conviction
➢ Leverage
➢ Lift (originally called interest)

Several more measures are presented and compared by Tan et al. and by Hahsler.
Looking for techniques that can model what the user has known (and using these models as
interestingness measures) is currently an active research trend under the name of “Subjective
Interestingness.”

Statistically Sound Associations

One limitation of the standard approach to discovering associations is that by searching


massive numbers of possible associations to look for collections of items that appear to be
associated, there is a large risk of finding many spurious associations. These are collections of
items that co-occur with unexpected frequency in the data, but only do so by chance. For
example, suppose we are considering a collection of 10,000 items and looking for rules
containing two items in the left hand-side and 1 item in the right-hand-side. There are
approximately 1,000,000,000,000 such rules. If we apply a statistical test for independence with
a significance level of 0.05 it means there is only a 5% chance of accepting a rule if there is no
association. If we assume there are no associations, we should nonetheless expect to find
50,000,000,000 rules. Statistically sound association discovery controls this risk, in most cases
reducing the risk of finding any spurious associations to a user-specified significance levels.

Algorithms

Many algorithms for generating association rules were presented over time. Some well
known algorithms are Apriori, Eclat and FP-Growth, but they only do half the job, since they are
algorithms for mining frequent itemsets. Another step needs to be done after to generate rules
from frequent itemsets found in a database.

Apriori Algorithm

Apriori uses a breadth-first search strategy to count the support of itemsets and uses a
candidate generation function which exploits the downward closure property of support.

20
Eclat Algorithm

Eclat (alt. ECLAT, stands for Equivalence Class Transformation) is a depth-first search
algorithm using set intersection. It is a naturally elegant algorithm suitable for both sequential as
well as parallel execution with locality enhancing properties. It was first introduced by Zaki,
Parthasarathy, Li and Ogihara in a series of papers written in 1997. Mohammed Javeed Zaki,
Srinivasan Parthasarathy, M. Ogihara, Wei Li: New Algorithms for Fast Discovery of Association
Rules. KDD 1997. Mohammed Javeed Zaki, Srinivasan Parthasarathy, Mitsunori Ogihara, Wei
Li: Parallel Algorithms for Discovery of Association Rules. Data Min. Knowl. Discov. 1(4): 343-
373 (1997)

FP-growth Algorithm

FP stands for frequent pattern. In the first pass, the algorithm counts occurrence of items
(attribute-value pairs) in the dataset and stores them to ‘header table’. In the second pass, it
builds the FP-tree structure by inserting instances. Items in each instance have to be sorted by
descending order of their frequency in the dataset, so that the tree can be processed quickly.
Items in each instance that do not meet minimum coverage threshold are discarded. If many
instances share most frequent items, FP-tree provides high compression close to tree root.
Recursive processing of this compressed version of main dataset grows large item sets directly,
instead of generating candidate items and testing them against the entire database. Growth
starts from the bottom of the header table (having longest branches), by finding all instances
matching given condition. New tree is created, with counts projected from the original tree
corresponding to the set of instances that are conditional on the attribute, with each node
getting sum of its children counts. Recursive growth ends when no individual items conditional
on the attribute meet minimum support threshold, and processing continues on the remaining
header items of the original FP-tree. Once the recursive process has completed, all large item
sets with minimum coverage have been found, and association rule creation begins.

AprioriDP

AprioriDP utilizes Dynamic Programming in Frequent itemset mining. The working


principle is to eliminate the candidate generation like FP-tree, but it stores support count in
specialized data structure instead of tree.

Context Based Association Rule Mining Algorithm

CBPNARM is the newly developed algorithm which is developed in 2013 to mine


association rules on the basis of context. It uses context variable on the basis of which the
support of an itemset is changed on the basis of which the rules are finally populated to the rule
set.

Node-set-based Algorithms

FIN, PrePost and PPV are three algorithms based on node sets. They use nodes in a
coding FPtree to represent itemsets, and employ a depth-first search strategy to discovery
frequent itemsets using “intersection” of node sets.

21
GUHA Procedure ASSOC

GUHA is a general method for exploratory data analysis that has theoretical foundations
in observational calculi. The ASSOC procedure is a GUHA method which mines for generalized
association rules using fast bitstrings operations. The association rules mined by this method are
more general than those output by apriori, for example “items” can be connected both with
conjunction and disjunctions and the relation between antecedent and consequent of the rule is
not restricted to setting minimum support and confidence as in apriori: an arbitrary combination
of supported interest measures can be used.

OPUS Search

OPUS is an efficient algorithm for rule discovery that, in contrast to most alternatives,
does not require either monotone or anti-monotone constraints such as minimum support.
Initially used to find rules for a fixed consequent it has subsequently been extended to find rules
with any item as a consequent. OPUS search is the core technology in the popular Magnum
Opus association discovery system.
Lore

A famous story about association rule mining is the “beer and diaper” story. A purported
survey of behavior of supermarket shoppers discovered that customers (presumably young men)
who buy diapers tend also to buy beer. This anecdote became popular as an example of how
unexpected association rules might be found from everyday data. There are varying opinions as
to how much of the story is true. Daniel Powers says: In 1992, Thomas Blischok, manager of a
retail consulting group at Teradata, and his staff prepared an analysis of 1.2 million market
baskets from about 25 Osco Drug stores. Database queries were developed to identify affinities.
The analysis “did discover that between 5:00 and 7:00 p.m. that consumers bought beer and
diapers”. Osco managers did NOT exploit the beer and diapers relationship by moving the
products closer together on the shelves.

Other Types of Association Mining

Multi-Relation Association Rules: Multi-Relation Association Rules (MRAR) is a new


class of association rules which in contrast to primitive, simple and even multi-relational
association rules (that are usually extracted from multi-relational databases), each rule item
consists of one entity but several relations. These relations indicate indirect relationship between
the entities. Consider the following MRAR where the first item consists of three relations live in,
nearby and humid: “Those who live in a place which is near by a city with humid climate type and
also are younger than 20 -> their health condition is good”. Such association rules are
extractable from RDBMS data or semantic web data.

Context Based Association Rules is a form of association rule. Context Based


Association Rules claims more accuracy in association rule mining by considering a hidden
variable named context variable which changes the final set of association rules depending upon
the value of context variables. For example the baskets orientation in market basket analysis

22
reflects an odd pattern in the early days of month. This might be because of abnormal context
i.e. salary is drawn at the start of the month

Contrast set learning is a form of associative learning. Contrast set learners use rules
that differ meaningfully in their distribution across subsets.

Weighted class learning is another form of associative learning in which weight may be
assigned to classes to give focus to a particular issue of concern for the consumer of the data
mining results.

High-order pattern discovery facilitate the capture of high-order (polythetic) patterns or


event associations that are intrinsic to complex real-world data.

K-optimal pattern discovery provides an alternative to the standard approach to


association rule learning that requires that each pattern appear frequently in the data.

Approximate Frequent Itemset mining is a relaxed version of Frequent Itemset mining


that allows some of the items in some of the rows to be 0.

Generalized Association Rules hierarchical taxonomy (concept


hierarchy) Quantitative Association Rules categorical and quantitative

data

Interval Data Association Rules e.g. partition the age into 5-year-increment

ranged Maximal Association Rules

Sequential pattern mining discovers subsequences that are common to more than
minsup sequences in a sequence database, where minsup is set by the user. A sequence is an
ordered list of transactions. Sequential Rules discovering relationships between items while
considering the time ordering. It is generally applied on a sequence database. For example, a
sequential rule found in database of sequences of customer transactions can be that customers
who bought a computer and CD-Roms, later bought a webcam, with a given confidence and
support.

Subspace Clustering, a specific type of Clustering high-dimensional data, is in many


variants also based on the downward-closure property for specific clustering models.

Warmr is shipped as part of the ACE data mining suite. It allows association rule learning
for first order relational rules.

Expected float entropy minimisation simultaneously uncovers relationships between


nodes (referred to as items on this page) of a system and also between the different states that
the nodes can be in. The theory links associations inherent to systems such as the brain to
associations present in perception.

23

Review Questions: SELF-TEST EXERCISES

1. Understand the nature and concept of association rule learning.


2. Enumerate examples of association rules and techniques.
3. Describe confidence, lift, support and conviction.
4. Enlist examples of applications that can facilitate association rule learning.
5. Understand the concept and function of association mining.

CHAPTER IV
CLUSTER ANALYSIS

Objectives
After the completion of this module, you should be able to comprehend and demonstrate
the mastery of the following:
➢ Understand the concept of cluster analysis and its importance to decision making.
➢ Enumerate different categories of cluster analysis and its characteristics. ➢
Comprehend the recent developments on the application of cluster analysis. ➢
Differentiate internal and external evaluation of cluster analysis.
➢ Enumerate some practical applications of cluster analysis.

Cluster Analysis

Cluster analysis or clustering is the task of


grouping a
set of objects in such a way that objects in the same
group
(called a cluster) are more similar (in some sense or
another)
to each other than to those in other groups
(clusters). It is a
main task of exploratory data mining, and a common
technique for statistical data analysis, used in many fields,
including machine learning, pattern recognition, image
analysis, information retrieval, bioinformatics, data
compression, and computer graphics.

Cluster analysis itself is not one specific algorithm, but the general task to be solved. It
can be achieved by various algorithms that differ significantly in their notion of what constitutes a
cluster and how to efficiently find them. Popular notions of clusters include groups with small
distances among the cluster members, dense areas of the data space, intervals or particular
statistical distributions. Clustering can therefore be formulated as a multi-objective optimization
problem. The appropriate clustering algorithm and parameter settings (including values such as
the distance function to use, a density threshold or the number of expected clusters) depend on
the individual data set and intended use of the results. Cluster analysis as such is not an
automatic task, but an iterative process of knowledge discovery or interactive multi-objective
optimization that involves trial and failure. It is often necessary to modify data preprocessing and
model parameters until the result achieves the desired properties.

24
Besides the term clustering, there are a number of terms with similar meanings, including
automatic classification, numerical taxonomy, botryology and typological analysis. The subtle
differences are often in the usage of the results: while in data mining, the resulting groups are
the matter of interest, in automatic classification the resulting discriminative power is of interest.
Cluster analysis was originated in anthropology by Driver and Kroeber in 1932 and introduced to
psychology by Zubin in 1938 and Robert Tryon in 1939 and famously used by Cattell beginning
in 1943 for trait theory classification in personality psychology.

Definition

The notion of a “cluster” cannot be precisely defined, which is one of the reasons why
there are so many clustering algorithms. There is a common denominator: a group of data
objects. However, different researchers employ different cluster models, and for each of these
cluster models again different algorithms can be given. The notion of a cluster, as found by
different algorithms, varies significantly in its properties. Understanding these “cluster models” is
key to understanding the differences between the various algorithms. Typical cluster models
include:

➢ Connectivity models: for example, hierarchical clustering builds models based on distance
connectivity.
➢ Centroid models: for example, the k-means algorithm represents each cluster by a single
mean vector.
➢ Distribution models: clusters are modeled using statistical distributions, such as multivariate
normal distributions used by the Expectation-maximization algorithm.
➢ Density models: for example, DBSCAN and OPTICS defines clusters as connected dense
regions in the data space.
➢ Subspace models: in Biclustering (also known as Co-clustering or two-mode-clustering),
clusters are modeled with both cluster members and relevant attributes.
➢ Group models: some algorithms do not provide a refined model for their results and just
provide the grouping information.
➢ Graph-based models: a clique, that is, a subset of nodes in a graph such that every two
nodes in the subset are connected by an edge can be considered as a prototypical form of
cluster.

Relaxations of the complete connectivity requirement (a fraction of the edges can be


missing) are known as quasi-cliques, as in the HCS clustering algorithm. A “clustering” is
essentially a set of such clusters, usually containing all objects in the data set. Additionally, it
may specify the relationship of the clusters to each other, for example, a hierarchy of clusters
embedded in each other. Clusterings can be roughly distinguished as:

➢ Hard clustering: each object belongs to a cluster or not


➢ soft clustering (also: fuzzy clustering): each object belongs to each cluster to a certain
degree (for example, a likelihood of belonging to the cluster)
➢ Strict partitioning clustering: here each object belongs to exactly one cluster ➢ Strict
partitioning clustering with outliers: objects can also belong to no cluster and are considered
outliers.
➢ Overlapping clustering (also: alternative clustering, multi-view clustering): while usually a
hard clustering, objects may belong to more than one cluster.
➢ Hierarchical clustering: objects that belong to a child cluster also belong to the parent cluster

25
➢ Subspace clustering: while an overlapping clustering, within a uniquely defined subspace,
clusters are not expected to overlap.

Algorithms

Clustering algorithms can be categorized based on their cluster model, as listed above.
The following overview will only list the most prominent examples of clustering algorithms, as
there are possibly over 100 published clustering algorithms. Not all provide models for their
clusters and can thus not easily be categorized. There is no objectively “correct” clustering
algorithm, but as it was noted, “clustering is in the eye of the beholder.” The most appropriate
clustering algorithm for a particular problem often needs to be chosen experimentally, unless
there is a mathematical reason to prefer one cluster model over another. It should be noted that
an algorithm that is designed for one kind of model has no chance on a data set that contains a
radically different kind of model. For example, k-means cannot find non-convex clusters.

Connectivity-based Clustering (Hierarchical Clustering)

Connectivity based clustering, also known as hierarchical clustering, is based on the core
idea of objects being more related to nearby objects than to objects farther away. These
algorithms connect “objects” to form “clusters” based on their distance. A cluster can be
described largely by the maximum distance needed to connect parts of the cluster. At different
distances, different clusters will form, which can be represented using a dendrogram, which
explains where the common name “hierarchical clustering” comes from: these algorithms do not
provide a single partitioning of the data set, but instead provide an extensive hierarchy of
clusters that merge with each other at certain distances. In a dendrogram, the y-axis marks the
distance at which the clusters merge, while the objects are placed along the x-axis such that the
clusters don’t mix. Connectivity based clustering is a whole family of methods that differ by the
way distances are computed.

Apart from the usual choice of distance functions, the user also needs to decide on the
linkage criterion (since a cluster consists of multiple objects, there are multiple candidates to
compute the distance to) to use. Popular choices are known as single-linkage clustering (the
minimum of object distances), complete linkage clustering (the maximum of object distances) or
UPGMA (“Unweighted Pair Group Method with Arithmetic Mean”, also known as average linkage
clustering). Furthermore, hierarchical clustering can be agglomerative (starting with single
elements and aggregating them into clusters) or divisive (starting with the complete data set and
dividing it into partitions). These methods will not produce a unique partitioning of the data set,
but a hierarchy from which the user still needs to choose appropriate clusters.

They are not very robust towards outliers, which will either show up as additional clusters
or even cause other clusters to merge (known as “chaining phenomenon”, with single-linkage
clustering). In the general case, the complexity is (n3) for agglomerative clustering and (2n−1)
for divisive clustering, which makes them too slow for large data sets. For some special cases, optimal
efficient methods (of complexity (n2)) are known: SLINK for single-linkage and CLINK for
complete-linkage clustering. In the data mining community these methods are recognized as a
theoretical foundation of cluster analysis, but often considered obsolete. They did however
provide inspiration for many later methods such as density-based clustering.

26
Linkage Clustering Examples
linkage on Gaussian data. At 35 clusters, the
biggest cluster starts fragmenting into smaller
parts, while before it was still connected to the
second largest due to the single-link effect.

Centroid-based Clustering
Single-linkage on density-based clusters. 20
clusters extracted, most of which contain single
elements, since linkage clustering does not
have a notion of “noise”.
Single-

In centroid-based clustering, clusters are represented by a central vector, which may not
necessarily be a member of the data set. When the number of clusters is fixed to k, k-means
clustering gives a formal definition as an optimization problem: find the k cluster centers and
assign the objects to the nearest cluster center, such that the squared distances from the cluster
are minimized.

The optimization problem itself is known to be NP-hard, and thus the common approach
is to search only for approximate solutions. A particularly well known approximative method is
Lloyd’s algorithm, often actually referred to as “k-means algorithm”. It does however only find a
local optimum and is commonly run multiple times with different random initializations. Variations
of k-means often include such optimizations as choosing the best of multiple runs, but also
restricting the centroids to members of the data set (k-medoids), choosing medians (k-medians
clustering), choosing the initial centers less randomly (K-means++) or allowing a fuzzy cluster
assignment (Fuzzy c-means).

Most k-means-type algorithms require the number of clusters - k - to be specified in


advance, which is one of the biggest drawbacks of these algorithms. Furthermore, the
algorithms prefer clusters of approximately similar size, as they will always assign an object to
the nearest centroid. This often leads to incorrectly cut borders in between of clusters (which is
not surprising, as the algorithm optimized cluster centers, not cluster borders).

27
K-Means Clustering Examples
separates data into Voronoi-cells, which
assumes equal-sized clusters (not adequate
here)
K-means cannot represent density-based
clusters

K-means

K-means has several interesting theoretical properties. First, it partitions the data space
into a structure known as a Voronoi diagram. Second, it is conceptually close to nearest
neighbor classification, and as such is popular in machine learning. Third, it can be seen as a
variation of model-based classification, and Lloyd’s algorithm as a variation of the Expectation-
maximization algorithm for this model discussed below.

Distribution-based Clustering

The clustering model most closely related to statistics is based on distribution models.
Clusters can then easily be defined as objects belonging most likely to the same distribution. A
convenient property of this approach is that this closely resembles the way artificial data sets are
generated: by sampling random objects from a distribution. While the theoretical foundation of
these methods is excellent, they suffer from one key problem known as overfitting, unless
constraints are put on the model complexity.

A more complex model will usually be able to explain the data better, which makes
choosing the appropriate model complexity inherently difficult. One prominent method is known
as Gaussian mixture models (using the expectation-maximization algorithm). Here, the data set
is usually modelled with a fixed (to avoid overfitting) number of Gaussian distributions that are
initialized randomly and whose parameters are iteratively optimized to fit better to the data set.

This will converge to a local optimum, so multiple runs may produce different results. In order to
obtain a hard clustering, objects are often then assigned to the Gaussian distribution they most
likely belong to; for soft clusterings, this is not necessary. Distribution-based clustering produces
complex models for clusters that can capture correlation and dependence between attributes.
However, these algorithms put an extra burden on the user: for many real data sets,
28
there may be no concisely defined mathematical model (e.g. assuming Gaussian distributions is
a rather strong assumption on the data).

Expectation-Maximization (EM) Clustering Examples

Gaussian-distributed data, EM works well,


since it uses Gaussians for modelling clusters

Density-based Clustering
Density-based clusters cannot be modeled
using Gaussian distributions

On

In density-based clustering, clusters are defined as areas of higher density than the
remainder of the data set. Objects in these sparse areas - that are required to separate clusters -
are usually considered to be noise and border points.

The most popular density based clustering method is DBSCAN. In contrast to many
newer methods, it features a well-defined cluster model called “density-reachability”. Similar to
linkage based clustering, it is based on connecting points within certain distance thresholds.
However, it only connects points that satisfy a density criterion, in the original variant defined as
a minimum number of other objects within this radius.

A cluster consists of all density-connected objects (which can form a cluster of an


arbitrary shape, in contrast to many other methods) plus all objects that are within these objects’
range. Another interesting property of DBSCAN is that its complexity is fairly low - it requires a
linear number of range queries on the database - and that it will discover essentially the same
results (it is deterministic for core and noise points, but not for border points) in each run,
therefore there is no need to run it multiple times. OPTICS is a generalization of DBSCAN that
removes the need to choose an appropriate value for the range parameter ε, and produces a
hierarchical result related to that of linkage clustering. DeLi-Clu, Density-Link-Clustering
combines ideas from single
linkage clustering and OPTICS, eliminating the ε parameter entirely and offering performance
improvements over OPTICS by using an R-tree index.

The key drawback of DBSCAN and OPTICS is that they expect some kind of density
drop to detect cluster borders. Moreover, they cannot detect intrinsic cluster structures which
are

29
prevalent in the majority of real life data. A variation of DBSCAN, EnDBSCAN, efficiently detects
such kinds of structures. On data sets with, for example, overlapping Gaussian distributions - a
common use case in artificial data - the cluster borders produced by these algorithms will often
look arbitrary, because the cluster density decreases continuously.

On a data set consisting of mixtures of Gaussians, these algorithms are nearly always
outperformed by methods such as EM clustering that are able to precisely model this kind of
data. Mean-shift is a clustering approach where each object is moved to the densest area in its
vicinity, based on kernel density estimation. Eventually, objects converge to local maxima of
density. Similar to k-means clustering, these “density attractors” can serve as representatives
for the data set, but mean-shift can detect arbitrary-shaped clusters similar to DBSCAN. Due to
the expensive iterative procedure and density estimation, mean-shift is usually slower than
DBSCAN or k
Means.

Density-Based Clustering Examples

Density-based clustering with DBSCAN. DBSCAN assumes clusters of similar density, and
may have problems separating nearby
clusters OPTICS is a DBSCAN variant that
handles different densities

Recent Developments

In recent years considerable effort has been put into improving the performance of
existing algorithms. Among them are CLARANS (Ng and Han, 1994), and BIRCH (Zhang et al.,
1996). With the recent need to process larger and larger data sets (also known as big data), the
willingness to trade semantic meaning of the generated clusters for performance has been
increasing. This led to the development of pre-clustering methods such as canopy clustering,
which can process huge data sets efficiently, but the resulting “clusters” are merely a rough pre
partitioning of the data set to then analyze the partitions with existing slower methods such as k
means clustering. Various other approaches to clustering have been tried such as seed based
clustering. For high-dimensional data, many of the existing methods fail due to the curse of
dimensionality, which renders particular distance functions problematic in high-dimensional

30
spaces. This led to new clustering algorithms for high-dimensional data that focus on subspace
clustering (where only some attributes are used, and cluster models include the relevant
attributes for the cluster) and correlation clustering that also looks for arbitrary rotated
(“correlated”) subspace clusters that can be modeled by giving a correlation of their attributes.

Examples for such clustering algorithms are CLIQUE and SUBCLU. Ideas from density
based clustering methods (in particular the DBSCAN/OPTICS family of algorithms) have been
adopted to subspace clustering (HiSC, hierarchical subspace clustering and DiSH) and
correlation clustering (HiCO, hierarchical correlation clustering, 4C using “correlation
connectivity” and ERiC exploring hierarchical density-based correlation clusters). Several
different clustering systems based on mutual information have been proposed. One is Marina
Meilă’s variation of information metric; another provides hierarchical clustering. Using genetic
algorithms, a wide range of different fit-functions can be optimized, including mutual information.
Also message passing algorithms, a recent development in Computer Science and Statistical
Physics, has led to the creation of new types of clustering algorithms.

Evaluation and Assessment

Evaluation of clustering results sometimes is referred to as cluster validation. There have


been several suggestions for a measure of similarity between two clusterings. Such a measure
can be used to compare how well different data clustering algorithms perform on a set of data.
These measures are usually tied to the type of criterion being considered in assessing the
quality of a clustering method.

Internal Evaluation

When a clustering result is evaluated based on the data that was clustered itself, this is
called internal evaluation. These methods usually assign the best score to the algorithm that
produces clusters with high similarity within a cluster and low similarity between clusters. One
drawback of using internal criteria in cluster evaluation is that high scores on an internal
measure do not necessarily result in effective information retrieval applications. Additionally, this
evaluation is biased towards algorithms that use the same cluster model. For example, k-Means
clustering naturally optimizes object distances, and a distance-based internal criterion will likely
overrate the resulting clustering. Therefore, the internal evaluation measures are best suited to
get some insight into situations where one algorithm performs better than another, but this shall
not imply that one algorithm produces more valid results than another. Validity as measured by
such an index depends on the claim that this kind of structure exists in the data set. An
algorithm designed for models has no chance if the data set contains a radically different set of
models, or if the evaluation measures a radically different criterion. For example, k-means
clustering can only find convex clusters, and many evaluation indexes assume convex clusters.
On a data set with non convex clusters neither the use of k-means, nor of an evaluation criterion
that assumes convexity, is sound. The following methods can be used to assess the quality of
clustering algorithms based on internal criterion:

Davies–Bouldin index
The Davies–Bouldin index can be calculated by the following formula:

31

where n is the number of clusters, is the centroid of cluster x, is the average distance
of all elements in cluster x to centroid, and is the distance between centroids .
Since algorithms that produce clusters with low intra-cluster distances (high intra-cluster
similarity) and high inter-cluster distances (low inter-cluster similarity) will have a low Davies–
Bouldin index, the clustering algorithm that produces a collection of clusters with the smallest
Davies–Bouldin index is considered the best algorithm based on this criterion.

Dunn index

The Dunn index aims to identify dense and well-separated clusters. It is defined as the
ratio between the minimal inter-cluster distance to maximal intra-cluster distance. For each
cluster partition, the Dunn index can be calculated by the following formula:

where d(i,j) represents the distance between clusters i and j, and d ‘(k) measures the
intra cluster distance of cluster k. The inter-cluster distance d(i,j) between two clusters may be
any number of distance measures, such as the distance between the centroids of the clusters.
Similarly, the intra-cluster distance d ‘(k) may be measured in a variety way, such as the
maximal distance between any pair of elements in cluster k. Since internal criterion seek
clusters with high intra-cluster similarity and low inter-cluster similarity, algorithms that produce
clusters with high Dunn index are more desirable.

Silhouette coefficient

The silhouette coefficient contrasts the average distance to elements in the same cluster
with the average distance to elements in other clusters. Objects with a high silhouette value are
considered well clustered, objects with a low value may be outliers. This index works well with k
means clustering and is also used to determine the optimal number of clusters.

External Evaluation

In external evaluation, clustering results are evaluated based on data that was not used
for clustering, such as known class labels and external benchmarks. Such benchmarks consist
of a set of pre-classified items, and these sets are often created by (expert) humans. Thus, the
benchmark sets can be thought of as a gold standard for evaluation. These types of evaluation
methods measure how close the clustering is to the predetermined benchmark classes.

32
However, it has recently been discussed whether this is adequate for real data, or only
on synthetic data sets with a factual ground truth, since classes can contain internal structure,
the attributes present may not allow separation of clusters or the classes may contain
anomalies. Additionally, from a knowledge discovery point of view, the reproduction of known
knowledge may not necessarily be the intended result.

In the special scenario of constrained clustering, where meta information (such as class
labels) is used already in the clustering process, the hold-out of information for evaluation
purposes is non-trivial. A number of measures are adapted from variants used to evaluate
classification tasks. In place of counting the number of times a class was correctly assigned to a
single data point (known as true positives), such pair counting metrics assess whether each pair
of data points that is truly in the same cluster is predicted to be in the same cluster. Some of the
measures of quality of a cluster algorithm using external criterion include:

Rand measure (William M. Rand 1971)

The Rand index computes how similar the clusters (returned by the clustering algorithm)
are to the benchmark classifications. One can also view the Rand index as a measure of the
percentage of correct decisions made by the algorithm. It can be computed using the following
formula:

where TP is the number of true positives, TN is the number of true negatives, FP is the
number of false positives, and FN is the number of false negatives. One issue with the Rand
index is that false positives and false negatives are equally weighted. This may be an
undesirable characteristic for some clustering applications. The F-measure addresses this
concern, as does the chance-corrected adjusted Rand index.

F-measure

The F-measure can be used to balance the contribution of false negatives by weighting recall
through a parameter β ≥ 0. Let precision and recall be defined as follows:

where P is the precision rate and R is the recall rate. We can calculate the F-measure by
using the following formula:

33
Notice that when β = 0 , 0 F = P . In other words, recall has no impact on the F-measure
when β = 0, and increasing β allocates an increasing amount of weight to recall in the final F
measure.
Jaccard index

The Jaccard index is used to quantify the similarity between two datasets. The Jaccard
index takes on a value between 0 and 1. An index of 1 means that the two dataset are identical,
and an index of 0 indicates that the datasets have no common elements. The Jaccard index is
defined by the following formula:

This is simply the number of unique elements common to both sets divided by the total
number of unique elements in both sets.

Fowlkes–Mallows index (E. B. Fowlkes & C. L. Mallows 1983)

The Fowlkes-Mallows index computes the similarity between the clusters returned by the
clustering algorithm and the benchmark classifications. The higher the value of the Fowlkes
Mallows index the more similar the clusters and the benchmark classifications are. It can be
computed using the following formula:

where TP is the number of true positives, FP is the number of false positives, and FN is
the number of false negatives. The FM index is the geometric mean of the precision and recall P
and R, while the F-measure is their harmonic mean. Moreover, precision and recall are also
known as Wallace’s indices BI and BII.

Mutual Information

It is an information theoretic measure of how much information is shared between a


clustering and a ground-truth classification that can detect a non-linear similarity between two
clusterings. Adjusted mutual information is the corrected-for-chance variant of this that has a
reduced bias for varying cluster numbers.

Confusion matrix

A confusion matrix can be used to quickly visualize the results of a classification (or
clustering) algorithm. It shows how different a cluster is from the gold standard cluster.

34
Applications

Biology, computational biology and bioinformatics


➢ Plant and animal ecology
o Cluster analysis is used to describe and to make spatial and temporal comparisons of
communities (assemblages) of organisms in heterogeneous environments; it is also
used in plant systematics to generate artificial phylogenies or clusters of organisms
(individuals) at the species, genus or higher level that share a number of attributes.
➢ Transcriptomics
o Clustering is used to build groups of genes with related expression patterns (also
known as coexpressed genes) as in HCS clustering algorithm. Often such groups
contain functionally related proteins, such as enzymes for a specific pathway, or
genes that are co-regulated. High throughput experiments using expressed
sequence tags (ESTs) or DNA microarrays can be a powerful tool for genome
annotation.
➢ Sequence analysis
o Clustering is used to group homologous sequences into gene families. This is a very
important concept in bioinformatics, and evolutionary biology in general.
➢ High-throughput genotyping platforms
o Clustering algorithms are used to automatically assign genotypes.
➢ Human genetic clustering
o The similarity of genetic data is used in clustering to infer population structures.

Medicine
➢ Medical imaging
o On PET scans, cluster analysis can be used to differentiate between different types of
tissue in a three-dimensional image for many different purposes.

Business and marketing


➢ Market research
o Cluster analysis is widely used in market research when working with multivariate data
from surveys and test panels. Market researchers use cluster analysis to partition the
general population of consumers into market segments and to better understand the
relationships between different groups of consumers/potential customers, and for use
in market segmentation, Product positioning, New product development and
Selecting test markets.
➢ Grouping of shopping items
o Clustering can be used to group all the shopping items available on the web into a set
of unique products. For example, all the items on eBay can be grouped into unique
products. (eBay doesn’t have the concept of a SKU)

World wide web


➢ Social network analysis
o In the study of social networks, clustering may be used to recognize communities
within large groups of people.
➢ Search result grouping
o In the process of intelligent grouping of the files and websites, clustering may be used
to create a more relevant set of search results compared to normal search engines
like Google.
➢ Slippy map optimization

35
o Flickr’s map of photos and other map sites use clustering to reduce the number of
markers on a map. This makes it both faster and reduces the amount of visual clutter.

Computer science
➢ Software evolution
o Clustering is useful in software evolution as it helps to reduce legacy properties in
code by reforming functionality that has become dispersed. It is a form of
restructuring and hence is a way of direct preventative maintenance.
➢ Image segmentation
o Clustering can be used to divide a digital image into distinct regions for border
detection or object recognition.
➢ Evolutionary algorithms
o Clustering may be used to identify different niches within the population of an
evolutionary algorithm so that reproductive opportunity can be distributed more
evenly amongst the evolving species or subspecies.
➢ Recommender systems
o Recommender systems are designed to recommend new items based on a user’s
tastes. They sometimes use clustering algorithms to predict a user’s preferences
based on the preferences of other users in the user’s cluster.
➢ Markov chain Monte Carlo methods
o Clustering is often utilized to locate and characterize extrema in the target distribution.
➢ Anomaly detection
o Anomalies/outliers are typically - be it explicitly or implicitly - defined with respect to
clustering structure in data.

Social science
➢ Crime analysis
o Cluster analysis can be used to identify areas where there are greater incidences of
types of crime. By identifying these distinct areas or “hot spots” where a similar crime
has happened over a period, it is possible to manage law enforcement resources
more effectively.
➢ Educational data mining
o Cluster analysis is for example used to identify groups of schools or students with
similar properties.
➢ Typologies
o From poll data, projects such as those undertaken by the Pew Research Center use
cluster analysis to discern typologies of opinions, habits, and demographics that may
be useful in politics and marketing.

Others
➢ Field robotics
o Clustering algorithms are used for robotic situational awareness to track objects and
detect outliers in sensor data.
➢ Mathematical chemistry
o To find structural similarity, etc., for example, 3000 chemical compounds were
clustered in the space of 90 topological indices.
➢ Climatology
o To find weather regimes or preferred sea level pressure atmospheric patterns.
➢ Petroleum geology

36
o Cluster analysis is used to reconstruct missing bottom hole core data or missing log
curves to evaluate reservoir properties.
➢ Physical geography
o The clustering of chemical properties in different sample locations.

SELF-TEST EXERCISES

Review Questions:
1. Discuss the concept of cluster analysis and its importance to decision making.
2. Enumerate different categories of cluster analysis and its characteristics. 3.
Discuss the recent developments on the application of cluster analysis. 4.
Differentiate internal and external evaluation in lieu to cluster analysis. 5.
Enumerate some practical applications of cluster analysis.

CHAPTER V
MARKET BASKET ANALYSIS

Objectives
After the completion of this module, you should be able to comprehend and demonstrate
the mastery of the following:
➢ Understand the nature and concept of market basket analysis.
➢ Assimilate the differences of categories of lifts in lieu of market basket analysis.
➢ Familiarize market basket analysis via data manipulation on excel spreadsheets.
➢ Determine the practical applications of market basket analysis.
➢ Utilize solver window to facilitate data manipulation in excel spreadsheets.

Market Basket Analysis

On a given visit to a store, a consumer’s market basket is simply the list of products
purchased by the customer. Market basket analysis is therefore concerned with extracting
information from consumers’ market baskets that can be used to increase a retailer’s profit.
Most of the time, market basket analysis draws actionable insights after looking at the
association between products bought during a given transaction.

For example, for most supermarket customers there is a positive association between
the purchases of cereal and bananas because a shopper who purchases cereal is more likely
than a typical shopper to purchase bananas. Lift is probably the most used tool in market
basket analysis. The concept of lift enables the analyst to easily identify combinations of
items (like handbags and makeup or cereal and bananas) that tend to be purchased together.
The lift for a combination of purchase items and/or day of week is defined by Equation 1:

(1) (Actual number of times combination occurs)


(Predicted number of times combination occurs if items in combination were independent)

A two-way product lift therefore is simply a lift involving two products and can easily be
computed in Excel. It can be generalized to situations involving the computation of lifts involving

37
more than two items or other transaction attributes (such as day of week). To practice computing
lift, you’ll use the superstore transaction data in the file marketbasket.xls. Figure below shows a
subset of the data. The day of the week is denoted by 1 = Monday, 2 = Tuesday to 7 = Sunday.
For example, the first transaction represents a person who bought vegetables, meat, and milk on
a Friday.

Market Basket Data


For the superstore data, the lift for meat and vegetables would equal:
(Actual number of transactions where meat and vegetables were purchased) (Total
number of transactions) * (Fraction of times meat was purchased) * (Fraction of time
vegetables were purchased)

To be more specific, suppose that in 1,000 transactions, 300 involved a meat purchase,
400 involved a vegetable purchase, and 200 involved a purchase of meat and vegetables.
Independence of meat and vegetable purchases implies that the likelihood of a transaction
involving meat is 0.30 irrespective of a transaction involving a vegetable purchase. Thus,
independence implies that 1,000 (0.40) (0.30) = 120 transactions should involve purchase of
meat and vegetables. Because 200 transactions involved a purchase of meat and vegetables,
knowing that a transaction involves meat makes it 1.67 times (200/120) more likely that a
transaction involves vegetables. This is consistent with Equation 1, which tells you that the lift
for vegetables and meat is

200 = 1.67
1,000(0.40) (0.30)

38
Product combinations with lifts much greater than 1 indicate items tend to be purchased
together. This is valuable information for the retailer because placing products with large lifts
near each other in a store display can increase sales based on the assumption that the sales of
one product will stimulate sales of the other product. Because handbags and makeup have a
large lift, this explains why Bloomingdale’s placed handbags and makeup together. Promoting
cross-selling of products with high lifts can also stimulate profits. Therefore, in the
Bloomingdale’s example, giving a customer who purchases at least $50 of makeup a coupon for
20 percent off a handbag would likely yield increased profits.

Creating Named Ranges

Returning to the superstore data, you can now try to find the lift for all two-product
combinations. Before you begin computing all these lifts, though, it is convenient to create a few
named ranges. You can use the Name box to assign the name data to the range B9:H2936,
which contains all the transactions. Simply select the range B8:H296 and choose INSERT
NAME CREATE to name each column of data by its heading. For example, you can call Column
B day_week, Column C vegetables, and so on (refer to marketbasket.xls fi le for the rest of the
heading names). Now perform the following steps to determine the fraction of all transactions
involving each type of product and the fraction of transactions taking place on each day of the
week. This information is needed to compute the denominator of 1.

1. In cell L7 compute the total number of transactions with the formula=COUNT(B:B). This
formula counts how many numbers occur in Column B, which gives you the number of
transactions.
2. Copy the formula =COUNTIF(INDIRECT(K9),1)/$L$7 from L9 to cells L10:L14 to compute the
fraction of transactions involving each product. Recall that COUNTIF counts the number of
entries in a range matching a given number or text string (in this case 1). Any cell reference
within an INDIRECT function is evaluated as the contents of the cell. Thus INDIRECT(K9)
becomes vegetables. This enables you to copy your COUNTIF statement and pick off the
range names. A neat trick! Thus 60.7 percent of all transactions involve vegetables, and so
on.
3. Copy the formula =COUNTIF(day_week,K17)/COUNT(day_week) from L17 to L18:L23 to
determine the fraction of all transactions occurring on each day of the week. For example,
13.9 percent of all transactions occur on Monday, and so on. These calculations will be used
in the next section when you compute three-way lifts.

Calculating Lifts for Multiple Two-way Product Combinations Simultaneously

1. Enter the range names for any two products in the range N9:O9. To ease the selection of
products you can use the drop-down list in cells N9 and O9.
2. In cell Q10 use the formula = IF(N9<>O9,VLOOKUP(N9,K9:L14,2,FALSE)*L7*V
LOOKUP(O9,K9:L14,2,FALSE),0) to compute the predicted number of transactions involving
the two products assuming independence. This formula computes the denominator of
Equation 1. If you choose the same product twice, enter a 0.
3. In cell P10 use the array formula =SUM((INDIRECT(N9)=1)*(INDIRECT(O9)=1)) to compute
the number of times the combination of vegetables and fruit occur together. After typing in
the formula, press Control+Shift+Enter instead of just Enter. This formula creates two
arrays:
➢ An array containing 1 whenever the entry in the vegetable column is 1 and 0 otherwise.
➢ An array containing 1 whenever the entry in the fruit column is 1 and 0 otherwise.

39
This formula causes the arrays to be pairwise multiplied and then the entries in the
resulting array are added together. The pairwise multiplication yields the number of transactions
involving both fruits and vegetables (520).
4. In cell R10 compute the total lift for these categories with the formula=IF(Q10=0,1,P10/Q10).
If you chose the same item twice, simply set the lift to equal 1. Otherwise, divide actual
occurrence of fruits and vegetables together by the predicted number of occurrences
(assuming fruits and vegetables are purchased independently.)

The lift for fruits and vegetables does not indicate a lack of independence (.99 is near 1).
Taking this one step further, you can use a two-way data table to compute the lift for all two
product combinations simultaneously.

1. In cell O17 place the formula for lift (=R10) that you want to recalculate. R10 contains the lift
for a generic two-product combination.
2. Select the table range O17:U23.
3. Select What-If Analysis from the Data Tools Group on the Data tab and choose Data Table 4.
From the Data Table dialog box enter N9 as the row input cell and O9 as the column input cell.
After clicking, you now have the lift for each two-product combination in the figure below. For
example, DVDs and baby goods have a relatively large lift of 1.4.
Computing Lifts for Market Basket Example

Computing Three-Way Lifts


To illustrate how the concept of lift applies to three or more attributes associated with a
transaction, consider calculating the lift for the purchase of baby goods and DVDs on Thursday.
This lift would be computed as follows:

(Actual number of Thursday transactions where baby goods and DVDs were purchased)

40
(Total number of transactions) * (Fraction of transactions on Thursday) * (Fraction of
transactions with baby goods) * (Fraction of transactions with DVDs)

1. In cell Q14 use the array formula=SUM((INDIRECT(P13)=$P$14)*(INDIRECT


(N13)=1)*(INDIRECT(O13)=1)) to compute the actual number of transactions involving
vegetables and baby goods on Friday. This formula computes three arrays:
➢ An array containing a 1 if the day of the week matches the number in P14 (here a 5) and a 0
otherwise.
➢ An array containing a 1 if the vegetables column contains a 1 and 0 otherwise. ➢ An array
containing a 1 if the baby column contains a 1 and 0 otherwise. 2. For each row of data the array
formula in Q14 creates a new array. Create the new array
element in any row by multiplying the three listed arrays. A 1 is obtained in a row of the
product array if and only if baby goods and vegetables were bought on Friday.
3. Sum up the entries in the product array to yield the actual number of Friday transactions
where baby goods and vegetables were purchased.
4. In cell R14 compute the predicted number of transactions involving baby goods and
vegetables purchased on Friday with the following formula:
IF(N13<>O13,VLOOKUP(N13,K9:L14,2,FALSE)*L7*VLOOKUP(O13,K9:L14,2,FALSE)*VLOO
K UP(P14,K17:L23,2),0)
5. If you enter the same product class twice, this formula yields a 0. Otherwise, multiply (total
number of transactions) * (fraction of baby transactions) * (fraction of vegetable transactions)
* (fraction of Friday transactions). This gives a predicted number of Monday meat and
vegetable transactions (assuming independence).
6. Finally, in cell S14, compute the lift with the formula =IF(R14=0,1,Q14/R14). The lift for
vegetables and baby goods on Friday is .85. This means that on Fridays vegetables and
baby goods are bought together less frequently than expected.

Finding Three-Way Lifts


41

Solver Window for Maximizing Three-Way Lift

Optimizing the Three-Way Lift

In an actual situation with many products, there would be a huge number of three-way lifts. For
example, with 1,000 products, you can expect 1,0003 = 1 billion three-way lifts! Despite this, a
retailer is often interested in finding the largest three-way lifts. Intelligent use of the Evolutionary
Solver can ease this task. To illustrate the basic idea, you can use the Evolutionary Solver to
determine the combination of products and day of the week with maximum lift.

Maximum Three-Way Lift


42

The three-way lift, indicates that roughly 6.32 times more people, as expected under an
independence assumption, buy DVDs and baby goods on Thursday. This indicates that on
Thursdays placing DVDs (often an impulse purchase) in the baby sections will increase profits.

1. Use Evolutionary Solver with the changing cells being the day of the week (cell P14) and an
index reflecting the product classes (cells N12 and O12). Cells N12 and O12 are linked with
lookup tables to cells N13:O13. For instance, a1 in cell N12 makes N13 be vegetables.
Figure below shows the Evolutionary Solver window.
2. Maximize lift (S14), and then choose N12 and O12 (product classes) to be integers between
1 and 6. P14 is an integer between 1 and 7.
3. Add a constraint that Q14 >= 20 to ensure you count only combinations that occur a
reasonable number of times. Set the Mutation Rate to .5.

Using Lift to Optimize Store Layout

Handbags and makeup are often purchased together. This information led
Bloomingdale’s to stimulate impulse buying by placing handbags next to makeup. This suggests
that to maximize revenues a store should be laid out so products with high lift are placed near
each other. Given a lift matrix for different product categories, you can use the Evolutionary
Solver to locate product categories to maximize the total lift of proximate product categories. To
illustrate the idea, consider a grocery store that stocks the six product categories shown below.
In rows 8 through 13, the two-way lifts are shown.

Optimizing Store Layout


1. In the cell range G16:I17, determine the locations of the product categories that maximize the
lifts of adjacent product categories. Assume that customers can travel only in a north-south
or east-west direction. This assumption is reasonable because any two store aisles are
either

43
parallel or perpendicular to each other. This implies, for example, that location A1 is adjacent
to A2 and B1, whereas location A2 is adjacent to B2, A1, and A3.
2. Enter a trial assignment of product categories to locations in cell range G16:I17. 3. In cell
range G21:G26, compute the lift for adjacent products for each location. For example, in cell
G21 the formula =INDEX(lifts,G17,G16)+INDEX(lifts,G17,H17) adds (Lift for products assigned
to A1 and B1) + (Lift for products assigned to A1 and A2). This gives the total lift for products
adjacent to A1. In cell G27 the formula =SUM(G21:G26) calculates the total lift generated by
adjacent product categories.
4. Use the Solver window, as shown in figure below, to find the store layout that maximizes the
total lift for adjacent product categories.
The target cell is to maximize the total lift for adjacent products (cell G27). The changing
cells are the range G16:I17, which describe the location for each product category. The only
entered constraint is G16:I17 All Different. When a range of n changing cells is selected as All
Different, Excel knows to assign the changing cells the integers 1, 2, …, n with each value
assigned exactly once. In the example, exactly one cell in the range G16:I17 is assigned one of
the integers 1, 2, 3, 4, 5, and 6. This is equivalent to having Excel try out all 6! = 720
permutations of the integers 1–6. Each permutation corresponds to a store layout. You can find
the maximum sum of lifts for adjacent categories equals 16.9 and the optimal store layout is
summarized in the cell range K19:M20.
5. To make a range of changing cells All Different, select Add from the right side of the Solver
window, and fill in the Add Constraint dialog box.

Optimizing Store Layout


44

Review Questions: SELF-TEST EXERCISES

1. Discuss the nature and concept of market basket analysis.


2. Assimilate the differences of categories of lifts in lieu of market basket analysis. 3. Research
the reasons why Netflix created the House of Cards TV series starring Kevin Spacey and Robin
Wright. How did the concept of lift factor into Netflix’s decision to make the show?
4. How could Amazon.com use the concept of lift to make book or music recommendations? 5. A
virtual lift includes a nonproduct element such as a demographic attribute (age, gender, income,
and so on). How could a retailer use knowledge of virtual lifts to increase profits?

CHAPTER VI
LINK ANALYSIS

Objectives
After the completion of this module, you should be able to comprehend and demonstrate
the mastery of the following:
➢ Understand the nature and concept of link analysis.
➢ Assimilate the advantages of using link analysis to gather customer insights. ➢
Familiarize the use of excel spreadsheet and software to facilitate data manipulation. ➢
Assimilate the methodologies of link analysis through internet and social media websites. ➢
Comprehend link analysis terms.

Link Analysis

Link analysis considers the relationship between entities in a network. They are
interesting in many contexts, to include social network analysis (Knoke and Yang 2008) which
has been used to measure social relationships, to include social media and collaboration
networks. People (or customers) can be represented as nodes and the relationships between
them can be links in a graph. In biological science they have been applied to analyze protein
interactions. They also have been applied to law enforcement and terrorism. In business, they
are of interest to marketing, especially related to product recommendation analysis. Amazon is
famous for its recommendation engine, supported by link analysis of customer selections.
Clickstream analysis provides Web business sites with the ability to predict where customers are
going, potentially allowing systems to interact with these customers to increase the probability of
their purchasing.

Link Analysis Terms

Link analysis in general provides tools for analysis of large scale graphs. The
effectiveness of such tools is bounded by graph dimensionality, yet large scale network analysis
has been applied to Twitter, human neural networks, and other web graphs. Graph analysis is
challenging in terms of efficient computing, making scalability a problem. There are a number of
important concepts in graph representation of networks. These include degree, which is the
number of

45
connections a node has to other nodes. These connections are also referred to as edges.
Graphs can represent a number of situations.

A graph of social acquaintances would probably involve undirected links, where


relationships ran both ways. Consider a network of seven people as in the table below, one of
whom (George) doesn’t deal with any of the others. Figure below displays this network
graphically in bidirectional form. You also can place values on graphs. For instance, on e-
business web sites, the nodes could be pages and it would be important to know the volume of
traffic on each link. This can be used for purposed of pricing, as well as many kinds of analysis.
Second figure shows a valued graph. Note that the value in terms of traffic can depend on the
direction of the link.

Social Network Social Graph

Figure above is vastly simplified for demonstration purposes but shows how the main
flow of traffic on the site is to books, but the site allows movement from each node to all others.
Thus it demonstrates both direction as well as value. To demonstrate additional measures for
graphs, we generated a set of pseudo-Amazon data for 15 product That data here is
aggregated into the counts (degrees) for each pair of products (nodes) below.
Valued Graph

46
There are thus fifteen nodes. The degree of connectivity for automotive products (Auto) is
to two other products, hard-back books (Hard) and health products (Health). Degree can be
directional. For instance, here we don’t have a direction inherently present, so the graph
direction could go from Auto to Health or Health to Auto. The bidirectional degree for Auto is
thus 4. Baby products have connections to 11 other products, bidirectional, yielding a degree of
22.

Density is the number of connections divided by the number of possible connections


(disregarding loops that would create duplicates). The density for Auto in Table 7.1 is 4/28, or
0.143. Conversely, the density for Baby is 22/28, or 0.786. The maximum density is 1.000 for
hard-back books (Hard), the only product whose purchasers (493 out of 1000) purchased every
other product category.

Connectedness refers to the degree of ability of the overall graph to reach all nodes. The
Krackhardt connectedness score sums each node’s unidirectional degree divided by {the number of nodes
(n) times (n − 1)}. This measure has a maximum of 1.0, indicating all nodes connect to all other nodes,
and a minimum of 0 where all nos are completely isolated (there are no node connections). In the pseudo-
Amazon data shows the degree (one-way in this case) for each node.
These fifteen degrees add to 161. The Krackhardt connectedness score would thus

be: 161/(15 × 14) = 0.767

This is a high degree of connectedness. We can calculate density for each of the seven
participants. The connectedness of this graph is equal to the sum of connections (16 – the sum of the
numerators in the second column in Table 7.4) divided by 42 (n × n − 1) or 0.381.

Geodesic distance is the length of the shortest path between two nodes. This would
required calculating all paths between the two nodes in question, finding the lowest value of
each path, and dividing this lowest value by the length of the path, then choosing the highest
value among these outcomes. Table below gives geodesic distances (path lengths) for the
network figure.

Pseudo-Amazon Node Degrees


4
7
Unidirectional degrees for Pseudo-Amazon data

Densities for social network (unidirectional)

Geodesic Distances
48
Betweenness centrality is a measure of the degree to which a particular node lies on the
shortest paths between other nodes in the graph. Thus it can reflect how other nodes control or
mediate relations between pairs of other nodes that are not directly connected. The formula for
Betweenness Centrality Cb is the sum of shortest paths (geodesics) from this node to all other
nodes divided by the number of distinct geodesics in the graph system. Considering the linked
set of nodes A through F in the figure, table shows this calculation. Thus Node A lies on 4
geodesics out of 15, for a betweeness centrality of 0.267, B 2/15 = 0.133, C 0/15 = 0.0, D 2/15 =
0.133, E 0/15 = 0/0, and F 1/15 = 0.067.

G connects with nothing, and thus is not a member of any geodesic, so has a
betweenness centrality of 0.0. If direction were considered both numerator and denominator
would simply double, yielding the same betweenness calculations since the figure was
bidirectional. If it were unidirectional, it would affect the number of “yes” values. Closeness
reflects how near a node is to the other nodes in a network.

Betweenness Calculations

This can indicate how quickly nodes can interact, with longer closeness implying the
need to go through more intermediaries. Closeness centrality is the inverse of the sum of
geodesic distances between a node and the other nodes. If you add the noninfinite values in the
table, you get the sum of distances between all nodes (in this case 22). The closeness centrality
measure is:

Closeness Centrality = {Number of nodes − 1}/{sum of distances}

This measure decreases if the number of nodes reachable from the node in question
decreases, or if the distances between nodes increases. Here A has the most “closeness”, while
G has the least.

49
Betweeness Centrality

Basic Network Graphics with NodeXL

NodeXL (Network Overview, Discovery and Exploration for Excel) is a template for Microsoft Excel
that provides the ability to explore network graphs. NodeXL Excel Template opens from the
Windows Start Menu, creating a new workbook customized for network graph data. Figure below
displays the basic layout, with the NodeXL ribbon tab at the top, the Edges worksheet on the left,
and a graph pane to display networks on the right. At the bottom are tabs for the five worksheets
(Edges, Vertices, Groups, Group Vertices, and a hidden fifth worksheet— Overall Metrics). Note
that most metrics require NodeXL Pro version or higher.

NodeXL Worksheets

50
Initial NodeXL screen
Nodex Input for Social Network Example

51
Graph Display for Simple Social Network

NodeX Vertices Output


NodeXL provides a simple way to work with link analysis, applicable in many contexts. It
is a bit touchy to load, and some features require investing nominal amounts of money. We now
discuss a well-developed system for applying link analysis to text data mining.

Link Analysis Application with PolyAnalyst (Olson and Shi 2007)

Text mining can be used for many applications. One such application is to analyze a
series of text messages, such as customer complaints. Let us assume a distributor of printers
deals with three products: inkjet black and white printers, inkjet color printers, and laser black
and white printers. Table below gives five such messages, providing product, a quality rating on
a 1 (bad) to 5 (best) scale, the name of the representative involved, and comments. This set of
comments involves some variety of comments. In order to apply text mining, the first step is to
generate a set of key words. Some software products, such as PolyAnalyst by Megaputer,
provide Text Analyst capabilities that focus on words that appear at a high rate. The user of the
system can
delete words which are not pertinent, and a file developed that will include the sentences where
these selected key words appear.

One use of such files is to condense large documents to some percentage of their
original size, focusing on sentences containing words of interest to the user. Another use is to
identify those terms of interest to the user. For instance, words such as those in bold could be
selected for link analysis. Note the ability to pick up different variants of the key terms
(knowledge and

52
knowledgeable, break and breaks (or broken), kind and kindly). Then a data file can be
constructed using each of the key terms as a binary (yes/no, Boolean, 0/1) variable reflecting the
presence or absence of the key term in each message. The file containing these key-term
variables can include any other variables, such as the categorical variables for printer product
and representative, as well as numeric data for quality ratings. Figure shows the screen from
PolyAnalyst for a small data set of 100 complaints. This is an initial output, where all variables
are displayed where any of the variables had common entries. The link analysis utilizes all
categorical and binary data (not the numeric quality rating). Positive and negative correlations
are color coded. The user has the option of deleting either. The system uses a minimum value
to include relationships. One factor for this data set is that it was highly skewed, with 70 inkjet
printers, 20 color inkjets, and only 10 laser printers.

NodeXL Overall Metrics for Simple Social In-Degree and Out-Degree displays from
Network Example NodeXL
In-Degree and Out-Degree displays from
NodeXL Betweenness centrality display

53

Clu

stering Coefficient Display


Therefore, the links shown between laser printers and representatives is based upon a
very small sample (and representatives like Emil had no chance to appear). This may be
appropriately useful in some cases (maybe Emil doesn’t sell laser printers), but since the sample
size is small, there is no linkage shown of Emil’s connection with laser printers. Since below was
very cluttered, the analyst has the ability to reset the minimum count necessary to show in the
link analysis display. The second figure below gives such an output for a minimum setting of 3
occurrences. This makes relationships much clearer. It appears that Chuck deals with a number
of inkjet printers, but his customers tend to think that they are junk. All that shows for Dennis is
that he often sells laser printers. (This could indicate that Chuck concentrates on the low end of
the business, and Dennis on the high end.) Color printers seem to have trouble with cartridges.

54
Printer Complaints

There are no other systematic patterns at the minimum setting of 3 between products
and key words. There are relationships among key words, in that those who complain about
cheap printers that clog encounter abusive representatives. Those who complement
representatives as being understanding also tend to include the word kind. Those who use the
word service often use the word smear. The identification of relationships through link analysis
can show some patterns or trends, that might be useful in identifying communication flaws of
representatives, or product defects.

PolyAnalyst Initial Link Analysis Output


55

PolyAnalyst Initial Link Analysis Output at Minimum Setting of 3

SELF-TEST EXERCISES

Review Questions:
1. Discuss the nature and concept of link analysis.
2. Assimilate the advantages of using link analysis to gather customer insights. 3.
Familiarize the use of excel spreadsheet and software to facilitate data manipulation. 4.
Discuss the methodologies of link analysis through internet and social media websites. 5.
Discuss link analysis terms.
56
References:

➢ Camm, Cochran, Fry, Ohlmann, Anderson, Sweeney, Williams. Essentials of Business


Analytics (1st Ed.) 2019. Cengage Learning.

➢ Albright, C. & Winston, W. (2005) Business Analytics: Data Analysis and Decision Making.
Cengage Learning

➢ Bentley, D. (2007) Business Intelligence and Analytics. Library Press. ➢

Olson, D. (2017) Descriptive Data Mining. Springer Nature Singapore Pte Ltd.

57

You might also like