You are on page 1of 129

MODULE-III

Business Intelligence & Data Mining


Overview and Techniques

Dr. Anjan Krishnamurthy


Associate Professor
Dept. of CSE, BMSIT&M

04/23/2020 1
Apache Ambari
• Apache Ambari is used to develop a software for provisioning, managing and
monitoring Hadoop Clusters. As per web analytics Apache Ambari is considered to
be of three layers Core Hadoop, Essential Hadoop and Hadoop Support.
• Many companies are awaiting for Apache Ambari job candidates for several roles
to optimize user experience on their websites using Apache Ambari. 
• Apache Ambari job description might consists of developing a tool using the
methodologies of the technology to monitor and support the clusters in Hadoop. 
• Wisdomjobs created interview questions exclusively for the candidates who are in
search of job.
• Do check our page for Apache Ambari interview questions to get set for the
interview.

04/23/2020 2
Apache Ambari
• Ambari enables System Administrators to:
• Provision a Hadoop Cluster
• Ambari provides a step-by-step wizard for installing Hadoop services across any number of hosts.
• Ambari handles configuration of Hadoop services for the cluster.
• Manage a Hadoop Cluster
• Ambari provides central management for starting, stopping, and reconfiguring Hadoop services
across the entire cluster.
• Monitor a Hadoop Cluster
• Ambari provides a dashboard for monitoring health and status of the Hadoop cluster.
• Ambari leverages Ambari Metrics System for metrics collection.
• Ambari leverages Ambari Alert Framework for system alerting and will notify you when your
attention is needed (e.g., a node goes down, remaining disk space is low, etc).

04/23/2020 3
Advantages of Ambari
• Its flexible and scalable user interface allows a range of tools such as Pig, MapReduce,
Hive, etc. to be installed on the cluster and administers their performances in a user-
friendly fashion. Some of the key features of this technology can be highlighted as:
• Instantaneous insight into the health of the Hadoop cluster using preconfigured
operational metrics
• User-friendly configuration providing an easy step-by-step guide for installation
• Installation of Apache Ambari is possible through Hortonworks Data Platform (HDP)
• Monitoring dependencies and performances by visualizing and analyzing jobs and tasks
• Authentication, authorization, and auditing by installing Kerberos-based Hadoop clusters
• Flexible and adaptive technology fitting perfectly in the enterprise environment

04/23/2020 4
Ambari Vs Zookeeper
• Therefore, though these tasks may
seem similar from a distance, actually,
these two technologies perform
different tasks on the same Hadoop
cluster making it agile, responsive,
scalable, and fault-tolerant in a big
way. 
• As an Apache Ambari Administrator,
you will be creating and managing
Ambari users and groups.
• Import users and groups from LDAP
systems into Ambari.
04/23/2020 5
Ambari Installation
• To build up the cluster, the Install Wizard needs to know some general information regarding
the cluster to which you should supply the fully qualified domain name (FQDN) of your each
host.
• Additionally, the wizard needs access to the private key file the user created in Set Up
Passwordless SSH. This is used to locate all the hosts in the system and to access and interact
with them securely.
• The list of hostnames, one per line, can be entered using the Target Hosts text box
• Select Provide Your SSH Private Key if you want Ambari to automatically install the Ambari
Agent on all your hosts using SSH. In the Host Registration Information, you can use
the Choose File button to find the private key file matching the public key installed earlier on
all your hosts. Alternatively, you can cut and paste the key into the text box manually
• Select Perform Manual Registration if you do not wish Ambari to automatically install the
Ambari Agent

04/23/2020 6
Business Intelligence
• Every Business has data generated by the customers
or stakeholders that is recorded electronically.
• All this data can be analyzed and mined using special
tools and techniques to generate patterns and
intelligence.
• This reflect how the business is functioning.
• Any business organization needs to continually
monitor its business environment and its own
performance, and then rapidly adjust its future
plans. Business intelligence is a broad set of information
• Key performance Indexes (KPIs) or key result areas technology (IT) solutions that includes tools for gathering,
(KRAs) are critical performance metrics for the analyzing, and reporting information to the users about
industries. performance of the organization and its environment.

04/23/2020 7
History of Business Intelligence

• Started in the early 90’s, had a surge of popularity


• Stopped being used in late 90’s due to a variety of reasons, including
being over promised, the vision of BI not being supported by current
technology, and not knowing how to properly use the information
they obtain
• Recently has made a big comeback in businesses
BIDW Process Overview
BIDW Process (In-Depth)

• Raw data is stored


• Raw data are typically stored, retrieved,
and updated by an organization’s on-line
transaction processing (OLTP) system.
• Information is cleansed and optimized
• Info is cleansed and optimized for decision
support apps. It is usually “read only” and
stored on separate systems.
• Data mining, query and analytical tools
generate intelligence
• Enables companies to spot trends, enhance
business relationships, and create new
opportunities
BIDW Process Cont…

• Organizations use intelligence to make


strategic business decisions
• With this intelligence, organizations can
make effective decisions, and create
strategies and programs for competitive
advantage.
• The system is regulated by an overall
corporate security policy
• Business performance management
applications track results
• Well run BIDW operation includes BPM
applications, which help track results of
the decisions made and the performance
of he programs created.
Gathering Business Intelligence

• Data Integration
• The process of combining two or more data sets for sharing and analysis, in
order to support information management inside a business.
• Data Mining
• The process of extracting previously unknown and useful information from
large data sets or databases.
• Mined data of a company are then stored in a “Data Warehouse”
Gathering Business Intelligence

• Data Warehousing
• A data warehouse is an special database where all the useful information is
stored in one location for easy access.
• A data mart is essentially the same as a data warehouse, but with more
specific information of predetermined select data.
• Makes for more efficient data analysis and reporting
Gathering Business Intelligence

• Business Performance Management (BPM)


• BPM is a set of processes that help organizations optimize business
performance. BPM is focused on business processes such as planning and
forecasting. It helps businesses discover efficient use of their business units,
financial, human, and material resources.
Simplicity
• You cannot assume the average business user
has the time or ability to use complex Business
Intelligence tools.
• Organizations tend to overlook the actual end
user of the business tools.
• If the user cannot properly use the tools
provided, they are worthless.
• Also, advanced users who can make sense of
the tools will spend hours writing a report and
get bogged down with information requests.
Common Mistakes & Problems
• Business Intelligence software
can be very useful for a
business if it is used correctly.
But there are common
mistakes to avoid when
implementing new business
tools.
Business Intelligence Improvements
• Data is becoming easier to access
and much more useful.
• Advances in BI make it easier for
companies to analyze data
immediately after it is collected.
• Technological advances also make
it possible to analyze data that
don’t easily fit into a traditional
database. This analysis of
“unstructured data” can give
organizations a competitive edge.
Business Intelligence Improvements
• These advances have resulted in
large part from improvements in
business intelligence software.
• Companies are moving toward more
of a self-service situation where
companies are putting BI tools in the
hands of hundreds of employees, not
just a few executives.
• BI is being added to standard
applications such as logistics and
inventory management. This
integration into regular operations
will allow managers to monitor
activity and quickly make more
informed decisions.
Uses of Business Intelligence

• Business Intelligence has significant relevance in business and government settings.


• Business Uses
• analyzing competitors
• market and industry research
• trying to gain a competitive advantage.
• Government uses
• sharing data between local and federal intelligence agencies to reduce crime levels, and lower
the risk of terrorist attacks.
• Helps further develop a uniform system of government BI resources, in which local,
state, and federal services are available online (such as the DMV, social security,
etc.)
More Uses of Business Intelligence
Excel as the default BI platform
• Excel is the most commonly used BI tool in the world. Small
businesses rely on Excel for most of their record keeping.
• Simple, easy to use but can effect the quality and consistency of
information.
• Manual, Error-Prone Processes; data can be easily corrupted and
passed on.
• “7% of all data found in Excel spreadsheets is wrong.”
• This is an example of how useful tools can cause more problems than
they solve.
BI Software

• Cognos
• Cognos 8 BI
• Information Builders
• WebFOCUS
• Microsoft
• Maestro BI
• ClearForest
• ClearResearch
• Sun Microsystems
• Sun StorEdgeTM SAM-FS/QFS
Pattern Recognition
• A pattern is a design or model that helps grasp something
• Patterns help connect things that may not appear to be connected.
• Patterns help cut through complexity and reveal simpler understandable
trends.
• Patterns can be as definitive as hard scientific rules, like the rule that the
sun always rises in the east.
• A perfect pattern or model is one that
• accurately describes a situation,
• is broadly applicable, and
• can be described in a simple manner.
04/23/2020 23
Pattern Recognition
• Patterns can be temporal, which is something that regularly occurs
over time.
• Patterns can also be spatial, such as things being organized in a
certain way.
• A spatial pattern, following the 80–20 rule, could be that the top 20 percent
of customers lead to 80 percent of the business.
• Patterns can be functional, in that doing certain things leads to certain
effects.
• A functional pattern may involve test-taking skills.
• Good patterns are often symmetric.
04/23/2020 24
Pattern Recognition
• The economic meltdown in 2008 to 2009 was because of the collapse of the
accepted pattern, that is, “housing prices always go up.”
• knowing the business domain well is very important.
• One can predict the traffic pattern on highways from the movement of cell phone
(in the car) locations on the highway.
• Some patterns may be so sparse that a very large amount of diverse data has to be
seen together to notice any connections.
• For instance, locating the debris of a flight that may have vanished midcourse would require
bringing together data from many sources, such as satellites, ships, and navigation systems.
• The raw data may come with various levels of quality, and may even be conflicting.
• The data at hand may or may not be adequate for finding good patterns.
• Additional dimensions of data may need to be added to help solve the problem.

04/23/2020 25
Data Processing Chain
• Data is the new natural resource. Implicit in this statement is the recognition
of hidden value in data.
• Data lies at the heart of business intelligence.
• Data can be modeled and stored in a database.
• Relevant data can be extracted from the operational data stores according to
certain reporting and analyzing purposes, and stored in a data warehouse.
• The data from the warehouse can be combined with other sources of data,
and mined using data mining techniques to generate new insights.
• The insights need to be visualized and communicated to the right audience in
real time for competitive advantage.

04/23/2020 26
Data Processing Chain
• Anything that is recorded is data.
• Observations and facts are data.
Anecdotes and opinions are also
data, of a different kind.
• Data can be numbers, such as
the record of daily weather or
daily sales.
• Data can be alphanumeric, such
as the names of employees and
customers.
04/23/2020 27
Data
• Data could come from any number of sources.
• It could come from operational records inside an organization, and it can come from records compiled by the
industry bodies and government agencies.
• Data could come from individuals telling stories from memory and from people’s interaction in social contexts.
• Data could come from machines reporting their own status or from logs of web usage.
• Data can come in many ways.
• It may come as paper reports.
• It may come as a file stored on a computer.
• It may be words spoken over the phone.
• It may be e-mail or chat on the Internet.
• It may come as movies and songs in DVDs, and so on.
• There is also data about data. It is called metadata.
• For example, people regularly upload videos on YouTube.
• The format of the video file (whether it was a high-def file or lower resolution) is metadata.
• The information about the time of uploading is metadata.
• The account from which it was uploaded is also metadata.
• The record of downloads of the video is also metadata.

04/23/2020 28
Data Types
• Data could be an unordered collection of values.
• For example, a retailer sells shirts of red, blue, and green colors.
• There is no intrinsic ordering among these color values.
• One can hardly argue that any one color is higher or lower than the other.
• This is called nominal (means names) data.
• Data could be ordered values like small, medium, and large.
• For example, the sizes of shirts could be extra-small, small, medium, and large.
• There is clarity that medium is bigger than small, and large is bigger than medium. But the differences
may not be equal.
• This is called ordinal (ordered) data.
• Another type of data has discrete numeric values defined in a certain range, with the
assumption of equal distance between the values.
• Customer satisfaction score may be ranked on a 10-point scale with 1 being lowest and 10 being highest.
• This requires the respondent to carefully calibrate the entire range as objectively as possible and place
his or her own measurement in that scale.
• This is called interval (equal intervals) data.

04/23/2020 29
Data Types
• The highest level of numeric data is ratio data that can take on any numeric value.
• The weights and heights of all employees would be exact numeric values.
• The price of a shirt will also take any numeric value.
• It is called ratio (any fraction) data.
• There is another kind of data that does not lend itself to much mathematical analysis,
at least not directly.
• Such data needs to be first structured and then analyzed.
• This includes data like audio, video, and graphs files, often called BLOBs (Binary Large Objects).
• These kinds of data lend themselves to different forms of analysis and mining.
• Songs can be described as happy or sad, fast-paced or slow, and so on.
• They may contain sentiment and intention, but these are not quantitatively precise.
• Datafication is a new term that means that almost every phenomenon is now being
observed and stored.
04/23/2020 30
Data Warehousing

04/23/2020 31
Need for Data
Warehousing
• Integrated, company-wide view of high-quality information
(from disparate databases)
• Separation of operational and informational systems and
data (for improved performance)

IS 257 – Fall 2015


Warehouse is a Specialized DB

Warehouse
Standard (Operational)
(Informational)
DB
• Mostly reads
updates
• Many small
Queries are transactions
long and complex
• Mb--Tb
Gb Gbofofdata
data
• Current snapshot
History
• Index/hash
Lots of scanson p.k.
• Raw data
Summarized, reconciled data
• Thousandsofofusers
Hundreds users(e.g.,
(e.g.,decision-makers,
clerical users) analysts)

Slide credit: J. Hammer


IS 257 – Fall 2015
… Cont’d
• Large volume of data (Gb, Tb)
• Non-volatile
• Historical
• Time attributes are important
• Updates infrequent
• May be append-only
• Examples
• All transactions ever at WalMart
• Complete client histories at insurance firm
• Stockbroker financial information and portfolios
Slide credit: J. Hammer
IS 257 – Fall 2015
History of data warehousing
• The concept of data warehousing dates back to the late 1980s
when IBM researchers Barry Devlin and Paul Murphy
developed the "business data warehouse".
• 1960s - General Mills and Dartmouth College, in a joint
research project, develop the terms dimensions and facts.
• 1970s - ACNielsen and IRI provide dimensional data marts for
retail sales.
• 1983 – Tera data introduces a database management system
specifically designed for decision support.
• 1988 - Barry Devlin and Paul Murphy publish the article An
architecture for a business and information systems in IBM
Systems Journal where they introduce the term "business data
warehouse".
Data Warehouse Evolution
“Building the
DW” Data Replication
Relational Company
Inmon (1992) Tools
Databases DWs

1960 1975 1980 1985 1990 1995 2000

Information-
“Prehistoric “Middle Data

TIME
Based
Times” Ages” Revolution Management

PC’s and End-user 1st DW DW Vendor DW


Spreadsheets Interfaces Article Confs. Frameworks
What is a Data Warehouse?
“A Data Warehouse is a • Data warehousing is combining data from multiple
and usually varied sources into one comprehensive
• subject-oriented, and easily manipulated database.
• Common accessing systems of data warehousing
• integrated, include queries, analysis and reporting.

• time-variant, • Because data warehousing creates one database in


the end, the number of sources can be anything
• non-volatile you want it to be, provided that the system can
handle the volume, of course.
collection of data used in • The final result, however, is homogeneous data,
which can be more easily manipulated.
support of management
decision making processes.”
-- Inmon & Hackathorn, 1994: viz. Hoffer, Chap 11

IS 257 – Fall 2015


DW Definition…
• Subject-Oriented:
• The data warehouse is organized around the key subjects (or high-level
entities) of the enterprise. Major subjects include
• Customers
• Patients
• Students
• Products
• Etc.

IS 257 – Fall 2015


DW Definition…
• Integrated
• The data housed in the data warehouse are defined using consistent
• Naming conventions
• Formats
• Encoding Structures
• Related Characteristics

IS 257 – Fall 2015


DW Definition…
• Time-variant
• The data in the warehouse contain a time dimension so that they may be
used as a historical record of the business

IS 257 – Fall 2015


DW Definition…
• Non-volatile
• Data in the data warehouse are loaded and refreshed from operational
systems, but cannot be updated by end-users

IS 257 – Fall 2015


Data Warehousing -- a process

• It is a relational or multidimensional database


management system designed to support
management decision making.
• A data warehousing is a copy of transaction data
specifically structured for querying and reporting.
• Technique for assembling and managing data from
various sources for the purpose of answering
business questions. Thus making decisions that were
not previous possible
What is Data Warehousing?

A process of
Information
transforming data
into information and
making it available to
users in a timely
enough manner to
make a difference

Data
OLTP vs Data Warehouse
– OLTP • Warehouse (DSS)
• Application Oriented – Subject Oriented
• Used to run business – Used to analyze business
• Detailed data – Summarized and refined
• Current up to date – Snapshot data
• Isolated Data – Integrated Data
• Clerical User – Knowledge User (Manager)
• Few Records at a time – Large volumes accessed at
accessed – a time (millions)
• (tens) – Mostly Read (Batch Update)
• Read/Update Access – Redundancy
Database Size present
100 GB - few terabytes
• No data redundancy
Database Size 100MB -100 GB – Query throughput is the performance metric
• Transaction throughput is the – Hundreds of users
performance metric – Managed by subsets
• Thousands of users
• Managed in entirety
Data Warehouse Architectures
• Generic Two-Level Architecture
• Independent Data Mart
• Dependent Data Mart and Operational Data Store
• Logical Data Mart and @ctive Warehouse
• Three-Layer architecture

All involve some form of extraction, transformation and loading (ETL)

IS 257 – Fall 2015


Data Warehouse Architecture

Client Client

Query & Analysis

Metadata Warehouse

Integration

Source Source
Source
Strategic uses of data warehousing
Industry Functional areas of Strategic use
use
Airline Operations; marketing Crew assignment, aircraft development, mix
of fares, analysis of route profitability,
frequent flyer program promotions

Banking Product development; Customer service, trend analysis, product and


Operations; marketing service promotions, reduction of IS
expenses

Credit card Product development; Customer service, new information service,


marketing fraud detection
Health care Operations Reduction of operational expenses
Investment and Product development; Risk management, market movements
Insurance Operations; marketing analysis, customer tendencies analysis,
portfolio management

Retail chain Distribution; marketing Trend analysis, buying pattern analysis,


pricing policy, inventory control, sales
promotions, optimal distribution channel
Telecommunications Product development; New product and service promotions,
Operations; marketing reduction of IS budget, profitability
analysis
Personal care Distribution; marketing Distribution decisions, product promotions,
sales decisions, pricing policy

Public sector Operations Intelligence gathering


Advantages of Warehousing Approach

• High query performance


• But not necessarily most current information
• Doesn’t interfere with local processing at sources
• Complex queries at warehouse
• OLTP at information sources
• Information copied at warehouse
• Can modify, annotate, summarize, restructure, etc.
• Can store historical information
• Security, no auditing
• Has caught on in industry
Slide credit: J. Hammer
IS 257 – Fall 2015
Disadvantages of data warehouses

• Data warehouses are not the optimal environment for


unstructured data.
• Because data must be extracted, transformed and loaded into the
in data warehouse
warehouse, there is an element of latency
• data.
Over their life, data warehouses can have high costs.
Maintenance costs are high.
• Data warehouses can get outdated relatively quickly. There is a
cost of delivering suboptimal information to the organization.
• There is often a fine line between data warehouses and
operational systems. Duplicate, expensive functionality may be
developed. Or, functionality may be developed in the data
warehouse that, in retrospect, should have been developed in the
operational systems and vice versa.
Data Marts
• A data mart is a scaled down version of a data warehouse that focuses on
a
• particular subject area.
A data mart is a subset of an organizational data store, usually oriented to a
• specific purpose or major data subject, that may be distributed to support
business needs.
• Data marts are analytical data stores designed to focus on specific business
functions for a specific community within an organization.
• Usually designed to support the unique business requirements of a specified
department or business process
Implemented as the first step in proving the usefulness of the technologies
Reasons for creating a data mart
to
• Easy access to frequently needed data
solve business problems
• Creates collective view by a group of users
• Improves end-user response time
• Ease of creation in less time
• Lower cost than implementing a full Data warehouse
• Potential users are more clearly defined than in a full Data warehouse
From the Data Warehouse to Data Marts

Information

Individually Less
Structured

Departmentally History
Structured Normalized
Detailed

Organizationally More
Data Warehouse
Structured

Data
Warehouse vs. Data Mart

IS 257 – Fall 2015


Warehousing and Industry
• Data Warehousing is big business
• $2 billion in 1995
• $3.5 billion in early 1997
• Predicted: $8 billion in 1998 [Metagroup]
• Wal-Mart said to have the largest warehouse
• 1000-CPU, 583 Terabyte, Teradata system
(InformationWeek, Jan 9, 2006)
• “Half a Petabyte” in warehouse (Ziff Davis Internet, October
13, 2004)
• 1 billion rows of data or more are updated every day
(InformationWeek, Jan 9, 2006)
• Reported to be 2.5 Petabytes in 2008
• http://gigaom.com/2013/03/27/why-apple-ebay-and-walmart-hav
e-some-of-the-biggest-data-warehouses-youve-ever-seen

IS 257 – Fall 2015


Other Large Data Warehouses

(InformationWeek, Jan 9, 2006)


IS 257 – Fall 2015
More Information on DW
• Agosta, Lou, The Essential Guide to Data Warehousing. Prentise Hall
PTR, 1999.
• Devlin, Barry, Data Warehouse, from Architecture to Implementation.
Addison-Wesley, 1997.
• Inmon, W.H., Building the Data Warehouse. John Wiley, 1992.
• Widom, J., “Research Problems in Data Warehousing.” Proc. of the 4th
Intl. CIKM Conf., 1995.
• Chaudhuri, S., Dayal, U., “An Overview of Data Warehousing and OLAP
Technology.” ACM SIGMOD Record, March 1997.

IS 257 – Fall 2015


Information Week magazine
https://www.informationweek.com/big-data-analytics.asp

Gigabit Magazine
https://www.gigabitmagazine.com/top10/top-10-biggest-data-centres-wo
rld

04/23/2020 56
Data Mining
• Art and science of discovering useful novel patterns from data
• E.g. seasonality of products
• E.g. customer segments with unique needs
• Supervised learning (right answer is known)
• Decision-making, e.g. approve loan or not
• Predictive patterns, e.g. sales next month
• Exploratory patterns (no right answer)
• Clusters, e.g. customer segments
• Association rules, e.g. products that sell together

04/23/2020 57
Data Mining
Characteristics
• Selecting the right business problem is key
• High value problem
• Data should exist to solve the problem
• Data is the most critical ingredient for DM
• May include soft/unstructured data in addition to
structured (rectangular) data
• Date miner can be an analyst or the end user
• Striking it rich requires creative thinking
• Need effective and easy data mining tools

04/23/2020 58
Data Mining – Major Techniques
Supervised Learning Classification – Decision Trees
(Predictive ability Machine Learning Neural Networks
based on past data) Support Vector M
Naïve Bayes
Classification - Regression
Statistics

Unsupervised
Learning Clustering Analysis K-Means
(Exploratory analysis Association Rules Apriori
to discover patterns)

04/23/2020 59
What is data mining
• Data mining is the art and science of discovering
knowledge, insights and patterns in data.
• Predicting winning chances of a sports team
• Identifying friends and foes in warfare
• Forecasting rainfall patterns in a country or region
• Patterns must be valid, novel, potentially useful,
understandable
• E.g. “customers who buy cheese and milk also buy bread
90% of the time”

04/23/2020 60
Why Data Mining
• Recognition of hidden value in data
• Field developed to help in science and defense
• Evolved to help develop competitive advantage in
business, fast, and at a global scale
• Ability to effectively gather quality data and
efficiently process it
• Availability of vast amounts of data on customers,
vendors, transactions, Web, machines, etc
• Technologies for consolidation and integration of data
sources into data warehouses
• Exponential increase in computing and storage
capabilities, and exponential decrease in costs
04/23/2020 61
Supervised vs. unsupervised
Learning
• Supervised learning: classification is seen as
supervised learning from examples.
• Supervision: The data (observations, measurements, etc.)
are labeled with pre-defined classes. It is like that a
“teacher” gives the classes.
• Test data are classified into these classes too, and
predictive accuracy is checked.
• Unsupervised learning: e.g. clustering
• Class labels of the data are unknown
• Given a set of data, the task is to establish the existence of
classes or clusters in the data

04/23/2020 62
Supervised learning process:
two steps
Learning (training): Learn a model using the training data
Testing: Test the model using unseen test data to assess the model accuracy

Number of correct classifications


Accuracy  ,
Total number of test cases

04/23/2020 63
Data mining methods/goals
• Decision Trees
• Popular, easy to use, machine learning technique
• Regression Analysis
• Statistical Technique to predict
• Artificial Neural Networks
• Sophistical versatile machine-learning technique
• Clustering
identifying a set of similarity groups in the data
• Association rules
Discovering rules of the form X  Y, where X and Y are
sets of data items.
04/23/2020 64
Confusion Matrix
ConfusionMatrix True Class
Positive Negative

Positive
Predicted Class True Positive (TP) False Positive (FP)
Predicted class

Negative False Negative (FN) True Negative (TN)

Predictive Accuracy = (TP +TN) / (TP + TN + FP + FN).

04/23/2020 65
Standard Data Mining Process
Generic Steps
• Understand the application
domain
• Identify data sources and
select target data
• Pre-process: cleaning,
attribute selection
• Data mining to extract
patterns or models
• Post-process: identifying
interesting or useful patterns
• Incorporate patterns in real
world tasks

(CRISP-DM)
04/23/2020 66
Data Preparation – A Critical
Task Real-world
Data
• Quality of data is key to data
mining effectiveness
· Collect data • Breadth of data
Data Consolidation · Select data
· Integrate data • Structure / Schema
· Impute missing values
• Sparse /Missing values
Data Cleaning ·
·
Reduce noise in data
Eliminate inconsistencies
• Information density

· Normalize data
• Extract, Transform, Load (ETL)
Data Transformation ·
·
Discretize/aggregate data
Construct new attributes
process
• Scripts for automation
Data Reduction
·
·
Reduce number of variables
Reduce number of cases • From operational to Dare
· Balance skewed data
Warehouses

Well-formed
Data

04/23/2020 67
Comparison of Popular Data Mining
Platforms
Feature Excel IBM SPSS Modeler Weka
Ownership Commercial Commercial, Open-source, free
expensive
Data Mining Limited; extensible Extensive features, Extensive,
Features with add-on unlimited data performance issues
modules sizes with large data
Stand-alone Stand-alone Embedded in BI Stand-alone
software suites
User skills needed End-users For skilled BI Skilled BI analysts
analysts
User interface Text and click, Easy Drag & Drop use, GUI, mostly b&w
colorful, beautiful text output
GUI
Data formats Industry-standard Variety of data Proprietary
sources accepted
04/23/2020 68
Data in Data Mining
• Data: a collection of facts usually obtained as the result of
experiences, observations, or experiments
• Data may consist of numbers, words, images, …
• Data: lowest level of abstraction (from which information
and knowledge are derived)
Data

Categorical Numerical

Nominal Ordinal Interval Ratio

04/23/2020 69
Data Mining Best Practices
• Asking the right business questions.
• Creative and open in proposing imaginative hypotheses
• Data should be clean and of high quality
• Continuously engaging with the data
• Dissemination and rollout of the solution

04/23/2020 70
Data Mining Wisdom: Myths
• Data mining …
• provides instant solutions/predictions
• is not yet viable for business applications
• requires a separate, dedicated database
• can only be done by those with advanced degrees
• is only for large firms that have lots of customer data
• is another name for the good-old statistics

04/23/2020 71
Data Mining Wisdom: Common Mistakes
1. Selecting the wrong problem for data mining
2. Ignoring what your sponsor thinks data mining is
and what it really can/cannot do
3. Not leaving insufficient time for data acquisition,
selection and preparation
4. Looking only at aggregated results and not at
individual records/predictions
5. Being sloppy about keeping track of the data
mining procedure and results

04/23/2020 72
Data Mining Wisdom: Common Mistakes
6. Ignoring suspicious (good or bad) findings and
quickly moving on
7. Running mining algorithms repeatedly and blindly,
without thinking about the next stage
8. Naively believing everything you are told about
the data
9. Naively believing everything you are told about
your own data mining analysis
10. Measuring your results differently from the way
your sponsor measures them

04/23/2020 73
Dimensions of Data Mining
• DM Inputs
• Data Domains (industry, function, etc)
• Types of Data field (categorical, numerical, blobs)
• Data sources (operations, web)
• Data quality (missing values, outliers)
• DM Outputs/Goals
• Objective functions (prediction, cluster definition etc)
• Output description types (trees, rules, etc)
• Data representation types
• DM Processes
• Methods (Classification, Clustering, etc.)
• Statistical vs AI machine learning
• Algorithm types (decision, trees, rules, neural net, etc)
• Reliability/Accuracy of results (ROC, Confusion matrix)
04/23/2020 74
04/23/2020 75
Review Questions
1. Describe the business intelligence and data mining cycle.
2. Describe the data processing chain.
3. What are the similarities between diamond mining and data mining?
4. What are the different data mining techniques? Which of these would
be relevant in your current work?
5. What is a dashboard? How does it help?
6. Create a visual to show the weather pattern in your city. Could you
show together temperature, humidity, wind, and rain/snow over a
period of time.

04/23/2020 76
Data Visualization

04/23/2020 77
What is visualization and data mining?

• Visualize: “To form a mental vision, image, or picture of


(something not visible or present to the sight, or of an
abstraction); to make visible to the mind or imagination.”
• Visualization is the use of computer graphics to create
visual images which aid in the understanding of
complex,
• often massive representations of data.
Visual Data Mining is the process of discovering implicit
but useful knowledge from large data sets using
visualization techniques.
Tables vs graphs

A table is best when: A graph is best when:


• You need to look up • The message is
specific values contained in the shape of
• Users need precise the values
values • You want to reveal
• You need to precisely relationships among
compare related values multiple values
• You have multiple data (similarities and
sets with different units differences)
of • Show general trends
measure • You have large data sets
• Graphs and tables serve different purposes. Choose the
appropriate data display to fit your purpose.
Box Plots

• In some situations we have, not M – median


a single data value at a point, Q1, Q3 – quarrtiles
but a number of data values, or Whiskers –
1.5 * interquartile range
even a probability distribution Dots - outliers
• When might this occur?
• Tukey proposed the idea of a
boxplot to visualize the
distribution of values
• For explanation and some Darwin’s plant study
history, see:

http://mathworld.wolfram.com/Box-and-
WhiskerPlot.html
http://en.wikipedia.org/wiki/Box_plot

http://www.upscale.utoronto.ca/GeneralInterest/Harrison/Visualisation/Visualisation.html
Distribution visualisation – US Crime Story
Data Visualization – Common Display Types

Common Display Types


– Bar Charts
– Line Charts Pie
– Charts Bubble
– Charts Stacked
– Charts
– Scatterplots
When to use which type?
20
15
10
Line Graph
5
0
– x-axis requires quantitative variable
1 2 3 4 5 6 7 8
– Variables have contiguous values
– Familiar/conventional ordering among
15
10 ordinals
5
0
1 2 3 4 5 6 7 8
Bar Graph
– Comparison of relative point values
100%
80%
R2 = 0.87
60%
40% Scatter Plot
20%
0%
0
– Convey overall impression of relationship
.
0 between two variables

Pie Chart
– Emphasizing differences in proportion
among a few numbers
Line Graph – Trend visualization

• Fundamental technique of
data presentation
• Used to compare two
variables
– X-axis is often the control Students participating in sporting activities
variable
– Y-axis is the response
variable
• Good at:
– Showing specific values Mobile
– Trends Phone
– Trends in groups (using use
multiple line graphs)

Note: graph labelling is fundamental


Time line graph – show dynamics of measurements
Stratified graphs

• Trends of values with respect to time and different qualitative


categories
Demo – Baby Names Voyager

http://www.babynamewizard.com/voyager
Scatter Plot – Wykresy rozrzutu XY

• Used to present
measurements of two
variables
• Effective if a
relationship exists
between the two
variables

Car ownership by household income

Example taken from


NIST Handbook –
Evidence of strong
positive correlation
Simple Representations – Bar Graph

• Bar graph
– Presents categorical variables
– Height of bar indicates value
– Double bar graph allows
comparison
– Note spacing between bars
– Can be horizontal (when would
you use this?) Number of police officers

Internet use at a school


Note more space for labels
Dot Graph

• Very simple but effective…


• Horizontal to give more space
for labelling
Bad Visualization: Spreadsheet

Year Sales Sales


1999 2,110
2130
2000 2,105 2125
2001 2,120 2120
2115
2002 2,121 2110
Sales
2003 2,124
2105
2100

2095
1999 2000 2001
2002 2003
What is wrong with this graph?
Bad Visualization:
Spreadsheet with misleading Y –axis

Year Sales Sales


1999 2,110
2130
2000 2,105 2125
2001 2,120 2120
2115
2002 2,121 2110
Sales
2003 2,124
2105
2100

2095
1999 2000 2001
2002 2003
Y-Axis scale gives WRONG
impression of big change
Better Visualization

Year Sales Sales

1999 2,110 3000


2000 2,105 2500
2001 2,120 2000
1500 Sales
2002 2,121
1000
2003 2,124 500
0
1
9
9
Axis from 0 to 2000 scale gives
9
correct impression of small change + small formatting
tricks
Integrating various graphs
Pie Chart

• Pie chart summarises a set of


categorical/nominal data
• But use with care…

• … too many segments are


harder to compare than in a bar Should we have a long lecture?
chart

Favourite movie genres


Visualizing in 4+ Dimensions

• Extensions of Scatterplots
• Parallel Coordinates
• Radar Figures
• Other tools
• …
Multiple Views

Give each variable its own display

A B C D E
1 4 1 8 3 5 2
2 6 3 4 2 1
3 5 7 2 4 3
3
4 2 6 3 1 5

A B C D E
Problem: does not show correlations
Tableau bar comparisons
Buisness Analytics Tools – Manager Dashboards
Parallel Coordinates

• Encode variables along a horizontal row


• Vertical line specifies values

Same dataset in parallel coordinates


Dataset in a Cartesian coordinates
Invented by
Alfred Inselberg
while at IBM,
Parallel Coordinates: 4 D

Sepal Sepal Petal Petal


Length Width length Width

3.5

5.1 0.2
1.4

sepal sepal petal petal


length width length width
5.1 3.5 1.4 0.2
Parallel Coordinates Plots for Iris Data
Radar Figures

• Agregate multidimensional
observations
• Each observation gets a
separate colour or graph
symbols
• Variables corresponds to
angles
Wybrana dziedzina

Wykres radarowy –
oceny wskaźników
w ramach dziedziny
I poziom oceny
F. Nightingale (1856) – abstract representation
Buisness Analytics Tools – Typical Reports

Raport more traditional Other forms


Buisness Analytics Tools – Manager Dashboards
Bars in business dashboards – Tableau Software
Data analytics – kokpity menadżerskie
Multidimensional Stacking
Multidimensional presentation of nominal attributes

• VL1 diagrams (Michalski 70) for machine learning

STAGGER and concept drif


Hierarchiczne wizualizacje - Treemaps

• Treemaps display hierarchical data using rectangles. Each branch of the tree
is assigned a rectangle. Then each sub-branch gets assigned to a rectangle
and this continues recursively until a leaf node is found.
• Depending on choice the rectangle representing the leaf node is colored,
sized or both according to chosen attributes.
Gapminder – Motion Charts

http://www.gapminder.org/ Using Bubble presentations


Hierarchical Techniques

Cone Trees [RMC91]


• animated 3D
visualizations of
hierarchical data
• file system
structure
visualized as a
cone tree
Abstract Î Hierarchical Information – Preview

Traditional Treemap Hyperbolic Tree

Botanical
ConeTree SunTree
Visualization of Search Results & Inter-Document Similarities
Abstract Î Text – MetaSearch Previews

Grokker Kartoo

MSN

Lycos AltaVista

MetaCrystal Î searchCrystal
Brushing and Linking
Census Data
Visualization of Association Rules in SGI/MineSet 3.0

57
IBM Miner – visualization of mining results
SGI – other tools
Graph-based Techniques

Narcissus
• Visualization of a larg
number of web pages
• visualization of compl
highly interconnected
data
Visualization of knowledge discovery process

• A graphical tool for arranging components / steps of KDD


• Just a graph flow of actions
• Graphical objects – plug and place
• Parametrization
• Often → you may produce a kind of scipt representing a
graphical flow of KD process
Statsoft – Data mining graphical panel
04/23/2020 129

You might also like