UNIT 04 - Data Science - Final

Lecture PPT on
Data Science (Open Elective)

(BCSL425)
6th Semester
By
Prof. Amresh Kumar

(Asst. Professor, CSE)
Prof. Amresh Kumar
Department of Computer Science & Engineering

Session: 2017-18 (Even)
G.H. Raisoni College of Engineering, Nagpur

(An autonomous Institute under UGC act 1956 & Affiliated to Rashtrasant Tukadoji Maharaj Nagpur University, Nagpur) 1
1
All Rights Reserved, Copyright © 2018 Prof. Amresh Kumar, GHRCE, Nagpur
Unit 04
Data Warehousing And Mining
Topics
Introduction to Data Warehouse, Data Warehouse Architecture,
Data Warehouse Models, Need for Data Warehousing, OLTP and
OLAP system design, Introduction to data mining, KDD Process,
Relational Vs Non-Relational databases.
2
2
Outline
1. Introduction to Data Warehouse,
2. Data Warehouse Architecture,
3. Data Warehouse Models,
4. Need for Data Warehousing,
5. OLTP and OLAP system design,
6. Introduction to data mining,
7. KDD Process,
8. Relational Vs Non-Relational databases.
3
3
Introduction to Data Warehouse
Introduction
• The term "Data Warehouse" was first coined by Bill Inmon in 1990.
• Data Warehouse:
o They are central repositories of integrated data from one or more disparate
sources.
o They store current and historical data in one single place that are used for
creating analytical reports for workers throughout the enterprise (helps the
organization to analyze its business).
o A data warehouse helps executives to organize, understand, and use their data
to take strategic business decisions.
• Data warehousing:
o Data warehousing is the process that uses Data Warehouse to analyze and
transform data into information, thereby enabling the business to examine its
operations and performance.
o Data warehousing is a process of Designing/Constructing and using data
warehouses for ETL process(Extraction, Transformation, Loading) and
Reporting). All Rights Reserved, Copyright © 2018
4
Prof. Amresh Kumar, GHRCE, Nagpur
Introduction to Data Warehouse Conti…
Data Warehouse Applications
• As discussed before, a data warehouse helps business executives to
organize, analyze, and use their data for decision making.
• A data warehouse serves as a sole part of a plan-execute-assess "closed-

loop" feedback system for the enterprise management.
• Data warehouses are widely used in the following fields:

o Financial services
o Banking services
o Consumer goods
o Retail sectors
o Controlled manufacturing
All Rights Reserved, Copyright © 2018

5
Types of Data Warehouse
Information processing, analytical processing, and data mining are the three
types of data warehouse applications that are discussed below:
1. Information Processing //Reporting
A data warehouse allows to process the data stored in it. The data can be
processed by means of querying, basic statistical analysis, reporting using
crosstabs, tables, charts, or graphs.
2. Analytical Processing //OLAP Analysis
A data warehouse supports analytical processing of the information
stored in it. The data can be analyzed by means of basic OLAP operations,
including slice-and-dice, drill down, drill up, and pivoting.
3. Data Mining
Data mining supports knowledge discovery by finding hidden patterns
and associations, constructing analytical models, performing
classification and prediction. These mining results can be presented using
visualization tools. All Rights Reserved, Copyright © 2018
6
Functions of Data Warehouse Tools and Utilities
The following are the functions of data warehouse tools and utilities:
1. Data Extraction
Involves gathering data from multiple heterogeneous sources.
2. Data Cleaning
Involves finding and correcting the errors in data.
3. Data Transformation
Involves converting the data from legacy format to warehouse format.
4. Data Loading
Involves sorting, summarizing, consolidating, checking integrity, and
building indices and partitions.
5. Refreshing
Involves updating from data sources to warehouse.
Note: Data cleaning and data transformation are important steps in improving
the quality of data and data mining
All Rights Reserved,results.
Copyright © 2018
7
Data Warehouse Architecture (01)
Data Sources Data Warehousing Users
Here: The term Metadata represent

All Rights data about
Reserved, Copyright © 2018 the data. For example the
8
index of a book serves as a Amresh
Prof. metadataKumar,for the contents
GHRCE, Nagpur in the book.
Or, Data Warehouse Architecture (02)

9
Or, Data Warehouse Architecture (03)

10
Data Warehouse Architecture Conti…
Data Warehouse Architecture is divided into three phases:
Phase 01: (Data warehouse database server.)
It is the relational database system. We use the back end tools and utilities to
feed data. These back end tools and utilities perform the Extract, Clean,
Load, and refresh functions.
Phase 02: (OLAP Server)

OLAP Server that can be implemented in either of the following ways.
1. By Relational OLAP (ROLAP), which is an extended relational database
management system. The ROLAP maps the operations on multidimensional
data to standard relational operations.
2. By Multidimensional OLAP (MOLAP) model, which directly implements the
multidimensional data and operations.
Phase 03: (Front-end Client layer)

This tier is the front-end client layer. This layer holds the query tools and
reporting tools, analysis All
tools
Rightsand data
Reserved, mining
Copyright tools.
© 2018
11
Data Warehouse Models
From the perspective of data warehouse architecture, we have
the following data warehouse models:
1. Virtual Warehouse
2. Data mart
3. Enterprise Warehouse
• Virtual Warehouse
o The view over an operational data warehouse is known as a virtual
warehouse.
o It is easy to build a virtual warehouse.
o Building a virtual warehouse requires excess capacity on operational database
servers.
• Data Mart
o Data marts contain a subset of organization-wide data that is valuable to
specific groups of people in an organization.
o For example, the marketing data mart may contain only data related to items,
customers, and sales. Data marts are confined to subjects.
12
Data Warehouse Models Conti…
• Enterprise Warehouse
o An enterprise warehouse collects all the information and the subjects
spanning an entire organization
o It provides us enterprise-wide data integration.
o The data is integrated from operational systems and external information
providers.
o This information can vary from a few gigabytes to hundreds of gigabytes,
terabytes or beyond.

13
Need for Data Warehousing
 A data warehouse is the foundation for a successful BI (Business
Intelligence) program.
 The concept of data warehousing is to create a central location and
permanent storage space for the various data sources needed to support
a company’s analysis, reporting and other BI functions.
Need for Data Warehousing

1. A Data Warehouse Delivers Enhanced Business Intelligence:
By providing data from various sources, managers and executives will no
longer need to make business decisions based on limited data.
2. A Data Warehouse Saves Time:

Since business users can quickly access critical data from a number of
sources—all in one place—they can rapidly make informed decisions on
key initiatives. They won’t waste precious time retrieving data from
multiple sources. All Rights Reserved, Copyright © 2018
14
Need for Data Warehousing Conti…
Need for Data Warehousing Conti…
3. A Data Warehouse Enhances Data Quality and Consistency
A data warehouse implementation includes the conversion of data
from numerous source systems into a common format.
4. A Data Warehouse Provides Historical Intelligence

A data warehouse stores large amounts of historical data so you can
analyze different time periods and trends in order to make future
predictions.
5. A Data Warehouse Generates a High ROI (Return On Investment)

Companies that have implemented data warehouses and complementary
BI systems have generated more revenue and saved more money than
companies that haven’t invested in BI systems and data warehouses.

15
OLTP and OLAP system design
 Entire IT systems can be divided into Two:
1. OLTP System (On-line Transaction Processing System) and
2. OLAP System (On-line Analytical Processing System).
 In general OLTP systems provide source data to data
warehouses, whereas OLAP systems help to analyze it.

16
OLTP and OLAP system design Conti…
1. OLTP (On-line Transaction Processing) System
• It is characterized by a large number of short on-line transactions (INSERT,
UPDATE, DELETE).
• The main emphasis for OLTP systems is put on very fast query processing,
maintaining data integrity in multi-access environments and an
effectiveness measured by number of transactions per second.
• In OLTP database there is detailed and current data, and schema used to
store transactional databases is the entity model (usually 3NF).
2. OLAP (On-line Analytical Processing)

• It is characterized by relatively low volume of transactions.
• Queries are often very complex and involve aggregations.
• For OLAP systems a response time is an effectiveness measure.
• OLAP applications are widely used by Data Mining techniques.
• In OLAP database there is aggregated, historical data, stored in multi-
dimensional schemas (usually
All Rights star schema).
Reserved, Copyright © 2018
17
OLTP and OLAP system design Conti…
Following table summarize difference between OLTP and OLAP systems.

18
Data Cube
Introduction
• It is a multi-dimensional array of values
• It is used to represent data along some measure of interest.
• Even though it is called a 'cube', it can be 1-dimensional, 2-dimensional, 3-
dimensional, or higher-dimensional.
• Every dimension represents a new measure whereas the cells in the cube
represent the facts of interest.
• For example, they could contain a count for the number of times that
attribute combination occurs in the database, or the minimum, maximum,
sum or average value of some attribute.
• Queries are performed on the cube to retrieve decision support
information.

19
Data Cube Conti…
Example
We have a database that contains transaction information relating company sales of a
part to a customer at a store location. The data cube formed from this database is
a 3-dimensional representation, with each cell (p,c,s) of the cube representing a
combination of values from part, customer and store-location. A sample data cube
for this combination is shown in Figure 1. The contents of each cell is the count of
the number of times that specific combination of values occurs together in the
database. Cells that appear blank in fact have a value of zero. The cube can then
be used to retrieve information within the database about, for example, which
store should be given a certain part to sell in order to make the greatest sales.

20
Star and Snowflake Schema
Basis for comparison Star Schema Snowflake Schema
Structure of schema Contains fact and dimension Contains sub-dimension
tables. tables including fact and
dimension tables.
Use of normalization Doesn't use normalization. Uses normalization and
denormalization.
Ease of use Simple to understand and easily Hard to understand and
designed. design.
Data model Top-down Bottom-up
Query complexity Low High
Foreign key join used Fewer Large in number
Space usage More Less
Time consumed in query Less More comparatively due to
execution excessive use of join.
Ease of maintenance / Has redundant data and hence No redundancy, so snowflake
change less easy to maintain/change schemas are easier to
All Rights Reserved, Copyright © 2018 maintain and change.
21
Star and Snowflake Schema Conti…
Star Schema Snowflake Schema
•Star schema is the simple and common •Snowflake schema is the variant of the
modeling paradigm. star schema which includes the
•It schema resembles to a star, with dimension hierarchical form of dimensional tables.
table displayed in a radial pattern around the •In this schema, there is a fact table
central fact table. comprise of various dimension and sub-
•The dimensions in fact table are connected to dimension table connected across through
dimension table through primary key and primary and foreign key to the fact table.
foreign key. Prof. Amresh Kumar, GHRCE, Nagpur
22
Star and Snowflake Schema Conti…

23
Introduction to Data Mining
Introduction
• Data mining refers to the application of algorithms for extracting
patterns from data without the additional steps of the KDD process.
• Data Mining is defined as the process of extracting information from
huge sets of data.
• Data Mining Process find the patterns that are:
o valid: hold on new data with some certainty
o novel: non-obvious to the system
o useful: should be possible to act on the item
o understandable: humans should be able to interpret the pattern
• Extraction of information is not the only process we need to perform; data
mining also involves other processes such as Data Cleaning, Data
Integration, Data Transformation, Data Mining, Pattern Evaluation and
Data Presentation.
• Once all these processes are over, we would be able to use this
information in many applications.
24
Data Mining Applications
• Banking: loan/credit card approval
o predict good customers based on old customers
• Customer relationship management
o identify those who are likely to leave for a competitor.
• Targeted marketing
o identify likely responders to promotions
• Fraud detection: telecommunications, financial transactions
o from an online stream of event identify fraudulent events
• Manufacturing and production
o automatically adjust knobs when process parameter changes
• Medicine: disease outcome, effectiveness of treatments
o analyze patient disease history: find relationship between diseases
• Molecular/Pharmaceutical: identify new drugs
• Scientific data analysis
o identify new galaxies by searching for sub clusters
• Web site/store design and promotion
o find likeness of visitor to pages and modify layout
25
KDD Process
Fig. Diagram shows the process of knowledge discovery

26
KDD Process Conti…
Introduction
• KDD stands for Knowledge Discovery
• KDD is the overall process of discovering useful knowledge from data
• KDD process consists of many steps (one of them is Data Mining)
Steps involved in the knowledge discovery process
• Data Cleaning − In this step, the noise and inconsistent data is removed.
• Data Integration − In this step, multiple data sources are combined.
• Data Selection − In this step, data relevant to the analysis task are
retrieved from the database.
• Data Transformation − In this step, data is transformed or consolidated
into forms appropriate for mining by performing summary or aggregation
operations.
• Data Mining − In this step, intelligent methods are applied in order to
extract data patterns.
• Pattern Evaluation − In this step, data patterns are evaluated.
• Knowledge PresentationProf. − Amresh
In thisKumar,
step, knowledge
GHRCE, Nagpur is represented. 27
Relational Vs Non-Relational databases
Parameter Relational databases Non-relational databases
Example Microsoft SQL Server, Oracle Azure Cosmos DB, MongoDB,

Database, IBM DB2 Cassandra
Category Of Structured data (Works with Unstructured data (Works with
Data structured data (Data in Tablular semi-structured data (JSON, XML))
Fromat))
Transaction Relational databases use the ACID Non-relational databases
Processing system (NoSQL) uses the BASE system
Query SQL NoSQL

Language
Pros •Built-in data integrity •Works with semi-structured data
•Limitless indexing •Good Speed, due to not having to
join tables
Cons •Difficulty in working with semi- •Does not have built-in data
structured data integrity (must do in code)
•Low Speed, due to lots of joins. •Limited indexing
28
Clustering Techniques
Introduction
• Clustering is the task of dividing the population or data points into a
number of groups such that data points in the same groups are more
similar to other data points in the same group than those in other groups.
• In simple words, the aim is to segregate groups with similar traits and
assign them into clusters.
Types of Clustering
1. Hard Clustering: In hard clustering, each data point either belongs to a
cluster completely or not. For example, in the above example each
customer is put into one group out of the 10 groups.
2. Soft Clustering: In soft clustering, instead of putting each data point into
a separate cluster, a probability or likelihood of that data point to be in
those clusters is assigned. For example, from the above scenario each
costumer is assigned a probability to be in either of 10 clusters of the
retail store. All Rights Reserved, Copyright © 2018
29
Clustering Techniques Conti…
Types of clustering algorithms
• Connectivity models: Examples of these models are hierarchical clustering
algorithm and its variants.
• Centroid models: K-Means clustering algorithm is a popular algorithm that
falls into this category.
• Distribution models: A popular example of these models is Expectation-
maximization algorithm which uses multivariate normal distributions.
• Density Models: Popular examples of density models are DBSCAN and
OPTICS.
Two of the most popular clustering algorithms in detail:

1. Hierarchical Clustering
2. K Means Clustering

30
1. Hierarchical Clustering
• Hierarchical clustering, as the name suggests is an algorithm that builds
hierarchy of clusters. This algorithm starts with all the data points assigned
to a cluster of their own. Then two nearest clusters are merged into the
same cluster. In the end, this algorithm terminates when there is only a
single cluster left.
• The results of hierarchical clustering can be shown using dendrogram. The
dendrogram can be interpreted as:

31
Two important things that you should know about hierarchical clustering
are:
1. This algorithm has been implemented above using bottom up approach.
It is also possible to follow top-down approach starting with all data
points assigned in the same cluster and recursively performing splits till
each data point is assigned a separate cluster.
2. The decision of merging two clusters is taken on the basis of closeness of
these clusters. There are multiple metrics for deciding the closeness of
two clusters :
o Euclidean distance: ||a-b||2 = √(Σ(ai-bi))
o Squared Euclidean distance: ||a-b||22 = Σ((ai-bi)2)
o Manhattan distance: ||a-b||1 = Σ|ai-bi|
o Maximum distance:||a-b||INFINITY = maxi|ai-bi|
o Mahalanobis distance: √((a-b)T S-1 (-b)) {where, s : covariance matrix}

32
2. K Means Clustering
K means is an iterative clustering algorithm that aims to
find local maxima in each iteration. This algorithm Step 01
works in these 5 steps :
Step 01: Specify the desired number of clusters K : Let us
choose k=2 for these 5 data points in 2-D space.
Step 02
Step 02: Randomly assign each data point to a cluster :
Let’s assign three points in cluster 1 shown using red
color and two points in cluster 2 shown using grey
color.

33
Clustering Techniques Conti… Step 03
Step 03: Compute cluster centroids : The centroid of data points

in the red cluster is shown using red cross and those in grey
cluster using grey cross.
Step 04: Re-assign each point to the closest cluster centroid :
Note that only the data point at the bottom is assigned to
the red cluster even though its closer to the centroid of grey Step 04
cluster. Thus, we assign that data point into grey cluster
Step 05: Re-compute cluster centroids : Now, re-computing the
centroids for both the clusters.
Step 06: Repeat steps 4 and 5 until no improvements are
possible : Similarly, we’ll repeat the 4th and 5th steps until
we’ll reach global optima. When there will be no further Step 05
switching of data points between two clusters for two

successive repeats. It will mark the termination of the
algorithm if not explicitly mentioned.

34
References
Websites
1. http://datawarehouse4u.info/OLTP-vs-OLAP.html
2. http://www2.cs.uregina.ca/~dbd/cs831/notes/dcubes/dcubes.html
3. https://techdifferences.com/difference-between-star-and-snowflake-
schema.html
4. https://www.analyticsvidhya.com/blog/2016/11/an-introduction-to-
clustering-and-different-methods-of-clustering/
Research Papers
1. https://kaigi.org/jsai/webprogram/2013/pdf/955.pdf
//Author: Amresh Kumar
Ebooks
1. http://myweb.sabanciuniv.edu/rdehkharghani/files/2016/02/The-
Morgan-Kaufmann-Series-in-Data-Management-Systems-Jiawei-Han-
Micheline-Kamber-Jian-Pei-Data-Mining.-Concepts-and-Techniques-3rd-
Edition-Morgan-Kaufmann-2011.pdf
35
Best Wishes!!!
All Rights Reserved, Copyright © 2018 36
36

UNIT 04 - Data Science - Final

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

UNIT 04 - Data Science - Final

Uploaded by

Copyright:

Available Formats

Lecture PPT on

Data Science (Open Elective)

Prof. Amresh Kumar

Department of Computer Science & Engineering

G.H. Raisoni College of Engineering, Nagpur

• A data warehouse serves as a sole part of a plan-execute-assess "closed-

• Data warehouses are widely used in the following fields:

All Rights Reserved, Copyright © 2018

Here: The term Metadata represent

All Rights Reserved, Copyright © 2018

All Rights Reserved, Copyright © 2018

Phase 02: (OLAP Server)

Phase 03: (Front-end Client layer)

All Rights Reserved, Copyright © 2018

Need for Data Warehousing

2. A Data Warehouse Saves Time:

4. A Data Warehouse Provides Historical Intelligence

5. A Data Warehouse Generates a High ROI (Return On Investment)

All Rights Reserved, Copyright © 2018

All Rights Reserved, Copyright © 2018

2. OLAP (On-line Analytical Processing)

All Rights Reserved, Copyright © 2018

All Rights Reserved, Copyright © 2018

All Rights Reserved, Copyright © 2018

All Rights Reserved, Copyright © 2018

Fig. Diagram shows the process of knowledge discovery

Example Microsoft SQL Server, Oracle Azure Cosmos DB, MongoDB,

Query SQL NoSQL

Two of the most popular clustering algorithms in detail:

All Rights Reserved, Copyright © 2018

All Rights Reserved, Copyright © 2018

All Rights Reserved, Copyright © 2018

All Rights Reserved, Copyright © 2018

Step 03: Compute cluster centroids : The centroid of data points

switching of data points between two clusters for two

All Rights Reserved, Copyright © 2018

You might also like