You are on page 1of 80

Data Mining and Business

Intelligence

Unit 1: Data Warehousing

Taught By –
Mrs. Veena Kiragi
School of Computer Science Engg. and
Applications
DY Patil International University

1
What is KDD?
• Knowledge Discovery from Data/Databases (KDD)
or Pattern Analysis, Knowledge Extraction

• In today’s world, there is a large volume of data.

• Sources of data –
– Business such as ecommerce,
– Science such as biomedical field,
– Society such as news, digital cameras,
smartphones, apps.

2
Contd.
• Many people treat data mining as a synonym for
another popularly used term, knowledge
discovery from data, or KDD.

• While others view data mining as merely an


essential step in the process of knowledge
discovery.

• Aim of this subject – to learn


– To extract meaningful information from a
large scale of data.
– Interesting patterns or knowledge from the
given dataset. 3
Why KDD/Data Science becoming popular
now?
• Converting data into meaningful information is the
work of a Data Scientist.

• Before Internet era, data was present in very small


amount and can be managed in few files only such
as excel sheet.

• Now after 10-15 years post internet era, millions


and millions of data is present.

• The emergence of Facebook in 2004 , YouTube in


2005 and then Instagram, Twitter etc. forced the
people to come on internet.
4
Contd.
• The affordability of internet to the common man
has revolutionized this field.

• As a result, large amount of data is being added to


the databases every second.

• The significant development in computers, high


performance systems has helped in handling this
huge amount of data.

• There is need of Data Scientists to develop/explore


advanced algorithms so that meaningful
information can be extracted from this large
amount of databases in quick time. 5
Contd.
• Most of the data scientist tasks nowadays are
related to the datasets which is collected through
the activity of the people in the internet.

• Any type of activity by an individual, such as,


watching a video in YouTube, checking reels in
Instagram, watching feeds in Facebook means you
are leaving your imprint on the internet.

• That imprint is being recorded by the companies to


provide a better service to the customer. This
particular step is done by a data scientist by
observing activity of millions of people. 6
Introduction to KDD
Process
Data Data
Cleaning Integration

Data Data
Transformation Selection

Data Mining

Pattern
Evaluation

Knowledge
Presentation

Source: Han et al., Data


Mining Concepts and Figure Data mining as a step in the process of knowledge 7
Data Cleaning to remove noise and
inconsistent data

Data Integration where multiple data sources may be


combined

Data Selection where data relevant to the analysis task are


retrieved from the database

Data Transformation where data are transformed and consolidated into


forms
appropriate for mining
an essential process where intelligent methods are
Data Mining
applied to extract data patterns

to identify the truly interesting patterns representing


Pattern Evaluation
knowledge
based on interestingness measures
Knowledge where visualization and knowledge representation
Presentation techniques are used to present mined knowledge to
users 8
Introduction to Data
Mining
• Data mining is the process of discovering
interesting patterns and knowledge from large
amounts of data.

• Data is the most elementary descriptions of things,


events, activities etc.

• The data sources can include databases, data


warehouses, the Web, other information
repositories, or data that are streamed into the
system dynamically.

• Data mining field is continuously evolving and will9


Why there is need of Data Mining

• Huge amount of data is being generated each


second.

• There is competitive pressure always on the owner


to provide better services to the customer.

• Data mining helps to develop smart market


decision, run accurate campaigns, make
predictions, and more.

• With the help of Data mining, companies can


analyze customer behaviors and their insights.
This leads to great success and data-driven 10
Database Management System
(DBMS)
• It consists of a collection of interrelated data,
known as a database, and a set of software
programs to manage and access the data.

• A database is a logical grouping of data or


collection of related data (e.g. Structured and
Unstructured).

• DBMS is a computerized data-keeping system.

• DBMS helps provide data security, data integrity,


concurrency, and uniform data administration
procedures. 11
Contd.

• A DBMS serves as an interface between an end-user


and a database, allowing users to create, read,
update, and delete data in the database.

• DBMS manage the data, the database engine, and


the database schema, allowing for data to be
manipulated or extracted by users and other
programs. 

• The most widely used types of DBMS software are


relational, distributed, hierarchical, object-oriented
& network.
12
Data Mining
Functionalities
• Data mining functionalities are used to specify the kinds of
patterns to be found in data mining tasks.

• In general, such tasks can be classified into two categories:

Descriptive Predictive

Descriptive mining tasks characterize properties of the data in a


target data set.

Predictive mining tasks perform induction on the current data in


order to make predictions.

13
Characterizatio
Mining of frequent
n
patterns, associations,
and
and correlations
Discrimination

Data Mining
Functionaliti
es

Classification
and Regression

Outlier Analysis

Clustering
Analysis

14
Data Warehouse
• Data warehouse systems are valuable tools in today’s competitive, fast-
evolving world. A data warehouse is a central repository of information
that can be analyzed to make more informed decisions.

• A data warehouse refers to a data repository that is maintained


separately from an organization’s operational databases.

• Data warehouse systems allow for integration of a variety of application


systems. They support information processing by providing a solid
platform of consolidated historic data for analysis.

• Data warehousing provides architectures and tools for business


executives to systematically organize, understand, and use their data to
make strategic decisions.

15
Data Warehouse is
the place where
valuable data assets
of an organization are
stored such as

16
Contd.
• A data warehouse is a type of data management system that is designed
to enable and support business intelligence (BI) activities.

• Data warehouses are solely intended to perform queries and analysis


and often contain large amounts of historical data.

• The data within a data warehouse is usually derived from a wide range
of sources such as application log files and transaction applications.

• A data warehouse centralizes and consolidates large amounts of data


from multiple sources.

• Its analytical capabilities allow organizations to derive valuable business


insights from their data to improve decision-making.

17
Contd.
• Over time, it builds a historical record that can be invaluable to data
scientists and business analysts.

• Because of these capabilities, a data warehouse can be considered an


organization’s “single source of truth.”

• According to William H. Inmon, a leading architect in the construction of


data
warehouse systems, “A data warehouse is a subject-oriented, integrated,
time-variant, and nonvolatile collection of data in support of
management’s decision making process”.

• The four keywords—subject-oriented, integrated, time-variant, and


nonvolatile -distinguish data warehouses from other data repository
systems, such as relational database systems, transaction processing
systems, and file systems.

18
Subject Oriented
• A data warehouse is organized around major subjects such as customer,
supplier, product, and sales.

• Rather than concentrating on the day-to-day operations and transaction


processing of an organization, a data warehouse focuses on the
modeling and analysis of data for decision makers.

• Hence, data warehouses typically provide a simple and concise view of


particular subject issues by excluding data that are not useful in the
decision support process.

19
Integrated
• A data warehouse is usually constructed by integrating multiple
heterogeneous sources, such as relational databases, flat files, and
online transaction records.

• Data cleaning and data integration techniques are applied to ensure


consistency in naming conventions, encoding structures, attribute
measures, and so on.

• Data warehouses create consistency among different data types from


disparate sources.

20
Time-variant
• Data are stored to provide information from an historic perspective (e.g.,
the past 5–10 years).

• Every key structure in the data warehouse contains, either implicitly or


explicitly, a time element.

• Data warehouse analysis looks at change over time.

21
Non-volatile
• A data warehouse is always a physically separate store of data
transformed from the application data found in the operational
environment.

• Due to this separation, a data warehouse does not require transaction


processing, recovery, and concurrency control mechanisms.

• It usually requires only two operations in data accessing: initial loading


of data and access of data.

• Once data is in a data warehouse, it’s stable and doesn’t change. It can
only be deleted.

22
What are the benefits of using
a data warehouse?
• Benefits of a data warehouse include the following:

a) Informed decision making

b) Consolidated data from many sources

c) Historical data analysis

d) Data quality, consistency, and accuracy

e) Separation of analytics processing from


transactional databases, which improves
performance of both systems
23
Data Warehousing Architecture
• A data warehouse is a semantically consistent data store that serves as a
physical implementation of a decision support data model.

• It stores the information an enterprise needs to make strategic decisions.

• A data warehouse is also often viewed as an architecture, constructed by


integrating data from multiple heterogeneous sources to support
structured and/or ad hoc queries, analytical reporting, and decision
making.

• Based on this information, data warehousing is the process of


constructing and using data warehouses.

24
Contd…
• A data warehouse architecture is made up of 3-tiers system.

• Data is stored in two different types of ways: 1) data that is accessed


frequently is stored in very fast storage (like SSD drives) and 2) data that
is infrequently accessed is stored in a cheap object store, like Amazon
S3.

• The data warehouse will automatically make sure that frequently


accessed data is moved into the “fast” storage so query speed is
optimized.

25
• The top tier is the front-end client that
presents results of query through
reporting, analysis, and data mining
tools.

• The middle tier is an OLAP server


consists of the analytics engine that is
used to access and analyze the data.

• It is typically implemented using either


(1) a
relational OLAP(ROLAP) model (i.e., an
extended relational DBMS that maps
operations on multidimensional data to
standard relational operations); or (2) a
multidimensional OLAP (MOLAP)
model.

• The bottom tier of the architecture is


the warehouse database server (that is
almost always a relational database
system), where data is loaded and
stored.
26
Source: https://www.youtube.com/
watch?v=q9oAZwhuUy4

27
28
OLAP and OLTP Systems
• OLAP : On Line Analytical Processing

• OLTP : On Line Transaction Processing

29
Comparison of OLTP and OLAP
Systems

Source: https://www.youtube.com/
watch?v=q9oAZwhuUy4

30
OLTP OLAP
Systems Systems
• These systems are called online • These systems are known as online
transaction processing (OLTP) systems. analytical processing (OLAP) systems.

• The major task of these systems is to • Data warehouse systems serve users or
perform online transaction and query knowledge workers in the role of data
processing. analysis and decision making.

• They cover most of the day-to-day • Such systems can organize and present
operations of an organization such as data in various formats in order to
purchasing, inventory, manufacturing, accommodate the diverse needs of
banking, payroll, registration, and different users.
accounting.
• An OLAP system is market-oriented.
• An OLTP system is customer-oriented.
• An OLAP system manages large amounts of
• An OLTP system manages current data that, historic data, provides facilities for
typically, are too detailed to be easily used summarization and aggregation, and stores
for decision making. and manages information at different
levels of granularity.

31
OLTP OLAP
Systems Systems
• An OLTP system usually adopts an entity- • An OLAP system typically adopts either a
relationship (ER) data model and an star or a snowflake model and a subject-
application-oriented database design. oriented database design.

• An OLTP system focuses mainly on the • An OLAP system often spans multiple
current data within an enterprise or versions of a database schema, due to the
department, without referring to historic evolutionary process of an organization.
data or data in different organizations.
• OLAP systems also deal with information
• The access patterns of an OLTP system that originates from different organizations,
consist mainly of short, atomic transactions. integrating information from many data
Such a system requires concurrency stores.
control and recovery mechanisms.
• Because of their huge volume, OLAP data
are stored on multiple storage media.

• Access to OLAP systems are mostly read-


only operations (because most data
warehouses store historic rather than up-to-
date information).

32
Comparison of OLTP and OLAP
Systems
Feature OLTP OLAP
Characteristic operational processing informational
processing
Orientation transaction analysis
User clerk, DBA, database knowledge worker (e.g.,
professional manager, executive,
analyst)
Function day-to-day operations long-term informational
requirements decision
support
DB design ER-based, application-oriented star/snowflake, subject-
oriented
Data current, guaranteed up-to-date historic, accuracy
maintained
over time
Summarization primitive, highly detailed summarized,
consolidated
View detailed, flat relational summarized,
33
multidimensional
Contd.
Feature OLTP OLAP
Access read/write mostly read
Focus data in information out
Number of records tens millions
accessed
Number of users thousands hundreds
DB size GB to high-order GB ≥ TB
Priority high performance, high high flexibility, end-user
availability autonomy
Metric transaction throughput query throughput,
response time

34
Dimensional Data Models
• The entity-relationship data model is commonly used in the design of
relational
databases, where a database schema consists of a set of entities and the
relationships
between them.

• Such a data model is appropriate for online transaction processing.

• A data warehouse, however, requires a concise, subject-oriented schema


that facilitates
online data analysis.

• The most popular data model for a data warehouse is a


multidimensional model,
which can exist in the form of a star schema, a snowflake schema, or a
fact constellation schema.
35
1. Star Schema
• The most common modeling paradigm is the star schema, in which the
data warehouse contains:
(1) a large central table (fact table) containing the bulk of the data, with
no redundancy, and
(2) a set of smaller attendant tables (dimension tables), one for each
dimension.

• The schema graph resembles a starburst, with the dimension tables


displayed in a radial pattern around the central fact table.

36
• A star schema for AllElectronics sales is shown in Figure here. Sales are
considered along four dimensions: time, item, branch, and location.

• The schema contains a


central fact table for
sales that contains keys
to each of the four
dimensions, along with
two measures: dollars
sold and units sold.

• Each dimension is
represented by only one
table, and
each table contains a set
of attributes.

37
2. Snowflake Schema
• The snowflake schema is a variant of the star schema model, where some
dimension tables are normalized, thereby further splitting the data into
additional tables. The resulting schema graph forms a shape similar to a
snowflake.

• The major difference between the snowflake and star schema models is
that the
dimension tables of the snowflake model may be kept in normalized form
to reduce
redundancies.

• Such a table is easy to maintain and saves storage space. However, this
space savings is negligible in comparison to the typical magnitude of the
fact table.

• Furthermore, the snowflake structure can reduce the effectiveness of


browsing, since more joins will be needed to execute a query. 38
Contd.
• Hence, although the snowflake schema reduces redundancy, it is not as
popular as the star schema in data warehouse design.
• A snowflake schema for
AllElectronics sales is
given in Figure.
• Here, the sales fact table is
identical to that of the star
schema.
• The main difference
between the two schemas
is in the definition of
dimension tables.
• The single dimension table
for item in the star schema
is normalized in the
snowflake schema,
resulting in new item and
supplier table.
39
3. Fact Constellation
• Sophisticated applications may require multiple fact tables to share
dimension tables. This kind of schema can be viewed as a collection of
stars, and hence is called a galaxy schema or a fact constellation.
• This schema specifies
two fact tables, sales and
shipping.
• The sales table definition
is identical to that of the
star schema.
• The shipping table has
five dimensions, or
keys—item key,
time key, shipper key,
from location, and to
location—and two
measures—dollars cost
and units_shipped.
40
Contd.
• A fact constellation schema allows dimension tables to be shared between fact
tables.

• For example, the dimensions tables for time, item, and location are shared
between the sales and shipping fact tables.

• In data warehousing, there is a distinction between a data warehouse and a data


mart. A data warehouse collects information about subjects that span the entire
organization, such as customers, items, sales, assets, and personnel, and thus its
scope is enterprise-wide.

• For data warehouses, the fact constellation schema is commonly used, since it
can model multiple, interrelated subjects. A data mart, on the other hand, is a
department subset of the data warehouse that focuses on selected subjects, and
thus its scope is departmentwide.

• For data marts, the star or snowflake schema is commonly used, since both are
41
geared toward modeling single subjects, although the star schema is more
Data Warehouse Models/
Schemas
• From the architecture point of view, there are three data warehouse
models:

Enterprise Virtual
Warehous Data Mart Warehous
e e

42
1. Enterprise Warehouse
• An enterprise warehouse collects all of the information about subjects
spanning the entire organization.

• It provides corporate-wide data integration, usually from one or more


operational systems or external information providers, and is cross-
functional in scope.

• It typically contains detailed data as well as summarized data, and can


range in size from a few gigabytes to hundreds of gigabytes, terabytes,
or beyond.

• An enterprise data warehouse may be implemented on traditional


mainframes, computer superservers, or parallel architecture platforms.

• It requires extensive business modeling and may take years to design


and build.
43
2. Data Mart
• A data mart contains a subset of corporate-wide data that is of value to a
specific group of users.

• The scope is confined to specific selected subjects. For example, a


marketing data mart may confine its subjects to customer, item, and
sales.

• The data contained in data marts tend to be summarized.

• Data marts are usually implemented on low-cost departmental servers


that are Unix/Linux or Windows based.

• The implementation cycle of a data mart is more likely to be measured


in weeks rather than months or years.

44
Contd.
• However, it may involve complex integration in the long run if its design
and planning were not enterprise-wide.

• Depending on the source of data, data marts can be categorized as


independent
or dependent.

• Independent data marts are sourced from data captured from one or
more operational systems or external information providers, or from
data generated locally within a particular department or geographic
area.

• Dependent data marts are sourced directly from enterprise data


warehouses.

45
3. Virtual Warehouse
• A virtual warehouse is a set of views over operational databases.

• For efficient query processing, only some of the possible summary views
may be
materialized.

• A virtual warehouse is easy to build but requires excess capacity on


operational database servers.

46
Data Cube: A Multidimensional
Data Model
• Data warehouses and OLAP tools are based on a multidimensional data model.
This model views data in the form of a data cube.

• A data cube allows data to be modeled and viewed in multiple dimensions. It is


defined by dimensions and facts.

• In general terms, dimensions are the perspectives or entities with respect to


which an organization wants to keep records.

• Each dimension may have a table associated with it, called a dimension table,
which further describes the dimension.

• For example, a dimension table for item may contain the attributes item name,
brand, and type.

• Dimension tables can be specified by users or experts, or automatically generated


and adjusted based on data distributions.

47
Contd.
• A multidimensional data model is typically organized around a central
theme, such
as sales.

• Think of them as the quantities by which we want to analyze


relationships between dimensions.

• Examples of facts for a sales data warehouse include dollars sold (sales
amount in dollars), units sold (number of units sold), and amount
budgeted.

48
Example

• Imagine that you have collected the data for your analysis. These data
consist of the AllElectronics sales per quarter, for the years 2008 to
2010.

49
Contd.

• You are, however, interested in the annual sales (total per year),
rather than the total per quarter.

• Thus, the data can be aggregated so that the resulting data summarize
the total sales per year instead of per quarter.

• The resulting data set is smaller in volume, without loss of


information necessary for the analysis task.

50
2-D DATA CUBE
• In this 2-D representation, the sales for Vancouver are shown with respect to the time
dimension (organized in quarters) and the item dimension (organized according to the
types of items sold). The fact or measure displayed is dollars sold (in thousands).

• A simple 2-D data cube


that is, in fact, a table
or spreadsheet for
sales data from
AllElectronics.

• In particular, we will
look at the
AllElectronics sales
data for items sold per
quarter in the city of
Vancouver.

51
3-D DATA CUBE
• suppose we would like to view the data according to time and item, as well
as location, for the cities Chicago, New York, Toronto, and Vancouver.

• The 3-D data in the table are represented as a series of 2-D tables.

52
Contd.

• Data cubes store multidimensional aggregated information.

• For example, Figure below shows a data cube for multidimensional


analysis of sales data with respect to annual sales per item type for
each AllElectronics branch. Each cell holds an aggregate data value,
corresponding to the data point in multidimensional space.

53
Operations on Cubes
Drill-
Roll-Up
down

Slice &
Dice Pivot

54
55
56
Extraction, Transformation, and Loading (ETL) /ETL
Operations
• Data warehouse systems use back-end tools and utilities to populate and refresh
their data.
• These tools and utilities include the following functions:

– Data extraction, which typically gathers data from multiple,


heterogeneous, and external sources.

– Data cleaning, which detects errors in the data and rectifies them
when possible.

– Data transformation, which converts data from legacy or host format


to warehouse format.

– Load, which sorts, summarizes, consolidates, computes views, checks


integrity, and builds indices and partitions.

– Refresh, which propagates the updates from the data sources to the
warehouse.

57
Data Preprocessing

What is the aim?

• How can the data be preprocessed in order to help improve the quality
of the data?

• How can the data be preprocessed in order to help improve the mining
results?

• How can the data be preprocessed so as to improve the efficiency and


ease of the mining process?

58
Why we need Data Pre-processing? (Over
view)
• In large real-world databases, three of the most common problems
are:

Inaccurat Incomplet Inconsiste


e e nt

Data
Preprocess

Consisten
Accurate Complete
t

59
Why Data Preprocessing?
• Data in the real world is dirty
– incomplete: lacking attribute values, lacking
certain attributes of interest, or containing only
aggregate data
– noisy: containing errors or outliers
– inconsistent: containing discrepancies in codes or
names
• No quality data, no quality mining results!
– Quality decisions must be based on quality data
– Data warehouse needs consistent integration of
quality data
• A multi-dimensional measure of data quality:
– A well-accepted multi-dimensional view:
• accuracy, completeness, consistency, timeliness,
believability, value added, interpretability, accessibility
Why there are data problems?
• There are many possible reasons for inaccurate data (e.g, having
incorrect attribute values)., or

• The instrument used for data collection may be faulty, there may be
human or computer errors occurring at data entry.

• Users may purposely submit incorrect data values for mandatory


fields when they do not wish to submit personal information (e.g., by
choosing the default value “January 1” displayed for birthday). This is
known as disguised missing data. Errors in data transmission can also
occur.

• Incorrect data may also result from inconsistencies in naming


conventions or data codes, or inconsistent formats for input fields (e.g.
, date). Duplicate tuples also require data cleaning.

• Relevant data may not be recorded due to a misunderstanding or


because of equipment malfunctions. 61
What are Data Pre-processing techniques/
tasks?
• Data cleaning can be applied to remove noise and correct
inconsistencies in data.

• Data integration merges data from multiple sources into a coherent


data store such as a data warehouse.

• Data transformations (e.g., normalization) may be applied, where


data are scaled to fall within a smaller range like 0.0 to 1.0. This can
improve the accuracy and efficiency of mining algorithms involving
distance measurements.

• Data reduction can reduce data size by, for instance, aggregating,
eliminating redundant features, or clustering.

• These techniques are not mutually exclusive; they may work together.
For example, data cleaning can involve transformations to correct
wrong data, such as by transforming all entries for a date field to a
62
common format.
Forms of data
preprocessing
Data Cleaning
• Data cleaning tasks
– Fill in missing values
– Identify outliers and smooth out noisy
data
– Correct inconsistent data
Missing Data
• Data is not always available
– E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
• Missing data may be due to
– equipment malfunction
– inconsistent with other recorded data and thus deleted
– data not entered due to misunderstanding
– certain data may not be considered important at the time
of entry
– not register history or changes of the data
• Missing data may need to be inferred
How to Handle Missing
• Data?
Ignore the tuple: usually done when class label is missing
(assuming the task is classification—not effective in certain
cases)

• Fill in the missing value manually: tedious + infeasible?


• Use a global constant to fill in the missing value: e.g.,
“unknown”, a new class?!

• Use the attribute mean to fill in the missing value


• Use the attribute mean for all samples of the same
class to fill in the missing value: smarter
• Use the most probable value to fill in the missing
value: inference-based such as regression, Bayesian formula,
Noisy Data
• Q: What is noise?
• A: Random error in a measured variable.
• Incorrect attribute values may be due to
– faulty data collection instruments
– data entry problems
– data transmission problems
– technology limitation
– inconsistency in naming convention
• Other data problems which requires data
cleaning
– duplicate records
– incomplete data
– inconsistent data
How to Handle Noisy Data?
• Binning method:
– first sort data and partition into (equi-depth) bins
– then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
– used also for discretization (discussed later)
• Clustering
– detect and remove outliers
• Semi-automated method: combined
computer and human inspection
– detect suspicious values and check manually
• Regression
– smooth by fitting the data into regression functions
Simple Discretization Methods:
Binning
• Equal-width (distance) partitioning:
– It divides the range into N intervals of equal size:
uniform grid
– if A and B are the lowest and highest values of
the attribute, the width of intervals will be: W =
(B-A)/N.
– The most straightforward
– But outliers may dominate presentation
– Skewed data is not handled well.
• Equal-depth (frequency) partitioning:
– It divides the range into N intervals, each
containing approximately same number of
samples
– Good data scaling
– Managing categorical attributes can be tricky.
Binning Methods for Data
Smoothing
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21,
24, 25, 26, 28, 29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
How to Handle Inconsistent
Data?
• Manual correction using external
references
• Semi-automatic using various tools
– To detect violation of known functional
dependencies and data constraints
– To correct redundant data
Data
Integration
• Data integration:
– combines data from multiple sources into a
coherent store
• Schema integration
– integrate metadata from different sources
– Entity identification problem: identify real world
entities from multiple data sources,
– Detecting and resolving data value conflicts
– for the same real world entity, attribute values
from different sources are different
– possible reasons: different representations,
different scales, e.g., metric vs. British units,
different currency
Handling Redundant
Data in Data Integration
• Redundant data occur often when integrating
multiple DBs
– The same attribute may have different names in
different databases
– One attribute may be a “derived” attribute in another
table, e.g., annual revenue

• Careful integration can help reduce/avoid


redundancies and inconsistencies and improve
mining speed and quality
Data
Transformation
• Smoothing: remove noise from data (binning,
clustering, regression)
• Aggregation: summarization, data cube
construction
• Generalization: concept hierarchy climbing
• Normalization: scaled to fall within a small,
specified range
– min-max normalization
– z-score normalization
– normalization by decimal scaling
• Attribute/feature construction
Data Reduction

• Problem:
Data Warehouse may store terabytes of
data: Complex data analysis/mining may
take a very long time to run on the
complete data set

• Solution?
– Data reduction…
Data
• Reduction
Obtains a reduced representation of
the data set that is much smaller in
volume but yet produces the same (or
almost the same) analytical results
• Data reduction strategies
– Data cube aggregation
– Dimensionality reduction
– Data compression
– Numerosity reduction
– Discretization and concept hierarchy
generation
Data Cube Aggregation
• Multiple levels of aggregation in data
cubes
– Further reduce the size of data to deal with
• Reference appropriate levels
– Use the smallest representation capable to
solve the task
• Queries regarding aggregated
information should be answered using
data cube, when possible
Data
Compression
• String compression
– There are extensive theories and well-tuned
algorithms
– Typically lossless
– But only limited manipulation is possible without
expansion
• Audio/video, image compression
– Typically lossy compression, with progressive
refinement
– Sometimes small fragments of signal can be
reconstructed without reconstructing the whole
• Time sequence is not audio
– Typically short and vary slowly with time
Data
Compression

Compresse
Original Data
d
Data
lossle
ss

s
los
Original Data y
Approximated
END OF Module-1….
THANK YOU!!!

80

You might also like