You are on page 1of 99

Year/Section : B.

Tech III /C & D


Faculty : N.Swapna
(Associate Prof)
 Unit 1 : Data Management: Design Data Architecture and
manage the data for analysis, understand various sources
of Data like Sensors/Signals/GPS etc. Data Management,
Data Quality(noise, outliers, missing values, duplicate
data) and Data Processing & Processing
 Unit 2: Data Analytics: Introduction to Analytics,
Introduction to Tools and Environment, Application of
Modeling in Business, Databases & Types of Data and
variables, Data Modeling Techniques, Missing
Imputations etc. Need for Business Modeling
 Unit 3:Regression – Concepts, Blue property assumptions,
Least Square Estimation, Variable Rationalization,
and Model Building etc. Logistic Regression: Model Theory,
Model fit Statistics, Model Construction, Analytics
applications to various Business Domains etc.
 Unit 4: Object Segmentation: Regression Vs Segmentation –
Supervised and Unsupervised Learning, Tree Building –
Regression, Classification, Overfitting, Pruning and
Complexity, Multiple Decision Trees etc.
Time Series Methods: Arima, Measures of Forecast Accuracy,
STL approach, Extract features from generated model as
Height, Average Energy etc and Analyze for prediction.
 Unit 5 :Data Visualization: Pixel-Oriented Visualization
Techniques, Geometric Projection Visualization
Techniques, Icon-Based Visualization Techniques,
Hierarchical Visualization Techniques, Visualizing Complex
Data and Relations.
 Student’s Handbook for Associate Analytics – II, III.

 Data Mining Concepts and Techniques, Han, Kamber, 3rd


Edition, Morgan Kaufmann
Publishers.

 Introduction to Data Mining, Tan, Steinbach and Kumar,


Addision Wisley, 2006

 Data Mining Analysis and Concepts, M. Zaki and W.


Meira
 https://lecturenotes.in/notes/6099-notes-for-data-
analytics-da-by-prasanta-bal
https://lecturenotes.in/notes/7775-notes-for-data-
analytics-da-by-ranu-singh
https://nptel.ac.in/courses/110/106/110106072/
https://www.analyticsvidhya.com/blog/2015/08/compreh
ensive-guide-
regression/#:~:text=Regression%20analysis%20is%20a%20
form,effect%20relationship%20between%20the%20variable
s.
https://www.analyticsvidhya.com/blog/2019/04/introduct
ion-image-segmentation-techniques-python/
https://en.wikipedia.org/wiki/Data_visualization
https://www.itl.nist.gov/div898/handbook/pmc/section4/
pmc4.htm https://www.hackerearth.com/practice/data-
structures/advanced-data-structures/segment-
trees/tutorial/
 Assignments
 Tutorials
 Group discussions
 Online quiz
 Interactive sessions
 Seminars
 Guest lectures
Not limited to above methods
At the completion of course the student will be able to:

 Learn to analyze a given data set.


 Develop a mini project
 Data Management: Design Data Architecture and
manage the data for analysis, understand various
sources of Data like Sensors/Signals/GPS etc. Data
Management, Data Quality(noise, outliers, missing
values, duplicate data) and Data Processing &
Processing
 Enterprise is a collection of organizations or business units that
share a common set of goals.
 Architecture is the fundamental organization of components,
their relationships and the principles governing their design and
evolution.
 An Enterprise Architecture is a conceptual blue print that
defines the structure and operation of an organization.
 The intent of Enterprise architecture is to determine how an
organization can most effectively achieve its current and future
objectives.
 The Architecture is the structure and description of a system or
enterprise.
 It includes all principles, polices and standards that ensure that
system will work smoothly.
Architecture domains of enterprises architecture:
 Business architecture
 Data architecture
 Applications architecture
 Technology architecture
Enterprise Architecture Framework (The Other View):
Data Architecture deals with designing and constructing of
data resource:
Data architecture provides methods to design construct and
implement a fully integrated business-driven data resource that
include real world objects and events. It is one of the pillars of
enterprise architecture.
Data architecture falls under the work of data architects. These
professionals define the target state, make developmental
alignments and then once the data architecture is implemented,
the data architects make whatever enhancements tailored to the
needs of the original blueprint.
If the data architecture is not stable or robust leads to the failures
that can cause the whole system to breakdown and result in a
lot of monetary loss to the organization.
The data architecture is broken down into three traditional
architectural processes :
The conceptual aspects represent all the business entities and its
related attributes.
The logical aspects represents the entire logic of the entity
relationships.
The physical aspects is the realization of actual data mechanism
for particular types of functionalities.
 The "data" column of the Zachman Framework for enterprise
architecture –
Layer View Data (What) Stakeholder

List of things and architectural


1 Scope/Contextual standards important to the Planner
business

Semantic model
2 Business Model/Conceptual or Conceptual/Enterprise Data Owner
Model

3 System Model/Logical Enterprise/Logical Data Model Designer

4 Technology Model/Physical Physical Data Model Builder

5 Detailed Representations Actual databases Subcontractor


 Data architecture includes a complete analysis of the relationships
among an organization's functions, available technologies, and data
types.
The general framework of the data architecture takes into three vast
aspects of the whole information system:
 The first is the physical data architecture which is focused on actual
tangible hardware elements. It depends on the bulk of data to be
processed and the number of data consumers who may be accessing
the data warehouse simultaneously. Investment in physical data
architecture includes buying computer servers, routers and other
network devices.
 The second aspect pertains to the elements of the data architecture.
For example structure of the administrative session of the
implementing company and how data is used within the section
should be described in the data architecture.
 The third aspect pertains to the forces at work which can have
various influences and constraints which will potentially affect the
design. These forces include economics, business polices, enterprise
requirements, technology drivers and data processing needs.
Various constraints and influences will have an effect on
data architecture design. These include:
 Enterprise requirements
 Technology drivers
 Economics
 Business policies and
 Data processing needs.
The view of architecture domains as layers can be presented as:

 Environment (the external entities and activities monitored,


supported or directed by the business).
 Business Layer (business functions offering services to each other
and to external entities).
 Data Layer (Business information and other valuable stored data)
 Information System Layer (business applications offering
information services to each other and to business functions)
 Technology Layer (generic hardware, network and platform
applications offering platform services to each other and to business
applications).
 Each layer delegates work to the layer below. In each layer, the
components, the processes and the services can be defined at a
coarse-grained level and decomposed into finer-grained components,
processes and services.
Data Management
Data management is the implementation of policies and
procedures that put organizations in control of their business
data regardless of where it resides.

What do I need to know about data management?


 Data management is concerned with the endtoend lifecycle of
data, from creation to retirement, and the controlled
progression of data to and from each stage within its lifecycle.
Data management minimizes the risks and costs of regulatory
non compliance, legal complications, and security breaches. It
also provides access to accurate data when and where it is
needed, without ambiguity or conflict, thereby avoiding
miscommunication.
 What kind of data is managed via data management?
Any kind of business data is subject to data management principles and
procedures, but it is particularly useful in rectifying conflict among data
from duplicative sources. Organizations that use cloudbased applications in
particular find it hard to keep data orchestrated across systems. Data

management practices can prevent ambiguity and make sure that data
conforms to organizational best practices for access, storage, back up, and
retirement, among other things.

 What are the processes of data management?


A common approach to data management is to utilize a master data file, called
Master Data Management (MDM). This file provides a common definition
of an asset and all its data properties in an effort to eliminate ambiguous or
competing data policies and give the organization comprehensive
stewardship over its data.

 What are the benefits of data management?


The benefits of data management include enhanced compliance, greater
security, improved sales and marketing strategies, better product
classification, and improved data governance to reduce organizational risk.
Data analysis is a study includes developing and executing a plan
for collecting data, analysis and interpretation of the data, a data
analysis presumes the data have already been collected.
More specifically, a study includes:
 the development of a hypothesis or question
 the designing of the data collection process
 the collection of the data, and
 the analysis and interpretation of the data.
Because a data analysis presumes that the data have
already been collected, it includes development and refinement
of a question and the process of analyzing and interpreting the
data.
 In reality, data analysis is a highly iterative and non-linear
process, better reflected by a series of epicycles.
 An epicycle is a small circle whose centre moves around
the circumference of a larger circle.
 Managing data includes data preparation (data preprocessing),
Exploratory Data Analysis and Statistical Analysis and
Descriptive Analysis through plotting and Visualisation and
also merging and joining two datasets/thus tables as per the
needs of analysis before actual data analysis happens
 Data science is a concept used to tackle big data and includes
data cleansing, preparation, and analysis.

 A data scientist gathers data from multiple sources and applies


machine learning, predictive analytics, and sentiment analysis to
extract critical information from the collected data sets.

 They understand data from a business point of view and can


provide accurate predictions and insights that can be used to
power critical business decisions.
Managing Data Analysis describes the process of analyzing data and
how to manage that process.

We describe the iterative nature of data analysis and the role of stating
a sharp question, exploratory data analysis, inference, formal
statistical modeling, interpretation, and communication.

In addition, we will describe how to direct analytic activities within a


team and to drive the data analysis process towards coherent and
useful results.
There are 5 core activities of data analysis:
 Stating and refining the question
 Exploring the data
 Building formal statistical models
 Interpreting the results
 Communicating the results

These 5 activities can occur at different time scales:


for example, you might go through all 5 in the course of
a day, but also deal with each, for a large project, over
the course of many months.
Epicycles of Analysis:
More specifically, for each of the five core activities, it is
critical that you engage in the following steps:
1. Setting Expectations,
2. Collecting information (data), comparing the data to
your expectations, and if the expectations don’t match,
3. Revising your expectations or fixing the data so your
data and your expectations match. Iterating through this
3-step process is what we call the “epicycle of data
analysis.”
1)Stating and refining Research question:
#Research Question- Is the mileage for Automatic &
Manual cars is Same?
2)Start exploratory data analysis
 EDA is doing preliminary analysis before doing actual
analysis.
(i) Check whether the dataset is completely loaded or not
using head and tail commands.
(ii) Get the structure of the dataset using str() function.
(iii)Get the summary statistics of the dataset using
summary()
(iv) Use data visualization techs for this checking,
histograms, boxplots, and density plots to ensure that
analysis variable or “y” variable is numerical, normally
distributed, no missing values and no outliers

(v) Get the statistics for numeric and freq dist table for
text/cat items using summary() and table() respectively.

(vi) Create some new columns to suite analysis question.

(vii) mtcars is a default dataset available in R.


 dim(mtcars)
 cardata=mtcars#copy mtcars ds into a new var
 head(cardata)#first 6 rows
 head(cardata,10)# first 10 rows
 tail(cardata)#last six rows of data
 str(cardata)#identify what vars are there. cols are vars(11) and rows
are observations(32)
 summary(cardata)#gives statistics for numeric and freq dist table for
text/cat items
 sd(cardata$mpg,na.rm=FALSE)#standard deviation(for specific
variable))
 lapply(cardata,sd)#sd of complete ds(col apply) sapply for row wise
 lapply(cardata,median)#median for all vars in ds
 #the analysis var or y var should be numerical,
 #normally distributed, no missing values and no outliers
 #we use data visualization techs for this checking,
 #we use histograms, boxplots, and density plots
 hist(cardata$mpg)
 boxplot(cardata$mpg,horizontal = T)#horizontal box
plot/viscuss
 plot(density(cardata$mpg))
 library(psych)# or put a tick mark in packages list
 describe(cardata)#gives skew, kurtosys etc...skew=0.61
 describe(cardata$mpg)
 #Data manipulation stage
 summary(cardata$am)# it has 0's and 1's
Data can be generated from two types of sources namely : Primary and
Secondary
Primary sources of data:
The sources of generating primary data are –
 Observation Method (questionnaires, depth interview, focus group
interviews, case studies, )
 Survey Method– online surveys,offline surveys
 Online—all mandatory is filled
 Appropriate attributes to get good survey
 Offline—missing data
 Experimental Method

In form for example : 20 fields


atmost 10 fields
missing data is result of offline collecting
Experimental Method
There are number of experimental designs that are used in
carrying out and experiment.
However, market researchers have used 4 experimental designs
most frequently. These are -
1. CRD - Completely Randomized Design
2. RBD - Randomized Block Design

3. LSD - Latin Square Design

4. FD - Factorial Design

5. R language(statistical language) ,Python


RBD - Randomized Block Design
 The term originated from agricultural research.

 In this design several treatments of variables are applied to different


blocks of land to ascertain their effect on the yield of the crop.
 Blocks are formed in such a manner that each block contains as
many plots as a number of treatments so that one plot from each is
selected at random for each treatment. The production of each plot is
measured after the treatment is given.
 These data are then interpreted and inferences are drawn by using the
analysis of variance Technique so as to know the effect of various
treatments like different dozes of fertilizers, different types of
irrigation etc.
 http://www.r-tutor.com/elementary-statistics/analysis-
variance/randomized-block-design
LSD - Latin Square Design –
A Latin square is one of the experimental designs which has a balanced
two way classification scheme say for example - 4 X 4 arrangement.
In this scheme each letter from A to D occurs only once in each row
and also only once in each column. The balance arrangement, it may
be noted that, will not get disturbed if any row gets changed with the
other.

B C D A
C D A B
D A B C
A B C D
The balance arrangement achieved in a Latin Square is its main
strength. In this design, the comparisons among treatments, will be
free from both differences between rows and columns.
Thus the magnitude of error will be smaller than any other design.
FD - Factorial Designs –
This design allows the experimenter to test two or more
variables simultaneously. It also measures interaction
effects of the variables and analyzes the impacts of each
of the variables. In a true experiment, randomization is
essential so that the experimenter can infer cause and
effect without any bias.
Sources of Secondary Data:
While primary data can be collected through questionnaires,
depth interview, focus group interviews, case studies,
experimentation and observation.

The secondary data can be obtained through

 Internal Sources - These are within the organization


 External Sources - These are outside the organization
Internal Sources of Data
 Internal secondary data can be obtained with less time, effort
and money than the external secondary data.
 It is within the organization and hence more pertinent.

The internal sources include


Accounting resources- This gives so much information which
can be used by the marketing researcher. They give information
about internal factors.
 Sales Force Report- It gives information about the sale of a
product. The information provided is of outside the
organization.
Internal Experts- These are people who are heading the various
departments. They can give an idea of how a particular thing is
working.
Miscellaneous Reports- These are what information you
are getting from operational reports.
If the data available within the organization are unsuitable
or inadequate, the marketer should extend the search to
external secondary data sources.

External Sources of Data


External Sources are sources which are outside the
company in a larger environment. Collection of external
data is more difficult because the data have much greater
variety and the sources are much more numerous.
External data can be divided into following classes.
Government Publications- Government sources provide an
extremely rich pool of data for the researchers.
Data are available free of cost on internet websites. There are
number of government agencies generating data. These are:
Registrar General of India(office)-
Generates demographic data. It includes details of gender, age,
occupation etc.
Central Statistical Organization-
This organization publishes the national accounts statistics. It
contains estimates of national income for several years, growth
rate, and rate of major economic activities. Annual survey of
Industries is also published by the CSO. It gives information
about the total number of workers employed, production units,
material used and value added by the manufacturer.
 Director General of Commercial Intelligence- This office
operates from Kolkata. It gives information about foreign trade
i.e. import and export.
It provides data region-wise and country-wise.

 Ministry of Commerce and Industries- This ministry through


the office of economic advisor provides information on
wholesale price index. These indices may be related to a number
of sectors like food, fuel, power, food grains etc. It also
generates All India Consumer Price Index numbers for
industrial workers, urban, non manual employees and cultural
laborers.
 Planning Commission- It provides the basic statistics of Indian
Economy.
 Reserve Bank of India- This provides information on
Banking Savings and investment. RBI also prepares
currency and finance reports.

 Labour Bureau- It provides information on skilled,


unskilled, white collared jobs etc. National Sample Survey-
This is done by the Ministry of Planning and it provides
social, economic, demographic, industrial and agricultural
statistics. Department of Economic Affairs- It conducts
economic survey and it also generates information on
income, consumption, expenditure, investment, savings and
foreign trade.

 State Statistical Abstract- This gives information on


various types of activities related to the state like -
commercial activities, education, occupation etc.
Non Government Publications- These includes publications of
various industrial and trade associations, such as--
 The Indian Cotton Mill Association

 Various chambers of commerce

 The Bombay Stock Exchange (it publishes a directory


containing financial accounts, key profitability and other
relevant matter)
 Various Associations of Press Media.

 Export Promotion Council.

 Confederation of Indian Industries ( CII )


 Small Industries Development Board of India

 Different Mills like - Woolen mills, Textile mills etc

The only disadvantage of the above sources is that the data may
be biased. They are likely to colour their negative points.
Syndicate Services- These services are provided by certain organizations
which collect and tabulate the marketing information on a regular basis for a
number of clients who are the subscribers to these services. So the services
are designed in such a way that the information suits the subscriber. These
services are useful in television viewing, movement of consumer goods etc.
These syndicate services provide information data from both household as well
as institution.
In collecting data from household they use three approaches Survey- They
conduct surveys regarding - lifestyle, sociographic, general topics.
Mail Diary Panel- It may be related to 2 fields - Purchase and Media.
Electronic Scanner Services- These are used to generate data on volume.
They collect data for Institutions from
 Whole sellers

 Retailers, and

 Industrial Firms

Various syndicate services are Operations Research Group (ORG) and The
Indian Marketing Research Bureau (IMRB).
Importance of Syndicate Services
Syndicate services are becoming popular since the constraints of decision making are
changing and we need more of specific decision-making in the light of changing
environment. Syndicate services provide information to the industries at a low unit cost.

Disadvantages of Syndicate Services


The information provided is not exclusive. A number of research agencies provide
customized services which suits the requirement of each individual organization.

International Organization- These includes:

The International Labour Organization (ILO)- It publishes data on the total and active
population, employment, unemployment, wages and consumer prices

The Organization for Economic Co-operation and development (OECD) –


It publishes data on foreign trade, industry, food, transport, and science and technology.
The International Monetary Fund (IMA) - It publishes reports on national and international
foreign exchange regulations.
 Data mining—(extracting knowledge from available
data) applications are often applied to data that was
collected for another purpose or future. Much of the
statistics deals with the design of experiments or survey
that achieves a prespecified level of data quality. The
first step is detection and correction and it is often
called as data cleaning.
 Specific aspects of data quality
 Measurement and data collection issues
 Issues related to applications
 1. Measurement and data collection issues:
 It is unrealistic to expect the data will be perfect. There
may be problems due to human error, limitations of
measuring devices or flaws in the data collection
process. Values or even entire data objects may be
missing or duplicate i.e, multiple data objects that all
correspond to a single “real” object. Inconsistencies of
data.
Issues:
Measurement and data collection errors:
The term measurement error refers to any problem resulting from the measurement
process. The values recorded differ from true values. For continuous attributes
the numerical difference of the measured and true value is called the error.
Precentage={1,2.3.4...... to 100}
Price-{100-200}--- continuous data
Gender={male.female}—category of m or f
Categorical data
1-100
Ordinal={infant-1young,middle aged,old aged}
Week days ordinal

errror=78/80-77/76
Error=Actual/True value - Measured/Predicted value
The term data collection error refers to errors such as omitting data objects or
attribute values or inappropriately including a data object. For certain types of
data errors example keyboard errors they often exist welldeveloped techniques
for detecting and /or correcting these errors with human intervention.
Noise and Artifacts:

 Noise is the random component of a measurement


error. It may involve the distortion of a value or the
addition of spurious object. The term noise is often used
in connection with data that has spatial or temporal
components. The techniques from signal or image
processing can frequently be used to reduce noise. Data
errors may be the result of a more deterministic
phenomenon such as a streak in the same place on a set
of photographs. Such deterministic distortions of the
data are often referred to as artifacts.
 Ouput: 10---10.00001
 Precision, Bias and Accuracy:
 In statistics and experimental science, the quality of the
measurement process and the resulting data are
measured by precision and bias.
 The precision is defined as the closeness of repeated
measurement of the same quantity to one another.
 The bias is a systematic variation of measurements from
the quantity being measured.
 Precision is often measured by the standard deviation
of a set of values while bias is measured by taking the
difference between the mean of the set of values and
the known values of the quantity being measured.
 Ex: Suppose we have a standard laboratory weight with a mass
of 1g and wants to assess the precision and bais of our new
laboratory scale.
 We weigh the mass five times and obtain the following five
values { 1.015, 0.990,1.013,1.001,0.986}.
 The mean of these values is 1.001 and hence the bias is
0.001(1g1.001).
 The precision is measured by the standard deviation and i.e,
0.013.
 It’s important to note that measurement systems can suffer
from both accuracy and precision problems! A dart board can
help us visualize the difference between the two concepts:
 The standard deviation is a measure of the amount of
variation or dispersion of a set of values. A low
standard deviation indicates that the values tend to be
close to the mean of the set, while a high standard
deviation indicates that the values are spread out over a
wider range.
Accurate and Precise Precise...but not Accurate

Accurate, but not Precise Neither Accurate nor Precise


 Accuracy refers to the degree of measurement error in
data. It is defined as closeness of measurements to the
true value of the quantity being measured. Accuracy
depends on precision and bais.
Unbiased measurements relative to the target
Biased measurements not relative to the target
 Outliers:
 Outliers are
 (1) Data objects that have characteristics that are different from
most of the other data objects in the data set.
 (2) Values of an attribute that are unusual with respect to the typical
values for that attribute. The outliers are also called as anomalous
objects or values.
 It is important to distinguish between the notions of noise and
outliers. Outliers can be legitimate data objects or values. Thus
unlike noise, outliers may sometimes be of interest. Example fraud
and network intrusion detection.
 1-100
 254--- 54 poor data quality
 {a-z}property of name
 Roll no {range}
 10--- 10.00003
 Handling Outliers with R—statistical language

 Outlier is a point or an observation that deviates


significantly from the other observations.
 Due to experimental errors or “special circumstances”
 Outlier detection tests to check for outliers
 Outlier treatment –
Exclusion
Retention

 Other treatment methods


 We have “OUTLIER” package in R to detect
and treat outliers in Data.
Normally we use BOX Plot and Scatter plot to
find outliers from graphical representation.
There are three commonly accepted ways of modifying outlier
values.
 Remove the case. If you have many cases and there does not
appear to be an explanation for the appearance of this value,
or if the explanation is that it is in error, you can simply get
rid of it.
 Assign the next value nearer to the median in place of the
outlier value. Here that would be 130.997, the next lower
value. You will find that this approach leaves the
distribution close to what it would be without the value. You
can use this approach if you have few cases and are trying
to maintain the number of observations you do have.
 Calculate the mean of the remaining values without the
outlier and assign that to the outlier case. While I have seen
this frequently, I don’t really understand its justification and
I think it distorts the distribution more than the previous
solution.
Method-1: Case Study

 # Inject outliers into data.


 cars1 <- cars[1:30, ] # original data
 cars_outliers <- data.frame(speed=c(19,19,20,20,20),
dist=c(190, 186, 210, 220, 218)) # introduce outliers.
 cars2 <- rbind(cars1, cars_outliers) # data with outliers.

 # Plot of data with outliers.


 par(mfrow=c(1, 2))
 plot(cars2$speed, cars2$dist, xlim=c(0, 28), ylim=c(0,
230), main="With Outliers", xlab="speed", ylab="dist",
pch="*", col="red", cex=2)
 abline(lm(dist ~ speed, data=cars2), col="blue", lwd=3,
lty=2)
 # Plot of original data without outliers. Note the
change in slope (angle) of best fit line.

 plot(cars1$speed, cars1$dist, xlim=c(0, 28), ylim=c(0,


230), main="Outliers removed \n A much better fit!",
xlab="speed", ylab="dist", pch="*", col="red", cex=2)
 abline(lm(dist ~ speed, data=cars1), col="blue",
lwd=3, lty=2)
Notice the change in slope of the best fit line after removing the outliers. Had we used
the outliers to train the model(left chart), our predictions would be exagerated (high
error) for larger values of speed because of the larger slope.
Missing values:
 It is not unusual for an object to be missing one or more
attribute values. In some cases, the information was not
collected. Example some people decline to give their phone
numbers or age details. In these cases, some attributes are
not applicable to all objects. Regardless, missing values
should be taken into account during the data analysis.
 Imagine that you need to analyze AllElectronics sales and
customer data. You note that many tuples have no recorded
value for several attributes such as customer income. How
can you go about filling in the missing values for this
attribute? Let’s look at the following methods.
There are several strategies for dealing with missing data.
 Eliminate data objects or attributes: A simple and effective
strategy is to eliminate objects with missing values. But if
may objects have missing values, then a reliable analysis can
be difficult or impossible. If a data set has only a few objects
that have missing values, then related strategy is to omit them.
 Estimate missing values: Sometimes missing data can be
reliably estimated. For example consider a time series that
changes in a reasonably smooth fashion, but has a few widely
scattered missing values. In such cases the missing values can
be estimated by using the remaining values. Another example
is considering a data set that has many similar data points. In
this situation, the attribute values of the points closest to the
point with the missing values are often used to estimate the
missing value.
 If the attribute is continuous, then the average attribute value of the
nearest neighbors is used. If the attribute is categorical, then the most
commonly occurring attribute value can be taken.
 Ignore the missing values during analysis: Many data mining
approaches can be modified to ignore missing values. For example,
suppose that objects are being clustered and the similarity between pairs
of the data objects needs to be calculated. If one or both objects of a pair
have missing value for some attributes, then the similarity can be
calculated by using only the attributes that do not have missing values.
 Fill in the missing value manually: In general, this approach is time
consuming and may not be feasible given a large data set with many
missing values.
 Use a global constant to fill in the missing value: Replace all missing
attribute values by the same constant such as a label like “Unknown” or
−∞. If missing values are replaced by, say, “Unknown,” then the mining
program may mistakenly think that they form an interesting concept,
since they all have a value in common— that of “Unknown.” Hence,
although this method is simple, it is not foolproof.
 Use a measure of central tendency for the attribute (e.g.,
the mean or median) to fill in the missing value: For
example, suppose that the data distribution regarding the
income of AllElectronics customers is symmetric and that
the mean income is $56,000. Use this value to replace the
missing value for income.
 Use the attribute mean or median for all samples
belonging to the same class as the given tuple: For
example, if classifying customers according to credit risk,
we may replace the missing value with the mean income
value for customers in the same credit risk category as that
of the given tuple. If the data distribution for a given class is
skewed, the median value is a better choice.
 Use the most probable value to fill in the missing value:
This may be determined with regression, inferencebased
tools using a Bayesian formalism, or decision tree.
 In R, missing values are represented by the symbol NA (not available). Impossible
values
(e.g., dividing by zero) are represented by the symbol NaN (not a number). Unlike
SAS, R uses the same symbol for character and numeric data.
 To test if there is any missing in the dataset we use is.na () function.
For Example,
We have defined “y” and then checked if there is any missing value. T or True means
that
there is a missing value.
Testing for missing values
 y <- c(1,2,3,NA)
is.na(y)
# returns a vector (F FF T)

 Arithmetic functions on missing values yield missing values.


For Example,
mean(y)
# returns NA
 2. MICE Package -> Multiple Imputation by
Chained Equations

MICE uses PMM to impute missing values in a dataset.

PMM-> Predictive Mean Matching (PMM) is a semi-
parametric imputation approach. It is
similar to the regression method except that for each
missing value, it fills in a value randomly
from among the a observed donor values from an
observation whose regression-predicted values are
closest to the regression-predicted value for the missing
value from the simulated regression model.

 There are four ways you can handle missing values:
 1. Deleting the observations
 If you have large number of observations in your dataset,
where all the classes to be predicted are sufficiently
represented in the training data, then try deleting (or not to
include missing values while model building, for example by
setting na.action=na.omit) those observations (rows) that
contain missing values. Make sure after deleting the
observations, you have:
 Have sufficent data points, so the model doesn’t lose power.
 Not to introduce bias (meaning, disproportionate or non-
representation of classes).
 # Example
 lm(medv ~ ptratio + rad, data=BostonHousing,
na.action=na.omit) # though na.omit is default in lm()
 2. Deleting the variable
 If a paricular variable is having more missing values that rest
of the variables in the dataset, and, if by removing that one
variable you can save many observations, then you are better
off without that variable unless it is a really important
predictor that makes a lot of business sense. It is a matter of
deciding between the importance of the variable and losing
out on a number of observations.
 3. Imputation with mean / median / mode
 Replacing the missing values with the mean / median / mode
is a crude way of treating missing values. Depending on the
context, like if the variation is low or if the variable has low
leverage over the response, such a rough approximation is
acceptable and could possibly give satisfactory results.
 library(Hmisc)

 impute(BostonHousing$ptratio, mean) # replace with


mean

 impute(BostonHousing$ptratio, median) # median


 impute(BostonHousing$ptratio, 20) # replace specific
number
 # or if you want to impute manually
 BostonHousing$ptratio[is.na(BostonHousing$ptratio)]
<- mean(BostonHousing$ptratio, na.rm = T) # not run
 Lets compute the accuracy when it is imputed with
mean
 library(DMwR)
 actuals <-original$ptratio[is.na(BostonHousing$ptratio)]
 predicteds <- rep(mean(BostonHousing$ptratio,
na.rm=T), length(actuals))
 regr.eval(actuals, predicteds)
 #> mae mse rmse mape
 #> 1.62324034 4.19306071 2.04769644 0.09545664
 Inconsistent values:
 Data can contain inconsistent values. Consider an
address field, where both a zip code and city are listed,
but the specified zip code is not contained in that city. It
may be that the individual entering this information
transposed two digits or perhaps a digit was misread
when the information was scanned from a handwritten
form.
 Regardless of the cause of the inconsistent values, it is
important to detect and if possible correct such
problems. Some types of inconsistencies are easy to
detect. Example a person's height cannot be negative.
Once an inconsistency has been detected, it is sometimes
possible to correct the data.
 Duplicate data: A data set may include data objects that
are duplicates or almost duplicate of one another. Many
people receive duplicate mailing because they appear in a
database multiple times under slightly different names.
 To detect and eliminate such duplicates two main issues
must be addressed.
 First, if there are two objects that actually represent a single
object, then the values of corresponding attributes may
differ and these inconsistent values must be resolved.
 Second, care needs to be taken to avoid accidently
combining data objects that are similar, but not duplicates
such as two distinct people with identical names. The term
deduplication is often used to refer to the process of dealing
with these issues.
 2. Issues related to Applications:
 Data quality issues can also be considered from an
application viewpoint as expressed by the statement.
“Data is of high quality if it is suitable for its intended
use”. This approach to data quality has proven quite
useful.
 As with quality issues at the measurement and data
collection level, there are many issues that are specific
to particular applications and fields.
 Few of the general issues are
 1. Timeliness: Some data starts to age as soon as it has been
collected. For example if the data provides a snapshot of
some ongoing phenomenon or process such as the purchasing
behavior of customers or web browsing patterns, then this
snapshot represents reality for only a limited time. If the data
is out of date, then so are the models and patterns that are
based on it.
 2. Relevance: The available data must contain the
information necessary for the application. For example the
task of building a model that predicts the accident rate for
drivers. If information about the age and gender of the driver
is omitted, then it is likely that the model will have limited
accuracy.
 Making sure that the objects in a data set are relevant is also
challenging. A common problem is sampling bias, which
occurs when a sample does not contain different types of
objects in proportion to their actual occurrence in the
population.
 3. Knowledge about the data: Ideally, datasets are
accompanied by documentation that describes different aspects
of the data. The quality of this documentation can either aid or
hinder (obstruct) the subsequent analysis. Other important
characteristics are the precision of the data , the type of
features(nominal, ordinal, interval, ratio), the scale of
measurement(example meters or feet for length) and the origin
of the data.
Data Preprocessing
Why preprocessing?
Real world data are generally
 Incomplete: Lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
 Noisy: Containing errors or outliers
 Inconsistent: Containing discrepancies in codes or names

Tasks in data preprocessing:


Data cleaning: Fill in missing values, smooth noisy data, identify or
remove outliers, and resolve inconsistencies.
Data integration: Using multiple databases, data cubes, or files.
Data transformation: Normalization and aggregation.
Data reduction: Reducing the volume but producing the same or
similar analytical results.
 Data discretization: It a part of data reduction,
replacing numerical attributes with nominal ones.
Data Cleaning
Realworld data tend to be incomplete, noisy, and inconsistent. Data cleaning (or
data cleansing) routines attempt to fill in missing values, smooth out noise while
identifying outliers, and correct inconsistencies in the data.

The approaches to data cleaning as a process.


Missing Values:
Imagine that you need to analyze AllElectronics sales and customer data. You note
that many tuples have no recorded value for several attributes such as customer
income. How can you go about filling in the missing values for this attribute?
Let’s look at the following methods.

Ignore the tuple: This is usually done when the class label is missing (assuming
the mining task involves classification). This method is not very effective,
unless the tuple contains several attributes with missing values. It is especially
poor when the percentage of missing values per attribute varies considerably. By
ignoring the tuple, we do not make use of the remaining attributes’ values in the
tuple. Such data could have been useful to the task at hand.

Fill in the missing value manually: In general, this approach is time consuming
and may not be feasible given a large data set with many missing values.
 Use a global constant to fill in the missing value: Replace all missing
attribute values by the same constant such as a label like “Unknown” or
−∞. If missing values are replaced by, say, “Unknown,” then the mining
program may mistakenly think that they form an interesting concept,
since they all have a value in common— that of “Unknown.” Hence,
although this method is simple, it is not foolproof.

 Use a measure of central tendency for the attribute (e.g., the mean or
median) to fill in the missing value: For example, suppose that the data
distribution regarding the income of AllElectronics customers is
symmetric and that the mean income is $56,000. Use this value to
replace the missing value for income.
 Use the attribute mean or median for all samples belonging to the
same class as the given tuple: For example, if classifying customers
according to credit risk, we may replace the missing value with the mean
income value for customers in the same credit risk category as that of the
given tuple. If the data distribution for a given class is skewed, the
median value is a better choice.
 Use the most probable value to fill in the missing value: This may be
determined with regression, inferencebased tools using a Bayesian
formalism, or decision tree.
 Noisy Data:
 Noise is a random error or variance in a measured
variable. the basic statistical description techniques
(e.g., boxplots and scatter plots), and methods of data
visualization can be used to identify outliers, which may
represent noise.
 Given a numeric attribute such as, say, price, how can
we “smooth” out the data to remove the noise? the
following data smoothing techniques.
 Smoothing by :
 bin means
 Bin medians
 Bin boundaries
Binning:
 Binning methods smooth a sorted data value by consulting
its “neighborhood,” that is, the values around it. The sorted
values are distributed into a number of “buckets,” or bins.
Because binning methods consult the neighborhood of
values, they perform local smoothing. Figure 3.2 illustrates
some binning techniques. In this example, the data for price
are first sorted and then partitioned into equalfrequency bins
of size 3 (i.e., each bin contains three values). In smoothing
by bin means, each value in a bin is replaced by the mean
value of the bin. For example, the mean of the values 4, 8,
and 15 in Bin 1 is 9. Therefore, each original value in this
bin is replaced by the value 9. Similarly, smoothing by bin
medians can be employed, in which each bin value is
replaced by the bin median. In smoothing by bin
boundaries, the minimum and maximum values in a given
bin are identified as the bin boundaries. Each bin value is
then replaced by the closest boundary value
Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
Partition into (equalfrequency) bins:
Bin 1: 4, 8, 15--- mean=(4+15+8)/3=9
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34--- 25 and 34
Smoothing by bin means:
Bin 1: 9, 9, 9
Bin 2: 22, 22, 22
Bin 3: 29, 29, 29
Smoothing by bin boundaries: bin1 boundaries—4 and 15
Bin 1: 4, 4, 15
Bin 2: 21, 21, 24
Bin 3: 25, 25, 34
Regression:

Data smoothing can also be done by regression, a


technique that conforms data values to a function.
Linear regression involves finding the “best” line to fit
two attributes (or variables) so that one attribute can be
used to predict the other. Multiple linear regression is an
extension of linear regression, where more than two
attributes are involved and the data are fit to a
multidimensional surface.
Aggregate Sale—y- dependent variable
Item1—x1– independent variable
Item2—x2– independent variable
Outlier analysis:
Outliers may be detected by clustering, for example,
where similar values are organized into groups, or
“clusters.” Intuitively, values that fall outside of the set
of clusters may be considered outliers.
Data Integration
Data mining often requires data integration—the merging of data
from multiple data stores. Careful integration can help reduce and
avoid redundancies and inconsistencies in the resulting data set.
This can help improve the accuracy and speed of the subsequent
data mining process.

The semantic heterogeneity and structure of data pose great


challenges in data integration.
How can we match schema and objects from different sources? This
is the essence of the entity identification problem.
There are a number of issues to consider during data integration.
Schema integration and object matching can be tricky. How can
equivalent real world entities from multiple data sources be
matched up? This is referred to as the entity identification
problem.
 For example, how can the data analyst or the computer be sure that
customer id in one database and cust number in another refer to the
same attribute? Examples of metadata for each attribute include
the name, meaning, data type, and range of values permitted for
the attribute, and null rules for handling blank, zero, or null values
. Such metadata can be used to help avoid errors in schema
integration. The metadata may also be used to help transform the
data (e.g., where data codes for pay type in one database may be
“H” and “S” but 1 and 2 in another). When matching attributes
from one database to another during integration, special attention
must be paid to the structure of the data. This is to ensure that any
attribute functional dependencies and referential constraints in the
source system match those in the target system. For example, in
one system, a discount may be applied to the order, whereas in
another system it is applied to each individual line item within the
order. If this is not caught before integration, items in the target
system may be improperly discounted.
 Redundancy and Correlation Analysis:

Redundancy is another important issue in data integration. An


attribute (such as annual revenue, for instance) may be
redundant if it can be “derived” from another attribute or set
of attributes. Inconsistencies in attribute or dimension
naming can also cause redundancies in the resulting data set.
Some redundancies can be detected by correlation analysis.
Given two attributes, such analysis can measure how
strongly one attribute implies the other, based on the
available data. For nominal data, we use the χ 2 (chisquare)
test. For numeric attributes, we can use the correlation
coefficient and covariance, both of which access how one
attribute’s values vary from those of another.
Data Reduction
 Imagine that you have selected data from the
AllElectronics data warehouse for analysis. The data set
will likely be huge! Complex data analysis and mining
on huge amounts of data can take a long time, making
such analysis impractical or infeasible. Data reduction
techniques can be applied to obtain a reduced
representation of the data set that is much smaller in
volume, yet closely maintains the integrity of the
original data. That is, mining on the reduced data set
should be more efficient yet produce the same (or
almost the same) analytical results.
 Data reduction strategies include dimensionality
reduction, numerosity reduction, and data compression.
 Dimensionality reduction is the process of reducing
the number of random variables or attributes under
consideration. Dimensionality reduction methods
include wavelet transforms and principal components
analysis , which transform or project the original data
onto a smaller space. Attribute subset selection is a
method of dimensionality reduction in which irrelevant,
weakly relevant, or redundant attributes or dimensions
are detected and removed.
 Numerosity reduction techniques replace the original
data volume by alternative, smaller forms of data
representation. These techniques may be parametric or
nonparametric.
 For parametric methods, a model is used to estimate the data, so
that typically only the data parameters need to be stored, instead
of the actual data. (Outliers may also be stored.) Regression and
loglinear models are examples.

 Nonparametric methods for storing reduced representations of


the data include histograms , clustering , sampling ), and data
cube aggregation. In data compression, transformations are
applied so as to obtain a reduced or “compressed” representation
of the original data. If the original data can be reconstructed from
the compressed data without any information loss, the data
reduction is called lossless. If, instead, we can reconstruct only an
approximation of the original data, then the data reduction is
called lossy. There are several lossless algorithms for string
compression; however, they typically allow only limited data
manipulation. Dimensionality reduction and numerosity reduction
techniques can also be considered forms of data compression.
Data transformation
Normalization: Scaling attribute values to fall within a
specified range.
Example: To transform V in [min, max] to V' in [0,1], apply
V'=(VMin)/(MaxMin)
Scaling by using mean and standard deviation (useful when
min and max are unknown or when there are outliers):
V'=(VMean)/StDev
Aggregation: Moving up in the concept hierarchy on numeric
attributes.
Generalization: Moving up in the concept hierarchy on
nominal attributes.
Attribute construction: Replacing or adding new attributes
inferred by existing attributes.
All the best

You might also like