You are on page 1of 69

UNIT 2

DATA MINING
DATA PROCESSING
DATA CLEANING
Data Cleaning in Data Mining. Quality of your data is critical in getting to
final analysis. Any data which tend to be incomplete, noisy and inconsistent
can effect your result. Data cleaning in data mining is the process of
detecting and removing corrupt or inaccurate records from a record set, table
or database.

Data Mining: Concepts and Techniques April 26, 2024 6


Data cleaning is the process of identifying, deleting, and/or replacing inconsistent or
incorrect information from the database. This technique ensures high quality of
processed data and minimizes the risk of wrong or inaccurate conclusions. As such,
it is the foundational part of data science.
The standard data cleaning process consists of the following stages:
•Importing Data
•Merging data sets
•Rebuilding missing data
•Standardization
•Normalization
•Deduplication
•Verification & enrichment
•Exporting data
Standardize + Normalize
Standardization and normalization are crucial to the effectiveness of the
data cleaning process.
Why?
Because they make data ripe for statistical analysis and easy to compare and
analyze.
Standardization is a process during which you’re making sure all your
values adhere to a specific standard, such as deciding whether to go with
kilos or grams, upper or lower case letters.
Normalization is the process of adjusting the values to a common scale.
For example, you can rescale values into the 0-1 range. This action is
necessary if you want to use statistical methods that require normally
distributed data to work.
Handle Missing Data

Action: There three main methods of dealing with missing data.


•Drop. When the missing values in a column are few and far between, the easiest way
to handle them is to drop the missing data rows.
•Impute. This method involves calculating the missing values based on other
observations. Statistical techniques like median, mean, or linear regression are helpful
if there aren’t many missing values. You can also handle this issue by replacing
missing data with entries from another “similar” database. This method is called the
hot-deck imputation.
•Flag. Missing data can be informative, especially if there is a pattern in play. For
example, you conduct a survey, and most women refuse to answer a particular
question. That’s why sometimes just flagging the data can help you with those subtle
insights:
• For numeric data – just put in 0.
• For categorical data – introduce the ‘missing’ category.
DATA TRANSFORMATION

• Aggregation:
Data collection or aggregation is the method of storing and presenting data in a summary
format. The data may be obtained from multiple data sources to integrate these data
sources into a data analysis description. This is a crucial step since the accuracy of data
analysis insights is highly dependent on the quantity and quality of the data used.
Gathering accurate data of high quality and a large enough quantity is necessary to
produce relevant results.
• The collection of data is useful for everything from decisions concerning financing or
business strategy of the product, pricing, operations, and marketing strategies.
• For example, Sales, data may be aggregated to compute monthly& annual total amounts.
Discretization:
It is a process of transforming continuous data into set of small
intervals. Most Data Mining activities in the real world require
continuous attributes. Yet many of the existing data mining
frameworks are unable to handle these attributes.
Also, even if a data mining task can manage a continuous
attribute, it can significantly improve its efficiency by replacing
a constant quality attribute with its discrete values.
For example, (1-10, 11-20) (age:- young, middle age, senior).
Normalization: Data normalization involves converting all data variable
into a given range.

Concept of Hierarchy?
DATA INTEGRATION
Data Integration is a data pre-processing technique that involves combining data from multiple heterogeneous data sources into a
coherent data store and provide a unified view of the data. These sources may include multiple data cubes, databases or flat files.
• A lot of connectors. There are many systems and applications in the world; the more pre-built connectors your Data Integration
tool has, the more time your team will save.
• Open source. Open source architectures typically provide more flexibility while helping to avoid vendor lock-in.
• Portability. It's important, as companies increasingly move to hybrid cloud models, to be able to build your data integrations once
and run them anywhere.
• Ease of use. Data integration tools should be easy to learn and easy to use with a GUI interface to make visualizing your
data pipelines simpler.
• A transparent price model. Your data integration tool provider should not ding you for increasing the number of connectors or
data volumes.
• Cloud compatibility. Your data integration tool should work natively in a single cloud, multi-cloud, or hybrid cloud environment.

Data Mining: Concepts and Techniques April 26, 2024 14


DATA WAREHOUSE
A data warehouse is a collection of databases that work together. Distributed
databases are used to store a database at multiple computer sites to
improve data access and processing. Data mining is the process of
analysing data and summarizing it to produce useful information.
Data warehouse refers to the process of compiling and organizing data into
one common database, whereas data mining refers to the process of extracting
useful data from the databases. The data mining process depends on
the data compiled in the data warehousing phase to recognize meaningful
patterns.
Data Mining: Concepts and Techniques April 26, 2024 15
Data Preparation

• Gathering: You may need to collect data that is relevant to the analysis from
multiple sources.
• Cleansing: Data may have some issues, that you want to resolve before analysis.
• Formating: Data is transformed and consolidated into forms appropriate for
mining by performing summary or aggregation operations.
• Blending: Data is combined or blended from various sources to develop our
analytical dataset.
• Sampling -
Data Mining: Concepts and Techniques April 26, 2024 16
Validation
• Important steps:-
• Observe the key results of the model.
• Ensure the results make sense within the context of business problems.
• Determine whether to proceed or take a step back.
• Iterate.

Data Mining: Concepts and Techniques April 26, 2024 17


FREQUENCY DISTRIBUTION

Class Interval Frequency


1 under 3 4
3–5 12
5–7 13
7–9 19
9 – 11 7
11 -13 5
DATA ASSESSMENT
Interval Frequency Class Relative Cumulative
(Fi) Midpoint Frequency Frequency
(Mi)
1 under 3 4 2 4/60 = .0667 4 8
3–5 12 4 12/60 = .2000 16 48

5–7 13 6 13/60 = .2167 29 78

7–9 19 8 19/60 = .3167 48 152

9 – 11 7 10 7/60 = .1167 55 70
11 -13 5 12 5/60 = .0833 60 60
Mean = = 416/60 = 6.93
Median of Grouped Data

Median =
L = the lower limit of the median class interval = 7
cf = a cumulative total of the frequencies up to but not including the frequency of the median
class =29
fmed = the frequency of the median class = 19
W = The width of the median class = 2
N = Total number of frequency 60
DATA VISUALIZATION
TOOLS USED FOR DATA VISUALIZATION
Tableau
Looker
Zoho Analytics
Sisense
IBM Cognos Analytics
Qlik Sense
Domo
Microsoft Power BI
Klipfolio
SAP Analytics Cloud
Top 8 Python Libraries for Data Visualization

1. Matplotlib
2. Plotly
3. Seaborn
4. GGplot
5. Altair
6. Bokeh

7. Pygal
8. Geoplotlib
Data Mining Knowledge Representation
Background knowledge:
Concept hierarchies: The concept hierarchies
are induced by a partial order1 over the values
of a given attribute. Depending on the type of
the ordering relation we distinguish several
types of concept hierarchies
Representing Input Data And Output Knowledge

You might also like