Session 2: Preprocessing

Session 2
Preprocessing
 It’s a process of discovering or mining knowledge from a large
amount of data.
 Another term for data mining is KDD
 KDD is (knowledge discovery from data)
 Attempts to extract patterns and trends from large databases.
 It finds hidden information in the data base
 Called as exploratory data analysis, data driven and deductive
learning.
Need of Data  Extracting meaningful info
Mining  Needs comes evolution in size of database

 Db [big data(huge database)] 🡪 manual analyses 🡪 need of
automatic analysis
 Term was introduced in 1990 ‘s various data mining technologies
are the following:
 STATISTICS:
Evolution of Regression analysis cluster analysis standard deviation etc.
Data Mining  ARTIFICAL INTELLIGENCE :

Applying of human thoughts like processing.
 MACHINE LEARNING:
Union of statistics and AI
DATA MINING TOOLS
 Splunk is used for monitoring and searching through big data. It
indexes and correlates information in a container that makes it
searchable, and makes it possible to generate alerts, reports and
visualizations.
SPLUNK
 Talend is an ETL tool for Data Integration.
 Apache spark
 Power BI
 KNIME
 Rapid Miner
 Tableau software
 Excel
 R and Python
Summary
Data Mining is a process of finding potentially useful patterns from
huge data sets. It is a multi-disciplinary skill that uses machine learning,
statistics, and AI to extract information to evaluate future events
probability
 It’s a process of discovering or mining knowledge from a large
amount of data.
 Another term for data mining is KDD
 KDD is (knowledge discovery from data)
 Attempts to extract patterns and trends from large databases.
 It finds hidden information in the data base
 Called as exploratory data analysis, data driven and deductive
learning.
Need of Data  Extracting meaningful info
Mining  Needs comes evolution in size of database

 Db [big data(huge database)]  manual analyses  need of
automatic analysis
 Term was introduced in 1990 ‘s various data mining technologies
are the following:
 STATISTICS:
Evolution of Regression analysis cluster analysis standard deviation etc.
Data Mining  ARTIFICAL INTELLIGENCE :

Applying of human thoughts like processing.
 MACHINE LEARNING:
Union of statistics and AI
 https://www.anaconda.com/products/individual
anaconda
 https://jupyter.org/try
Data Mining
Implementation
Process
CRISP-DM :
CRISP-DM
 In this phase, business and data-mining goals are established.
 First, you need to understand business and client objectives. You
need to define what your client wants (which many times even
they do not know themselves)
Business  Take stock of the current data mining scenario. Factor in
resources, assumption, constraints, and other significant factors
understanding: into your assessment.
 Using business objectives and current scenario, define your data
mining goals.
 A good data mining plan is very detailed and should be developed
to accomplish both business and data mining goals.
 In this phase, sanity check on data is performed to check whether its
appropriate for the data mining goals.
 First, data is collected from multiple data sources available in the organization.
 These data sources may include multiple databases, flat filer or data cubes.
There are issues like object matching and schema integration which can arise
during Data Integration process. It is a quite complex and tricky process as
Data data from various sources unlikely to match easily. For example, table A
contains an entity named cust_no whereas another table B contains an entity
understanding: named cust-id.
 Therefore, it is quite difficult to ensure that both of these given objects refer to
the same value or not. Here, Metadata should be used to reduce errors in the
data integration process.
 Next, the step is to search for properties of acquired data. A good way to
explore the data is to answer the data mining questions (decided in business
phase) using the query, reporting, and visualization tools.
 Based on the results of query, the data quality should be ascertained. Missing
data if any should be acquired.
 In this phase, data is made production ready.
 The data preparation process consumes about 90% of the time of
the project.
 The data from different sources should be selected, cleaned,
transformed, formatted, anonymized, and constructed (if
required).
Data  Data cleaning is a process to “clean” the data by smoothing noisy
preparation: data and filling in missing values.
 For example, for a customer demographics profile, age data is
missing. The data is incomplete and should be filled. In some
cases, there could be data outliers. For instance, age has a value
300. Data could be inconsistent. For instance, name of the
customer is different in different tables.
 Data transformation operations change the data to make it useful
in data mining. Following transformation can be applied
 In this phase, mathematical models are used to determine data
patterns.
 Based on the business objectives, suitable modeling techniques
should be selected for the prepared dataset.
 Create a scenario to test check the quality and validity of the
Modeling model.
 Run the model on the prepared dataset.
 Results should be assessed by all stakeholders to make sure that
model can meet data mining objectives.
 In this phase, patterns identified are evaluated against the
business objectives.
 Results generated by the data mining model should be evaluated
against the business objectives.
 Gaining business understanding is an iterative process. In fact,
Evaluation while understanding, new business requirements may be raised
because of data mining.
 A go or no-go decision is taken to move the model in the
deployment phase.
 In the deployment phase, you ship your data mining discoveries to
everyday business operations.
 The knowledge or information discovered during data mining
process should be made easy to understand for non-technical
stakeholders.
Deployment:  A detailed deployment plan, for shipping, maintenance, and
monitoring of data mining discoveries is created.
 A final project report is created with lessons learned and key
experiences during the project. This helps to improve the
organization’s business policy.
What Is Data
Mining?
 1. Data cleaning (to remove noise and inconsistent data)
 2. Data integration (where multiple data sources may be
combined)
 3. Data selection (where data relevant to the analysis task are
retrieved from the database)
 4. Data transformation (where data are transformed and
consolidated into forms appropriate for mining by performing
summary or aggregation operations)
 5. Data mining (an essential process where intelligent methods are
applied to extract data patterns)
 6. Pattern evaluation (to identify the truly interesting patterns
representing knowledge based on interestingness measure
 7. Knowledge presentation (where visualization and knowledge
representation techniques are used to present mined knowledge to
users)
 Data base data
What Kinds of  Transactional data
Data Can Be  Data ware house data
Mined?
Data mining
techniques
 Outlier analysis example
 Outlier analysis may uncover fraudulent usage of credit
Outlier cards by detecting purchases of unusually large amounts for a
given account number in comparison to regular charges incurred
analysis by the same account. Outlier values may also be detected with
respect to the locations and types of purchase, or the
purchase frequency.
classification
Clustering
Data mining
adopts
techniques
from many
domains
 What is data set ?
 A data set is a collection of numbers or
values that relate to a particular subject.
For example, the test scores of each
student in a particular class is a data set.
The number of fish eaten by each dolphin
at an aquarium is a data set.
 Dataset come in different forms
 Data set has three general characteristic
DIMENSIONALITY The dimensionality of a data set is the number of attributes that

the objects in the data set have.
For some data sets, such as those with asymmetric features,

SPARSITY
most attributes of an object have values of 0; in many cases
fewer than 1% of the entries are non-zero. Such a data is
called sparse data or it can be said that the data set
has Sparsity
RESOLUTION The patterns in the data depend on the level of resolution. If the
resolution is too fine, a pattern may not be visible or may be
buried in noise; if the resolution is too coarse, the pattern may
disappear.
 Types of data sets
RECORD DATA
SPARSITY
RESOLUTION
ORDERED DATA
GRAPH DATA
 https://www.kaggle.com/datasets
Free sites for  https://github.com/awesomedata/awesome-public-datasets
accessing data  r/ datasets

 Google advance search
sets
 https://www.google.com/advanced_search
Can you make
your own data
sets How ?
Google flu trend
Useful links  https://www.nature.com/articles/d41586-019-02755-6

Session 2: Preprocessing

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Session 2: Preprocessing

Uploaded by

Copyright:

Available Formats

Session 2

Mining  Needs comes evolution in size of database

Evolution of Regression analysis cluster analysis standard deviation etc.

Data Mining  ARTIFICAL INTELLIGENCE :

Mining  Needs comes evolution in size of database

Evolution of Regression analysis cluster analysis standard deviation etc.

Data Mining  ARTIFICAL INTELLIGENCE :

DIMENSIONALITY The dimensionality of a data set is the number of attributes that

For some data sets, such as those with asymmetric features,

accessing data  r/ datasets

You might also like