Professional Documents
Culture Documents
The main
objective of data mining is to identify patterns, trends, or rules that explain data behavior
contextually. The data mining method uses mathematical analysis to deduce patterns and trends,
which were not possible through the old methods of data exploration. Data mining is a handy and
highly convenient methodology for dealing with vast volumes of data. In this article, we explore
some data mining functionalities that are measured to predict the type of patterns in data sets.
Classification
Association Analysis
Cluster Analysis
Data Characterization
Data Discrimination
Prediction
Outlier Analysis
Evolution Analysis
Classification
Association Analysis
consequent(then) –
An antecedent (if) points towards a degree of discovering a consequent (then) in the data set. It
suggests that they are associated.
One example to understand this better can be – If a person buys popcorn in the theatre, there is a
60% chance that he will buy a cold drink. This way, a prediction can be made on the consumer’s
shopping behavior.
Cluster Analysis
The cluster analysis process is similar to that of classification. In cluster analysis, similar data
types are grouped; the only difference is that the class label is unknown. Clustering algorithms
divide the data basis similarities, and the grouped data are similar to each other more than the
other data in other groups. Cluster analysis is used in machine learning, deep learning, image
processing, pattern recognition, NLP, etc.
Data Characterization
The process of data characterization involves summarizing the generic data features, which can
result in specific rules to define a target class. An attribute-oriented induction technique is used to
characterize the data without much user intervention or interaction. The resultant characterized
data can be visualized in the form of different types of graphs, charts, or tables.
Data Discrimination
Data discrimination is a bias when a data set or source is treated differently than the others,
intentional or unintentional. This data mining functionality helps to separate peculiar data sets
based on the ambiguity in attribute values.
Prediction
Prediction is among the most popular data mining functionalities that determine any missing or
unknown element in a data set. Linear regression models based on the previous data are used to
make numeric predictions, which help businesses forecast the results of any given event,
positively or negatively. There are two types of predictions –
Outlier Analysis
We use the outlier analysis technique if we cannot group any data in any class. Outlier analysis
helps to learn about data quality. Outlier means data abnormality in most cases. More outliers in
your data sets low the data quality. You cannot identify data patterns or derive conclusions from
data sets with many outliers. The outlier analysis process helps check if any data can be used to
analyze after some clean-up. Nevertheless, tracking unusual data and activities is still essential so
that any anomalies can be detected beforehand and any business impact can be detected in
advance.
Evolution Analysis
Evolution Analysis refers to the study of data sets that may have been through a phase of
transformation or change. The evolution analysis models capture evolutionary trends in data,
which further contributes to data characterization, classification, or discrimination and clustering
for multivariate time series.
The data processing is carried out automatically or manually. Nowadays, most data is
processed automatically with the help of the computer, which is faster and gives
accurate results. Thus, data can be converted into different forms. It can be graphic
as well as audio ones. It depends on the software used as well as data processing
methods.
The collection of raw data is the first step of the data processing cycle. The raw data
collected has a huge impact on the output produced. Hence, raw data should be
gathered from defined and accurate sources so that the subsequent findings are
valid and usable. Raw data can include monetary figures, website cookies, profit/loss
statements of a company, user behavior, etc.
2. Data Preparation
Data preparation or data cleaning is the process of sorting and filtering the raw data
to remove unnecessary and inaccurate data. Raw data is checked for errors,
duplication, miscalculations, or missing data and transformed into a suitable form for
further analysis and processing. This ensures that only the highest quality data is fed
into the processing unit.
3. Data Input
In this step, the raw data is converted into machine-readable form and fed into the
processing unit. This can be in the form of data entry through a keyboard, scanner, or
any other input source.
4. Data Processing
In this step, the raw data is subjected to various data processing methods using
machine learning and artificial intelligence algorithms to generate the desired
output. This step may vary slightly from process to process depending on the source
of data being processed (data lakes, online databases, connected devices, etc.) and
the intended use of the output.
The data is finally transmitted and displayed to the user in a readable form like
graphs, tables, vector files, audio, video, documents, etc. This output can be stored
and further processed in the next data processing cycle.
6. Data Storage
The last step of the data processing cycle is storage, where data and metadata are
stored for further use. This allows quick access and retrieval of information whenever
needed. Effective proper data storage is necessary for compliance with GDPR (data
protection legislation).
2. Attribute Subset Selection:
The highly relevant attributes should be used, rest all can be
discarded. For performing attribute selection, one can use level of
significance and p- value of the attribute.the attribute having p-value
greater than significance level can be discarded.
3. Numerosity Reduction:
This enable to store the model of data instead of whole data, for
example: Regression Models.
4. Dimensionality Reduction:
This reduce the size of data by encoding mechanisms.It can be
lossy or lossless. If after reconstruction from compressed data,
original data can be retrieved, such reduction are called lossless
reduction else it is called lossy reduction. The two effective
methods of dimensionality reduction are:Wavelet transforms and
PCA (Principal Component Analysis).