You are on page 1of 4

Data Mining

Data mining is a process of discovering various models, summaries, and derived values from a given
collection of data.
Or
Minig is the process of extracting information from various data resources to identify different patterns
and errors. This technique is particularly used in extracting data from large databases and the internets.
In practice, the two primary goals of data mining tend to be prediction and description.

• Predictive data mining, which produces the model of the system described by the given data set.
It involved the using historical data and statistical algorithms to build models that can predict
future outcomes or trends.
• Descriptive data mining, which produces new, nontrivial information based on the available data
set. It focused on exploring and summarizing existing data to understand patterns, relationships
within a data.

Primary data mining tasks


1. Classification:
What it does: Figures out a way to predict and assign a data item into specific groups or categories.
Example: Sorting emails into "spam" or "not spam."
2. Regression:
What it does: Finds a formula that predicts a numerical value based on other data.
Example: Predicting house prices based on factors like size, location, and number of bedrooms.
3. Clustering:
What it does: Identifies natural groups or clusters within a dataset.
Example: Grouping customers based on their purchasing behavior.
4. Summarization:
What it does: Creates a concise description for a set of data, making it easier to understand.
Example: Condensing a large report into key findings and trends.
5. Dependency Modeling:
What it does: Reveals relationships and connections between different variables in a dataset.
Example: Understanding how weather patterns may affect sales of certain products.
6. Change and Deviation Detection:
What it does: Highlights significant changes or deviations in the dataset.
Example: Noticing a sudden drop in website traffic or a spike in user engagement.
Data mining roots
Data mining is depended upon the statistics and machine learning.
Statistics, rooted in mathematics, emphasizes theoretical rigor before practical application.
On the other hand, machine learning, born from computer science, prioritizes practical experimentation
and performance testing over formal theoretical proof.
So
Data mining = statistics + math + machine learning + computer
Basic modeling principles in data mining also have roots in control theory, which is primarily applied to
engineering systems and industrial processes.
System identification
The problem of determining a mathematical model for an unknown system by observing its input and
output data pairs is generally referred to as system identification.
And the unknown system is Target system.
System identification generally involves two top - down steps

• Structure Identification: This step utilizes prior knowledge of the target system to define a class
of models, typically represented by a parameterized function y = f (u, t). This function is
determined based on the designer's expertise, intuition, and the governing laws of the system.

• Parameter Identification: Once the model structure is established, optimization techniques are
applied to find the parameter vector t that best fits the model to the system's behavior. This step
aims to determine the parameters t* for the model y* = f (u, t*) to accurately describe the system.

Block diagram for parameter identification.


Datamining process
1. Problem Formulation and Hypothesis: Involves defining a meaningful problem statement based on
domain-specific knowledge. This step requires the collaboration of both domain experts and data
mining specialists to specify variables and propose initial hypotheses regarding the dependency
between them.

2. Data Collection: Data can be obtained through designed experiments or observational approaches.
Understanding the data generation process is crucial, and ensuring consistency in the sampling
distribution between training and testing datasets is important for accurate model estimation and
application.

3. Data Preprocessing: Involves outlier detection and removal, as well as scaling, encoding, and feature
selection. These steps aim to enhance the quality of the data and ensure that variables have
appropriate weights for analysis.

4. Model Estimation: Selecting and implementing suitable data mining techniques to build models.
This process often involves comparing multiple models to identify the most effective one for the
given problem.

5. Model Interpretation and Conclusion: Data mining models should be interpretable to facilitate
decision-making. Balancing model accuracy with interpretability is important, especially considering
the complexity of modern high-dimensional models.

Large data set


As we enter the age of digital information, the problem of data overload looms ominously ahead. Our
ability to analyze and understand massive data sets, as we call large data, is far behind our ability to
gather and store the data.
independent variables are the inputs or predictors that are used to predict or explain the behavior of the
dependent variable. Dependent variables, on the other hand, are the outcomes or responses that are being
predicted or explained by the independent variables.

For example, in a study analyzing housing prices, independent variables might include factors such as
square footage, number of bedrooms, and neighborhood quality. The dependent variable would be the
price of the house. The goal of data mining would be to understand how changes in the independent
variables affect the dependent variable, allowing for predictions or insights into housing prices.

Data warehouse for data mining


The data warehouse is a collection of integrated, subject - oriented databases designed to support the
decision - support functions (DSF), where each unit of data is relevant to some moment in time.

or

data warehouse can be defined as a central repository of integrated, structured, and preprocessed data that
is optimized for analysis and exploration.
A data warehouse includes the following categories of data, where the classification is accommodated to
the time - dependent data sources:

• old detail data


• current (new) detail data
• lightly summarized data
• highly summarized data
• meta - data (the data directory or guide).

Data transformation
1. Simple Transformations: Basic changes made to individual data fields, like converting data
types or replacing encoded values with decoded ones.
2. Cleansing and Scrubbing: Ensuring consistent formatting and accuracy of data, such as properly
formatting addresses or validating values within a specified range.
3. Integration: Combining data from different sources into a unified structure in the data
warehouse. Challenges include identifying the same entities across multiple systems and
resolving conflicts or missing values.
4. Aggregation and Summarization: Condensing operational data into fewer instances in the
warehouse. Summarization involves adding values along dimensions (e.g., daily sales to monthly
sales), while aggregation combines different business elements into a common total, depending
on the domain (e.g., combining sales of different products and services).

You might also like