Professional Documents
Culture Documents
Enterprise DWH: -
The data is integrated from different operational system and is stored into data warehouse.
Attributes:
Data mart: -
Data mart is a subset or part of a data warehouse which is mainly focused on single subject
history.
By using this we cannot receive complete details for all subject areas.
Why we need?
Ans: If we are not having data warehouse but we want to access the data from multiplesources then
users can create virtual data bases.
When we use virtual warehouses, the data accessing is very fast and it can apply abstraction also.
Predictive task:
Here users are going to predict values of the target attribute by using different set ofknown values.
We are going to predict the value it is not hundred percent true or hundred percent false.
Classification
Regression
1. Classification:
It is having 3 set of classification by using this we can predict target value.
It finds discrete or finite or fixed target variable.
Whether a person buying book or not.
C1 C2 Target attribute
- - True
- - False
In the above representation we are going to predict target attribute values based on independent
attribute values.
2. Regression:
It is used to predict continuous target variable for that reason we could not able to predict
future target variable.
It is also used in mathematics formulas. Ex: Book price.
Book price is varying day by day weekly etc.
Descriptive task:
Cluster analysis
Summarization
Anamoly detection
1. Cluster analysis:
It is also called as grouping or segmentation.
In this model users are going to make a group with similar type of attributes.
Ex: In a class we are going to group the students based on the marks attribute.
Whenever clustering is done then we can find the pattern and relationship easily.
2. Associative rule analysis:
In this model users are going to discover a pattern based on stronglyassociates features of
attributes.
It means users should know the relationship between the attributes.
Ex: Retail services. In this first we should know relationship between the producer and
consumer.
3. Anomaly detection:
This is a task to identify problems or anomalies in the data which helps us to check the correctness.
4. Summarization:
It is a short conclusion of large data due to this user can determine the patterns very
easily.
The goal of data mining and KDD is same that is “process of discovering knowledgefrom large
amount of data”.
But the difference is KDD is to extract knowledge from large databases with the help of data mining
methods.
It means KDD process which is having many phases. Out of all that datamining is a one of the critical
steps.
Phases of KDD: There are six steps behind the KDD process.
1.Data integration.
2.Data selection.
3.Data transformation.
4.Data mining.
5.Pattern evaluation.
6.Knowledge representation.
1.Data integration:
In this phase collection of data from different sources and integrated into a single source
(DWH).
2. Data selection:
In this phase retrieve purpose first select relevant data from DWH.
3. Data transformation:
After selecting the data, that will be transformed in other forms as per requirements.
Data cleaning: It involves removal of noisy and irrelavent data from the database.
4. Data mining:
Apply various techniques like association rule, classification, clustering, regression etc., to
extract the data patterns.
5. Pattern evaluation:
The different data patterns generated by data mining are evaluated using metrics.
6. Knowledge representation:
The final step of KDD, which represents the knowledge extracted in the user required forms.
Issues of datamining
Data mining systems face a lot of challenges / issues in todays world, some of them are:
1.Mining methodology:
User should know what kind of methodology used to retrieve data.
2. Issues related to handling different types of data:
Data is not in one format if may be images, documents, XML, jpg etc.
When we get different types of data there may be chance of getting issues in that.
3. Performance:
Once we define with the data and type, then next major target is performance ofoperation.
Generally performance is measured based on efficiency, effectiveness and scalability.
4. Incorporation of background knowledge:
If users do not have a domain (subject) knowledge then they cant find workflow and solution
that’s why first we should know background knowledge of a particular domain.
5. Pattern evaluation:
If any user wants to retrieve pattern first they should know the relationship between the
attributes.
They can easy to retrieve pattern.
6. Handling noisy and incomplete data:
When we are getting data from different source they may changes to get noisy / distributed /
corrupted and unfilled data.
Metrics are the set of measurements, which can help in determining the efficiency of a data mining
methods / Algorithm.
It helps us to decide / choose the right datamining algorithms.
Each datamining method will have its own metrics.
For example, for web mining, the various metrics are website, visitors, pages served, queries etc.